Update_[18_Aug_2025]_[SHUtils]

inbravo · inbravo · commit 2d7c62c9bcad · 2025-08-18T14:38:41.000+01:00
diff --git a/README.md b/README.md
@@ -1,47 +1,53 @@
 # Python Language Feature Set
 
-## Basic features
+## Basic Features
 
-- [Hello world](com/inbravo/core/HelloWorld.py)
-- [Hello world using Jupitor Notebook](com/inbravo/core/HelloWorld.ipynb)
-- [Data types](com/inbravo/core/DataTypeTest.py)
-- [Variable types](com/inbravo/core/VariableTest.py)
-- [Why intendation matters in Python](com/inbravo/core/IntendationTest.py)
-- [Main function](com/inbravo/core/MainFunctionTest.py)
+- [Hello World](com/inbravo/core/HelloWorld.py)
+- [Hello World using Jupyter Notebook](com/inbravo/core/HelloWorld.ipynb)
+- [Data Types](com/inbravo/core/DataTypeTest.py)
+- [Variable Types](com/inbravo/core/VariableTest.py)
+- [Why Indentation Matters in Python](com/inbravo/core/IntendationTest.py)
+- [Main Function](com/inbravo/core/MainFunctionTest.py)
 
-## Data structures
+## File Operations
+
+- [SHUtils Examples ("shell utilities", sh standing for Shell)](com/inbravo/file/SHUtil_Test.py)
+- [Get File Metadata](com/inbravo/file/File_Meta_Data.py)
+- [Get Count of Files in a Folder](com/inbravo/file/File_Count.py)
+
+## Data Structures
 
 - [Tuples](com/inbravo/core/TupleTest.py)
 - [Sets](com/inbravo/core/SetTest.py)
 
-## String handling
+## String Handling
 
 - [Formatted Strings](com/inbravo/string/FString.py)
 
-## Regular expressions
+## Regular Expressions
 
-- [Regular expressions based string splitting](com/inbravo/regexp/Reg_Exp_Utils.py)
+- [Regular Expressions Based String Splitting](com/inbravo/regexp/Reg_Exp_Utils.py)
 
 ## System
 
-- [Operating system information](com/inbravo/system/OSInfo.py)
-- [Find the library version in environment](com/inbravo/system/LibVersion.py)
+- [Operating System Information](com/inbravo/system/OSInfo.py)
+- [Find the Library Version in Environment](com/inbravo/system/LibVersion.py)
 
-## Matlib
+## Matplotlib
 
-- [Create a two dimentional graph](com/inbravo/matplot/Graph_Test.py)
+- [Create a Two-Dimensional Graph](com/inbravo/matplot/Graph_Test.py)
 
 ## PySpark
 
-- [Calculate gross income of a super marker](com/inbravo/dbx/super-market/Gross_Income.ipynb)
+- [Calculate Gross Income of a Supermarket](com/inbravo/dbx/super-market/Gross_Income.ipynb)
 
-## VENV (Optional)
+## Virtual Environment (VENV) (Optional)
 
 1. Install Python: `brew install python@3.11`
 2. Install PIP: `pip3.11 install uv`
-3. Create VENV in the downloaded codebase: `python3.11 -m venv .venv`
-4. uv pip install -r requirements.txt
-5. Activate VENV: `source .venv/bin/activate`
+3. Create a virtual environment in the downloaded codebase: `python3.11 -m venv .venv`
+4. Install dependencies: `pip install -r requirements.txt`
+5. Activate the virtual environment: `source .venv/bin/activate`
 
 ## License
 
diff --git a/com/inbravo/.DS_Store b/com/inbravo/.DS_Store
diff --git a/com/inbravo/file/SHUtil_Test.py b/com/inbravo/file/SHUtil_Test.py
@@ -0,0 +1,47 @@
+import os
+import shutil
+
+# This script demonstrates various file operations using the shutil module in Python.
+#    Directory and files operations
+#        Platform-dependent efficient copy operations
+#        copytree example
+#        rmtree example
+#    Archiving operations
+#        Archiving example
+#        Archiving example with base_dir
+#    Querying the size of the output terminal
+
+# Create a directory
+os.mkdir('example_dir')
+
+# Create a file in the directory
+with open('example_dir/example_file.txt', 'w', encoding='utf-8') as f:
+    f.write('This is an example file.')
+
+# Copy a file
+shutil.copy('example_dir/example_file.txt', 'example_dir/copied_file.txt')
+
+# Copy a directory
+shutil.copytree('example_dir', 'example_dir_copy')
+
+# Move a file
+shutil.move('example_dir/copied_file.txt', 'example_dir/moved_file.txt')
+
+# Rename a file
+os.rename('example_dir/moved_file.txt', 'example_dir/renamed_file.txt')
+
+# Archive a directory (creates a zip file)
+shutil.make_archive('example_dir_archive', 'zip', 'example_dir')
+
+# Extract the archive
+shutil.unpack_archive('example_dir_archive.zip', 'extracted_dir')
+
+# Remove a directory tree
+shutil.rmtree('example_dir')
+shutil.rmtree('example_dir_copy')
+shutil.rmtree('extracted_dir')
+
+# Remove the archive
+os.remove('example_dir_archive.zip')
+
+print("All shutil operations completed successfully.")
diff --git a/com/inbravo/llm/dataloader.ipynb b/com/inbravo/llm/dataloader.ipynb
@@ -0,0 +1,202 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6e2a4891-c257-4d6b-afb3-e8fef39d0437",
+   "metadata": {},
+   "source": [
+    "<table style=\"width:100%\">\n",
+    "<tr>\n",
+    "<td style=\"vertical-align:middle; text-align:left;\">\n",
+    "<font size=\"2\">\n",
+    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
+    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
+    "</font>\n",
+    "</td>\n",
+    "<td style=\"vertical-align:middle; text-align:left;\">\n",
+    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
+    "</td>\n",
+    "</tr>\n",
+    "</table>\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f678e62-7bcb-4405-86ae-dce94f494303",
+   "metadata": {},
+   "source": [
+    "# The Main Data Loading Pipeline Summarized"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "070000fc-a7b7-4c56-a2c0-a938d413a790",
+   "metadata": {},
+   "source": [
+    "The complete chapter code is located in [ch02.ipynb](./ch02.ipynb).\n",
+    "\n",
+    "This notebook contains the main takeaway, the data loading pipeline without the intermediate steps."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b4e8f2d-cb81-41a3-8780-a70b382e18ae",
+   "metadata": {},
+   "source": [
+    "Packages that are being used in this notebook:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c7ed6fbe-45ac-40ce-8ea5-4edb212565e1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch version: 2.8.0\n",
+      "tiktoken version: 0.11.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# NBVAL_SKIP\n",
+    "from importlib.metadata import version\n",
+    "\n",
+    "print(\"torch version:\", version(\"torch\"))\n",
+    "print(\"tiktoken version:\", version(\"tiktoken\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ed4b7db-3b47-4fd3-a4a6-5f4ed5dd166e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tiktoken\n",
+    "import torch\n",
+    "import os\n",
+    "import urllib.request\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "\n",
+    "class GPTDatasetV1(Dataset):\n",
+    "    def __init__(self, txt, tokenizer, max_length, stride):\n",
+    "        self.input_ids = []\n",
+    "        self.target_ids = []\n",
+    "\n",
+    "        # Tokenize the entire text\n",
+    "        token_ids = tokenizer.encode(txt, allowed_special={\"<|endoftext|>\"})\n",
+    "\n",
+    "        # Use a sliding window to chunk the book into overlapping sequences of max_length\n",
+    "        for i in range(0, len(token_ids) - max_length, stride):\n",
+    "            input_chunk = token_ids[i:i + max_length]\n",
+    "            target_chunk = token_ids[i + 1: i + max_length + 1]\n",
+    "            self.input_ids.append(torch.tensor(input_chunk))\n",
+    "            self.target_ids.append(torch.tensor(target_chunk))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.input_ids)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        return self.input_ids[idx], self.target_ids[idx]\n",
+    "\n",
+    "\n",
+    "def create_dataloader_v1(txt, batch_size, max_length, stride,\n",
+    "                         shuffle=True, drop_last=True, num_workers=0):\n",
+    "    # Initialize the tokenizer\n",
+    "    tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
+    "\n",
+    "    # Create dataset\n",
+    "    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)\n",
+    "\n",
+    "    # Create dataloader\n",
+    "    dataloader = DataLoader(\n",
+    "        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)\n",
+    "\n",
+    "    return dataloader\n",
+    "\n",
+    "# Download the text file if it does not exist \n",
+    "if not os.path.exists(\"the-verdict.txt\"):\n",
+    "    print(\"Downloading the-verdict.txt...\")\n",
+    "    url = (\"https://raw.githubusercontent.com/rasbt/\"\n",
+    "           \"LLMs-from-scratch/main/ch02/01_main-chapter-code/\"\n",
+    "           \"the-verdict.txt\")\n",
+    "    file_path = \"the-verdict.txt\"\n",
+    "    urllib.request.urlretrieve(url, file_path)\n",
+    "\n",
+    "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
+    "    raw_text = f.read()\n",
+    "\n",
+    "vocab_size = 50257\n",
+    "output_dim = 256\n",
+    "context_length = 1024\n",
+    "\n",
+    "\n",
+    "token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)\n",
+    "pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)\n",
+    "\n",
+    "batch_size = 8\n",
+    "max_length = 4\n",
+    "dataloader = create_dataloader_v1(\n",
+    "    raw_text,\n",
+    "    batch_size=batch_size,\n",
+    "    max_length=max_length,\n",
+    "    stride=max_length\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "664397bc-6daa-4b88-90aa-e8fc1fbd5846",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for batch in dataloader:\n",
+    "    x, y = batch\n",
+    "\n",
+    "    token_embeddings = token_embedding_layer(x)\n",
+    "    pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n",
+    "\n",
+    "    input_embeddings = token_embeddings + pos_embeddings\n",
+    "\n",
+    "    break"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d3664332-e6bb-447e-8b96-203aafde8b24",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(input_embeddings.shape)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/requirements.txt b/requirements.txt
@@ -1,10 +1,9 @@
 matplotlib >= 3.10         # com.inbravo.matplot
 pandas >= 2.2.1            # com.inbravo.pandas
-torch >= 2.3.0             # all
-jupyterlab >= 4.0          # all
-tiktoken >= 0.5.1          # ch02; ch04; ch05
-matplotlib >= 3.7.1        # ch04; ch06; ch07
-tensorflow >= 2.18.0       # ch05; ch06; ch07
-tqdm >= 4.66.1             # ch05; ch07
+torch >= 2.3.0             # com.inbravo.llm
+jupyterlab >= 4.0          # com.inbravo.llm
+tiktoken >= 0.5.1          # com.inbravo.llm
+tensorflow >= 2.18.0       # com.inbravo.llm
+tqdm >= 4.66.1             # com.inbravo.llm
 numpy >= 1.26, < 2.1       # dependency of several other libraries like torch and pandas
-psutil >= 5.9.5            # ch07; already installed automatically as dependency of torch
+psutil >= 5.9.5            # com.inbravo.llm; as dependency of torch
diff --git a/the-verdict.txt b/the-verdict.txt