{"id":1,"date":"2025-08-29T15:04:12","date_gmt":"2025-08-29T19:04:12","guid":{"rendered":"https:\/\/wp.stgeorges.bc.ca\/connork\/?p=1"},"modified":"2025-10-17T13:02:14","modified_gmt":"2025-10-17T17:02:14","slug":"littlegpt","status":"publish","type":"post","link":"https:\/\/wp.stgeorges.bc.ca\/connork\/2025\/08\/29\/littlegpt\/","title":{"rendered":"Building ChatGPT and Adding My Own Twist"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">Introduction<\/h3>\n\n\n\n<p class=\"is-style-default wp-block-paragraph\">When I started this project, I wanted to challenge myself by rebuilding GPT-2 piece by piece, which I&#8217;ve been holding off from doing for a while.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">My journey began with micrograd, a tiny automatic differentiation engine (which I&#8217;ll explain later). It then escalated into coding a GPT-2 style transformer model, training it on small slices of WikiText and C4, and finally deploying it into a Streamlit app that could answer questions. I saw a clip online on a podcast of a person saying they wish they could have a private chatbot that didn&#8217;t have their data sent across the internet, so I set out the goal of creating that.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I did this through adding RAG (retrieval augmented generation) so the chatbot could use my own uploaded notes, which I&#8217;ll talk more about later.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This wasn\u2019t easy. Micrograd was deceptively hard, and GPT-2 was even harder. But each step taught me something crucial about how modern AI systems are built, which I believe will really help me going forward.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Micrograd Was Supposed to Be Small<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Micrograd looks really simple (or at least it did when I was looking at the repo): just a few dozen lines of Python. But it was one of the hardest parts of this project because it forced me to change the way I thought about math, especially since I&#8217;m supposed to assume a college-level calculus to understand it well.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class Value:\n    def __init__(self, data, children=()):\n        self.data = data\n        self.grad = 0\n        self._backward = lambda: None\n        self._prev = set(children)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This snippet is the <em>core building block<\/em> of micrograd.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>class Value<\/code>: Each number in your computation isn\u2019t just a number anymore. It\u2019s a node in a graph, or neural network on a larger scale.<\/li>\n\n\n\n<li><code>data<\/code>: the raw number (like <code>2.5<\/code>).<\/li>\n\n\n\n<li><code>grad<\/code>: where the gradient (the derivative of the output in respect to the node) will be stored once you run &#8220;backpropagation&#8221;.<\/li>\n\n\n\n<li><code>_backward<\/code>: a function placeholder that tells this node how to send its gradient backward to its parents when <code>.backward()<\/code> is called.<\/li>\n\n\n\n<li><code>_prev<\/code>: the set of \u201cchildren\u201d (really, the inputs that produced this node). This is what makes the whole thing a graph.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">What I learned:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Graph thinking\n<ul class=\"wp-block-list\">\n<li>Computations are really a graph where each node tracks how it was created.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Backpropagation\n<ul class=\"wp-block-list\">\n<li> <code>.backward()<\/code> means gradients flow backward through every operation.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>System over calculator\n<ul class=\"wp-block-list\">\n<li>This system is incredibly efficient, and building it can reap benefits that overshadow any basic calculator.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Extra explanation (micrograd snippet):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Think of each <code>Value<\/code> as a spreadsheet cell that not only holds a number but also remembers the formula that produced it.<\/li>\n\n\n\n<li>During backprop, <code>_backward<\/code> is set to a tiny function for each operation (add, mul, tanh, etc.) so gradients can be pushed into <code>grad<\/code> fields of the inputs.<\/li>\n\n\n\n<li><code>_prev<\/code> lets you traverse the graph in reverse topological order to apply all those tiny <code>_backward<\/code> functions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">GPT-2 Was Another Level<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Micrograd in all of its glory prepared me for JUST THE BASICS. GPT-2 is pushed me into deep water (and a bit into insantity). The architecture looks simple on paper but was full of tricky details that took me a while to grasp.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Attention Was Brutal<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Understanding self-attention was the hardest part.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>class Head(nn.Module):\n    def __init__(self, n_embd, head_size):\n        super().__init__()\n        self.key = nn.Linear(n_embd, head_size, bias=False)\n        self.query = nn.Linear(n_embd, head_size, bias=False)\n        self.value = nn.Linear(n_embd, head_size, bias=False)\n\n    def forward(self, x):\n        B, T, C = x.shape\n        k = self.key(x)\n        q = self.query(x)\n        wei = q @ k.transpose(-2, -1) * (C ** -0.5)\n        wei = torch.softmax(wei, dim=-1)\n        v = self.value(x)\n        return wei @ v\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What I learned:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Queries, keys, values: queries are \u201cwhat I\u2019m looking for,\u201d keys are \u201cwhat I have,\u201d values are \u201cthe content.\u201d<\/li>\n\n\n\n<li>Shapes matter: batch size, sequence length, embedding dim; a single mismatch crashes the model.<\/li>\n\n\n\n<li>Multi-heads: learn different relationships in parallel.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (attention head):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The head splits the input into three versions:<\/li>\n\n\n\n<li><strong>Query (Q)<\/strong> = what the token is looking for.<\/li>\n\n\n\n<li><strong>Key (K)<\/strong> = what the token has to offer.<\/li>\n\n\n\n<li><strong>Value (V)<\/strong> = the actual info.\n<ul class=\"wp-block-list\">\n<li>The math compares Qs with Ks to find relevance, then uses those scores to mix together Vs. The output is each token rewritten with context from the others.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Positional Embeddings<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Transformers aren\u2019t naturally sequential. Positional embeddings give them a sense of order, which turns out to be foundational.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (positional embeddings, conceptually):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Without positions, the model can\u2019t tell \u201cthe dog bit the man\u201d from \u201cthe man bit the dog.\u201d<\/li>\n\n\n\n<li>Positional vectors are added to token embeddings so the model can learn order-sensitive patterns (like bigrams and syntax).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Training Was Such a Freaking Grind<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>for step in range(max_iters):\n    xb, yb = get_batch(\"train\")\n    logits, loss = model(xb, yb)\n    optimizer.zero_grad()\n    loss.backward()\n    optimizer.step()\n    if step % eval_interval == 0:\n        val_loss = estimate_loss(\"val\")\n        print(step, loss.item(), val_loss)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What I learned:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Loss going down feels magical, even when it\u2019s just math.<\/li>\n\n\n\n<li>Hyperparameters (LR, batch size, dropout) make or break runs.<\/li>\n\n\n\n<li>Sometimes a single GPU and simpler tooling beat fighting with TPUs.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (training loop):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Grab a batch of inputs\/targets.<\/li>\n\n\n\n<li>Run the model to get predictions and calculate loss.<\/li>\n\n\n\n<li>Clear old gradients, backprop to compute new ones, and step the optimizer to update weights.<\/li>\n\n\n\n<li>Every so often, check validation loss to see if the model is learning or overfitting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">From Model to App: LittleGPT<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The end result is a Streamlit app that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Loads a small HF model locally (e.g., Qwen 0.6B) with device\/precision controls and <code>@st.cache_resource<\/code>.<\/li>\n\n\n\n<li>Lets you chat and optionally ground answers in your own uploaded notes via embeddings + FAISS RAG.<\/li>\n\n\n\n<li>Supports quick LoRA fine\u2011tuning and simple evaluation.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">The Heartbeat: Generation<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The core generation function uses plain <code>transformers<\/code> under the hood. It prepares tensors, calls <code>model.generate<\/code>, and decodes the output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def generate(\n    model,\n    tokenizer,\n    prompt: str,\n    max_new_tokens: int = 128,\n    temperature: float = 0.0,\n    top_p: float = 0.9,\n) -&gt; str:\n    \"\"\"Generate text from a prompt using lightweight decoding defaults.\"\"\"\n    device = next(model.parameters()).device\n    inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n    with torch.no_grad():\n        temp = max(float(temperature), 0.0)\n        do_sample = temp &gt; 0\n        sampling_temp = max(temp, 1e-5) if do_sample else 1.0\n        output_ids = model.generate(\n            **inputs,\n            max_new_tokens=max_new_tokens,\n            temperature=sampling_temp,\n            top_p=min(max(float(top_p), 0.1), 1.0) if do_sample else 1.0,\n            do_sample=do_sample,\n            pad_token_id=tokenizer.pad_token_id,\n            eos_token_id=tokenizer.eos_token_id,\n        )\n    generated = tokenizer.decode(output_ids&#091;0], skip_special_tokens=True)\n    if generated.startswith(prompt):\n        return generated&#091;len(prompt) :].strip()\n    return generated.strip()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (generation):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Take the user\u2019s prompt, tokenize it, and feed it to the model.<\/li>\n\n\n\n<li>If <code>temperature &gt; 0<\/code>, the model samples more randomly. Lower = safer, higher = more creative.<\/li>\n\n\n\n<li><code>top_p<\/code> keeps sampling only from the most likely words.<\/li>\n\n\n\n<li>The model outputs tokens, we decode them back into text, and strip out the original prompt so only the continuation is shown.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">On the chat page, we compose a concise prompt template and call <code>generate<\/code> with decoding controls sourced from the sidebar:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>prompt_template = (\n    \"You are Littlegpt, Connor's concise assistant. State your identity once at the start of a conversation and only repeat it if the user explicitly asks. Respond in at most four short sentences, without small talk, follow-up questions, or offers of extra help unless requested. Do not use emojis. Use the provided context when it helps.\\n\\n\"\n    \"### User Instruction:\\n{instruction}\\n\\n### Context:\\n{context}\\n\\n### Answer:\".format(\n        instruction=prompt.strip(),\n        context=input_block.strip(),\n    )\n)\n\nstart_time = time.time()\nwith st.chat_message(\"assistant\"):\n    with st.spinner(\"Generating response...\"):\n        raw_response = generate(\n            model,\n            tokenizer,\n            prompt_template,\n            max_new_tokens=int(st.session_state.get(\"max_new_tokens\", 128)),\n            temperature=float(st.session_state.get(\"temperature\", 0.7)),\n            top_p=float(st.session_state.get(\"top_p\", 0.9)),\n            top_k=int(st.session_state.get(\"top_k\", 0)) or None,\n            repetition_penalty=float(st.session_state.get(\"repetition_penalty\", 1.05)),\n            no_repeat_ngram_size=int(st.session_state.get(\"no_repeat_ngram_size\", 3)) or None,\n        )\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (chat page call):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The app wraps the user\u2019s question and any retrieved context into a prompt template.<\/li>\n\n\n\n<li>All the generation settings (tokens, temperature, top-p, etc.) come from the sidebar.<\/li>\n\n\n\n<li>While it runs, Streamlit shows a spinner, then prints the model\u2019s answer with latency.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Retrieval-Augmented Generation (RAG)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Uploads are chunked, embedded with a sentence-transformer, and searched via FAISS (if available). Top\u2011k snippets are stitched into the prompt.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Configure and add documents:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>@dataclass(frozen=True)\nclass RAGConfig:\n    embed_model: str = \"sentence-transformers\/all-MiniLM-L6-v2\"\n    chunk_size: int = 320\n    overlap: int = 48\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (RAGConfig):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Here is the code simplified:<\/strong><\/li>\n\n\n\n<li><code>embed_model<\/code>: the encoder that turns text into vectors.<\/li>\n\n\n\n<li><code>chunk_size<\/code>: how big each text piece is.<\/li>\n\n\n\n<li><code>overlap<\/code>: how much chunks overlap so no meaning is lost across boundaries.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>def add_document(self, doc_key: str, name: str, text: str) -&gt; Tuple&#091;int, int]:\n    \"\"\"Add a document if it's not already cached. Returns (chunks_added, total_chunks).\"\"\"\n    if doc_key in self._documents:\n        return 0, self.chunk_count\n\n    chunks = chunk_text(text, max_tokens=self.config.chunk_size, overlap=self.config.overlap)\n    if not chunks:\n        return 0, self.chunk_count\n\n    embedder = self._embedding_model()\n    embeddings = embedder.encode(\n        chunks,\n        batch_size=min(16, len(chunks)),\n        convert_to_numpy=True,\n        normalize_embeddings=True,\n    )\n\n    embeddings = _normalize(embeddings)\n\n    start_idx = len(self._chunk_texts)\n    new_indices = list(range(start_idx, start_idx + len(chunks)))\n\n    self._chunk_texts.extend(chunks)\n    self._chunk_sources.extend(&#091;name] * len(chunks))\n    self._append_embeddings(embeddings)\n    self._documents&#091;doc_key] = RAGDocument(name=name, chunk_indices=new_indices)\n    return len(chunks), self.chunk_count\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (add_document):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Skip files you\u2019ve already added.<\/li>\n\n\n\n<li>Split text into chunks, embed them, normalize the vectors, and save them with their source name.<\/li>\n\n\n\n<li>This way the app can later search over them fast.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>def search(self, query: str, top_k: int = 3) -&gt; List&#091;Tuple&#091;str, str, float]]:\n    if not self._chunk_texts:\n        return &#091;]\n\n    embedder = self._embedding_model()\n    query_embedding = embedder.encode(\n        &#091;query], convert_to_numpy=True, normalize_embeddings=True\n    )\n    query_embedding = _normalize(query_embedding)\n\n    if faiss is not None and self._index is not None:\n        scores, indices = self._index.search(query_embedding.astype(\"float32\"), top_k)\n        results: List&#091;Tuple&#091;str, str, float]] = &#091;]\n        if indices is not None and len(indices) and len(indices&#091;0]):\n            for idx, score in zip(indices&#091;0], scores&#091;0]):\n                if idx is None or idx == -1:\n                    continue\n                if not (0 &lt;= int(idx) &lt; len(self._chunk_texts)):\n                    continue\n                j = int(idx)\n                results.append((self._chunk_sources&#091;j], self._chunk_texts&#091;j], float(score)))\n        return results\n\n    if self._embeddings is None:\n        return &#091;]\n\n    sims = np.dot(self._embeddings, query_embedding.squeeze(0))\n    if sims.ndim == 0:\n        sims = np.array(&#091;float(sims)])\n    top_indices = sims.argsort()&#091;::-1]&#091;:top_k]\n    return &#091;\n        (self._chunk_sources&#091;idx], self._chunk_texts&#091;idx], float(sims&#091;idx]))\n        for idx in top_indices\n        if sims&#091;idx] &gt; 0\n    ]\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (search):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Turn the query into a vector and compare it with stored chunks.<\/li>\n\n\n\n<li>If FAISS is available, use it for speed; otherwise just use dot products.<\/li>\n\n\n\n<li>Return the best-matching snippets with their sources.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>context_chunks: List&#091;tuple&#091;str, str, float]] = &#091;]\nif rag_store.chunk_count:\n    context_chunks = rag_store.search(prompt, top_k=rag_top_k)\n\nconversation_prefix = \"\\n\".join(\n    f\"{msg&#091;'role'].capitalize()}: {msg&#091;'content']}\"\n    for msg in st.session_state.chat_history&#091;-6:-1]\n    if msg&#091;\"role\"] != \"system\"\n)\n\ncontext_sections: List&#091;str] = &#091;]\nif conversation_prefix:\n    context_sections.append(\"Conversation so far:\\n\" + conversation_prefix)\nif context_chunks:\n    joined = \"\\n\\n\".join(f\"&#091;{src}] {text}\" for src, text, _ in context_chunks)\n    context_sections.append(\"Context documents:\\n\" + joined)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (wiring RAG):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Recent chat history is bundled with the retrieved document snippets.<\/li>\n\n\n\n<li>They get stitched into a \u201cContext\u201d block that the model sees alongside the user\u2019s instruction.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Under the hood, text chunking uses a simple token-approximate word stride:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def chunk_text(text: str, max_tokens: int = 320, overlap: int = 48) -&gt; List&#091;str]:\n    \"\"\"Split text into overlapping chunks sized ~max_tokens tokens (word-level).\"\"\"\n    words = text.split()\n    if not words:\n        return &#091;]\n\n    if max_tokens &lt;= 0:\n        return &#091;text]\n\n    stride = max(max_tokens - overlap, 1)\n    chunks: List&#091;str] = &#091;]\n    for start in range(0, len(words), stride):\n        chunk = \" \".join(words&#091;start : start + max_tokens])\n        if chunk:\n            chunks.append(chunk)\n    return chunks\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (chunking):<\/strong><\/p>\n\n\n\n<div class=\"wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-f56f613f wp-block-group-is-layout-flex\">\n<ul class=\"wp-block-list\">\n<li>Split the text into word windows of <code>max_tokens<\/code>, sliding forward by <code>max_tokens - overlap<\/code>.<\/li>\n\n\n\n<li>Overlap makes sure info that straddles boundaries isn\u2019t lost.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Fine\u2011Tuning with LoRA<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Fine-tuning means taking a pre-trained model (like GPT-2) and teaching it extra patterns on a smaller, domain-specific dataset (like customer support chats, medical notes, or your own writing style). Instead of training from scratch, you \u201cnudge\u201d the model so it adapts quickly to your use-case. Below is an image of me in a Jupyter notebook using a v5e1 (high end Colab GPU) to train GPT-2. The table being printed shows <strong>step_loss<\/strong> (how wrong the model was at each step) and <strong>elapsed<\/strong> (how long training has run).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Step 0: Loss was around <strong>10.97<\/strong>.<\/li>\n\n\n\n<li>By Step 30: Loss has dropped to around <strong>7.64<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">That steady downward trend means the model is <strong>learning<\/strong>. It\u2019s adjusting weights to better predict tokens from your dataset. The Colab GPU makes this possible in minutes rather than days with something like my laptop.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"733\" src=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Screenshot-2025-09-21-at-5.30.13-PM-1024x733.png\" alt=\"\" class=\"wp-image-48\" srcset=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Screenshot-2025-09-21-at-5.30.13-PM-1024x733.png 1024w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Screenshot-2025-09-21-at-5.30.13-PM-300x215.png 300w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Screenshot-2025-09-21-at-5.30.13-PM-768x550.png 768w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Screenshot-2025-09-21-at-5.30.13-PM-1536x1099.png 1536w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Screenshot-2025-09-21-at-5.30.13-PM-2048x1466.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You can prototype instruction-tuning with LoRA right from the app. The training module prepares a Peft config over common projection layers and uses TRL\u2019s <code>SFTTrainer<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>lora_config = LoraConfig(\n    r=lora_r,\n    lora_alpha=lora_alpha,\n    lora_dropout=lora_dropout,\n    bias=\"none\",\n    task_type=\"CAUSAL_LM\",\n    target_modules=&#091;\n        \"q_proj\",\n        \"k_proj\",\n        \"v_proj\",\n        \"o_proj\",\n        \"gate_proj\",\n        \"down_proj\",\n        \"up_proj\",\n    ],\n)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (LoRA config):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instead of retraining all weights, LoRA slips in tiny adapters into projection layers.\n<ul class=\"wp-block-list\">\n<li><strong><code>r<\/code> (rank):<\/strong> You can think of it as the width of the adapter. Bigger <code>r<\/code> = more capacity to learn, but also more compute.<\/li>\n\n\n\n<li><strong><code>alpha<\/code>:<\/strong> A scaling factor that adjusts how much influence the previously mentioned adapters have compared to the now frozen model weights.<\/li>\n\n\n\n<li><strong><code>dropout<\/code>:<\/strong> Randomly turns off parts of the adapter during training, which prevents the adapters from memorizing the training data and helps them generalize, which is better for model learning.<\/li>\n\n\n\n<li><strong>target_modules:<\/strong> Lists exactly which parts of the transformer get these adapters. In this case, the projection layers (q_proj, k_proj, v_proj, etc.), which is the heart of self-attention.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Training arguments and <code>SFTTrainer<\/code> setup:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>training_args = TrainingArguments(\n    output_dir=tmp_dir,\n    num_train_epochs=num_train_epochs,\n    per_device_train_batch_size=batch_size,\n    gradient_accumulation_steps=1,\n    warmup_ratio=0.03,\n    learning_rate=learning_rate,\n    fp16=model.dtype == torch.float16,\n    bf16=model.dtype == torch.bfloat16,\n    logging_steps=max(1, len(dataset) \/\/ max(batch_size, 1)),\n    save_strategy=\"no\",\n    report_to=&#091;],\n)\n\ntrainer = SFTTrainer(\n    model=model,\n    train_dataset=dataset,\n    args=training_args,\n    peft_config=lora_config,\n    tokenizer=tokenizer,\n    dataset_text_field=\"text\",\n    max_seq_length=max_seq_length,\n    packing=False,\n)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (training args + SFTTrainer):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Defines how to fine-tune: epochs, batch size, learning rate, precision (fp16\/bf16).<\/li>\n\n\n\n<li>SFTTrainer runs supervised fine-tuning on text, with LoRA activated.<\/li>\n\n\n\n<li><code>packing=False<\/code> keeps each training example separate, making it simpler on my end.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quick Evaluation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A small evaluation loop runs deterministic generations and computes exact match + a simple BLEU proxy:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def _batched_generate(model, tokenizer, prompts: List&#091;str], max_new_tokens: int) -&gt; List&#091;str]:\n    device = next(model.parameters()).device\n    encoded = tokenizer(prompts, return_tensors=\"pt\", padding=True, truncation=True).to(device)\n    outputs = model.generate(\n        **encoded,\n        max_new_tokens=max_new_tokens,\n        pad_token_id=tokenizer.pad_token_id,\n        eos_token_id=tokenizer.eos_token_id,\n    )\n    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)\n    responses: List&#091;str] = &#091;]\n    for prompt, text in zip(prompts, decoded):\n        if text.startswith(prompt):\n            responses.append(text&#091;len(prompt) :].strip())\n        else:\n            responses.append(text.strip())\n    return responses\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (evaluation generation):<\/strong><\/p>\n\n\n\n<div class=\"wp-block-group is-nowrap is-layout-flex wp-container-core-group-is-layout-f56f613f wp-block-group-is-layout-flex\">\n<ul class=\"wp-block-list\">\n<li>Batch prompts, generate responses deterministically (no randomness), and clean up the outputs.<\/li>\n\n\n\n<li>This gives stable, repeatable results for scoring.<\/li>\n<\/ul>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Flowchart<\/h3>\n\n\n\n<h5 class=\"wp-block-heading\">How to Read These Flowcharts<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">The app has three main flows:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Core Chat Flow<\/strong> &#8211; what happens when you open the app, type a prompt, and get an answer (with optional RAG if you\u2019ve uploaded notes).<\/li>\n\n\n\n<li><strong>LoRA Fine-Tuning Flow<\/strong> (left) &#8211; an optional path if you want to add adapters and specialize the model.<\/li>\n\n\n\n<li><strong>Evaluation Flow<\/strong> (right) &#8211; another optional path for testing the model on batches of prompts and scoring results.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"607\" height=\"1024\" data-id=\"57\" src=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-213935-607x1024.png\" alt=\"\" class=\"wp-image-57\" style=\"width:634px;height:auto\" srcset=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-213935-607x1024.png 607w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-213935-178x300.png 178w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-213935-768x1295.png 768w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-213935-911x1536.png 911w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-213935-1215x2048.png 1215w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-213935-scaled.png 1519w\" sizes=\"auto, (max-width: 607px) 100vw, 607px\" \/><\/figure>\n<\/figure>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-2 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"439\" height=\"1024\" data-id=\"59\" src=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214148-439x1024.png\" alt=\"\" class=\"wp-image-59\" srcset=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214148-439x1024.png 439w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214148-129x300.png 129w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214148-768x1790.png 768w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214148-659x1536.png 659w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214148-879x2048.png 879w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214148-scaled.png 1099w\" sizes=\"auto, (max-width: 439px) 100vw, 439px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"495\" height=\"1024\" data-id=\"61\" src=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214511-1-495x1024.png\" alt=\"\" class=\"wp-image-61\" srcset=\"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214511-1-495x1024.png 495w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214511-1-145x300.png 145w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214511-1-768x1589.png 768w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214511-1-742x1536.png 742w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214511-1-990x2048.png 990w, https:\/\/wp.stgeorges.bc.ca\/connork\/wp-content\/uploads\/sites\/21\/2025\/08\/Untitled-diagram-_-Mermaid-Chart-2025-09-25-214511-1-scaled.png 1237w\" sizes=\"auto, (max-width: 495px) 100vw, 495px\" \/><\/figure>\n<\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">TL;DR<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Chat: <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Prompt &#8211; (optional RAG) &#8211; assemble final prompt &#8211; generate &#8211; decode\/return &#8211; show latency + tokens.<\/strong><br><em>(LoRA fine-tune and Evaluation are optional side flows.)<\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Core chat flow<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open app &#8211; Sidebar<br>Pick device (CPU\/MPS\/CUDA), precision\/quantization, and gen params (max tokens, temperature, top-p).<\/li>\n\n\n\n<li>Load model &#8211; Cached base or LoRA adapter<\/li>\n\n\n\n<li>Chat input &#8211; User types prompt<\/li>\n\n\n\n<li>RAG branch (only if uploads exist)<\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Yes &#8211; Chunk &#8211; Embed &#8211; Build\/Use FAISS &#8211; Top-k search &#8211; produce context snippets.<\/li>\n\n\n\n<li>No &#8211; Skip straight to prompt assembly.<\/li>\n<\/ul>\n\n\n\n<ol start=\"5\" class=\"wp-block-list\">\n<li>Prompt assembly &#8211; Recent chat &#8211; Context snippets &#8211; Prompt template<\/li>\n\n\n\n<li>Model inference &#8211; transformers.generate &#8211; Decode\/strip echoed prompt &#8211; Return answer<\/li>\n\n\n\n<li>Metrics &#8211; Latency &#8211; Token counts<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading\">LoRA fine-tuning flow (optional)<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Start fine-tune &#8211; LoRA config (r, alpha, dropout, targets) &#8211; SFTTrainer + TrainingArguments &#8211; Train adapters &#8211; Save\/Load adapter &#8211; Model ready for chat<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Evaluation flow (optional)<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Open Evaluate page &#8211; Batch prompts &#8211; Deterministic generate (no sampling) &#8211; Score (Exact Match + BLEU proxy) &#8211; Report results<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading\">Ethics, Privacy &amp; Sustainability<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">When I first built LittleGPT, I thought privacy was something I didn\u2019t need to worry about, since everything ran locally on my laptop. But I realized that good engineering practice is about more than just \u201cdoes it work for me.\u201d If someone else ran this app, they might upload sensitive files without thinking about how embeddings or indices are stored. That made me stop and think about how to design responsibly, even for a private tool.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The RAG system is private by default. Documents are embedded in memory, and if they\u2019re cached, they sit in a local directory that can be cleared with a single command. Still, I started adding reminders in my write-up that people should avoid uploading files with personal data. If someone adapted this for public hosting, those warnings would be even more important.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I also learned that bias doesn\u2019t go away just because the model is small or running offline. WikiText and C4 are public web datasets, and they contain stereotypes and skewed information. Any model trained on them inherits those patterns. Acknowledging that openly makes the project stronger, not weaker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The last piece was sustainability. Even a small Colab run uses electricity, and running dozens of experiments adds up. I began calculating the energy draw of my runs and realized that smaller models plus retrieval give you a better balance than just scaling up. That awareness made me more thoughtful about when to train, when to fine-tune, and when to rely on retrieval instead.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Reproducible Evaluation<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">I wanted my evaluation to be something anyone could repeat. That meant fixing random seeds across Python, NumPy, and PyTorch, and also recording the environment details like which Python and PyTorch versions I was using, whether CUDA or MPS was available, and which model checkpoint I had loaded.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For fairness, I used deterministic decoding: temperature at zero and top-p at one. That meant no random sampling and completely stable outputs. I also froze my validation set so it stayed constant.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The three metrics I tracked were training loss, validation loss, and perplexity. Perplexity is simply <code>exp(cross_entropy)<\/code>, but it feels more intuitive as a number because it tells you how many \u201cguesses\u201d the model is effectively making.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To really see what worked, I also ran a couple of small ablations. Changing the number of attention heads or extending the context length changed performance in ways that were visible in the metrics. Summarizing them in a table gave me a clearer picture than just eyeballing the loss curve.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Example table from my runs:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Config<\/th><th>Params<\/th><th>Train Loss<\/th><th>Val Loss<\/th><th>Perplexity<\/th><th>Notes<\/th><\/tr><\/thead><tbody><tr><td>Base<\/td><td>emb=256, heads=2, layers=4<\/td><td>\u2026<\/td><td>\u2026<\/td><td>\u2026<\/td><td>baseline<\/td><\/tr><tr><td>Heads = 4<\/td><td>emb=256, heads=4, layers=4<\/td><td>\u2026<\/td><td>\u2026<\/td><td>\u2026<\/td><td>more heads<\/td><\/tr><tr><td>Context = 512<\/td><td>emb=256, heads=2, layers=4<\/td><td>\u2026<\/td><td>\u2026<\/td><td>\u2026<\/td><td>longer sequences<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Alongside this, I plotted a simple training vs. validation loss curve. It\u2019s one thing to see numbers in a table, but watching the validation curve flatten out while the training loss continues down is the clearest sign of overfitting.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">UX Tightening with CLI and Errors<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Streamlit gave me a nice web interface, but I wanted something more reproducible and script-friendly. So I built a command-line entry point. Running <code>python -m littlegpt.cli --help<\/code> shows everything in one place.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Here\u2019s what my <code>--help<\/code> looked like:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>$ python -m littlegpt.cli --help\nusage: littlegpt.cli &#091;-h] --prompt PROMPT &#091;--model M] &#091;--device cpu|mps|cuda]\n                     &#091;--max_new_tokens N] &#091;--temperature T] &#091;--top_p P]\n                     &#091;--rag_index PATH] &#091;--top_k K] &#091;--show_sources]\n                     &#091;--seed 42]\n\nGenerate text locally with LittleGPT.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --prompt PROMPT       Text to continue\n  --model M             HF model id or local path (default: Qwen-0.5B)\n  --device DEV          cpu|mps|cuda (auto-detects)\n  --max_new_tokens N    Max tokens to generate (default: 128)\n  --temperature T       0 = deterministic\n  --top_p P             nucleus sampling (default: 0.9)\n  --rag_index PATH      Optional FAISS index dir\n  --top_k K             Retrieval chunks to include (default: 3)\n  --show_sources        Print sources under the answer\n  --seed S              RNG seed for reproducibility\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I also included clear example commands in my post. One shows a fully deterministic run that I used when grading evaluation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python -m littlegpt.cli --prompt \"Explain positional embeddings in two sentences.\" --temperature 0\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And another demonstrates retrieval with citations:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python -m littlegpt.cli --prompt \"Summarize my notes on ecosystems.\" \\\n  --rag_index ~\/.littlegpt\/index --top_k 4 --show_sources\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, I made sure the CLI spoke in plain English when things went wrong. Instead of PyTorch tracebacks, it now says things like \u201cCUDA not found, falling back to CPU\u201d or \u201cOut of memory at shape [B,T,C], try reducing context length.\u201d Those little touches made it feel much more polished.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">How to Run It<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Local quickstart:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python -m venv .venv &amp;&amp; source .venv\/bin\/activate\npip install -r littlegpt\/requirements.txt\nstreamlit run littlegpt\/app.py\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (local run):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Make a Python virtual environment, install requirements, run the Streamlit app.<\/li>\n\n\n\n<li>It\u2019ll launch on localhost:8501.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Run tests:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pytest -q\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (tests):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Run <code>pytest -q<\/code> to quickly check if things are working. If it returns no errors you&#8217;ll be good. In my experience, it&#8217;s typically you not installing proper packages if it doesn&#8217;t work.<code>-q<\/code> runs pytest in quiet mode, great for smoke checks in CI or local validation.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Expose on a remote host during testing:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>streamlit run app.py --server.address 0.0.0.0 --server.port 7860\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (expose app):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Running with <code>--server.address 0.0.0.0<\/code> makes the app visible from outside your machine, as long as the port is open.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Deploying on Hugging Face Spaces (Single Streamlit Space)<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Follow the built-in guidance:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>huggingface-cli repo create &lt;org&gt;\/&lt;space-name&gt; --type=space --sdk=streamlit\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Code explanation (HF Spaces):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create a new Space with Streamlit SDK, push your (which is really mine!) repo (with requirements.txt and app.py).<\/li>\n\n\n\n<li>Spaces automatically runs the app online.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Copy this repository into the Space (or push via Git) ensuring <code>requirements.txt<\/code>, <code>app.py<\/code>, and the <code>pages\/<\/code> directory are included.3<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">What I Learned Building the App<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">I didn\u2019t expect hardware details like <strong>device choice and precision<\/strong> to matter as much as they did. Early on I thought, \u201cjust run it wherever it works,\u201d but the difference between CPU, MPS, and CUDA was night and day, not just in speed, but in how usable the app felt. Even applying quantization past just like what I&#8217;ve seen online (INT8 on CPU, 4-bit on CUDA) was a lesson on raw power vs. speed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I also came away with a new respect for <strong>small models combined with retrieval<\/strong>. A 600M-parameter checkpoint on its own feels underwhelming, but once you give it the right snippets of context, it suddenly becomes sharp, even \u201csmart\u201d in ways that surprised me. That changed my view of model size, and I now feel like this is something I could use daily while maintaining privacy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">On the practical side, I learned how <strong>important caching is for user experience<\/strong>. Streamlit\u2019s <code>@st.cache_resource<\/code> was something I added almost as an afterthought, but without it the app felt sluggish and pretty brittle. With it, everything kinda clicked. Loading a model once and serving it repeatedly made the app feel polished during my testing. I even added a \u201cpre-warm model cache\u201d as the app boots up, which while inconspicous to the user greatly helps with speeds.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, I found myself obsessing over <strong>the small UX details<\/strong>. A tight prompt template. Stripping out small talk. Keeping responses short and to the point. Showing latency and token counts in the corner <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">(at one point, it gave me response like &#8220;Hi, I&#8217;m littlegpt. Hi! I&#8217;m littlegpt&#8230;, this would go on for hundreds of lines). <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Looking back, the lesson I learned was that the work and learning is really in the small details. I could have never done GPT-2 if I hadn&#8217;t perfected micrograd, as tedious as that was. Overall, I learned to really slow down, test every aspect of my learning, and feel confident with it. Only then would I move on. <\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h5 class=\"wp-block-heading\">Credits<\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li>micrograd and nanoGPT (Andrej Karpathy) for inspiration.<\/li>\n\n\n\n<li>Hugging Face <code>transformers<\/code>, <code>datasets<\/code>, <code>peft<\/code>, <code>trl<\/code>.<\/li>\n\n\n\n<li>sentence-transformers and FAISS for retrieval.<\/li>\n\n\n\n<li>Streamlit for simple deployment.<\/li>\n<\/ul>\n\n\n\n<h5 class=\"wp-block-heading\">demo video <\/h5>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-loom wp-block-embed-loom\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Exploring Little GPT: Applications and Insights\" src=\"https:\/\/www.loom.com\/embed\/357d30b71c0340689527599db7c0e939\" frameborder=\"0\" width=\"500\" height=\"375\" webkitallowfullscreen mozallowfullscreen allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h5 class=\"wp-block-heading\">Code<\/h5>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/cwklurks\/littlegpt\">https:\/\/github.com\/cwklurks\/littlegpt<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">AI Usage:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/chatgpt.com\/share\/68d34220-e990-8010-8955-0e733c2cbec5\">https:\/\/chatgpt.com\/share\/68d34220-e990-8010-8955-0e733c2cbec5<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/chatgpt.com\/c\/68d1db4d-410c-832e-bbb7-7db7eb4f1ff6\">https:\/\/chatgpt.com\/c\/68d1db4d-410c-832e-bbb7-7db7eb4f1ff6<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/chatgpt.com\/c\/68d16196-4758-832b-bfe1-0e44eacb23a8\">https:\/\/chatgpt.com\/c\/68d16196-4758-832b-bfe1-0e44eacb23a8<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/chatgpt.com\/c\/68d087b9-e030-8320-9627-cbff5e80d6e7\">https:\/\/chatgpt.com\/c\/68d087b9-e030-8320-9627-cbff5e80d6e7<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction When I started this project, I wanted to challenge myself by rebuilding GPT-2 piece by piece, which I&#8217;ve been holding off from doing for a while. My journey began with micrograd, a tiny automatic differentiation engine (which I&#8217;ll explain later). It then escalated into coding a GPT-2 style transformer model, training it on small [&hellip;]<\/p>\n","protected":false},"author":19,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/posts\/1","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/comments?post=1"}],"version-history":[{"count":13,"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/posts\/1\/revisions"}],"predecessor-version":[{"id":107,"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/posts\/1\/revisions\/107"}],"wp:attachment":[{"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/media?parent=1"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/categories?post=1"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.stgeorges.bc.ca\/connork\/wp-json\/wp\/v2\/tags?post=1"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}