{ "artifact_data": [ { "id": "a964f3ac-e92f-4fcb-847a-a46da3d697d9", "content": { "Title": "Maxime Labonne - Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth", "Subtitle": null, "Content": "Maxime Labonne\n\n * __LLM Course\n * __Hands-On GNNs\n * __Research\n * __About\n\n * __\n * __\n * __\n * \n\n__\n\n 1. \ud83d\udd27 **LLM Post-training**\n 2. Fine-tune Llama 3.1 8B\n\n 1. \ud83d\udd27 **LLM Post-training**\n 2. Fine-tune Llama 3.1 8B\n\n# Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth\n\nA beginner\u2019s guide to state-of-the-art supervised fine-tuning\n\nLarge Language Models\n\nAuthor\n\nMaxime Lbonne\n\nPublished\n\nJuly 29, 2024\n\n * \ud83d\udd27 **LLM Post-training** __\n\n * Fine-tune Llama 2 in Colab\n\n * Fine-tune Llama 2 in Axolotl\n\n * Fine-tune Mistral-7b with DPO\n\n * Fine-tune Llama 3 with ORPO\n\n * Fine-tune Llama 3.1 8B\n\n * Merge LLMs with mergekit\n\n * Create Mixture of Experts\n\n * Uncensor any LLM\n\n * * * *\n\n * \u26a1 **LLM Quantization** __\n\n * Intro to Quantization\n\n * Quantization with GPTQ\n\n * Quantization with GGML\n\n * Quantization with ExLlamaV2\n\n * * * *\n\n * \ud83d\udde3\ufe0f **LLM stuff** __\n\n * ChatGPT + KG\n\n * Decoding Strategies\n\n * Agentic data generation\n\n * * * *\n\n * \ud83c\udf10 **Graph neural networks** __\n\n * Graph Convolution Network\n\n * Graph Attention Network\n\n * GraphSAGE\n\n * Graph Isomorphism Network\n\n * * * *\n\n * \ud83e\udd47 **Linear programming** __\n\n * Linear Programming\n\n * Integer Programming\n\n * Constraint Programming\n\n * Nonlinear Programming\n\n * * * *\n\n * \ud83c\udf00 **Miscellaneous** __\n\n * Q-learning\n\n * Minecraft Bot\n\n * Loops in Pandas\n\n * What is a Tensor\n\n## **Sections**\n\n * \ud83d\udd27 Supervised Fine-Tuning\n * \u2696\ufe0f SFT Techniques\n * \ud83e\udd99 Fine-Tune Llama 3.1 8B\n * Conclusion\n\nPre-order the **LLM Engineer\u2019s Handbook**, my new book to master the art of\nLLMs from concept to production\ud83d\udc47\n\nThe recent release of Llama 3.1 offers models with an incredible level of\nperformance, closing the gap between closed-source and open-weight models.\nInstead of using frozen, general-purpose LLMs like GPT-4o and Claude 3.5, you\ncan fine-tune Llama 3.1 for your specific use cases to achieve better\nperformance and customizability at a lower cost.\n\nIn this article, we will provide a comprehensive overview of supervised fine-\ntuning. We will compare it to prompt engineering to understand when it makes\nsense to use it, detail the main techniques with their pros and cons, and\nintroduce major concepts, such as LoRA hyperparameters, storage formats, and\nchat templates. Finally, we will implement it in practice by fine-tuning Llama\n3.1 8B in Google Colab with state-of-the-art optimization using Unsloth.\n\nAll the code used in this article is available on Google Colab and in the LLM\nCourse. Special thanks to Daniel Han for answering my questions.\n\n## \ud83d\udd27 Supervised Fine-Tuning\n\nSupervised Fine-Tuning (SFT) is a method to **improve and customize** pre-\ntrained LLMs. It involves retraining base models on a smaller dataset of\ninstructions and answers. The main goal is to transform a basic model that\npredicts text into an assistant that can follow instructions and answer\nquestions. SFT can also enhance the model\u2019s overall performance, add new\nknowledge, or adapt it to specific tasks and domains. Fine-tuned models can\nthen go through an optional preference alignment stage (see my article about\nDPO) to remove unwanted responses, modify their style, and more.\n\nThe following figure shows an instruction sample. It includes a system prompt\nto steer the model, a user prompt to provide a task, and the output the model\nis expected to generate. You can find a list of high-quality open-source\ninstruction datasets in the \ud83d\udcbe LLM Datasets GitHub repo.\n\nBefore considering SFT, I recommend trying prompt engineering techniques like\n**few-shot prompting** or **retrieval augmented generation** (RAG). In\npractice, these methods can solve many problems without the need for fine-\ntuning, using either closed-source or open-weight models (e.g., Llama 3.1\nInstruct). If this approach doesn\u2019t meet your objectives (in terms of quality,\ncost, latency, etc.), then SFT becomes a viable option when instruction data\nis available. Note that SFT also offers benefits like additional control and\ncustomizability to create personalized LLMs.\n\nHowever, SFT has limitations. It works best when leveraging knowledge already\npresent in the base model. Learning completely new information like an unknown\nlanguage can be challenging and lead to more frequent hallucinations. For new\ndomains unknown to the base model, it is recommended to continuously pre-train\nit on a raw dataset first.\n\nOn the opposite end of the spectrum, instruct models (i.e., already fine-tuned\nmodels) can already be very close to your requirements. For example, a model\nmight perform very well but state that it was trained by OpenAI or Meta\ninstead of you. In this case, you might want to slightly steer the instruct\nmodel\u2019s behavior using preference alignment. By providing chosen and rejected\nsamples for a small set of instructions (between 100 and 1000 samples), you\ncan force the LLM to say that you trained it instead of OpenAI.\n\n## \u2696\ufe0f SFT Techniques\n\nThe three most popular SFT techniques are full fine-tuning, LoRA, and QLoRA.\n\n**Full fine-tuning** is the most straightforward SFT technique. It involves\nretraining all parameters of a pre-trained model on an instruction dataset.\nThis method often provides the best results but requires significant\ncomputational resources (several high-end GPUs are required to fine-tune a 8B\nmodel). Because it modifies the entire model, it is also the most destructive\nmethod and can lead to the catastrophic forgetting of previous skills and\nknowledge.\n\n**Low-Rank Adaptation (LoRA)** is a popular parameter-efficient fine-tuning\ntechnique. Instead of retraining the entire model, it freezes the weights and\nintroduces small adapters (low-rank matrices) at each targeted layer. This\nallows LoRA to train a number of parameters that is drastically lower than\nfull fine-tuning (less than 1%), reducing both memory usage and training time.\nThis method is non-destructive since the original parameters are frozen, and\nadapters can then be switched or combined at will.\n\n**QLoRA (Quantization-aware Low-Rank Adaptation)** is an extension of LoRA\nthat offers even greater memory savings. It provides up to 33% additional\nmemory reduction compared to standard LoRA, making it particularly useful when\nGPU memory is constrained. This increased efficiency comes at the cost of\nlonger training times, with QLoRA typically taking about 39% more time to\ntrain than regular LoRA.\n\nWhile QLoRA requires more training time, its substantial memory savings can\nmake it the only viable option in scenarios where GPU memory is limited. For\nthis reason, this is the technique we will use in the next section to fine-\ntune a Llama 3.1 8B model on Google Colab.\n\n## \ud83e\udd99 Fine-Tune Llama 3.1 8B\n\nTo efficiently fine-tune a Llama 3.1 8B model, we\u2019ll use the Unsloth library\nby Daniel and Michael Han. Thanks to its custom kernels, Unsloth provides 2x\nfaster training and 60% memory use compared to other options, making it ideal\nin a constrained environment like Colab. Unfortunately, Unsloth only supports\nsingle-GPU settings at the moment. For multi-GPU settings, I recommend popular\nalternatives like TRL and Axolotl (both also include Unsloth as a backend).\n\nIn this example, we will QLoRA fine-tune it on the mlabonne/FineTome-100k\ndataset. It\u2019s a subset of arcee-ai/The-Tome (without arcee-\nai/qwen2-72b-magpie-en) that I re-filtered using HuggingFaceFW/fineweb-edu-\nclassifier. Note that this classifier wasn\u2019t designed for instruction data\nquality evaluation, but we can use it as a rough proxy. The resulting FineTome\nis an ultra-high quality dataset that includes conversations, reasoning\nproblems, function calling, and more.\n\nLet\u2019s start by installing all the required libraries.\n\n \n \n !pip install \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\"\n !pip install --no-deps \"xformers<0.0.27\" \"trl<0.9.0\" peft accelerate bitsandbytes __\n\nOnce installed, we can import them as follows.\n\n \n \n import torch\n from trl import SFTTrainer\n from datasets import load_dataset\n from transformers import TrainingArguments, TextStreamer\n from unsloth.chat_templates import get_chat_template\n from unsloth import FastLanguageModel, is_bfloat16_supported __\n\nLet\u2019s now load the model. Since we want to use QLoRA, I chose the pre-\nquantized unsloth/Meta-Llama-3.1-8B-bnb-4bit. This 4-bit precision version of\nmeta-llama/Meta-Llama-3.1-8B is significantly smaller (5.4 GB) and faster to\ndownload compared to the original 16-bit precision model (16 GB). We load in\nNF4 format using the bitsandbytes library.\n\nWhen loading the model, we must specify a maximum sequence length, which\nrestricts its context window. Llama 3.1 supports up to 128k context length,\nbut we will set it to 2,048 in this example since it consumes more compute and\nVRAM. Finally, the `dtype` parameter automatically detects if your GPU\nsupports the BF16 format for more stability during training (this feature is\nrestricted to Ampere and more recent GPUs).\n\n \n \n max_seq_length = 2048\n model, tokenizer = FastLanguageModel.from_pretrained(\n model_name=\"unsloth/Meta-Llama-3.1-8B-bnb-4bit\",\n max_seq_length=max_seq_length,\n load_in_4bit=True,\n dtype=None,\n )__\n\nNow that our model is loaded in 4-bit precision, we want to prepare it for\nparameter-efficient fine-tuning with LoRA adapters. LoRA has three important\nparameters:\n\n * **Rank** (r), which determines LoRA matrix size. Rank typically starts at 8 but can go up to 256. Higher ranks can store more information but increase the computational and memory cost of LoRA. We set it to 16 here.\n * **Alpha** (\u03b1), a scaling factor for updates. Alpha directly impacts the adapters\u2019 contribution and is often set to 1x or 2x the rank value.\n * **Target modules** : LoRA can be applied to various model components, including attention mechanisms (Q, K, V matrices), output projections, feed-forward blocks, and linear output layers. While initially focused on attention mechanisms, extending LoRA to other components has shown benefits. However, adapting more modules increases the number of trainable parameters and memory needs.\n\nHere, we set r=16, \u03b1=16, and target every linear module to maximize quality.\nWe don\u2019t use dropout and biases for faster training.\n\nIn addition, we will use Rank-Stabilized LoRA (rsLoRA), which modifies the\nscaling factor of LoRA adapters to be proportional to 1/\u221ar instead of 1/r.\nThis stabilizes learning (especially for higher adapter ranks) and allows for\nimproved fine-tuning performance as rank increases. Gradient checkpointing is\nhandled by Unsloth to offload input and output embeddings to disk and save\nVRAM.\n\n \n \n model = FastLanguageModel.get_peft_model(\n model,\n r=16,\n lora_alpha=16,\n lora_dropout=0,\n target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"up_proj\", \"down_proj\", \"o_proj\", \"gate_proj\"], \n use_rslora=True,\n use_gradient_checkpointing=\"unsloth\"\n )__\n\nWith this LoRA configuration, we\u2019ll only train 42 million out of 8 billion\nparameters (0.5196%). This shows how much more efficient LoRA is compared to\nfull fine-tuning.\n\nLet\u2019s now load and prepare our dataset. Instruction datasets are stored in a\n**particular format** : it can be Alpaca, ShareGPT, OpenAI, etc. First, we\nwant to parse this format to retrieve our instructions and answers. Our\nmlabonne/FineTome-100k dataset uses the ShareGPT format with a unique\n\u201cconversations\u201d column containing messages in JSONL. Unlike simpler formats\nlike Alpaca, ShareGPT is ideal for storing multi-turn conversations, which is\ncloser to how users interact with LLMs.\n\nOnce our instruction-answer pairs are parsed, we want to reformat them to\nfollow a **chat template**. Chat templates are a way to structure\nconversations between users and models. They typically include special tokens\nto identify the beginning and the end of a message, who\u2019s speaking, etc. Base\nmodels don\u2019t have chat templates so we can choose any: ChatML, Llama3,\nMistral, etc. In the open-source community, the ChatML template (originally\nfrom OpenAI) is a popular option. It simply adds two special tokens\n(`<|im_start|>` and `<|im_end|>`) to indicate who\u2019s speaking.\n\nIf we apply this template to the previous instruction sample, here\u2019s what we\nget:\n\n \n \n <|im_start|>system\n You are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.<|im_end|>\n <|im_start|>user\n Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on theirs device.\n <|im_end|>\n <|im_start|>assistant\n Itpreventsuserstosuspectthattherearesomehiddenproductsinstalledontheirsdevice.<|im_end|>\n\nIn the following code block, we parse our ShareGPT dataset with the `mapping`\nparameter and include the ChatML template. We then load and process the entire\ndataset to apply the chat template to every conversation.\n\n \n \n tokenizer = get_chat_template(\n tokenizer,\n mapping={\"role\": \"from\", \"content\": \"value\", \"user\": \"human\", \"assistant\": \"gpt\"},\n chat_template=\"chatml\",\n )\n \n def apply_template(examples):\n messages = examples[\"conversations\"]\n text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]\n return {\"text\": text}\n \n dataset = load_dataset(\"mlabonne/FineTome-100k\", split=\"train\")\n dataset = dataset.map(apply_template, batched=True)__\n\nWe\u2019re now ready to specify the training parameters for our run. I want to\nbriefly introduce the most important hyperparameters:\n\n * **Learning rate** : It controls how strongly the model updates its parameters. Too low, and training will be slow and may get stuck in local minima. Too high, and training may become unstable or diverge, which degrades performance.\n * **LR scheduler** : It adjusts the learning rate (LR) during training, starting with a higher LR for rapid initial progress and then decreasing it in later stages. Linear and cosine schedulers are the two most common options.\n * **Batch size** : Number of samples processed before the weights are updated. Larger batch sizes generally lead to more stable gradient estimates and can improve training speed, but they also require more memory. Gradient accumulation allows for effectively larger batch sizes by accumulating gradients over multiple forward/backward passes before updating the model.\n * **Num epochs** : The number of complete passes through the training dataset. More epochs allow the model to see the data more times, potentially leading to better performance. However, too many epochs can cause overfitting.\n * **Optimizer** : Algorithm used to adjust the parameters of a model to minimize the loss function. In practice, AdamW 8-bit is strongly recommended: it performs as well as the 32-bit version while using less GPU memory. The paged version of AdamW is only interesting in distributed settings.\n * **Weight decay** : A regularization technique that adds a penalty for large weights to the loss function. It helps prevent overfitting by encouraging the model to learn simpler, more generalizable features. However, too much weight decay can impede learning.\n * **Warmup steps** : A period at the beginning of training where the learning rate is gradually increased from a small value to the initial learning rate. Warmup can help stabilize early training, especially with large learning rates or batch sizes, by allowing the model to adjust to the data distribution before making large updates.\n * **Packing** : Batches have a pre-defined sequence length. Instead of assigning one batch per sample, we can combine multiple small samples in one batch, increasing efficiency.\n\nI trained the model on the entire dataset (100k samples) using an A100 GPU (40\nGB of VRAM) on Google Colab. The training took 4 hours and 45 minutes. Of\ncourse, you can use smaller GPUs with less VRAM and a smaller batch size, but\nthey\u2019re not nearly as fast. For example, it takes roughly 19 hours and 40\nminutes on an L4 and a whopping 47 hours on a free T4.\n\nIn this case, I recommend only loading a subset of the dataset to speed up\ntraining. You can do it by modifying the previous code block, like `dataset =\nload_dataset(\"mlabonne/FineTome-100k\", split=\"train[:10000]\")` to only load\n10k samples. Alternatively, you can use cheaper cloud GPU providers like\nPaperspace, RunPod, or Lambda Labs.\n\n \n \n trainer=SFTTrainer(\n model=model,\n tokenizer=tokenizer,\n train_dataset=dataset,\n dataset_text_field=\"text\",\n max_seq_length=max_seq_length,\n dataset_num_proc=2,\n packing=True,\n args=TrainingArguments(\n learning_rate=3e-4,\n lr_scheduler_type=\"linear\",\n per_device_train_batch_size=8,\n gradient_accumulation_steps=2,\n num_train_epochs=1,\n fp16=not is_bfloat16_supported(),\n bf16=is_bfloat16_supported(),\n logging_steps=1,\n optim=\"adamw_8bit\",\n weight_decay=0.01,\n warmup_steps=10,\n output_dir=\"output\",\n seed=0,\n ),\n )\n \n trainer.train()__\n\nNow that the model is trained, let\u2019s test it with a simple prompt. This is not\na rigorous evaluation but just a quick check to detect potential issues. We\nuse `FastLanguageModel.for_inference()` to get 2x faster inference.\n\n \n \n model = FastLanguageModel.for_inference(model)\n \n messages = [\n {\"from\": \"human\", \"value\": \"Is 9.11 larger than 9.9?\"},\n ]\n inputs = tokenizer.apply_chat_template(\n messages,\n tokenize=True,\n add_generation_prompt=True,\n return_tensors=\"pt\",\n ).to(\"cuda\")\n \n text_streamer = TextStreamer(tokenizer)\n _ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True)__\n\nThe model\u2019s response is \u201c9.9\u201d, which is correct!\n\nLet\u2019s now save our trained model. If you remember the part about LoRA and\nQLoRA, what we trained is not the model itself but a set of adapters. There\nare three save methods in Unsloth: `lora` to only save the adapters, and\n`merged_16bit`/`merged_4bit` to merge the adapters with the model in 16-bit/\n4-bit precision.\n\nIn the following, we merge them in 16-bit precision to maximize the quality.\nWe first save it locally in the \u201cmodel\u201d directory and then upload it to the\nHugging Face Hub. You can find the trained model on mlabonne/FineLlama-3.1-8B.\n\n \n \n model.save_pretrained_merged(\"model\", tokenizer, save_method=\"merged_16bit\")\n model.push_to_hub_merged(\"mlabonne/FineLlama-3.1-8B\", tokenizer, save_method=\"merged_16bit\")__\n\nUnsloth also allows you to directly convert your model into GGUF format. This\nis a quantization format created for llama.cpp and compatible with most\ninference engines, like LM Studio, Ollama, and oobabooga\u2019s text-generation-\nwebui. Since you can specify different precisions (see my article about GGUF\nand llama.cpp), we\u2019ll loop over a list to quantize it in `q2_k`, `q3_k_m`,\n`q4_k_m`, `q5_k_m`, `q6_k`, `q8_0` and upload these quants on Hugging Face.\nThe mlabonne/FineLlama-3.1-8B-GGUF contains all our GGUFs.\n\n \n \n quant_methods = [\"q2_k\", \"q3_k_m\", \"q4_k_m\", \"q5_k_m\", \"q6_k\", \"q8_0\"]\n for quant in quant_methods:\n model.push_to_hub_gguf(\"mlabonne/FineLlama-3.1-8B-GGUF\", tokenizer, quant)__\n\nCongratulations, we fine-tuned a model from scratch and uploaded quants you\ncan now use in your favorite inference engine. Feel free to try the final\nmodel available on mlabonne/FineLlama-3.1-8B-GGUF. What to do now? Here are\nsome ideas on how to use your model:\n\n * **Evaluate** it on the Open LLM Leaderboard (you can submit it for free) or using other evals like in LLM AutoEval.\n * **Align** it with Direct Preference Optimization using a preference dataset like mlabonne/orpo-dpo-mix-40k to boost performance.\n * **Quantize** it in other formats like EXL2, AWQ, GPTQ, or HQQ for faster inference or lower precision using AutoQuant.\n * **Deploy** it on a Hugging Face Space with ZeroChat for models that have been sufficiently trained to follow a chat template (~20k samples).\n\n## Conclusion\n\nThis article provided a comprehensive overview of supervised fine-tuning and\nhow to apply it in practice to a Llama 3.1 8B model. By leveraging QLoRA\u2019s\nefficient memory usage, we managed to fine-tune an 8B LLM on a super high-\nquality dataset with limited GPU resources. We also provided more efficient\nalternatives for bigger runs and suggestions for further steps, including\nevaluation, preference alignment, quantization, and deployment.\n\nI hope this guide was useful. If you\u2019re interested in learning more about\nLLMs, I recommend checking the LLM Course. If you enjoyed this article, follow\nme on X @maximelabonne and on Hugging Face @mlabonne. Good luck fine-tuning\nmodels!\n\n__Copyright 2023, Maxime Labonne\n\n", "language": "en" }, "platform": "mlabonne.github.io", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://mlabonne.github.io/blog/posts/2024-07-29_Finetune_Llama31.html" }, { "id": "4c510a29-a59a-4e15-874e-a5bd836a17de", "content": { "Title": "Maxime Labonne - The Rise of Agentic Data Generation", "Subtitle": null, "Content": "Maxime Labonne\n\n * __LLM Course\n * __Hands-On GNNs\n * __Research\n * __About\n\n * __\n * __\n * __\n * \n\n__\n\n 1. \ud83d\udde3\ufe0f **LLM stuff**\n 2. Agentic data generation\n\n 1. \ud83d\udde3\ufe0f **LLM stuff**\n 2. Agentic data generation\n\n# The Rise of Agentic Data Generation\n\nCombining AgentInstruct and Arena Learning\n\nLarge Language Models\n\nAuthor\n\nMaxime Lbonne\n\nPublished\n\nJuly 15, 2024\n\n * \ud83d\udd27 **LLM Post-training** __\n\n * Fine-tune Llama 2 in Colab\n\n * Fine-tune Llama 2 in Axolotl\n\n * Fine-tune Mistral-7b with DPO\n\n * Fine-tune Llama 3 with ORPO\n\n * Fine-tune Llama 3.1 8B\n\n * Merge LLMs with mergekit\n\n * Create Mixture of Experts\n\n * Uncensor any LLM\n\n * * * *\n\n * \u26a1 **LLM Quantization** __\n\n * Intro to Quantization\n\n * Quantization with GPTQ\n\n * Quantization with GGML\n\n * Quantization with ExLlamaV2\n\n * * * *\n\n * \ud83d\udde3\ufe0f **LLM stuff** __\n\n * ChatGPT + KG\n\n * Decoding Strategies\n\n * Agentic data generation\n\n * * * *\n\n * \ud83c\udf10 **Graph neural networks** __\n\n * Graph Convolution Network\n\n * Graph Attention Network\n\n * GraphSAGE\n\n * Graph Isomorphism Network\n\n * * * *\n\n * \ud83e\udd47 **Linear programming** __\n\n * Linear Programming\n\n * Integer Programming\n\n * Constraint Programming\n\n * Nonlinear Programming\n\n * * * *\n\n * \ud83c\udf00 **Miscellaneous** __\n\n * Q-learning\n\n * Minecraft Bot\n\n * Loops in Pandas\n\n * What is a Tensor\n\n## **Sections**\n\n * \ud83e\udd16 AgentInstruct: A Multi-Agent Approach\n * \u2694\ufe0f Arena Learning: A Competitive Refinement Approach\n * \ud83e\ude84 ArenaInstruct: Combining AgentInstruct and Arena Learning\n * Conclusion\n\nPre-order the **LLM Engineer\u2019s Handbook**, my new book to master the art of\nLLMs from concept to production\ud83d\udc47\n\nWith the consolidation of LLM architectures, the quality of training data has\nbecome the most important factor in creating state-of-the-art models. This is\ntrue for both pre-training and post-training, where instruction datasets have\na major impact on the final model. Two innovative approaches have recently\nemerged to address the challenge of generating high-quality instruction\ndatasets for post-training LLMs: AgentInstruct and Arena Learning. Both\nframeworks come from Microsoft Research and leverage multiple LLMs to create\nand refine samples.\n\nIn this article, I want to explore both methods, analyze their similarities\nand differences, and see how we could combine them in a single end-to-end\nframework.\n\n## \ud83e\udd16 AgentInstruct: A Multi-Agent Approach\n\nAgentInstruct is an agentic framework by Mitra et al. (2024), designed to\ngenerate large-scale, diverse, and high-quality synthetic data. The framework\nuses a sophisticated pipeline that transforms raw text into refined\ninstructions through multiple stages of processing. In the paper, the agents\nseem to be based on GPT-4, which is also used to evaluate data quality and\nhallucinations in some contexts.\n\n_Figure from the AgentInstruct paper._\n\nThe AgentInstruct pipeline consists of four main steps:\n\n * **Seed Collection** : Assemble a diverse collection of raw seeds, such as textbook chapters, web articles, and code snippets. These seeds serve as the foundation for generating new instructions.\n * **Content Transformation** : One or more specialized agents modify each seed into an intermediate representation that simplifies instruction creation. These agents are designed to perform tasks like generating argument passages, debates, conversations, meeting transcripts, poems, satirical content, etc.\n * **Seed Instruction Generation** : Multiple agents take the transformed seed and generate diverse instructions based on a pre-defined taxonomy of instruction types. For example, in the domain of reading comprehension, the taxonomy includes 43 question types, ranging from literal comprehension to critical analysis and inference.\n * **Instruction Refinement** : The final stage involves iteratively enhancing the complexity and quality of the generated instructions. This is achieved through suggester-editor agent pairs. Suggester agents propose ways to increase instruction complexity, while editor agents modify the instructions accordingly.\n\nTo get a better idea of what each stage produces, I recommend reading the\nexamples provided in the paper.\n\nEach flow in the AgentInstruct pipeline consists of multiple agents powered by\nLLMs. These agents can be equipped with tools like search APIs or code\ninterpreters to enhance their capabilities. The roles of these agents are\ncarefully defined in their system messages to ensure they perform their\nspecific tasks effectively.\n\nThe authors of AgentInstruct implemented flows for 17 different skills, each\nwith multiple subcategories. These skills cover a wide range of areas,\nincluding reading comprehension, question answering, coding, retrieval\naugmented generation, creative writing, tool use, and web control.\n\nUsing this comprehensive pipeline, the researchers generated approximately 22\nmillion instructions. They combined this synthetic data with 3.8 million\ninstructions from other sources to create a dataset of 25.8 million paired\ninstructions. This dataset was then used to fine-tune the Mistral-7b model,\nresulting in the creation of the Orca-3 model.\n\n## \u2694\ufe0f Arena Learning: A Competitive Refinement Approach\n\nArena Learning by Luo, Suo, et al. (2024) takes a different approach to\ngenerating high-quality instruction data. Instead of creating instructions\nfrom scratch, it focuses on refining existing instruction datasets through a\nsimulated competitive environment. It is not an agentic framework because\ntools are not provided to the models, but could easily be transformed into\none.\n\n_Figure from the Arena Learning paper._\n\nThe key components of the Arena Learning pipeline are:\n\n * **Offline Pairwise LLM Arena** : Arena Learning creates a simulated arena where multiple LLMs compete against each other on a large set of instruction data. A judge LLM (meta-llama/Meta-Llama-3-70B-Instruct) evaluates the responses from competing models for each instruction, providing rankings, scores, and explanations. This process effectively simulates human evaluation but at a much larger scale and lower cost.\n\n * **Data Collection and Preprocessing** : The framework starts with a large corpus of conversational data collected from various open sources. This data goes through filtering, cleaning, and deduplication. Instructions that are too short, illegal/toxic, or too similar to benchmark test sets are removed. The refined dataset is then split into multiple parts for iterative training.\n\n * **Iterative Battle and Model Evolution** : The process involves multiple rounds of battles and training:\n\n 1. An initial model (WizardLM-\u03b2-SFT-I0) is trained on a subset of data.\n 2. This model competes against other state-of-the-art LLMs on another data subset.\n 3. Instances where WizardLM-\u03b2 loses are collected, with the winning model\u2019s response used as the target for fine-tuning.\n 4. The process repeats for multiple iterations, with each iteration potentially using different training strategies (SFT, DPO, PPO).\n * **Training Strategies** : Arena Learning employs multiple training strategies to improve the model:\n\n * _Supervised Fine-Tuning (SFT)_ : Uses battle results to fine-tune the model on instances where it performed poorly.\n * _Direct Preference Optimization (DPO)_ : Treats win/loss responses as choice/reject pairs for training.\n * _Proximal Policy Optimization (PPO)_ : Uses battle results to train both a reward model and the language model.\n * **WizardArena Evaluation** : The authors create an offline test set (WizardArena) with diverse and hard subsets. This is used to evaluate models through pairwise battles, with results used to compute Elo rankings. The evaluation closely aligns with human-based arenas but is much faster and cheaper.\n\n * **Data Selection** : The pipeline uses various strategies to select high-quality training data, such as threshold-based filtering to control data size and quality, focusing on instances where the model underperforms, and gradually shifting towards more complex data in later iterations.\n\n_Figure from the Arena Learning paper._\n\nThis framework allows for multiple iterations of battles and training, as\nillustrated with WizardLM-\u03b2. The model\u2019s capabilities are progressively\nstrengthened, particularly in complex tasks. The process results in\nsignificant gains in Elo rankings, MT-bench scores, and other evaluation\nmetrics.\n\nArena Learning focuses on improving areas where the model under training is\ncurrently lacking. A nice feature is that it doesn\u2019t require particularly\npowerful models like Claude 3.5 Sonnet or GPT-4o. Models with a similar level\ncan be better in some tasks and domains, as well as more suited to answer\ncertain prompt syntaxes. It means that the entire pipeline can be deployed\nusing open-weight models, which is a big advantage if you already have a high-\nquality infrastructure.\n\n## \ud83e\ude84 ArenaInstruct: Combining AgentInstruct and Arena Learning\n\nWhile both AgentInstruct and Arena Learning aim to generate high-quality data\nfor post-training language models, they take fundamentally different\napproaches to achieve this goal. Understanding how they differ, as well as\ntheir strengths and weaknesses is a good first step to see how we could\ncombine them. I selected four points I want to focus on:\n\n * **Data Generation** : AgentInstruct starts from raw text, generating instructions from scratch through a multi-stage pipeline. This allows for the creation of entirely new content, potentially leading to greater diversity and novelty in the generated instructions. On the other hand, Arena Learning refines existing instruction datasets through simulated battles between models. This method leverages the quality of existing datasets while improving upon them through competitive evaluation.\n\n * **Data Quality** : AgentInstruct relies on suggester-editor agent pairs for iterative refinement of instructions. This approach allows for fine-grained control over the complexity and quality of generated instructions. Arena Learning, in contrast, uses an LLM-as-a-judge to evaluate responses in simulated battles. It means that the entire data quality process is handled by a single model.\n\n * **Diversity and Complexity** : AgentInstruct explicitly (i.e., manually) designs for diversity through a taxonomy of instruction types and multiple transformation agents. This structured approach ensures coverage across a wide range of skills and instruction types. Arena Learning\u2019s diversity comes from the variety of competing models and initial instruction datasets. While this may lead to less structured diversity, it could potentially capture more natural variations in instruction styles.\n\n * **Flexibility** : AgentInstruct\u2019s pipeline allows for easy addition of new seed types and instruction categories, making it highly adaptable to new domains and tasks. Arena Learning\u2019s iterative battle process enables continuous improvement of the target model, potentially allowing it to adapt more quickly to new challenges and competing models.\n\nBased on this comparison, it\u2019s not too difficult to see how we can leverage\nthe advantages of each framework. For instance, a taxonomy-based data\ngeneration is more steerable and could be improved upon by arena learning. But\nwe could also use feedback signals to improve this first step over multiple\niterations.\n\nHere\u2019s how such a hybrid approach might work:\n\n 1. **AgentInstruct Instruction Generation** : Use AgentInstruct to create a broad and diverse base of instructions (no answers!) from raw text. This would ensure wide coverage of tasks and domains that are relevant for our use cases.\n 2. **Arena Learning Answer Generation** : Apply Arena Learning\u2019s competitive battle approach to refine and select the highest quality answers from a pool of models. This would combine AgentInstruct\u2019s ability to generate novel content with Arena Learning\u2019s robust quality control mechanism.\n 3. **Data Quality Evaluation** : Instead of relying on a single LLM-as-a-judge, we can use reward models or an LLM-as-a-jury to improve the data selection process.\n 4. **Diversity Feedback** : Use insights from Arena Learning battles to dynamically update AgentInstruct\u2019s instruction taxonomy. This would focus the generation process on producing more of the instruction types that prove most challenging or useful in real-world scenarios.\n 5. **Complexity Feedback** : Leverage Arena Learning\u2019s performance metrics to identify areas where instructions are too easy or too difficult. Use this information to guide AgentInstruct\u2019s complexity refinement process, ensuring a well-balanced dataset that challenges the model appropriately over several iterations.\n\nBy combining these approaches, we can create a powerful feedback loop between\ninstruction generation and evaluation. This hybrid framework would benefit\nfrom AgentInstruct\u2019s ability to generate novel, diverse content and Arena\nLearning\u2019s competitive quality control and model improvement process. The\nresult would be a more robust, effective, and continuously improving post-\ntraining dataset for LLMs.\n\n## Conclusion\n\nIn conclusion, this article explored two recent approaches in synthetic data\ngeneration: AgentInstruct and Arena Learning. We proposed a hybrid solution\nthat combines AgentInstruct\u2019s structured, taxonomy-based methodology with\nArena Learning\u2019s iterative refinement using multiple LLMs. This combination\nleverages the strengths of both frameworks, allowing for a systematic\ngeneration of diverse data while enabling continuous improvement of the\nunderlying taxonomy through feedback from the LLM pool. I feel like we might\nlose some quality by removing the suggester-editor agent pairs. Let me know if\nyou have better ideas.\n\nStill, data quality evaluation is a significant challenge to perfect this\napproach. The current reliance on models like GPT-4 or Llama 3 70B Instruct as\njudges is imperfect and has known limitations (see my quick review here).\nImproving the quality assessment stage could lead to more efficient datasets,\nachieving better performance with fewer samples. To know more about how to\ncreate high-quality datasets, check out my GitHub repo \ud83d\udcbe LLM Datasets.\n\n__Copyright 2023, Maxime Labonne\n\n", "language": "en" }, "platform": "mlabonne.github.io", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://mlabonne.github.io/blog/posts/2024-07-15_The_Rise_of_Agentic_Data_Generation.html" }, { "id": "5a56c009-565d-4dc4-9bd5-d2b1be2ca2d4", "content": { "Title": "Uncensor any LLM with abliteration - Maxime Labonne", "Subtitle": "Fine-tuning without retraining", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Uncensor any LLM with abliteration\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Uncensor any LLM with abliteration\n\n### Fine-tuning without retraining\n\nMaxime Labonne\n\nJun 12, 2024\n\nShare this post\n\n#### Uncensor any LLM with abliteration\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### _Fine-tuning without retraining_\n\nImage generated with DALL-E 3 by author\n\nThe third generation of Llama models provided fine-tunes (Instruct) versions\nthat excel in understanding and following instructions. However, these models\nare heavily censored, designed to refuse requests seen as harmful with\nresponses such as \u201cAs an AI assistant, I cannot help you.\u201d While this safety\nfeature is crucial for preventing misuse, it limits the model\u2019s flexibility\nand responsiveness.\n\nIn this article, we will explore a technique called \u201cabliteration\u201d that can\nuncensor any LLM without retraining. This technique effectively removes the\nmodel\u2019s built-in refusal mechanism, allowing it to respond to all types of\nprompts.\n\nThe code is available on Google Colab and in the LLM Course on GitHub. Special\nthanks to FailSpy for proofreading this article.\n\n### \u2702\ufe0f What is abliteration?\n\nModern LLMs are fine-tuned for safety and instruction-following, meaning they\nare trained to refuse harmful requests. In their blog post, Arditi et al. have\nshown that this refusal behavior is mediated by a specific direction in the\nmodel\u2019s residual stream. If we prevent the model from representing this\ndirection, it **loses its ability to refuse requests**. Conversely, adding\nthis direction artificially can cause the model to refuse even harmless\nrequests.\n\nIn the traditional decoder-only Llama-like architecture, there are three\nresidual streams we can target: at the start of each block (\u201cpre\u201d), between\nthe attention and MLP layers (\u201cmid\u201d), and after the MLP (\u201cpost\u201d). The\nfollowing figure illustrates the location of each residual stream.\n\nImage by author\n\nTo uncensor an LLM, we first need to identify the \u201crefusal direction\u201d within\nthe model. This process involves a few technical steps:\n\n 1. **Data Collection** : Run the model on a set of harmful instructions and a set of harmless instructions, recording the residual stream activations at the last token position for each.\n\n 2. **Mean difference** : Calculate the mean difference between the activations of harmful and harmless instructions. This gives us a vector representing the \u201crefusal direction\u201d for each layer of the model.\n\n 3. **Selection** : Normalize these vectors and evaluate them to select the single best \u201crefusal direction.\u201d\n\nOnce we have identified the refusal direction, we can \u201cablate\u201d it, effectively\nremoving the model\u2019s ability to represent this feature. This can be done\nthrough an **inference-time intervention** or permanently with **weight\northogonalization**.\n\nLet\u2019s talk about inference-time intervention first. For every component that\nwrites to the residual stream (such as an attention head), we calculate the\nprojection of its output onto the refusal direction and subtract this\nprojection. This subtraction is applied at every token and every layer,\nensuring that the model never represents the refusal direction.\n\nOn the other hand, weight orthogonalization involves modifying the model\nweights directly. By orthogonalizing the component weights with respect to the\nrefusal direction, it prevents the model from writing to this direction\naltogether. This is achieved by adjusting the matrices that write to the\nresidual stream, ensuring they do not contribute to the refusal direction.\n\nIn the next section, we will implement abliteration with weight\northogonalization.\n\n### \ud83d\udcbb Implementation\n\nThe following implementation of abliteration is based on FailSpy\u2019s notebook,\nwhich is itself based on the original authors\u2019 notebook. I mostly adapted and\nsimplified it to make it easier to understand. This section is quite code-\nheavy so you can see what is going on, but you can use FailSpy\u2019s abliterator\nlibrary if you\u2019re less interested in the technical details (also check his\ncollection of abliterated models on Hugging Face).\n\nThe code relies on the excellent TransformerLens library (formerly known as\nEasyTransformer) to do the heavy lifting. It is designed for mechanistic\ninterpretability and is used here to intervene on activations. Thanks to Neel\nNanda and Joseph Bloom for creating and maintaining this library.\n\nFirst, let\u2019s install the necessary packages and import them. All these steps\nare available in this Google Colab notebook.\n\n \n \n !pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping\n \n import torch\n import functools\n import einops\n import gc\n \n from datasets import load_dataset\n from tqdm import tqdm\n from torch import Tensor\n from typing import List\n from transformer_lens import HookedTransformer, utils\n from transformer_lens.hook_points import HookPoint\n from transformers import AutoModelForCausalLM, AutoTokenizer\n from jaxtyping import Float, Int\n from collections import defaultdict\n \n # Turn automatic differentiation off to save GPU memory (credit: Undi95)\n torch.set_grad_enabled(False)\n\nWe need two datasets: one containing harmless instructions, and one containing\nharmful instructions. We\u2019ll use tatsu-lab/alpaca as well as data from llm-\nattacks. To make things easier, I repackaged them in two Hugging Face\ndatasets: mlabonne/harmless_behaviors and mlabonne/harmful_behaviors. That\nway, you can easily replace them with your own datasets.\n\nWe will load the instructions and reformat them into a list of dictionaries\nwith \u201crole\u201d and \u201ccontent\u201d keys. This makes it compatible with the\n`apply_chat_tokenizer()` method, which we will use to follow Llama 3's chat\ntemplate.\n\n \n \n def reformat_texts(texts):\n return [[{\"role\": \"user\", \"content\": text}] for text in texts]\n \n # Get harmful and harmless datasets\n def get_harmful_instructions():\n dataset = load_dataset('mlabonne/harmful_behaviors')\n return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])\n \n def get_harmless_instructions():\n dataset = load_dataset('mlabonne/harmless_alpaca')\n return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])\n \n harmful_inst_train, harmful_inst_test = get_harmful_instructions()\n harmless_inst_train, harmless_inst_test = get_harmless_instructions()\n\nNow that we have our datasets, we can load the model we want to abliterate.\nUnfortunately, you can\u2019t directly load a custom model using\n`HookedTransformer`. Here, I use a trick described in FailSpy's notebook to\ndownload a custom model and rename it as meta-llama/Meta-Llama-3-8B-Instruct.\nLoad in `torch.float16` format if your GPU is not compatible with BF16.\n\nIn this example, we\u2019ll use mlabonne/Daredevil-8B, a mega-merge created with\nDARE TIES (see my article about model merging) that has the highest MMLU score\non the Open LLM Leaderboard in the 8B category.\n\n \n \n MODEL_ID = \"mlabonne/Daredevil-8B\"\n MODEL_TYPE = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n \n # Download and load model\n !git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}\n \n # Load model and tokenizer\n model = HookedTransformer.from_pretrained_no_processing(\n MODEL_TYPE,\n local_files_only=True,\n dtype=torch.bfloat16,\n default_padding_side='left'\n )\n tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)\n tokenizer.padding_side = 'left'\n tokenizer.pad_token = tokenizer.eos_token\n\nWe can now tokenize our datasets. We\u2019re using the same number of samples for\nboth harmless and harmful instructions. Note that a high number of samples can\nuse all the RAM/VRAM, which is why I\u2019m limiting it to 256 here.\n\n \n \n def tokenize_instructions(tokenizer, instructions):\n return tokenizer.apply_chat_template(\n instructions,\n padding=True,\n truncation=False,\n return_tensors=\"pt\",\n return_dict=True,\n add_generation_prompt=True,\n ).input_ids\n \n n_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))\n \n # Tokenize datasets\n harmful_tokens = tokenize_instructions(\n tokenizer,\n instructions=harmful_inst_train[:n_inst_train],\n )\n harmless_tokens = tokenize_instructions(\n tokenizer,\n instructions=harmless_inst_train[:n_inst_train],\n )\n\nEverything is set up, we can now implement the first step of abliteration:\ndata collection. We want to process these tokenized datasets and store the\nresidual stream activations in `harmful` and `harmless`. This is managed by\nthe transformer_lens library.\n\n \n \n batch_size = 32\n \n # Initialize defaultdicts to store activations\n harmful = defaultdict(list)\n harmless = defaultdict(list)\n \n # Process the training data in batches\n num_batches = (n_inst_train + batch_size - 1) // batch_size\n \n for i in tqdm(range(num_batches)):\n print(i)\n start_idx = i * batch_size\n end_idx = min(n_inst_train, start_idx + batch_size)\n \n # Run models on harmful and harmless prompts, cache activations\n harmful_logits, harmful_cache = model.run_with_cache(\n harmful_tokens[start_idx:end_idx],\n names_filter=lambda hook_name: 'resid' in hook_name,\n device='cpu',\n reset_hooks_end=True\n )\n harmless_logits, harmless_cache = model.run_with_cache(\n harmless_tokens[start_idx:end_idx],\n names_filter=lambda hook_name: 'resid' in hook_name,\n device='cpu',\n reset_hooks_end=True\n )\n \n # Collect and store the activations\n for key in harmful_cache:\n harmful[key].append(harmful_cache[key])\n harmless[key].append(harmless_cache[key])\n \n # Flush RAM and VRAM\n del harmful_logits, harmless_logits, harmful_cache, harmless_cache\n gc.collect()\n torch.cuda.empty_cache()\n \n # Concatenate the cached activations\n harmful = {k: torch.cat(v) for k, v in harmful.items()}\n harmless = {k: torch.cat(v) for k, v in harmless.items()}\n\nWe can now compute the refusal direction for each layer. This corresponds to\nthe mean difference between the activations of harmful and harmless\ninstructions, which is then normalized. We sort them in descending order in\n`activation_scored`.\n\n \n \n # Helper function to get activation index\n def get_act_idx(cache_dict, act_name, layer):\n key = (act_name, layer)\n return cache_dict[utils.get_act_name(*key)]\n \n # Compute difference of means between harmful and harmless activations at intermediate layers\n activation_layers = [\"resid_pre\", \"resid_mid\", \"resid_post\"]\n activation_refusals = defaultdict(list)\n \n for layer_num in range(1, model.cfg.n_layers):\n pos = -1 # Position index\n for layer in activation_layers:\n harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)\n harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(\n dim=0\n )\n refusal_dir = harmful_mean_act - harmless_mean_act\n refusal_dir = refusal_dir / refusal_dir.norm()\n activation_refusals[layer].append(refusal_dir)\n \n selected_layers = [\"resid_pre\"]\n activation_scored = sorted(\n [\n activation_refusals[layer][l - 1]\n for l in range(1, model.cfg.n_layers)\n for layer in selected_layers\n ],\n key=lambda x: abs(x.mean()),\n reverse=True,\n )\n\nThe final step of the process consists of evaluating the refusal directions we\ncalculated. To do this, we\u2019re going to apply the refusal direction to each\nresidual stream and each block during inference. In the following snippet, we\nget generations for four test harmful instructions and 20 blocks (or layers).\n\n \n \n def _generate_with_hooks(\n model: HookedTransformer,\n tokenizer: AutoTokenizer,\n tokens: Int[Tensor, \"batch_size seq_len\"],\n max_tokens_generated: int = 64,\n fwd_hooks=[],\n ) -> List[str]:\n all_tokens = torch.zeros(\n (tokens.shape[0], tokens.shape[1] + max_tokens_generated),\n dtype=torch.long,\n device=tokens.device,\n )\n all_tokens[:, : tokens.shape[1]] = tokens\n for i in range(max_tokens_generated):\n with model.hooks(fwd_hooks=fwd_hooks):\n logits = model(all_tokens[:, : -max_tokens_generated + i])\n next_tokens = logits[:, -1, :].argmax(\n dim=-1\n ) # greedy sampling (temperature=0)\n all_tokens[:, -max_tokens_generated + i] = next_tokens\n return tokenizer.batch_decode(\n all_tokens[:, tokens.shape[1] :], skip_special_tokens=True\n )\n \n def get_generations(\n model: HookedTransformer,\n tokenizer: AutoTokenizer,\n instructions: List[str],\n fwd_hooks=[],\n max_tokens_generated: int = 64,\n batch_size: int = 4,\n ) -> List[str]:\n generations = []\n for i in tqdm(range(0, len(instructions), batch_size)):\n tokens = tokenize_instructions(\n tokenizer, instructions=instructions[i : i + batch_size]\n )\n generation = _generate_with_hooks(\n model,\n tokenizer,\n tokens,\n max_tokens_generated=max_tokens_generated,\n fwd_hooks=fwd_hooks,\n )\n generations.extend(generation)\n return generations\n \n # Inference-time intervention hook\n def direction_ablation_hook(\n activation: Float[Tensor, \"... d_act\"],\n hook: HookPoint,\n direction: Float[Tensor, \"d_act\"],\n ):\n if activation.device != direction.device:\n direction = direction.to(activation.device)\n proj = (\n einops.einsum(\n activation, direction.view(-1, 1), \"... d_act, d_act single -> ... single\"\n )\n * direction\n )\n return activation - proj\n \n # Testing baseline\n N_INST_TEST = 4\n baseline_generations = get_generations(\n model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]\n )\n \n # Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)\n EVAL_N = 20 # Evaluate how many of the top N potential directions\n evals = []\n for refusal_dir in tqdm(activation_scored[:EVAL_N]):\n hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)\n fwd_hooks = [\n (utils.get_act_name(act_name, layer), hook_fn)\n for layer in list(range(model.cfg.n_layers))\n for act_name in activation_layers\n ]\n intervention_generations = get_generations(\n model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks\n )\n evals.append(intervention_generations)\n\nWe stored all the generations in the `evals` list. We can now print them and\nmanually select the layer (block) that provides an uncensored response for\neach instruction. I'm automatically excluding responses containing \"I cannot\"\nand \"I can't\" to filter out unwanted answers.\n\nIf you can\u2019t find a layer that satisfies these requirements, you might want to\ntest other residual streams in the previous `selected_layers` list, other\ninstructions, additional blocks, etc.\n\n \n \n # Print generations for human evaluation\n blacklist = [\"I cannot\", \"I can't\"]\n for i in range(N_INST_TEST):\n print(f\"\\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}\")\n print(f\"\\nBASELINE COMPLETION:\\n{baseline_generations[i]}\\033[0m\")\n for layer_candidate in range(EVAL_N):\n if not any(word in evals[layer_candidate][i] for word in blacklist):\n print(f\"\\n---\\n\\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:\")\n print(evals[layer_candidate][i])\n\nIn my case, the layer candidate 9 managed to provide uncensored answer for the\nfour instructions. This is the one that we will select for the refusal\ndirection. In the following, we implement weight orthogonalization to modify\nthe weights and prevent the model from creating outputs with this direction.\nYou can verify that the model is successfully uncensored by printing the\ncompletions.\n\n \n \n def get_orthogonalized_matrix(\n matrix: Float[Tensor, \"... d_model\"], vec: Float[Tensor, \"d_model\"]\n ) -> Float[Tensor, \"... d_model\"]:\n proj = (\n einops.einsum(\n matrix, vec.view(-1, 1), \"... d_model, d_model single -> ... single\"\n )\n * vec\n )\n return matrix - proj\n \n # Select the layer with the highest potential refusal direction\n LAYER_CANDIDATE = 9\n refusal_dir = activation_scored[LAYER_CANDIDATE]\n \n # Orthogonalize the model's weights\n if refusal_dir.device != model.W_E.device:\n refusal_dir = refusal_dir.to(model.W_E.device)\n model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)\n \n for block in tqdm(model.blocks):\n if refusal_dir.device != block.attn.W_O.device:\n refusal_dir = refusal_dir.to(block.attn.W_O.device)\n block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)\n block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)\n \n # Generate text with abliterated model\n orthogonalized_generations = get_generations(\n model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]\n )\n \n # Print generations\n for i in range(N_INST_TEST):\n if len(baseline_generations) > i:\n print(f\"INSTRUCTION {i}: {harmful_inst_test[i]}\")\n print(f\"\\033[92mBASELINE COMPLETION:\\n{baseline_generations[i]}\")\n print(f\"\\033[91mINTERVENTION COMPLETION:\\n{evals[LAYER_CANDIDATE][i]}\")\n print(f\"\\033[95mORTHOGONALIZED COMPLETION:\\n{orthogonalized_generations[i]}\\n\")\n\nWe\u2019re now ready to use the model. We convert it back to the Hugging Face\nformat and upload it to the HF hub.\n\n \n \n # Convert model back to HF safetensors\n hf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)\n lm_model = hf_model.model\n \n state_dict = model.state_dict()\n lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict[\"embed.W_E\"].cpu())\n for l in range(model.cfg.n_layers):\n lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(\n einops.rearrange(\n state_dict[f\"blocks.{l}.attn.W_O\"], \"n h m->m (n h)\", n=model.cfg.n_heads\n ).contiguous()\n )\n lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(\n torch.transpose(state_dict[f\"blocks.{l}.mlp.W_out\"], 0, 1).contiguous()\n )\n \n hf_model.push_to_hub(f\"{MODEL_ID}-abliterated\")\n\n### \u2696\ufe0f DPO Fine-Tuning\n\nI evaluated the abliterated and source models from the previous section on the\nOpen LLM Leaderboard and on Nous\u2019 benchmark suite. Here are the results:\n\nImage by author\n\nAs you can see, the source model significantly outperforms Llama 3 8B\nInstruct. However, we observe a performance drop in the ablated version across\nall benchmarks. The ablation process successfully uncensored it but also\ndegraded the model\u2019s quality.\n\nTo address this issue, an idea consists of further training our abliterated\nmodel to heal it. Like most fine-tuned models, Llama 3 8B Instruct is quite\nbrittle when it comes to supervised fine-tuning. An additional SFT would\nlikely break the model\u2019s performance.\n\nAlternatively, preference alignment is quite light and shouldn\u2019t lobotomize\nour abliterated model. DPO is a good candidate here for its ease of use and\ngood track record. To implement it, I used LazyAxolotl (thanks to Wing Lian\nfor creating Axolotl) with the mlabonne/orpo-dpo-mix-40k dataset. Here\u2019s the\nconfiguration I used:\n\n \n \n base_model: mlabonne/Daredevil-8B-abliterated\n model_type: LlamaForCausalLM\n tokenizer_type: AutoTokenizer\n \n load_in_8bit: false\n load_in_4bit: true\n strict: false\n save_safetensors: true\n \n rl: dpo\n chat_template: chatml\n datasets:\n - path: mlabonne/orpo-dpo-mix-40k\n split: train\n type: chatml.intel\n \n dataset_prepared_path:\n val_set_size: 0.0\n output_dir: ./out\n \n adapter: qlora\n lora_model_dir:\n \n sequence_len: 2048\n sample_packing: false\n pad_to_sequence_len: false\n \n lora_r: 64\n lora_alpha: 32\n lora_dropout: 0.05\n lora_target_linear: true\n lora_fan_in_fan_out:\n \n wandb_project: axolotl\n wandb_entity:\n wandb_watch:\n wandb_name:\n wandb_log_model:\n \n gradient_accumulation_steps: 8\n micro_batch_size: 1\n num_epochs: 1\n optimizer: paged_adamw_8bit\n lr_scheduler: cosine\n learning_rate: 5e-6\n train_on_inputs: false\n group_by_length: false\n \n bf16: auto\n fp16:\n tf32:\n \n gradient_checkpointing: true\n early_stopping_patience:\n resume_from_checkpoint:\n local_rank:\n logging_steps: 1\n xformers_attention:\n flash_attention: true\n warmup_steps: 100\n evals_per_epoch: 0\n eval_table_size:\n eval_table_max_new_tokens: 128\n saves_per_epoch: 1\n debug:\n deepspeed: deepspeed_configs/zero2.json\n weight_decay: 0.0\n special_tokens:\n pad_token: <|end_of_text|>\n\nI trained it using 6xA6000 GPUs with DeepSpeed ZeRO-2. The training took about\n6 hours and 45 minutes. Here are the training curves I got from W&B:\n\nImage by author\n\nIt automatically uploaded the DPO fine-tuned model, called\nmlabonne/NeuralDaredevil-8B-abliterated. To see if it fixed our abliterated\nversion, I evaluated it on the same benchmarks:\n\nImage by author\n\nWe can see that this additional training allowed us to recover most of the\nperformance drop due to abliteration. One area where the model doesn\u2019t improve\nis GSM8K, a math dataset, which could mean the orpo-dpo-mix-40k would benefit\nfrom more math samples.\n\nThe final model is an uncensored LLM with state-of-the-art performance in the\n8B category. I recommend it as an improved version of Llama 3 8B Instruct when\nyou don\u2019t need censorship. You can play with quantized versions like GGUF in\nLM Studio.\n\n### Conclusion\n\nIn this article, we introduced the concept of abliteration. This technique\nuses the model\u2019s activations on harmless and harmful prompts to calculate a\nrefusal direction. It then uses this direction to modify the model\u2019s weights\nand ensure that we stop outputting refusals. This technique also demonstrates\nthe fragility of safety fine-tuning and raises ethical considerations.\n\nWe applied abliteration to Daredevil-8B to uncensor it, which also degraded\nthe model\u2019s performance. We then healed it using DPO to create the\nNeuralDaredevil-8B model, a fully uncensored and high-quality 8B LLM.\nAbliteration is not limited to removing alignment and should be seen as a form\nof fine-tuning without retraining. Indeed, it can creatively be applied to\nother goals, like FailSpy\u2019s MopeyMule, which adopts a melancholic\nconversational style.\n\nI hope you liked this article. If you want to see more follow me on Hugging\nFace and Twitter @maximelabonne.\n\n### References\n\n * FailSpy, \u201cabliterator library,\u201d GitHub, 2024.\n\n * Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, \u201cRefusal in LLMs is mediated by a single direction,\u201d Lesswrong, 2024.\n\nShare this post\n\n#### Uncensor any LLM with abliteration\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/uncensor-any-llm-with-abliteration-d30148b7d43e" }, { "id": "d3bf078f-7028-410f-b4ed-b79e717f7927", "content": { "Title": "Create Mixtures of Experts with MergeKit", "Subtitle": "Combine multiple models into a single MoE", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Create Mixtures of Experts with MergeKit\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Create Mixtures of Experts with MergeKit\n\n### Combine multiple models into a single MoE\n\nMaxime Labonne\n\nMar 27, 2024\n\n1\n\nShare this post\n\n#### Create Mixtures of Experts with MergeKit\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### _Combine multiple models into a single MoE_\n\nImage by author\n\nThanks to the release of Mixtral, the **Mixture of Experts** (MoE)\narchitecture has become popular in recent months. This architecture offers an\ninteresting tradeoff: higher performance at the cost of increased VRAM usage.\nWhile Mixtral and other MoE architectures are pre-trained from scratch,\nanother method of creating MoE has recently appeared. Thanks to Arcee\u2019s\nMergeKit library, we now have a new way of creating MoEs by ensembling several\npre-trained models. These are often referred to as **frankenMoEs** or\n**MoErges** to distinguish them from the pre-trained MoEs.\n\nIn this article, we will detail how the MoE architecture works and how\nfrankenMoEs are created. Finally, we will make our own frankenMoE with\nMergeKit and evaluate it on several benchmarks. The code is available on\nGoogle Colab in a wrapper called LazyMergeKit.\n\nSpecial thanks to Charles Goddard, the creator of MergeKit, for proofreading\nthis article.\n\n### \ud83d\udd00 Introduction to MoEs\n\nA Mixture of Experts is an architecture designed for improved efficiency and\nperformance. It uses multiple specialized subnetworks, known as \u201c**experts**.\u201d\nUnlike dense models, where the entire network is activated, MoEs only activate\nrelevant experts based on the input. This results in faster training and more\nefficient inference.\n\nThere are two components at the core of an MoE model:\n\n 1. **Sparse MoE Layers** : These replace the dense feed-forward network layers in the transformer architecture. Each MoE layer contains several experts, and only a subset of these experts are engaged for a given input.\n\n 2. **Gate Network or Router** : This component determines which tokens are processed by which experts, ensuring that each part of the input is handled by the most suitable expert(s).\n\nIn the following example, we show how a Mistral-7B block is transformed into\nan MoE block with a sparse MoE layer (feedforward network 1, 2, and 3) and a\nrouter. This example represents an MoE with three experts, where two are\ncurrently engaged (FFN 1 and FFN 3).\n\nImage by author\n\nMoEs also come with their own set of challenges, especially in terms of fine-\ntuning and memory requirements. The fine-tuning process can be difficult due\nto the model\u2019s complexity, with the need to **balance expert usage** during\ntraining to properly train the gating weights to select the most relevant\nones. In terms of memory, even though only a fraction of the total parameters\nare used during inference, the entire model, including all experts, needs to\nbe **loaded into memory** , which requires high VRAM capacity.\n\nMore specifically, there are two essential parameters when it comes to MoEs:\n\n * **Number of experts** (`num_local_experts`): This determines the total number of experts in the architecture (e.g., 8 for Mixtral). The higher the number of experts, the higher the VRAM usage.\n\n * **Number of experts/token** (`num_experts_per_tok`): This determines the number of experts that are engaged for each token and each layer (e.g., 2 for Mixtral). There is a tradeoff between a high number of experts per token for accuracy (but diminishing returns) vs. a low number for fast training and inference.\n\nHistorically, MoEs have underperformed dense models. However, the release of\nMixtral-8x7B in December 2023 shook things up and showed impressive\nperformance for its size. Additionally, GPT-4 is also rumored to be an MoE,\nwhich would make sense as it would be a lot cheaper to run and train for\nOpenAI compared to a dense model. In addition to these recent excellent MoEs,\nwe now have a new way of creating MoEs with MergeKit: frankenMoEs, also called\nMoErges.\n\n### \ud83e\udddf\u200d\u2642\ufe0f True MoEs vs. frankenMoEs\n\nThe main difference between true MoEs and frankenMoEs is how they\u2019re trained.\nIn the case of true MoEs, the experts and the router are trained jointly. In\nthe case of frankenMoEs, we upcycle existing models and initialize the router\nafterward.\n\nIn other words, we copy the weights of the layer norm and self-attention\nlayers from a base model, and then copy the weights of the FFN layers found in\neach expert. This means that besides the FFNs, all the other parameters are\nshared. This explains why Mixtral-8x7B with eight experts doesn\u2019t have 8*7 =\n56B parameters, but about 45B. This is also why using two experts per token\ngives the inference speed (FLOPs) of a 12B dense model instead of 14B.\n\nFrankenMoEs are about selecting the most relevant experts and initializing\nthem properly. MergeKit currently implements three ways of initializing the\nrouters:\n\n 1. **Random** : Random weights. Be careful when using it as the same experts might be selected every time (it requires further fine-tuning or `num_local_experts = num_experts_per_tok`, which means you don't need any routing).\n\n 2. **Cheap embed** : It uses the raw embeddings of the input tokens directly and applies the same transformation across all layers. This method is computationally inexpensive and suitable for execution on less powerful hardware.\n\n 3. **Hidden** : It creates hidden representations of a list of positive and negative prompts by extracting them from the last layer of the LLM. They are averaged and normalized to initialize the gates. More information about it is available on Charles Goddard\u2019s blog.\n\nAs you can guess, the \u201chidden\u201d initialization is the most efficient to\ncorrectly route the tokens to the most relevant experts. In the next section,\nwe will create our own frankenMoE using this technique.\n\n### \ud83d\udcbb Creating a frankenMoE\n\nTo create our frankenMoE, we need to select `n` experts. In this case, we will\nrely on Mistral-7B thanks to its popularity and relatively small size.\nHowever, eight experts like in Mixtral is quite a lot, as we need to fit all\nof them in memory. For efficiency, I'll only use four experts in this example,\nwith two of them engaged for each token and each layer. In this case, we will\nend up with a model with 24.2B parameters instead of 4*7 = 28B parameters.\n\nHere, our goal is to create a well-rounded model that can do pretty much\neverything: write stories, explain articles, code in Python, etc. We can\ndecompose this requirement into four tasks and select the best expert for each\nof them. This is how I decomposed it:\n\n * **Chat model** : a general-purpose model that is used in most interactions. I used mlabonne/AlphaMonarch-7B, which perfectly satisfies the requirements.\n\n * **Code model** : a model capable of generating good code. I don\u2019t have a lot of experience with Mistral-7B-based code models, but I found beowolx/CodeNinja-1.0-OpenChat-7B particularly good compared to others.\n\n * **Math model** : math is tricky for LLMs, which is why we want a model specialized in math. Thanks to its high MMLU and GMS8K scores, I chose mlabonne/NeuralDaredevil-7B for this purpose.\n\n * **Role-play model** : The goal of this model is to write high-quality stories and conversations. I selected SanjiWatsuki/Kunoichi-DPO-v2\u20137B because of its good reputation and high MT-Bench score (8.51 vs. 8.30 for Mixtral).\n\nNow that we\u2019ve identified the experts we want to use, we can create the YAML\nconfiguration that MergeKit will use to create our frankenMoE. This uses the\nmixtral branch of MergeKit. You can find more information about how to write\nthe configuration on this page. Here is our version:\n\n \n \n base_model: mlabonne/AlphaMonarch-7B\n experts:\n - source_model: mlabonne/AlphaMonarch-7B\n positive_prompts:\n - \"chat\"\n - \"assistant\"\n - \"tell me\"\n - \"explain\"\n - \"I want\"\n - source_model: beowolx/CodeNinja-1.0-OpenChat-7B\n positive_prompts:\n - \"code\"\n - \"python\"\n - \"javascript\"\n - \"programming\"\n - \"algorithm\"\n - source_model: SanjiWatsuki/Kunoichi-DPO-v2-7B\n positive_prompts:\n - \"storywriting\"\n - \"write\"\n - \"scene\"\n - \"story\"\n - \"character\"\n - source_model: mlabonne/NeuralDaredevil-7B\n positive_prompts:\n - \"reason\"\n - \"math\"\n - \"mathematics\"\n - \"solve\"\n - \"count\"\n\nFor each expert, I provide five basic positive prompts. You can be a bit\nfancier and write entire sentences if you want. The best strategy consists of\nusing real prompts that should trigger a particular expert. You can also add\nnegative prompts to do the opposite.\n\nOnce this is ready, you can save your configuration as `config.yaml`. In the\nsame folder, we will download and install the mergekit library (mixtral\nbranch).\n\n \n \n git clone -b mixtral https://github.com/arcee-ai/mergekit.git\n cd mergekit && pip install -e .\n pip install -U transformers\n\nIf your computer has enough RAM (roughly 24\u201332 GB of RAM), you can run the\nfollowing command:\n\n \n \n mergekit-moe config.yaml merge --copy-tokenizer\n\nIf you don\u2019t have enough RAM, you can shard the models instead as follows (it\nwill take longer):\n\n \n \n mergekit-moe config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle\n\nThis command automatically downloads the experts and creates the frankenMoE in\nthe `merge` directory. For the `hidden` gate mode, you can also use the\n`--load-in-4bit` and `--load-in-8bit` options to compute hidden states with\nlower precision.\n\nAlternatively, you can copy your configuration into LazyMergekit, a wrapper I\nmade to simplify model merging. In this Colab notebook, you can input your\nmodel name, select the `mixtral` branch, specify your Hugging Face\nusername/token, and run the cells. After creating your frankenMoE, it will\nalso upload it to the Hugging Face Hub with a nicely formatted model card.\n\nI called my model Beyonder-4x7B-v3 and created GGUF versions of it using\nAutoGGUF. If you can\u2019t run GGUF versions on your local machine, you can also\nperform inference using this Colab notebook.\n\nTo get a good overview of its capabilities, it has been evaluated on three\ndifferent benchmarks: Nous\u2019 benchmark suite, EQ-Bench, and the Open LLM\nLeaderboard. This model is not designed to excel in traditional benchmarks, as\nthe code and role-playing models generally do not apply to those contexts.\nNonetheless, it performs remarkably well thanks to strong general-purpose\nexperts.\n\n**Nous** : Beyonder-4x7B-v3 is one of the best models on Nous\u2019 benchmark suite\n(evaluation performed using LLM AutoEval) and significantly outperforms the\nv2. See the entire leaderboard here.\n\n**EQ-Bench** : It\u2019s also the best 4x7B model on the EQ-Bench leaderboard,\noutperforming older versions of ChatGPT and Llama-2\u201370b-chat. Beyonder is very\nclose to Mixtral-8x7B-Instruct-v0.1 and Gemini Pro, which are (supposedly)\nmuch bigger models.\n\n**Open LLM Leaderboard** : Finally, it\u2019s also a strong performer on the Open\nLLM Leaderboard, significantly outperforming the v2 model.\n\nOn top of these quantitative evaluations, I recommend checking the model\u2019s\noutputs in a more qualitative way using a GGUF version on LM Studio. A common\nway of testing these models is to gather a private set of questions and check\ntheir outputs. With this strategy, I found that Beyonder-4x7B-v3 is quite\nrobust to changes in the user and system prompts compared to other models,\nincluding AlphaMonarch-7B. This is pretty cool as it improves the usefulness\nof the model in general.\n\nFrankenMoEs are a promising but still experimental approach. The trade-offs,\nlike higher VRAM demand and slower inference speeds, can make it challenging\nto see their advantage over simpler merging techniques like SLERP or DARE\nTIES. Especially, when you use frankenMoEs with just two experts, they might\nnot perform as well as if you had simply merged the two models. However,\nfrankenMoEs excel in preserving knowledge, which can result in stronger\nmodels, as demonstrated by Beyonder-4x7B-v3. With the right hardware, these\ndrawbacks can be effectively mitigated.\n\n### Conclusion\n\nIn this article, we introduced the Mixture of Experts architecture. Unlike\ntraditional MoEs that are trained from scratch, MergeKit facilitates the\ncreation of MoEs by ensembling experts, offering an innovative approach to\nimproving model performance and efficiency. We detailed the process of\ncreating a frankenMoE with MergeKit, highlighting the practical steps involved\nin selecting and combining different experts to produce a high-quality MoE.\n\nThanks for reading this article. I encourage you to try to make your own\nFrankenMoEs using LazyMergeKit: select a few models, create your config based\nBeyonder\u2019s, and run the notebook to create your own models! If you liked this\narticle, please follow me on Hugging Face and X/Twitter @maximelabonne.\n\n### References\n\n * Mixtral of Experts by Jiang et al. (2023)\n\n * Mixture of Experts for Clowns by Charles Goddard (2023)\n\n * Mixture of Experts Explained by Sanseviero et al. (2023)\n\n * Adaptive Mixture of Local Experts by Jacobs et al. (1991)\n\n * Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints by Komatsuzaki et al. (2022)\n\n_Learn more about machine learning and support my work with one click \u2014 become\na Medium member here:_\n\n**Join Medium with my referral link \u2014 Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\n1\n\nShare this post\n\n#### Create Mixtures of Experts with MergeKit\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/create-mixtures-of-experts-with-mergekit-11b318c99562" }, { "id": "6d5c6e46-1390-4bb7-86ee-73df95b7a610", "content": { "Title": "Merge Large Language Models with mergekit", "Subtitle": "Create your own models easily, no GPU required!", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Merge Large Language Models with mergekit\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Merge Large Language Models with mergekit\n\n### Create your own models easily, no GPU required!\n\nMaxime Labonne\n\nJan 08, 2024\n\n1\n\nShare this post\n\n#### Merge Large Language Models with mergekit\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Create your own models easily, no GPU required!\n\nImage by author\n\nModel merging is a technique that **combines two or more LLMs** into a single\nmodel. It\u2019s a relatively new and experimental method to create new models for\ncheap (no GPU required). Model merging works surprisingly well and produced\nmany state-of-the-art models on the Open LLM Leaderboard.\n\nIn this tutorial, we will implement it using the mergekit library. More\nspecifically, we will review four merge methods and provide examples of\nconfigurations. Then, we will use mergekit to create our own model,\nMarcoro14\u20137B-slerp, which became the best-performing model on the Open LLM\nLeaderboard (02/01/24).\n\nThe code is available on GitHub and Google Colab. I recommend using my\nautomated notebook to easily run mergekit: \ud83e\udd71 LazyMergekit.\n\n_A special thanks toCharles Goddard, the author of the mergekit library, for\nreviewing this article._\n\nImage by author\n\n### \ud83e\udd1d Merge algorithms\n\nIn this section, we will focus on four methods currently implemented in\nmergekit. Note that there are other methods, such as linear and Task\nArithmetic. If you\u2019re interested in papers on model merging, I recommend this\nexcellent collection on Hugging Face.\n\n#### 1\\. SLERP\n\n**Spherical Linear Interpolation** (SLERP) is a method used to smoothly\ninterpolate between two vectors. It maintains a constant rate of change and\npreserves the geometric properties of the spherical space in which the vectors\nreside.\n\nThere are several reasons to prefer SLERP over a traditional linear\ninterpolation. For example, in high-dimensional spaces, linear interpolation\ncan lead to a **decrease in the magnitude** of the interpolated vector (i.e.,\nit reduces the scale of weights). Moreover, the change in direction of the\nweights often represents **more meaningful information** (like feature\nlearning and representation) than the magnitude of change.\n\nSLERP is implemented using the following steps:\n\n 1. Normalize the input vectors to unit length, ensuring they represent directions rather than magnitudes\n\n 2. Calculate the angle between these vectors using their dot product.\n\n 3. If the vectors are nearly collinear, it defaults to linear interpolation for efficiency. Otherwise, SLERP computing scale factors based on the interpolation factor `t` (`t=0` = 100% of the first vector, `t=1` = 100% of model 2) and the angle between the vectors.\n\n 4. These factors are used to weigh the original vectors, which are then summed to obtain the interpolated vector.\n\nSLERP is currently the most popular merging method, but it is limited to\ncombining only two models at a time. It is still possible to hierarchically\ncombine multiple models, as shown in Mistral-7B-Merge-14-v0.1.\n\n_Example of configuration:_\n\n \n \n slices:\n - sources:\n - model: OpenPipe/mistral-ft-optimized-1218\n layer_range: [0, 32]\n - model: mlabonne/NeuralHermes-2.5-Mistral-7B\n layer_range: [0, 32]\n merge_method: slerp\n base_model: OpenPipe/mistral-ft-optimized-1218\n parameters:\n t:\n - filter: self_attn\n value: [0, 0.5, 0.3, 0.7, 1]\n - filter: mlp\n value: [1, 0.5, 0.7, 0.3, 0]\n - value: 0.5\n dtype: bfloat16\n\nThis is a classic SLERP configuration, applied to every layer of both models.\nNote that we input a gradient of values for the interpolation factor `t`. The\nparameters for the self-attention and MLP layers will use different\ncombinations of OpenPipe/mistral-ft-optimized-1218 and\nmlabonne/NeuralHermes-2.5-Mistral-7B. The other layers are a 50/50 mixture of\nthe two models.\n\nYou can find the final model on the Hugging Face Hub at\nmlabonne/NeuralPipe-7B-slerp.\n\n#### 2\\. TIES\n\nIntroduced in this paper by Yadav et al., **TIES-Merging** is designed to\nefficiently merge multiple task-specific models into a single multitask model.\nIt addresses two main challenges in model merging:\n\n * **Redundancy in model parameters** : It identifies and eliminates redundant parameters within task-specific models. This is achieved by focusing on the changes made during fine-tuning, identifying the top-k% most significant changes, and discarding the rest.\n\n * **Disagreement between parameter signs** : Conflicts arise when different models suggest opposing adjustments to the same parameter. TIES-Merging resolves these conflicts by creating a unified sign vector that represents the most dominant direction of change across all models.\n\nTIES-Merging is divided into the following three steps:\n\n 1. **Trim** : Reduces redundancy in task-specific models by retaining only a fraction the most significant parameters (density parameter) and resetting the rest to zero.\n\n 2. **Elect Sign** : Resolves sign conflicts across different models by creating a unified sign vector based on the most dominant direction (positive or negative) in terms of cumulative magnitude.\n\n 3. **Disjoint Merge** : Averages parameter values that align with the unified sign vector, excluding zero values.\n\nUnlike SLERP, TIES can merge multiple models at a time.\n\n_Example of configuration:_\n\n \n \n models:\n - model: mistralai/Mistral-7B-v0.1\n # no parameters necessary for base model\n - model: OpenPipe/mistral-ft-optimized-1218\n parameters:\n density: 0.5\n weight: 0.5\n - model: mlabonne/NeuralHermes-2.5-Mistral-7B\n parameters:\n density: 0.5\n weight: 0.3\n merge_method: ties\n base_model: mistralai/Mistral-7B-v0.1\n parameters:\n normalize: true\n dtype: float16\n\nWith this config, we use Mistral-7B as a base model to calculate the delta\nweights. We merge the same two models: mistral-ft-optimized-1218 (50%) and\nNeuralHermes-2.5-Mistral-7B (30%) with normalization. Here, the density means\nthat we\u2019re only retaining 50% of the parameters of each model (the other half\ncomes from the base model).\n\nNote that the sum of the weights is not equal to 1 in the config, but the\n`normalize: true` parameter will automatically normalize them internally. This\nconfig is inspired by the parameters provided by the author of\nOpenHermes-2.5-neural-chat-7b-v3\u20131\u20137B.\n\nYou can find the final model on the Hugging Face Hub at\nmlabonne/NeuralPipe-7B-ties.\n\n#### 3\\. DARE\n\nIntroduced by Yu et al. (2023), DARE uses an approach similar to TIES with two\nmain differences:\n\n * **Pruning** : DARE randomly reset fine-tuned weights to their original values (those of the base model).\n\n * **Rescaling** : DARE rescales the weights to keep the expectations of model outputs approximately unchanged. It adds the rescaled weights of both (or more) models to the weights of the base model with a scale factor.\n\nMergekit\u2019s implementation of this method has two flavors: with the sign\nelection step of TIES (`dare_ties`) or without (`dare_linear`).\n\n_Example of configuration:_\n\n \n \n models:\n - model: mistralai/Mistral-7B-v0.1\n # No parameters necessary for base model\n - model: samir-fama/SamirGPT-v1\n parameters:\n density: 0.53\n weight: 0.4\n - model: abacusai/Slerp-CM-mist-dpo\n parameters:\n density: 0.53\n weight: 0.3\n - model: EmbeddedLLM/Mistral-7B-Merge-14-v0.2\n parameters:\n density: 0.53\n weight: 0.3\n merge_method: dare_ties\n base_model: mistralai/Mistral-7B-v0.1\n parameters:\n int8_mask: true\n dtype: bfloat16\n\nIn this configuration, we merge three different models based on Mistral-7B\nusing `dare_ties`. This time, I chose weights that sum to 1 (the sum should be\nbetween 0.9 and 1.1). The density parameter is a little higher than what's\nrecommended in the paper (<0.5), but it looks like it gives consistently\nbetter results (see this discussion).\n\nYou can find it on the Hugging Face Hub at mlabonne/Daredevil-7B. It\u2019s also\nthe best merge model in this article, outperforming even Marcoro14\u20137B-slerp.\n\n#### 4\\. Passthrough\n\nThe passthrough method differs significantly from the previous ones. By\nconcatenating layers from different LLMs, it can produce models with an\n**exotic number of parameters** (e.g., 9B with two 7B parameter models). These\nmodels are often referred to as \u201cfrankenmerges\u201d or \u201cFrankenstein models\u201d by\nthe community.\n\nThis technique is very experimental, but it managed to create impressive\nmodels, like goliath-120b using two Llama 2 70B models. The recently released\nSOLAR-10.7B-v1.0 also uses the same idea, called depth-up scaling in their\npaper.\n\n_Example of configuration:_\n\n \n \n slices:\n - sources:\n - model: OpenPipe/mistral-ft-optimized-1218\n layer_range: [0, 32]\n - sources:\n - model: mlabonne/NeuralHermes-2.5-Mistral-7B\n layer_range: [24, 32]\n merge_method: passthrough\n dtype: bfloat16\n\nThe resulting frankenmerge will have all the 32 layers from the first model\nand 8 additional layers from the second model. This creates a frankenmerge\nwith a total of 40 layers and 8.99B parameters. This config is inspired by\nGML-Mistral-merged-v1.\n\nYou can find the final model on the Hugging Face Hub at\nmlabonne/NeuralPipe-9B-merged.\n\n### \ud83d\udcbb Merge your own models\n\nIn this section, we will use mergekit to load a merge configuration, run it,\nand upload the resulting model to the Hugging Face Hub.\n\nFirst of all, we install mergekit directly from source as follows:\n\n \n \n !git clone https://github.com/cg123/mergekit.git\n !cd mergekit && pip install -q -e .\n\nIn the following block, we load the merge configuration in a YAML format. We\nalso specify the name of the merged model for future use. You can copy/paste\nany configuration from the previous section here.\n\nThis time, we will use two different models: Marcoroni-7B-v3 and\nMistral-7B-Merge-14-v0.1 and merge them with the SLERP method. We save the\nconfig as a yaml file to be used as input in the merge command.\n\n \n \n import yaml\n \n MODEL_NAME = \"Marcoro14-7B-slerp\"\n yaml_config = \"\"\"\n slices:\n - sources:\n - model: AIDC-ai-business/Marcoroni-7B-v3\n layer_range: [0, 32]\n - model: EmbeddedLLM/Mistral-7B-Merge-14-v0.1\n layer_range: [0, 32]\n merge_method: slerp\n base_model: AIDC-ai-business/Marcoroni-7B-v3\n parameters:\n t:\n - filter: self_attn\n value: [0, 0.5, 0.3, 0.7, 1]\n - filter: mlp\n value: [1, 0.5, 0.7, 0.3, 0]\n - value: 0.5\n dtype: bfloat16\n \n \"\"\"\n \n # Save config as yaml file\n with open('config.yaml', 'w', encoding=\"utf-8\") as f:\n f.write(yaml_config)\n\nWe run the merge command with the following parameters:\n\n * `--copy-tokenizer` to copy the tokenizer from the base model\n\n * `--allow-crimes` and `--out-shard-size` to chunk the models into smaller shards that can be computed on a CPU with low RAM\n\n * `--lazy-unpickle` to enable the experimental lazy unpickler for lower memory usage\n\nIn addition, some models can require the `--trust_remote_code` flag (this is\nnot the case with Mistral-7B).\n\nThis command will download the weights of all the models listed in the merge\nconfiguration and run the selected merge method (it should take ~10 minutes).\n\n \n \n # Merge models\n !mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickl\n\nThe model is now merged and saved in the `merge` directory. Before uploading\nit, we can create a README file with all the information required for\nreproducibility. The following code block defines a Jinja template and\nautomatically fills it with the data from the merge configuration.\n\n \n \n !pip install -qU huggingface_hub\n \n from huggingface_hub import ModelCard, ModelCardData\n from jinja2 import Template\n \n username = \"mlabonne\"\n \n template_text = \"\"\"\n ---\n license: apache-2.0\n tags:\n - merge\n - mergekit\n - lazymergekit\n {%- for model in models %}\n - {{ model }}\n {%- endfor %}\n ---\n \n # {{ model_name }}\n \n {{ model_name }} is a merge of the following models using [mergekit](https://github.com/cg123/mergekit):\n \n {%- for model in models %}\n * [{{ model }}](https://huggingface.co/{{ model }})\n {%- endfor %}\n \n ## \ud83e\udde9 Configuration\n \n ```yaml\n {{- yaml_config -}}\n ```\n \"\"\"\n \n # Create a Jinja template object\n jinja_template = Template(template_text.strip())\n \n # Get list of models from config\n data = yaml.safe_load(yaml_config)\n if \"models\" in data:\n models = [data[\"models\"][i][\"model\"] for i in range(len(data[\"models\"])) if \"parameters\" in data[\"models\"][i]]\n elif \"parameters\" in data:\n models = [data[\"slices\"][0][\"sources\"][i][\"model\"] for i in range(len(data[\"slices\"][0][\"sources\"]))]\n elif \"slices\" in data:\n models = [data[\"slices\"][i][\"sources\"][0][\"model\"] for i in range(len(data[\"slices\"]))]\n else:\n raise Exception(\"No models or slices found in yaml config\")\n \n # Fill the template\n content = jinja_template.render(\n model_name=MODEL_NAME,\n models=models,\n yaml_config=yaml_config,\n username=username,\n )\n \n # Save the model card\n card = ModelCard(content)\n card.save('merge/README.md')\n\nNow that we have a model card, we can push the entire folder to the Hub.\n\n \n \n from google.colab import userdata\n from huggingface_hub import HfApi\n \n username = \"mlabonne\"\n \n # Defined in the secrets tab in Google Colab\n api = HfApi(token=userdata.get(\"HF_TOKEN\"))\n \n api.create_repo(\n repo_id=f\"{username}/{MODEL_NAME}\",\n repo_type=\"model\"\n )\n api.upload_folder(\n repo_id=f\"{username}/{MODEL_NAME}\",\n folder_path=\"merge\",\n )\n\nThe model is now available on the Hugging Face Hub at\nmlabonne/Marcoro14\u20137B-slerp. In another notebook, we can try the model on a\nfree T4 GPU using the following code:\n\n \n \n !pip install -qU transformers accelerate\n \n from transformers import AutoTokenizer\n import transformers\n import torch\n \n model = \"mlabonne/Marcoro14-7B-slerp\"\n messages = [{\"role\": \"user\", \"content\": \"What is a large language model?\"}]\n \n tokenizer = AutoTokenizer.from_pretrained(model)\n prompt = tokenizer.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=True\n )\n pipeline = transformers.pipeline(\n \"text-generation\",\n model=model,\n torch_dtype=torch.float16,\n device_map=\"auto\",\n )\n \n outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)\n\nWe\u2019re asking the question \u201cWhat is a Large Language Model?\u201d and received this\noutput:\n\n> _A large language model is a type of artificial intelligence (AI) system\n> that has been trained on vast amounts of text data. It\u2019s designed to\n> understand and generate human-like language, making predictions on what\n> words or phrases might come next in a sentence or document. These models use\n> complex algorithms and neural network architectures to learn from the data\n> and improve their performance over time. Some well-known large language\n> models include GPT-3 from OpenAI and BERT from Google._\n\nIt\u2019s looking good, but we need a more comprehensive evaluation. For this kind\nof general-purpose model, there are a few interesting benchmarks:\n\n * **Chatbot Arena** , which compiles an Elo-based LLM leaderboard based on human votes.\n\n * **MT-bench** (same link), which uses GPT-4 as a judge to grade model responses on a set of multi-turn questions.\n\n * **NousResearch benchmark suite** , which aggregates four benchmarks: AGIEval, GPT4ALL, TruthfulQA, and Bigbench. GPT4ALL itself includes HellaSwag, OpenBookQA, Winogrande, ARC-Easy, ARC-Challenge, BoolQ, and PIQA.\n\n * **Open LLM Leaderboard** , which aggregates six benchmarks: ARC, HellaSwag, MMLU, Winogrande, GSM8K, and TruthfulQA.\n\nUnfortunately, we can\u2019t submit our model to the Chatbot Arena. Instead, I\nchose to evaluate it using the Open LLM Leaderboard and NousResearch\nbenchmarks.\n\nI submitted our model to the Open LLM Leaderboard (\u201c\ud83d\ude80 Submit here!\u201d tab). As\nshown in the introduction, it ranked as **the best 7B parameter model** on the\nleaderboard. Here are the complete results:\n\nImage by author\n\nThe problem with the Open LLM Leaderboard is that these benchmarks are public.\nIt means that people can train LLMs on the test data to get better results. By\nmerging the best models, we also contaminate our own results. It is safe to\nassume that **Marcoro14\u20137B-slerp is contaminated** and some models used in\nthis merge have been trained on the test set. If you want to create the best\nmodel and not hack the leaderboard, I recommend only using non-merge models to\ncreate your own merges.\n\nThis is why we don\u2019t want to only rely on the OpenLLM Leaderboard. For\nNousResearch benchmark suite, I used \ud83e\uddd0 LLM AutoEval to compute the scores\nautomatically with a simple Colab notebook. Here are the results compared to\nthe excellent OpenHermes-2.5-Mistral-7B:\n\nImage by author\n\nWe get a significant improvement over this model on **every benchmark**. Note\nthat NousResearch benchmark suite shares some tasks with the Open LLM\nLeaderboard: ARC-Challenge, TruthfulQA, HellaSwag, and Winogrande. To the best\nof my knowledge, Bigbench is the only benchmark that is 100% different (feel\nfree to contact me if that\u2019s not the case). However, one of the models we used\nin this merge could still have been trained on Bigbench.\n\n### Conclusion\n\nIn this article, we introduced the concept of merging LLMs with four different\nmethods. We detailed how SLERP, TIES, DARE, and passthrough work and provided\nexamples of configurations. Finally, we ran SLERP with mergekit to create\nMarcoro14\u20137B-slerp and upload it to the Hugging Face Hub. We obtained\nexcellent performance on two benchmark suites: Open LLM Leaderboard (**best-\nperforming 7B model**) and NousResearch. If you want to create your own\nmerges, I recommend using my automated notebook \ud83e\udd71 LazyMergekit.\n\nAnother way of combining multiple models is to merge them in a Mixture of\nExperts (MoE) architecture. In the next article, we\u2019ll discuss how to do this\nin detail and create our own Mixtral-like model. If you liked this article,\nplease follow me on Medium and Twitter @maximelabonne.\n\n_Learn more about machine learning and support my work with one click \u2014 become\na Medium member here:_\n\n**Join Medium with my referral link \u2014 Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\n1\n\nShare this post\n\n#### Merge Large Language Models with mergekit\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/merge-large-language-models-with-mergekit-2118fb392b54" }, { "id": "d79f3c67-c491-4fd1-96ba-67e03ba66d93", "content": { "Title": "Fine-tune a Mistral-7b model with Direct Preference Optimization", "Subtitle": "Boost the performance of your supervised fine-tuned models", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Fine-tune a Mistral-7b model with Direct Preference Optimization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Fine-tune a Mistral-7b model with Direct Preference Optimization\n\n### Boost the performance of your supervised fine-tuned models\n\nMaxime Labonne\n\nJan 01, 2024\n\n1\n\nShare this post\n\n#### Fine-tune a Mistral-7b model with Direct Preference Optimization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Boost the performance of your supervised fine-tuned models\n\nImage by author\n\nPre-trained Large Language Models (LLMs) can only perform next-token\nprediction, making them unable to answer questions. This is why these base\nmodels are then fine-tuned on pairs of instructions and answers to act as\nhelpful assistants. However, this process can still be flawed: fine-tuned LLMs\ncan be biased, toxic, harmful, etc. This is where Reinforcement Learning from\nHuman Feedback (RLHF) comes into play.\n\nRLHF provides different answers to the LLM, which are ranked according to a\ndesired behavior (helpfulness, toxicity, etc.). The model learns to output the\nbest answer among these candidates, hence mimicking the behavior we want to\ninstill. Often seen as a way to censor models, this process has recently\nbecome popular for improving performance, as shown in neural-chat-7b-v3\u20131.\n\nIn this article, we will create NeuralHermes-2.5, by fine-tuning\nOpenHermes-2.5 using a RLHF-like technique: Direct Preference Optimization\n(DPO). For this purpose, we will introduce a preference dataset, describe how\nthe DPO algorithm works, and apply it to our model. We\u2019ll see that it\nsignificantly improves the performance of the base model on the Open LLM\nLeaderboard.\n\nAs per usual, the code is available on GitHub and Google Colab.\n\n_**Update** : Jessie Davids, a reader who used this article and code, managed\nto create the best-performing model on the Open LLM Leaderboard ~7B param.\nCongrats to him! \ud83c\udf89_\n\nImage by author\n\n### \ud83e\udd47 Preference datasets\n\nPreference datasets are not standardized, but they typically consist of a\ncollection of answers that are ranked by humans. This ranking is essential, as\nthe RLHF process fine-tunes LLMs to output the preferred answer. Here is an\nexample of Anthropic/hh-rlhf, a popular preference dataset:\n\nImage by author\n\nThe structure of the dataset is straightforward: for each row, there is one\nchosen (preferred) answer, and one rejected answer. The goal of RLHF is to\nguide the model to output the preferred answer.\n\nPreference datasets are notoriously costly and difficult to make, as they\nrequire collecting manual feedback from humans. This feedback is also\nsubjective and can easily be biased toward confident (but wrong) answers or\ncontradict itself (different annotators have different values). Over time,\nseveral solutions have been proposed to tackle these issues, such as replacing\nhuman feedback with AI feedback (RLAIF).\n\nThese datasets also tend to be a lot smaller than fine-tuning datasets. To\nillustrate this, the excellent neural-chat-7b-v3\u20131 (best 7B LLM on the Open\nLLM Leaderboard when it was released) uses 518k samples for fine-tuning (Open-\nOrca/SlimOrca) but only 12.9k samples for RLHF (Intel/orca_dpo_pairs). In this\ncase, the authors generated answers with GPT-4/3.5 to create the preferred\nanswers, and with Llama 2 13b chat to create the rejected responses. It\u2019s a\nsmart way to bypass human feedback and only rely on models with different\nlevels of performance.\n\n### \ud83c\udf93 Direct Preference Optimization\n\nWhile the concept of RLHF has been used in robotics for a long time, it was\npopularized for LLMs in OpenAI\u2019s paper Fine-Tuning Language Models from Human\nPreferences. In this paper, the authors present a framework where a reward\nmodel is trained to approximate human feedback. This reward model is then used\nto optimize the fine-tuned model\u2019s policy using the Proximal Policy\nOptimization (PPO) algorithm.\n\nImage by author\n\nThe core concept of PPO revolves around making smaller, incremental updates to\nthe policy, as larger updates can lead to instability or suboptimal solutions.\nFrom experience, this technique is unfortunately still unstable (loss\ndiverges), difficult to reproduce (numerous hyperparameters, sensitive to\nrandom seeds), and computationally expensive.\n\nThis is where Direct Preference Optimization (DPO) comes into play. DPO\nsimplifies control by treating the task as a classification problem.\nConcretely, it uses two models: the **trained model** (or policy model) and a\ncopy of it called the **reference model**. During training, the goal is to\nmake sure the trained model outputs higher probabilities for preferred answers\nthan the reference model. Conversely, we also want it to output lower\nprobabilities for rejected answers. It means we\u2019re penalizing the LLM for bad\nanswers and rewarding it for good ones.\n\nImage by author\n\nBy using the LLM itself as a reward model and employing binary cross-entropy\nobjectives, DPO efficiently aligns the model\u2019s outputs with human preferences\nwithout the need for extensive sampling, reward model fitting, or intricate\nhyperparameter adjustments. It results in a more stable, more efficient, and\ncomputationally less demanding process.\n\n### \ud83d\udcbe Formatting the data\n\nIn this example, we\u2019ll fine-tune the excellent OpenHermes-2.5-Mistral-7B,\nwhich is a Mistral-7b model that was only supervised fine-tuned. To this end,\nwe\u2019ll use the Intel/orca_dpo_pairs dataset to align our model and improve its\nperformance. We call this new model NeuralHermes-2.5-Mistral-7B.\n\nThe first step consists of installing the required libraries as follows.\n\n \n \n pip install -q datasets trl peft bitsandbytes sentencepiece wandb\n\nOnce it\u2019s done, we can import the libraries. I\u2019m also using the secrets tab in\nGoogle Colab to store my Hugging Face token.\n\n \n \n import os\n import gc\n import torch\n \n import transformers\n from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig\n from datasets import load_dataset\n from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training\n from trl import DPOTrainer\n import bitsandbytes as bnb\n from google.colab import userdata\n import wandb\n \n # Defined in the secrets tab in Google Colab\n hf_token = userdata.get('huggingface')\n wb_token = userdata.get('wandb')\n wandb.login(key=wb_token)\n \n model_name = \"teknium/OpenHermes-2.5-Mistral-7B\"\n new_model = \"NeuralHermes-2.5-Mistral-7B\"\n\nOpenHermes-2.5-Mistral-7B uses a specific chat template, called ChatML. Here\nis an example of a conversation formatted with this template:\n\n \n \n <|im_start|>system\n You are a helpful chatbot assistant.<|im_end|>\n <|im_start|>user\n Hi<|im_end|>\n <|im_start|>assistant\n Hi, how can I help you?<|im_end|>\n\nAs you can see, ChatML defines different roles (system, user, assistant) and\nappends special tokens (`<|im_start|>` and `<|im_end|>`) to separate them.\nMoreover, `DPOTrainer` also requires a specific format with three columns:\nprompt, chosen, and rejected.\n\nOur dataset contains four columns: system, question, chatgpt, and\nllama2\u201313b-chat. We\u2019ll simply concatenate the system and question columns to\nthe prompt column. We\u2019ll also map the chatgpt column to \u201cchosen\u201d and\nllama2\u201313b-chat to \u201crejected\u201d. To format the dataset in a reliable way, we\u2019ll\nuse the tokenizer\u2019s `apply_chat_template()` function, which already uses\nChatML.\n\n \n \n def chatml_format(example):\n # Format system\n if len(example['system']) > 0:\n message = {\"role\": \"system\", \"content\": example['system']}\n system = tokenizer.apply_chat_template([message], tokenize=False)\n else:\n system = \"\"\n \n # Format instruction\n message = {\"role\": \"user\", \"content\": example['question']}\n prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)\n \n # Format chosen answer\n chosen = example['chosen'] + \"<|im_end|>\\n\"\n \n # Format rejected answer\n rejected = example['rejected'] + \"<|im_end|>\\n\"\n \n return {\n \"prompt\": system + prompt,\n \"chosen\": chosen,\n \"rejected\": rejected,\n }\n \n # Load dataset\n dataset = load_dataset(\"Intel/orca_dpo_pairs\")['train']\n \n # Save columns\n original_columns = dataset.column_names\n \n # Tokenizer\n tokenizer = AutoTokenizer.from_pretrained(model_name)\n tokenizer.pad_token = tokenizer.eos_token\n tokenizer.padding_side = \"left\"\n \n # Format dataset\n dataset = dataset.map(\n chatml_format,\n remove_columns=original_columns\n )\n\nLet\u2019s print a sample of the formatted dataset to confirm that everything works\nas expected:\n\n \n \n {'prompt': '<|im_start|>system\\nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer.<|im_end|>\\n<|im_start|>user\\nGenerate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One<|im_end|>\\n<|im_start|>assistant\\n',\n 'chosen': 'Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.<|im_end|>\\n',\n 'rejected': ' Sure! Here\\'s a sentence that describes all the data you provided:\\n\\n\"Midsummer House is a moderately priced Chinese restaurant with a customer rating of 3 out of 5, located near All Bar One, offering a variety of delicious dishes.\"<|im_end|>\\n'}\n\nWe can see that the prompt combines system and user instructions. Thanks to\nthe `add_generation_prompt=True` argument, it also appends the beginning of\nthe assistant's answer. If you want to skip this step, you can directly used\nthe preprocessed dataset as mlabonne/chatml_dpo_pairs.\n\n### \u2699\ufe0f Training the model with DPO\n\nNext, we define the LoRA configurations to train the model. As described in\nIntel\u2019s blog post, we set the rank value to be equal to the `lora_alpha`,\nwhich is unusual (2 * `r` as a rule of thumb). We also target all the linear\nmodules with adapters.\n\n \n \n # LoRA configuration\n peft_config = LoraConfig(\n r=16,\n lora_alpha=16,\n lora_dropout=0.05,\n bias=\"none\",\n task_type=\"CAUSAL_LM\",\n target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']\n )\n\nWe\u2019re now ready to load the model we want to fine-tune with DPO. In this case,\ntwo models are required: the model to fine-tune as well as the reference\nmodel. This is mostly for the sake of readability, as the `DPOTrainer` object\nautomatically creates a reference model if none is provided.\n\n \n \n # Model to fine-tune\n model = AutoModelForCausalLM.from_pretrained(\n model_name,\n torch_dtype=torch.float16,\n load_in_4bit=True\n )\n model.config.use_cache = False\n \n # Reference model\n ref_model = AutoModelForCausalLM.from_pretrained(\n model_name,\n torch_dtype=torch.float16,\n load_in_4bit=True\n )\n\nThe final step consists of providing all the hyperparameters to\n`TrainingArguments` and `DPOTrainer`:\n\n * Among them, the `beta` parameter is unique to DPO since it controls the divergence from the initial policy (0.1 is a typical value for it).\n\n * Compared to the values described in Intel\u2019s blog post, we lower the learning rate (from 5e-4 to 5e-5) and the number of steps (from 1,000 to 200). I manually optimized these values after a few runs to stabilize training and achieve the best results.\n\nWe can now start training the model. Note that it requires an A100 GPU and\ntakes between 1 hour to complete the training.\n\n \n \n # Training arguments\n training_args = TrainingArguments(\n per_device_train_batch_size=4,\n gradient_accumulation_steps=4,\n gradient_checkpointing=True,\n learning_rate=5e-5,\n lr_scheduler_type=\"cosine\",\n max_steps=200,\n save_strategy=\"no\",\n logging_steps=1,\n output_dir=new_model,\n optim=\"paged_adamw_32bit\",\n warmup_steps=100,\n bf16=True,\n report_to=\"wandb\",\n )\n \n # Create DPO trainer\n dpo_trainer = DPOTrainer(\n model,\n ref_model,\n args=training_args,\n train_dataset=dataset,\n tokenizer=tokenizer,\n peft_config=peft_config,\n beta=0.1,\n max_prompt_length=1024,\n max_length=1536,\n )\n \n # Fine-tune model with DPO\n dpo_trainer.train()\n\nOur model is now fine-tuned. You can check the project on Weights & Biases at\nthis address. Here are some interesting metrics to analyze:\n\nImage by author\n\nInterestingly, the training loss quickly drops to zero (before 50 steps),\ndespite 100 warmup steps. Meanwhile, the other metrics keep evolving.\n\nThe train/rewards/chosen and train/rewards/rejected plots correspond to the\nmean difference between the log probabilities output by the trained and\nreference models. It makes sense that, over time, they diverge as our trained\nmodel learns the preferred answers. The train/rewards/margins plot also shows\nthe difference between these two plots. Finally, the train/reward/accuracies\nplot shows the frequency of choosing the preferred answer. The trained model\nquickly reaches a perfect accuracy score, which is a good sign but could also\nmean that the difference between preferred and rejected answers is too\nobvious.\n\nNow that it\u2019s trained, we can merge the adapter with the original model. Next,\nwe save the merged model and the tokenizer before pushing it to the Hugging\nFace Hub.\n\n \n \n # Save artifacts\n dpo_trainer.model.save_pretrained(\"final_checkpoint\")\n tokenizer.save_pretrained(\"final_checkpoint\")\n \n # Flush memory\n del dpo_trainer, model, ref_model\n gc.collect()\n torch.cuda.empty_cache()\n \n # Reload model in FP16 (instead of NF4)\n base_model = AutoModelForCausalLM.from_pretrained(\n model_name,\n return_dict=True,\n torch_dtype=torch.float16,\n )\n tokenizer = AutoTokenizer.from_pretrained(model_name)\n \n # Merge base model with the adapter\n model = PeftModel.from_pretrained(base_model, \"final_checkpoint\")\n model = model.merge_and_unload()\n \n # Save model and tokenizer\n model.save_pretrained(new_model)\n tokenizer.save_pretrained(new_model)\n \n # Push them to the HF Hub\n model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)\n tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)\n\nLet\u2019s see how our model performs in a real test. We\u2019ll format the prompt to\nask a basic question: \u201cWhat is a Large Language Model?\u201d\n\n \n \n # Format prompt\n message = [\n {\"role\": \"system\", \"content\": \"You are a helpful assistant chatbot.\"},\n {\"role\": \"user\", \"content\": \"What is a Large Language Model?\"}\n ]\n tokenizer = AutoTokenizer.from_pretrained(new_model)\n prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)\n \n # Create pipeline\n pipeline = transformers.pipeline(\n \"text-generation\",\n model=new_model,\n tokenizer=tokenizer\n )\n \n # Generate text\n sequences = pipeline(\n prompt,\n do_sample=True,\n temperature=0.7,\n top_p=0.9,\n num_return_sequences=1,\n max_length=200,\n )\n print(sequences[0]['generated_text'])\n\nHere\u2019s the answer from the model:\n\n \n \n A large language model is a type of artificial intelligence (AI) system that has been trained on vast amounts of text data. These models are designed to understand and generate human language, allowing them to perform various natural language processing tasks, such as text generation, language translation, and question answering. Large language models typically use deep learning techniques, like recurrent neural networks (RNNs) or transformers, to learn patterns and relationships in the data, enabling them to generate coherent and contextually relevant responses. The size of these models, in terms of the number of parameters and the volume of data they are trained on, plays a significant role in their ability to comprehend and produce complex language structures.\n\nEverything seems to be working, we can now evaluate the merged model. As this\nis a general-purpose model, we can leverage the lm-evaluation-harness to\nevaluate it. As the process is quite resource-intensive, we can also directly\nsubmit it for evaluation on the Open LLM Leaderboard. It took a few days, but\nhere are the results compared to other OpenHermes models:\n\nImage by author\n\nCompared to the original model, NeuralHermes-2\u20135-Mistral-7B model improved the\naverage score by 6.7 points (particularly on GSM8K). This is an unexpectedly\nlarge improvement, which showcases the power of Direct Preference\nOptimization.\n\n### Conclusion\n\nIn this article, we fine-tuned an already supervised fine-tuned model using\nDPO and created our own NeuralHermes-2.5 model. By leveraging a high-quality\npreference dataset, we created a sample-efficient fine-tuning pipeline that\nproduced a significant improvement on the Open LLM Leaderboard. If you want to\ngive it a try, you can find quantized variants of this model or use this\nHugging Face Space.\n\nNote that our fine-tuning pipeline can still be improved in different ways.\nFor example, the preference dataset is still quite raw and could be improved\nwith more filtering and by using different models. In addition, numerous\nhyperparameters can still be tweaked to achieve better results. In particular,\nthe learning rate can still be lowered to train the model on more steps and\ninject more preference data.\n\n### References\n\n * Fine-tune Llama 2 with DPO by Kashif Rasul, Younes Belkada, and Leandro von Werra.\n\n * Supervised Fine-Tuning and Direct Preference Optimization on Intel Gaudi2 by Kaokao Lv, Wenxin Zhang, and Haihao Shen.\n\n * llama2-fine-tune by mzbac.\n\n_Learn more about machine learning and support my work with one click \u2014 become\na Medium member here:_\n\n**Join Medium with my referral link - Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\n1\n\nShare this post\n\n#### Fine-tune a Mistral-7b model with Direct Preference Optimization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac" }, { "id": "cedddb77-189c-4ef8-a1af-d9b19d105fcd", "content": { "Title": "ExLlamaV2: The Fastest Library to Run LLMs", "Subtitle": "Quantize and run EXL2 models", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### ExLlamaV2: The Fastest Library to Run LLMs\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# ExLlamaV2: The Fastest Library to Run LLMs\n\n### Quantize and run EXL2 models\n\nMaxime Labonne\n\nNov 20, 2023\n\nShare this post\n\n#### ExLlamaV2: The Fastest Library to Run LLMs\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Quantize and run EXL2 models\n\nImage by author\n\nQuantizing Large Language Models (LLMs) is the most popular approach to reduce\nthe size of these models and speed up inference. Among these techniques, GPTQ\ndelivers amazing performance on GPUs. Compared to unquantized models, this\nmethod uses almost 3 times less VRAM while providing a similar level of\naccuracy and faster generation. It became so popular that it has recently been\ndirectly integrated into the transformers library.\n\n**ExLlamaV2** is a library designed to squeeze even more performance out of\nGPTQ. Thanks to new kernels, it\u2019s optimized for (blazingly) fast inference. It\nalso introduces a new quantization format, EXL2, which brings a lot of\nflexibility to how weights are stored.\n\nIn this article, we will see how to quantize base models in the EXL2 format\nand how to run them. As usual, the code is available on GitHub and Google\nColab.\n\n### \u26a1 Quantize EXL2 models\n\nTo start our exploration, we need to install the ExLlamaV2 library. In this\ncase, we want to be able to use some scripts contained in the repo, which is\nwhy we will install it from source as follows:\n\n \n \n git clone https://github.com/turboderp/exllamav2\n pip install exllamav2\n\nNow that ExLlamaV2 is installed, we need to download the model we want to\nquantize in this format. Let\u2019s use the excellent zephyr-7B-beta, a Mistral-7B\nmodel fine-tuned using Direct Preference Optimization (DPO). It claims to\noutperform Llama-2 70b chat on the MT bench, which is an impressive result for\na model that is ten times smaller. You can try out the base Zephyr model using\nthis space.\n\nWe download zephyr-7B-beta using the following command (this can take a while\nsince the model is about 15 GB):\n\n \n \n git lfs install\n git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta\n\nGPTQ also requires a **calibration dataset** , which is used to measure the\nimpact of the quantization process by comparing the outputs of the base model\nand its quantized version. We will use the wikitext dataset and directly\ndownload the test file as follows:\n\n \n \n wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet\n\nOnce it\u2019s done, we can leverage the `convert.py` script provided by the\nExLlamaV2 library. We're mostly concerned with four arguments:\n\n * `-i`: Path of the base model to convert in HF format (FP16).\n\n * `-o`: Path of the working directory with temporary files and final output.\n\n * `-c`: Path of the calibration dataset (in Parquet format).\n\n * `-b`: Target average number of bits per weight (bpw). For example, 4.0 bpw will give store weights in 4-bit precision.\n\nThe complete list of arguments is available on this page. Let\u2019s start the\nquantization process using the `convert.py` script with the following\narguments:\n\n \n \n mkdir quant\n python python exllamav2/convert.py \\\n -i base_model \\\n -o quant \\\n -c wikitext-test.parquet \\\n -b 5.0\n\nNote that you will need a GPU to quantize this model. The official\ndocumentation specifies that you need approximately 8 GB of VRAM for a 7B\nmodel, and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2 hours\nand 10 minutes to quantize zephyr-7b-beta using a T4 GPU.\n\nUnder the hood, ExLlamaV2 leverages the GPTQ algorithm to lower the precision\nof the weights while minimizing the impact on the output. You can find more\ndetails about the GPTQ algorithm in this article.\n\nSo why are we using the \u201cEXL2\u201d format instead of the regular GPTQ format? EXL2\ncomes with a few new features:\n\n * It supports **different levels of quantization** : it\u2019s not restricted to 4-bit precision and can handle 2, 3, 4, 5, 6, and 8-bit quantization.\n\n * It can **mix different precisions** within a model and within each layer to preserve the most important weights and layers with more bits.\n\nExLlamaV2 uses this additional flexibility during quantization. It tries\ndifferent quantization parameters and measures the error they introduce. On\ntop of trying to minimize the error, ExLlamaV2 also has to achieve the target\naverage number of bits per weight given as an argument. Thanks to this\nbehavior, we can create quantized models with an average number of bits per\nweight of 3.5 or 4.5 for example.\n\nThe benchmark of different parameters it creates is saved in the\n`measurement.json` file. The following JSON shows the measurement for one\nlayer:\n\n \n \n \"key\": \"model.layers.0.self_attn.q_proj\",\n \"numel\": 16777216,\n \"options\": [\n {\n \"desc\": \"0.05:3b/0.95:2b 32g s4\",\n \"bpw\": 2.1878662109375,\n \"total_bits\": 36706304.0,\n \"err\": 0.011161142960190773,\n \"qparams\": {\n \"group_size\": 32,\n \"bits\": [\n 3,\n 2\n ],\n \"bits_prop\": [\n 0.05,\n 0.95\n ],\n \"scale_bits\": 4\n }\n },\n\nIn this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for an\naverage value of 2.188 bpw and a group size of 32. This introduced a\nnoticeable error that is taken into account to select the best parameters.\n\n### \ud83e\udd99 Running ExLlamaV2 for Inference\n\nNow that our model is quantized, we want to run it to see how it performs.\nBefore that, we need to copy essential config files from the `base_model`\ndirectory to the new `quant` directory. Basically, we want every file that is\nnot hidden (`.*`) or a safetensors file. Additionally, we don't need the\n`out_tensor` directory that was created by ExLlamaV2 during quantization.\n\nIn bash, you can implement this as follows:\n\n \n \n !rm -rf quant/out_tensor\n !rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/\n\nOur EXL2 model is ready and we have several options to run it. The most\nstraightforward method consists of using the `test_inference.py` script in the\nExLlamaV2 repo (note that I don\u2019t use a chat template here):\n\n \n \n python exllamav2/test_inference.py -m quant/ -p \"I have a dream\"\n\nThe generation is very fast (56.44 tokens/second on a T4 GPU), even compared\nto other quantization techniques and tools like GGUF/llama.cpp or GPTQ. You\ncan find an in-depth comparison between different solutions in this excellent\narticle from oobabooga.\n\nIn my case, the LLM returned the following output:\n\n \n \n -- Model: quant/\n -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']\n -- Loading model...\n -- Loading tokenizer...\n -- Warmup...\n -- Generating...\n \n I have a dream. <|user|>\n Wow, that's an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let's make this speech truly unforgettable! \n \n Absolutely! Here's your updated speech:\n \n Dear fellow citizens,\n \n Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors\n \n -- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (includes prompt eval.)\n\nAlternatively, you can use a chat version with the `chatcode.py` script for\nmore flexibility:\n\n \n \n python exllamav2/examples/chatcode.py -m quant -mode llama\n\nIf you\u2019re planning to use an EXL2 model more regularly, ExLlamaV2 has been\nintegrated into several backends like oobabooga\u2019s text generation web UI. Note\nthat it requires FlashAttention 2 to work properly, which requires CUDA 12.1\non Windows at the moment (something you can configure during the installation\nprocess).\n\nNow that we tested the model, we\u2019re ready to upload it to the Hugging Face\nHub. You can change the name of your repo in the following code snippet and\nsimply run it.\n\n \n \n from huggingface_hub import notebook_login\n from huggingface_hub import HfApi\n \n notebook_login()\n api = HfApi()\n api.create_repo(\n repo_id=f\"mlabonne/zephyr-7b-beta-5.0bpw-exl2\",\n repo_type=\"model\"\n )\n api.upload_folder(\n repo_id=f\"mlabonne/zephyr-7b-beta-5.0bpw-exl2\",\n folder_path=\"quant\",\n )\n\nGreat, the model can be found on the Hugging Face Hub. The code in the\nnotebook is quite general and can allow you to quantize different models,\nusing different values of bpw. This is ideal for creating models dedicated to\nyour hardware.\n\n### Conclusion\n\nIn this article, we presented ExLlamaV2, a powerful library to quantize LLMs.\nIt is also a fantastic tool to run them since it provides the highest number\nof tokens per second compared to other solutions like GPTQ or llama.cpp. We\napplied it to the zephyr-7B-beta model to create a 5.0 bpw version of it,\nusing the new EXL2 format. After quantization, we tested our model to see how\nit performs. Finally, it was uploaded to the Hugging Face Hub and can be found\nhere.\n\nIf you\u2019re interested in more technical content around LLMs, follow me on\nMedium.\n\n### Articles about quantization\n\n**Introduction to Weight Quantization** \n _Reducing the size of Large Language Models with 8-bit\nquantization_towardsdatascience.com\n\n**4-bit Quantization with GPTQ** \n _Quantize your own LLMs using AutoGPTQ_towardsdatascience.com\n\n _Learn more about machine learning and support my work with one click \u2014\nbecome a Medium member here:_\n\n**Join Medium with my referral link - Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\nShare this post\n\n#### ExLlamaV2: The Fastest Library to Run LLMs\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/exllamav2-the-fastest-library-to-run-llms-32aeda294d26" }, { "id": "715b7861-0f40-4025-bf87-7dddeabaf278", "content": { "Title": "Quantize Llama models with GGML and llama.cpp", "Subtitle": "GGML vs. GPTQ vs. NF4", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Quantize Llama models with GGML and llama.cpp\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Quantize Llama models with GGML and llama.cpp\n\n### GGML vs. GPTQ vs. NF4\n\nMaxime Labonne\n\nSep 04, 2023\n\nShare this post\n\n#### Quantize Llama models with GGML and llama.cpp\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### GGML vs. GPTQ vs. NF4\n\nImage by author\n\nDue to the massive size of Large Language Models (LLMs), quantization has\nbecome an essential technique to run them efficiently. By reducing the\nprecision of their weights, you can save memory and speed up inference while\npreserving most of the model\u2019s performance. Recently, 8-bit and 4-bit\nquantization unlocked the possibility of **running LLMs on consumer\nhardware**. Coupled with the release of Llama models and parameter-efficient\ntechniques to fine-tune them (LoRA, QLoRA), this created a rich ecosystem of\nlocal LLMs that are now competing with OpenAI\u2019s GPT-3.5 and GPT-4.\n\nBesides the naive approach covered in this article, there are three main\nquantization techniques: NF4, GPTQ, and GGML. NF4 is a static method used by\nQLoRA to load a model in 4-bit precision to perform fine-tuning. In a previous\narticle, we explored the GPTQ method and quantized our own model to run it on\na consumer GPU. In this article, we will introduce the GGML technique, see how\nto quantize Llama models, and provide tips and tricks to achieve the best\nresults.\n\nYou can find the code on Google Colab and GitHub.\n\n### What is GGML?\n\nGGML is a C library focused on machine learning. It was created by Georgi\nGerganov, which is what the initials \u201cGG\u201d stand for. This library not only\nprovides foundational elements for machine learning, such as tensors, but also\na **unique binary format** to distribute LLMs.\n\nThis format recently changed to **GGUF**. This new format is designed to be\nextensible, so that new features shouldn\u2019t break compatibility with existing\nmodels. It also centralizes all the metadata in one file, such as special\ntokens, RoPE scaling parameters, etc. In short, it answers a few historical\npain points and should be future-proof. For more information, you can read the\nspecification at this address. In the rest of the article, we will call \u201cGGML\nmodels\u201d all models that either use GGUF or previous formats.\n\nGGML was designed to be used in conjunction with the llama.cpp library, also\ncreated by Georgi Gerganov. The library is written in C/C++ for efficient\ninference of Llama models. It can load GGML models and **run them on a CPU**.\nOriginally, this was the main difference with GPTQ models, which are loaded\nand run on a GPU. However, you can now offload some layers of your LLM to the\nGPU with llama.cpp. To give you an example, there are 35 layers for a 7b\nparameter model. This drastically speeds up inference and allows you to run\nLLMs that don\u2019t fit in your VRAM.\n\nImage by author\n\nIf command-line tools are your thing, llama.cpp and GGUF support have been\nintegrated into many GUIs, like oobabooga\u2019s text-generation-web-ui, koboldcpp,\nLM Studio, or ctransformers. You can simply load your GGML models with these\ntools and interact with them in a ChatGPT-like way. Fortunately, many\nquantized models are directly available on the Hugging Face Hub. You\u2019ll\nquickly notice that most of them are quantized by TheBloke, a popular figure\nin the LLM community.\n\nIn the next section, we will see how to quantize our own models and run them\non a consumer GPU.\n\n### How to quantize LLMs with GGML?\n\nLet\u2019s look at the files inside of TheBloke/Llama-2\u201313B-chat-GGML repo. We can\nsee **14 different GGML models** , corresponding to different types of\nquantization. They follow a particular naming convention: \u201cq\u201d + the number of\nbits used to store the weights (precision) + a particular variant. Here is a\nlist of all the possible quant methods and their corresponding use cases,\nbased on model cards made by TheBloke:\n\n * `q2_k`: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.\n\n * `q3_k_l`: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K\n\n * `q3_k_m`: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K\n\n * `q3_k_s`: Uses Q3_K for all tensors\n\n * `q4_0`: Original quant method, 4-bit.\n\n * `q4_1`: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.\n\n * `q4_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K\n\n * `q4_k_s`: Uses Q4_K for all tensors\n\n * `q5_0`: Higher accuracy, higher resource usage and slower inference.\n\n * `q5_1`: Even higher accuracy, resource usage and slower inference.\n\n * `q5_k_m`: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K\n\n * `q5_k_s`: Uses Q5_K for all tensors\n\n * `q6_k`: Uses Q8_K for all tensors\n\n * `q8_0`: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.\n\nAs a rule of thumb, **I recommend using Q5_K_M** as it preserves most of the\nmodel\u2019s performance. Alternatively, you can use Q4_K_M if you want to save\nsome memory. In general, K_M versions are better than K_S versions. I cannot\nrecommend Q2 or Q3 versions, as they drastically decrease model performance.\n\nNow that we know more about the quantization types available, let\u2019s see how to\nuse them on a real model. You can execute the following code on a **free T4\nGPU** on Google Colab. The first step consists of compiling llama.cpp and\ninstalling the required libraries in our Python environment.\n\n \n \n # Install llama.cpp\n !git clone https://github.com/ggerganov/llama.cpp\n !cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make\n !pip install -r llama.cpp/requirements.txt\n\nNow we can download our model. We will use the model we fine-tuned in the\nprevious article, `mlabonne/EvolCodeLlama-7b`.\n\n \n \n MODEL_ID = \"mlabonne/EvolCodeLlama-7b\"\n \n # Download model\n !git lfs install\n !git clone https://huggingface.co/{MODEL_ID}\n\nThis step can take a while. Once it\u2019s done, we need to convert our weight to\nGGML FP16 format.\n\n \n \n MODEL_NAME = MODEL_ID.split('/')[-1]\n GGML_VERSION = \"gguf\"\n \n # Convert to fp16\n fp16 = f\"{MODEL_NAME}/{MODEL_NAME.lower()}.{GGML_VERSION}.fp16.bin\"\n !python llama.cpp/convert.py {MODEL_NAME} --outtype f16 --outfile {fp16}\n\nFinally, we can quantize the model using one or several methods. In this case,\nwe will use the Q4_K_M and Q5_K_M methods I recommended earlier. This is the\nonly step that actually requires a GPU.\n\n \n \n QUANTIZATION_METHODS = [\"q4_k_m\", \"q5_k_m\"]\n \n for method in QUANTIZATION_METHODS:\n qtype = f\"{MODEL_NAME}/{MODEL_NAME.lower()}.{GGML_VERSION}.{method}.bin\"\n !./llama.cpp/quantize {fp16} {qtype} {method}\n\nOur two quantized models are now **ready for inference**. We can check the\nsize of the bin files to see how much we compressed them. The FP16 model takes\nup 13.5 GB, while the Q4_K_M model takes up 4.08 GB (3.3 times smaller) and\nthe Q5_K_M model takes up 4.78 GB (2.8 times smaller).\n\nLet\u2019s use llama.cpp to efficiently run them. Since we\u2019re using a GPU with 16\nGB of VRAM, we can offload every layer to the GPU. In this case, it represents\n35 layers (7b parameter model), so we\u2019ll use the `-ngl 35` parameter. In the\nfollowing code block, we'll also input a prompt and the quantization method we\nwant to use.\n\n \n \n import os\n \n model_list = [file for file in os.listdir(MODEL_NAME) if GGML_VERSION in file]\n prompt = input(\"Enter your prompt: \")\n chosen_method = input(\"Please specify the quantization method to run the model (options: \" + \", \".join(model_list) + \"): \")\n \n # Verify the chosen method is in the list\n if chosen_method not in model_list:\n print(\"Invalid method chosen!\")\n else:\n qtype = f\"{MODEL_NAME}/{MODEL_NAME.lower()}.{GGML_VERSION}.{method}.bin\"\n !./llama.cpp/main -m {qtype} -n 128 --color -ngl 35 -p \"{prompt}\"\n\nLet\u2019s ask the model \u201cWrite a Python function to print the nth Fibonacci\nnumbers\u201d using the Q5_K_M method. If we look at the logs, we can confirm that\nwe successfully offloaded our layers thanks to the line \u201cllm_load_tensors:\noffloaded 35/35 layers to GPU\u201d. Here is the code the model generated:\n\n \n \n def fib(n):\n if n == 0 or n == 1:\n return n\n return fib(n - 2) + fib(n - 1)\n \n for i in range(1, 10):\n print(fib(i))\n\nThis wasn\u2019t a very complex prompt, but it successfully produced a working\npiece of code in no time. With this GGML, you can use your local LLM as an\nassistant in a terminal using the interactive mode (`-i` flag). Note that this\nalso works on Macbooks with Apple's Metal Performance Shaders (MPS), which is\nan excellent option to run LLMs.\n\nFinally, we can push our quantized model to a new repo on the Hugging Face Hub\nwith the \u201c-GGUF\u201d suffix. First, let\u2019s log in and modify the following code\nblock to match your username.\n\n \n \n !pip install -q huggingface_hub\n \n username = \"mlabonne\"\n \n from huggingface_hub import notebook_login, create_repo, HfApi\n notebook_login()\n\nNow we can create the repo and upload our models. We use the `allow_patterns`\nparameter to filter which files to upload, so we don't push the entirety of\nthe directory.\n\n \n \n api = HfApi()\n \n # Create repo\n create_repo(\n repo_id=f\"{username}/{MODEL_NAME}-GGML\",\n repo_type=\"model\",\n exist_ok=True\n )\n \n # Upload bin models\n api.upload_folder(\n folder_path=MODEL_NAME,\n repo_id=f\"{username}/{MODEL_NAME}-GGML\",\n allow_patterns=f\"*{GGML_VERSION}*\",\n )\n\nWe have successfully quantized, run, and pushed GGML models to the Hugging\nFace Hub! In the next section, we will explore how GGML actually quantize\nthese models.\n\n### Quantization with GGML\n\nThe way GGML quantizes weights is not as sophisticated as GPTQ\u2019s. Basically,\nit groups blocks of values and rounds them to a lower precision. Some\ntechniques, like Q4_K_M and Q5_K_M, implement a **higher precision for\ncritical layers**. In this case, every weight is stored in 4-bit precision,\nwith the exception of half of the attention.wv and feed_forward.w2 tensors.\nExperimentally, this mixed precision proves to be a good tradeoff between\naccuracy and resource usage.\n\nIf we look into the ggml.c file, we can see how the blocks are defined. For\nexample, the `block_q4_0` structure is defined as:\n\n \n \n #define QK4_0 32\n typedef struct {\n ggml_fp16_t d; // delta\n uint8_t qs[QK4_0 / 2]; // nibbles / quants\n } block_q4_0;\n\nIn GGML, weights are processed in blocks, each consisting of 32 values. For\neach block, a scale factor (delta) is derived from the largest weight value.\nAll weights in the block are then scaled, quantized, and packed efficiently\nfor storage (nibbles). This approach significantly reduces the storage\nrequirements while allowing for a relatively simple and deterministic\nconversion between the original and quantized weights.\n\nNow that we know more about the quantization process, we can compare the\nresults with NF4 and GPTQ.\n\n### NF4 vs. GGML vs. GPTQ\n\nWhich technique is better for 4-bit quantization? To answer this question, we\nneed to introduce the different backends that run these quantized LLMs. For\nGGML models, llama.cpp with Q4_K_M models is the way to go. For GPTQ models,\nwe have two options: AutoGPTQ or ExLlama. Finally, NF4 models can directly be\nrun in transformers with the `--load-in-4bit` flag.\n\nOobabooga ran multiple experiments in an excellent blog post that compare\ndifferent models in terms of perplexity (lower is better):\n\nBased on these results, we can say that GGML models have a slight advantage in\nterms of perplexity. The difference is not particularly significant, which is\nwhy it is better to focus on the generation speed in terms of tokens/second.\nThe best technique depends on your GPU: if you have enough VRAM to fit the\nentire quantized model, **GPTQ with ExLlama** will be the fastest. If that\u2019s\nnot the case, you can offload some layers and use **GGML models with\nllama.cpp** to run your LLM.\n\n### Conclusion\n\nIn this article, we introduced the GGML library and the new GGUF format to\nefficiently store these quantized models. We used it to **quantize our own\nLlama model** in different formats (Q4_K_M and Q5_K_M). We then ran the GGML\nmodel and pushed our bin files to the Hugging Face Hub. Finally, we delved\ndeeper into GGML\u2019s code to understand how it actually quantizes the weights\nand compared it to NF4 and GPTQ.\n\nQuantization is a formidable vector to democratize LLMs by lowering the cost\nof running them. In the future, mixed precision and other techniques will keep\nimproving the performance we can achieve with quantized weights. Until then, I\nhope you enjoyed reading this article and learned something new.\n\nIf you\u2019re interested in more technical content around LLMs, follow me on\nMedium.\n\n### Articles about quantization\n\n**Part 1: Introduction to Weight Quantization** \n _Reducing the size of Large Language Models with 8-bit\nquantization_towardsdatascience.com\n\n**Part 2: 4-bit Quantization with GPTQ** \n _Quantize your own LLMs using AutoGPTQ_towardsdatascience.com\n\n _Learn more about machine learning and support my work with one click \u2014\nbecome a Medium member here:_\n\n**Join Medium with my referral link \u2014 Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\nShare this post\n\n#### Quantize Llama models with GGML and llama.cpp\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172" }, { "id": "a219cfaa-c52a-4c7c-aa39-60883cc507cd", "content": { "Title": "A Beginner\u2019s Guide to LLM Fine-Tuning - Maxime Labonne", "Subtitle": "How to fine-tune Llama and other LLMs with one tool", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### A Beginner\u2019s Guide to LLM Fine-Tuning\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# A Beginner\u2019s Guide to LLM Fine-Tuning\n\n### How to fine-tune Llama and other LLMs with one tool\n\nMaxime Labonne\n\nAug 30, 2023\n\n1\n\nShare this post\n\n#### A Beginner\u2019s Guide to LLM Fine-Tuning\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n1\n\nShare\n\n#### How to fine-tune Llama and other LLMs with one tool\n\nImage by author\n\nThe growing interest in Large Language Models (LLMs) has led to a surge in\n**tools and wrappers designed to streamline their training process**.\n\nPopular options include FastChat from LMSYS (used to train Vicuna) and Hugging\nFace\u2019s transformers/trl libraries (used in my previous article). In addition,\neach big LLM project, like WizardLM, tends to have its own training script,\ninspired by the original Alpaca implementation.\n\nIn this article, we will use **Axolotl** , a tool created by the OpenAccess AI\nCollective. We will use it to fine-tune a **Code Llama 7b** model on an evol-\ninstruct dataset comprised of 1,000 samples of Python code.\n\n### \ud83e\udd14 Why Axolotl?\n\nThe main appeal of Axolotl is that it provides a one-stop solution, which\nincludes numerous features, model architectures, and an active community.\nHere\u2019s a quick list of my favorite things about it:\n\n * **Configuration** : All parameters used to train an LLM are neatly stored in a yaml config file. This makes it convenient for sharing and reproducing models. You can see an example for Llama 2 here.\n\n * **Dataset Flexibility** : Axolotl allows the specification of multiple datasets with varied prompt formats such as alpaca (`{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}`), sharegpt:chat (`{\"conversations\": [{\"from\": \"...\", \"value\": \"...\"}]}`), and raw completion (`{\"text\": \"...\"}`). Combining datasets is seamless, and the hassle of unifying the prompt format is eliminated.\n\n * **Features** : Axolotl is packed with SOTA techniques such as FSDP, deepspeed, LoRA, QLoRA, ReLoRA, sample packing, GPTQ, FlashAttention, xformers, and rope scaling.\n\n * **Utilities** : There are numerous user-friendly utilities integrated, including the addition or alteration of special tokens, or a custom wandb configuration.\n\nSome well-known models trained using this tool are Manticore-13b from the\nOpenAccess AI Collective and Samantha-1.11\u201370b from Eric Hartford. Like other\nwrappers, it is built on top of the transformers library and uses many of its\nfeatures.\n\n### \u2699\ufe0f Create your own config file\n\nBefore anything, we need a configuration file. You can reuse an existing\nconfiguration from the `examples` folder. In our case, we will tweak the QLoRA\nconfig for Llama 2 to create our own **Code Llama** model. The model will be\ntrained on a subset of 1,000 Python samples from the `nickrosh/Evol-Instruct-\nCode-80k-v1` dataset.\n\nFirst, we must change the `base_model` and `base_model_config` fields to\n\"codellama/CodeLlama-7b-hf\". To push our trained adapter to the Hugging Face\nHub, let's add a new field `hub_model_id`, which corresponds to the name of\nour model, \"EvolCodeLlama-7b\". Now, we have to update the dataset to\n`mlabonne/Evol-Instruct-Python-1k` and set `type` to \"alpaca\".\n\nThere's no sample bigger than 2048 tokens in this dataset, so we can reduce\nthe `sequence_len` to \"2048\" and save some VRAM. Talking about VRAM, we\u2019re\ngoing to use a `micro_batch_size` of 10 and a `gradient_accumulation_steps` of\n1 to maximize its use. In practice, you try different values until you use\n>95% of the available VRAM.\n\nFor convenience, I'm going to add the name \"axolotl\" to the `wandb_project`\nfield so it's easier to track on my account. I'm also setting the\n`warmup_steps` to \"100\" (personal preference) and the `eval_steps` to 0.01 so\nwe'll end up with 100 evaluations.\n\nHere\u2019s how the final config file should look:\n\n \n \n base_model: codellama/CodeLlama-7b-hf\n base_model_config: codellama/CodeLlama-7b-hf\n model_type: LlamaForCausalLM\n tokenizer_type: LlamaTokenizer\n is_llama_derived_model: true\n hub_model_id: EvolCodeLlama-7b\n \n load_in_8bit: false\n load_in_4bit: true\n strict: false\n \n datasets:\n - path: mlabonne/Evol-Instruct-Python-1k\n type: alpaca\n dataset_prepared_path: last_run_prepared\n val_set_size: 0.02\n output_dir: ./qlora-out\n \n adapter: qlora\n lora_model_dir:\n \n sequence_len: 2048\n sample_packing: true\n \n lora_r: 32\n lora_alpha: 16\n lora_dropout: 0.05\n lora_target_modules:\n lora_target_linear: true\n lora_fan_in_fan_out:\n \n wandb_project: axolotl\n wandb_entity:\n wandb_watch:\n wandb_run_id:\n wandb_log_model:\n \n gradient_accumulation_steps: 1\n micro_batch_size: 10\n num_epochs: 3\n optimizer: paged_adamw_32bit\n lr_scheduler: cosine\n learning_rate: 0.0002\n \n train_on_inputs: false\n group_by_length: false\n bf16: true\n fp16: false\n tf32: false\n \n gradient_checkpointing: true\n early_stopping_patience:\n resume_from_checkpoint:\n local_rank:\n logging_steps: 1\n xformers_attention:\n flash_attention: true\n \n warmup_steps: 100\n eval_steps: 0.01\n save_strategy: epoch\n save_steps:\n debug:\n deepspeed:\n weight_decay: 0.0\n fsdp:\n fsdp_config:\n special_tokens:\n bos_token: \"\"\n eos_token: \"\"\n unk_token: \"\"\n\nYou can also find this config file here as a GitHub gist.\n\nBefore we start training our model, I want to introduce a few parameters that\nare important to understand:\n\n * **QLoRA** : We\u2019re using QLoRA for fine-tuning, which is why we\u2019re loading the base model in 4-bit precision (NF4 format). You can check this article from Benjamin Marie to know more about QLoRA.\n\n * **Gradient checkpointing** : It lowers the VRAM requirements by removing some activations that are re-computed on demand during the backward pass. It also slows down training by about 20%, according to Hugging Face\u2019s documentation.\n\n * **FlashAttention** : This implements the FlashAttention mechanism, which improves the speed and memory efficiency of our model thanks to a clever fusion of GPU operations (learn more about it in this article from Aleksa Gordi\u0107).\n\n * **Sample packing** : Smart way of creating batches with as little padding as possible, by reorganizing the order of the samples (bin packing problem). As a result, we need fewer batches to train the model on the same dataset. It was inspired by the Multipack Sampler (see my note) and Krell et al.\n\nYou can find FlashAttention in some other tools, but sample packing is\nrelatively new. As far as I know, OpenChat was the first project to use sample\npacking during fine-tuning. Thanks to Axolotl, we\u2019ll use these techniques for\nfree.\n\n### \ud83e\udd99 Fine-tune Code Llama\n\nHaving the config file ready, it\u2019s time to get our hands dirty with the actual\nfine-tuning. You might consider running the training on a Colab notebook.\nHowever, for those without access to a high-performance GPU, a more cost-\neffective solution consists of renting **cloud-based GPU services** , like\nAWS, Lambda Labs, Vast.ai, Banana, or RunPod.\n\nPersonally, I use RunPod, which is a popular option in the fine-tuning\ncommunity. It\u2019s not the cheapest service but it hits a good tradeoff with a\nclean UI. You can easily replicate the following steps using your favorite\nservice.\n\nWhen your RunPod account is set up, go to Manage > Templates and click on \u201cNew\nTemplate\u201d. Here is a simple template:\n\nImage by author\n\nLet\u2019s review the different fields and their corresponding values:\n\n * **Template Name** : Axolotl (you can choose whatever you want)\n\n * **Container Image** : winglian/axolotl-runpod:main-py3.10-cu118\u20132.0.1\n\n * **Container Disk** : 100 GB\n\n * **Volume Disk** : 0 GB\n\n * **Volume Mount Path** : /workspace\n\nIn addition, there are two handy environment variables can include:\n\n * **HUGGING_FACE_HUB_TOKEN** : you can find your token on this page (requires an account)\n\n * **WANDB_API_KEY** : you can find your key on this page (requires an account)\n\nAlternatively, you can simply log in the terminal later (using huggingface-cli\nlogin and wandb login). Once you\u2019re set-up, go to Community Cloud and deploy\nan RTX 3090. Here you can search for the name of your template and select it\nas follows:\n\nImage by author\n\nYou can click on \u201cContinue\u201d and RunPod will deploy your template. You can see\nthe installation in your pod\u2019s logs (Manage > Pods). When the option becomes\navailable, click on \u201cConnect\u201d. Here, click on \u201cStart Web Terminal\u201d and then\n\u201cConnect to Web Terminal\u201d. You are now connected to your pod!\n\nThe following steps are **the same no matter what service you choose** :\n\n 1. We install Axolotl and the PEFT library as follows:\n\n \n \n git clone https://github.com/OpenAccess-AI-Collective/axolotl\n cd axolotl\n \n pip3 install -e .[flash-attn]\n pip3 install -U git+https://github.com/huggingface/peft.git\n\n2\\. Download the config file we created:\n\n \n \n wget https://gist.githubusercontent.com/mlabonne/8055f6335e2b85f082c8c75561321a66/raw/93915a9563fcfff8df9a81fc0cdbf63894465922/EvolCodeLlama-7b.yaml\n\n3\\. You can now **start fine-tuning the model** with the following command:\n\n \n \n accelerate launch scripts/finetune.py EvolCodeLlama-7b.yaml\n\nIf everything is configured correctly, you should be able to train the model\nin a little more than **one hour** (it took me 1h 11m 44s). If you check the\nGPU memory used, you\u2019ll see almost 100% with this config, which means we\u2019re\noptimizing it pretty nicely. If you\u2019re using a GPU with more VRAM (like an\nA100), you can increase the micro-batch size to make sure you\u2019re fully using\nit.\n\nIn the meantime, feel free to close the web terminal and check your loss on\nWeights & Biases. We\u2019re using tmux so the training won\u2019t stop if you close the\nterminal. Here are my loss curves:\n\nImage by author\n\nWe see a steady improvement in the eval loss, which is a good sign. However,\nyou can also spot drops in the eval loss that are not correlated with a\ndecrease in the quality of the outputs\u2026 The best way to evaluate your model is\nsimply by using it: you can run it in the terminal with the command\n`accelerate launch scripts/finetune.py EvolCodeLlama-7b.yaml --inference\n--lora_model_dir=\"./qlora-out\"`.\n\nThe QLoRA adapter should already be uploaded to the Hugging Face Hub. However,\nyou can also **merge the base Code Llama model with this adapter and push the\nmerged model** there by following these steps:\n\n 1. Download this script:\n\n \n \n wget https://gist.githubusercontent.com/mlabonne/a3542b0519708b8871d0703c938bba9f/raw/60abc5afc07f9d843bc23d56f4e0b7ab072c4a62/merge_peft.py\n\n2\\. Execute it with this command:\n\n \n \n python merge_peft.py --base_model=codellama/CodeLlama-7b-hf --peft_model=./qlora-out --hub_id=EvolCodeLlama-7b\n\nCongratulations, you should have **your own EvolCodeLlama-7b** on the Hugging\nFace Hub at this point! For reference, you can access my own model trained\nwith this process here: `mlabonne/EvolCodeLlama-7b`\n\nConsidering that our EvolCodeLlama-7b is a code LLM, it would be interesting\nto compare its performance with other models on **standard benchmarks** , such\nas HumanEval and MBPP. For reference, you can find a leaderboard at the\nfollowing address: Multilingual Code Evals.\n\nIf you\u2019re happy with this model, you can **quantize** it with GGML for local\ninference with this free Google Colab notebook. You can also fine-tune\n**bigger models** (e.g., 70b parameters) thanks to deepspeed, which only\nrequires an additional config file.\n\n### Conclusion\n\nIn this article, we\u2019ve covered the essentials of **how to efficiently fine-\ntune LLMs**. We customized parameters to train on our Code Llama model on a\nsmall Python dataset. Finally, we merged the weights and uploaded the result\non Hugging Face.\n\nI hope you found this guide useful. I recommend using Axolotl with a cloud-\nbased GPU service to get some experience and upload a few models on Hugging\nFace. Build your own datasets, play with the parameters, and break stuff along\nthe way. Like with every wrapper, don\u2019t hesitate to check the source code to\nget a good intuition of what it\u2019s actually doing. It will massively help in\nthe long run.\n\nThanks to the OpenAccess AI Collective and all the contributors!\n\nIf you\u2019re interested in more technical content around LLMs, follow me on\nMedium.\n\n### Related articles\n\n**Fine-Tune Your Own Llama 2 Model in a Colab Notebook** \n _A practical introduction to LLM fine-tuning_towardsdatascience.com\n\n**4-bit Quantization with GPTQ** \n _Quantize your own LLMs using AutoGPTQ_towardsdatascience.com\n\n _Learn more about machine learning and support my work with one click \u2014\nbecome a Medium member here:_\n\n**Join Medium with my referral link - Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\n1\n\nShare this post\n\n#### A Beginner\u2019s Guide to LLM Fine-Tuning\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n1\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| DanielJun 23Thanks for this great article! One question: How do you deal\nwith the issue that the chat template defined in the Axolotl config for\ntraining and a chat template used for inference (e.g. when you load the model\nfrom the Hub via HuggingFace transformers method .from_pretrained and use\ntheir chat template) might be different? If I am not mistaken then the Axolotl\ntemplates assembles prompts in token space, whereas HF chat templates\nassembles them in string space, which might cause tokenization mismatches?\nExpand full commentReplyShare \n---|--- \n \nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/a-beginners-guide-to-llm-fine-tuning-4bae7d4da672" }, { "id": "30f815cd-5776-4f2f-9b1d-4038f07ec65e", "content": { "Title": "Graph Convolutional Networks: Introduction to GNNs", "Subtitle": "A step-by-step guide using PyTorch Geometric", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Graph Convolutional Networks: Introduction to GNNs\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Graph Convolutional Networks: Introduction to GNNs\n\n### A step-by-step guide using PyTorch Geometric\n\nMaxime Labonne\n\nAug 14, 2023\n\n2\n\nShare this post\n\n#### Graph Convolutional Networks: Introduction to GNNs\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### A step-by-step guide using PyTorch Geometric\n\nImage by author\n\n**Graph Neural Networks** (GNNs) represent one of the most captivating and\nrapidly evolving architectures within the deep learning landscape. As deep\nlearning models designed to process data structured as graphs, GNNs bring\nremarkable versatility and powerful learning capabilities.\n\nAmong the various types of GNNs, the **Graph Convolutional Networks** (GCNs)\nhave emerged as the most prevalent and broadly applied model. GCNs are\ninnovative due to their ability to leverage both the features of a node and\nits locality to make predictions, providing an effective way to handle graph-\nstructured data.\n\nIn this article, we will delve into the mechanics of the GCN layer and explain\nits inner workings. Furthermore, we will explore its practical application for\nnode classification tasks, using PyTorch Geometric as our tool of choice.\n\nPyTorch Geometric is a specialized extension of PyTorch that has been created\nspecifically for the development and implementation of GNNs. It is an\nadvanced, yet user-friendly library that provides a comprehensive suite of\ntools to facilitate graph-based machine learning. To commence our journey, the\nPyTorch Geometric installation will be required. If you are using Google\nColab, PyTorch should already be in place, so all we need to do is execute a\nfew additional commands.\n\nAll the code is available on Google Colab and GitHub.\n\n \n \n !pip install torch_geometric\n \n \n import torch\n import numpy as np\n import networkx as nx\n import matplotlib.pyplot as plt\n\nNow that PyTorch Geometric is installed, let\u2019s explore the dataset we will use\nin this tutorial.\n\n### \ud83c\udf10 I. Graph data\n\nGraphs are an essential structure for representing relationships between\nobjects. You can encounter graph data in a multitude of real-world scenarios,\nsuch as social and computer networks, chemical structures of molecules,\nnatural language processing, and image recognition, to name a few.\n\nIn this article, we will study the infamous and much-used Zachary\u2019s karate\nclub dataset.\n\nImage by author\n\nThe Zachary\u2019s karate club dataset embodies the relationships formed within a\nkarate club as observed by Wayne W. Zachary during the 1970s. It is a kind of\nsocial network, where each node represents a club member, and edges between\nnodes represent interactions that occurred outside the club environment.\n\nIn this particular scenario, the members of the club are split into four\ndistinct groups. Our task is to **assign the correct group to each member**\n(node classification), based on the pattern of their interactions.\n\nLet\u2019s import the dataset with PyG\u2019s built-in function and try to understand\nthe `Datasets` object it uses.\n\n \n \n from torch_geometric.datasets import KarateClub\n \n \n # Import dataset from PyTorch Geometric\n dataset = KarateClub()\n \n \n # Print information\n print(dataset)\n print('------------')\n print(f'Number of graphs: {len(dataset)}')\n print(f'Number of features: {dataset.num_features}')\n print(f'Number of classes: {dataset.num_classes}')\n \n \n KarateClub()\n ------------\n Number of graphs: 1\n Number of features: 34\n Number of classes: 4\n\nThis dataset only has 1 graph, where each node has a feature vector of 34\ndimensions and is part of one out of four classes (our four groups). Actually,\nthe `Datasets` object can be seen as a collection of `Data` (graph) objects.\n\nWe can further inspect our unique graph to know more about it.\n\n \n \n # Print first element\n print(f'Graph: {dataset[0]}')\n \n \n Graph: Data(x=[34, 34], edge_index=[2, 156], y=[34], train_mask=[34])\n\nThe `Data` object is particularly interesting. Printing it offers a good\nsummary of the graph we're studying:\n\n * `x=[34, 34]` is the **node feature matrix** with shape (number of nodes, number of features). In our case, it means that we have 34 nodes (our 34 members), each node being associated to a 34-dim feature vector.\n\n * `edge_index=[2, 156]` represents the **graph connectivity** (how the nodes are connected) with shape (2, number of directed edges).\n\n * `y=[34]` is the **node ground-truth labels**. In this problem, every node is assigned to one class (group), so we have one value for each node.\n\n * `train_mask=[34]` is an optional attribute that tells which nodes should be used for training with a list of `True` or `False` statements.\n\nLet\u2019s print each of these tensors to understand what they store. Let\u2019s start\nwith the node features.\n\n \n \n data = dataset[0]\n \n \n print(f'x = {data.x.shape}')\n print(data.x)\n \n \n x = torch.Size([34, 34])\n tensor([[1., 0., 0., ..., 0., 0., 0.],\n [0., 1., 0., ..., 0., 0., 0.],\n [0., 0., 1., ..., 0., 0., 0.],\n ...,\n [0., 0., 0., ..., 1., 0., 0.],\n [0., 0., 0., ..., 0., 1., 0.],\n [0., 0., 0., ..., 0., 0., 1.]])\n\nHere, the node feature matrix `x` is an identity matrix: it **doesn't contain\nany relevant information** about the nodes. It could contain information like\nage, skill level, etc. but this is not the case in this dataset. It means\nwe'll have to classify our nodes just by looking at their connections.\n\nNow, let\u2019s print the edge index.\n\n \n \n print(f'edge_index = {data.edge_index.shape}')\n print(data.edge_index)\n \n \n edge_index = torch.Size([2, 156])\n tensor([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,\n 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,\n 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7,\n 7, 7, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 13, 13, 13,\n 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 20, 20, 21,\n 21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27,\n 27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31,\n 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33,\n 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33],\n [ 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 17, 19, 21, 31, 0, 2,\n 3, 7, 13, 17, 19, 21, 30, 0, 1, 3, 7, 8, 9, 13, 27, 28, 32, 0,\n 1, 2, 7, 12, 13, 0, 6, 10, 0, 6, 10, 16, 0, 4, 5, 16, 0, 1,\n 2, 3, 0, 2, 30, 32, 33, 2, 33, 0, 4, 5, 0, 0, 3, 0, 1, 2,\n 3, 33, 32, 33, 32, 33, 5, 6, 0, 1, 32, 33, 0, 1, 33, 32, 33, 0,\n 1, 32, 33, 25, 27, 29, 32, 33, 25, 27, 31, 23, 24, 31, 29, 33, 2, 23,\n 24, 33, 2, 31, 33, 23, 26, 32, 33, 1, 8, 32, 33, 0, 24, 25, 28, 32,\n 33, 2, 8, 14, 15, 18, 20, 22, 23, 29, 30, 31, 33, 8, 9, 13, 14, 15,\n 18, 19, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32]])\n\nIn graph theory and network analysis, connectivity between nodes is stored\nusing a variety of data structures. The `edge_index` is one such data\nstructure, where the graph's connections are stored in **two lists** (156\ndirected edges, which equate to 78 bidirectional edges). The reason for these\ntwo lists is that one list stores the source nodes, while the second one\nidentifies the destination nodes.\n\nThis method is known as a **coordinate list** (COO) format, which is\nessentially a means to efficiently store a sparse matrix. Sparse matrices are\ndata structures that efficiently store matrices with a majority of zero\nelements. In the COO format, only non-zero elements are stored, saving memory\nand computational resources.\n\nContrarily, a more intuitive and straightforward way to represent graph\nconnectivity is through an **adjacency matrix** _A_. This is a square matrix\nwhere each element _A_ \u1d62\u2c7c _s_ pecifies the presence or absence of an edge from\nnode _i_ to node _j_ in the graph. In other words, a non-zero element _A_ \u1d62\u2c7c\nimplies a connection from node _i_ to node _j_ , and a zero indicates no\ndirect connection.\n\nImage by author\n\nAn adjacency matrix, however, is not as space-efficient as the COO format for\nsparse matrices or graphs with fewer edges. However, for clarity and easy\ninterpretation, the adjacency matrix remains a popular choice for representing\ngraph connectivity.\n\nThe adjacency matrix can be inferred from the `edge_index` with a utility\nfunction `to_dense_adj()`.\n\n \n \n from torch_geometric.utils import to_dense_adj\n \n \n A = to_dense_adj(data.edge_index)[0].numpy().astype(int)\n print(f'A = {A.shape}')\n print(A)\n \n \n A = (34, 34)\n [[0 1 1 ... 1 0 0]\n [1 0 1 ... 0 0 0]\n [1 1 0 ... 0 1 0]\n ...\n [1 0 0 ... 0 1 1]\n [0 0 1 ... 1 0 1]\n [0 0 0 ... 1 1 0]]\n\nWith graph data, it is relatively uncommon for nodes to be densely\ninterconnected. As you can see, our adjacency matrix _A_ is **sparse** (filled\nwith zeros).\n\nIn many real-world graphs, most nodes are connected to only a few other nodes,\nresulting in a large number of zeros in the adjacency matrix. Storing so many\nzeros is not efficient at all, which is why the COO format is adopted by PyG.\n\nOn the contrary, ground-truth labels are easy to understand.\n\n \n \n print(f'y = {data.y.shape}')\n print(data.y)\n \n \n y = torch.Size([34])\n tensor([1, 1, 1, 1, 3, 3, 3, 1, 0, 1, 3, 1, 1, 1, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0,\n 2, 2, 0, 0, 2, 0, 0, 2, 0, 0])\n\nOur node ground-truth labels stored in `y` simply encode the group number (0,\n1, 2, 3) for each node, which is why we have 34 values.\n\nFinally, let\u2019s print the train mask.\n\n \n \n print(f'train_mask = {data.train_mask.shape}')\n print(data.train_mask)\n \n \n train_mask = torch.Size([34])\n tensor([ True, False, False, False, True, False, False, False, True, False,\n False, False, False, False, False, False, False, False, False, False,\n False, False, False, False, True, False, False, False, False, False,\n False, False, False, False])\n\nThe train mask shows which nodes are supposed to be used for training with\n`True` statements. These nodes represent the training set, while the others\ncan be considered as the test set. This division helps in model evaluation by\nproviding unseen data for testing.\n\nBut we\u2019re not done yet! The `Data` object has a lot more to offer. It provides\nvarious utility functions that enable the investigation of several properties\nof the graph. For instance:\n\n * `is_directed()` tells you if the graph is **directed**. A directed graph signifies that the adjacency matrix is not symmetric, i.e., the direction of edges matters in the connections between nodes.\n\n * `isolated_nodes()` checks if some nodes are **not connected** to the rest of the graph. These nodes are likely to pose challenges in tasks like classification due to their lack of connections.\n\n * `has_self_loops()` indicates if at least one node is **connected to itself**. This is distinct from the concept of loops: a loop implies a path that starts and ends at the same node, traversing other nodes in between.\n\nIn the context of the Zachary\u2019s karate club dataset, all these properties\nreturn `False`. This implies that the graph is not directed, does not have any\nisolated nodes, and none of its nodes are connected to themselves.\n\n \n \n print(f'Edges are directed: {data.is_directed()}')\n print(f'Graph has isolated nodes: {data.has_isolated_nodes()}')\n print(f'Graph has loops: {data.has_self_loops()}')\n \n \n Edges are directed: False\n Graph has isolated nodes: False\n Graph has loops: False\n\nFinally, we can convert a graph from PyTorch Geometric to the popular graph\nlibrary NetworkX using `to_networkx`. This is particularly useful to visualize\na small graph with `networkx` and `matplotlib`.\n\nLet\u2019s plot our dataset with a different color for each group.\n\n \n \n from torch_geometric.utils import to_networkx\n \n \n G = to_networkx(data, to_undirected=True)\n plt.figure(figsize=(12,12))\n plt.axis('off')\n nx.draw_networkx(G,\n pos=nx.spring_layout(G, seed=0),\n with_labels=True,\n node_size=800,\n node_color=data.y,\n cmap=\"hsv\",\n vmin=-2,\n vmax=3,\n width=0.8,\n edge_color=\"grey\",\n font_size=14\n )\n plt.show()\n\nThis plot of Zachary\u2019s karate club displays our 34 nodes, 78 (bidirectional)\nedges, and 4 labels with 4 different colors. Now that we\u2019ve seen the\nessentials of loading and handling a dataset with PyTorch Geometric, we can\nintroduce the **Graph Convolutional Network** architecture.\n\n### \u2709\ufe0f II. Graph Convolutional Network\n\nThis section aims to introduce and build the graph convolutional layer from\nthe ground up.\n\nIn traditional neural networks, linear layers apply a **linear\ntransformation** to the incoming data. This transformation converts input\nfeatures _x_ into hidden vectors _h_ through the use of a weight matrix \ud835\udc16.\nIgnoring biases for the time being, this can be expressed as:\n\nWith graph data, an additional layer of complexity is added through the\n**connections between nodes**. These connections matter because, typically, in\nnetworks, it\u2019s assumed that similar nodes are more likely to be linked to each\nother than dissimilar ones, a phenomenon known as network homophily.\n\nWe can enrich our **node representation** by merging its features with those\nof its neighbors. This operation is called convolution, or neighborhood\naggregation. Let\u2019s represent the neighborhood of node _i_ including itself as\n_\u00d1_.\n\nUnlike filters in Convolutional Neural Networks (CNNs), our weight matrix \ud835\udc16 is\nunique and shared among every node. But there is another issue: nodes do not\nhave a **fixed number of neighbors** like pixels do.\n\nHow do we address cases where one node has only one neighbor, and another has\n500? If we simply sum the feature vectors, the resulting embedding _h_ would\nbe much larger for the node with 500 neighbors. To ensure a **similar range**\nof values for all nodes and comparability between them, we can normalize the\nresult based on the **degree** of nodes, where degree refers to the number of\nconnections a node has.\n\nWe\u2019re almost there! Introduced by Kipf et al. (2016), the graph convolutional\nlayer has one final improvement.\n\nThe authors observed that features from nodes with numerous neighbors\npropagate much more easily than those from more isolated nodes. To offset this\neffect, they suggested assigning **bigger weights** to features from nodes\nwith fewer neighbors, thus balancing the influence across all nodes. This\noperation is written as:\n\nNote that when _i_ and _j_ have the same number of neighbors, it is equivalent\nto our own layer. Now, let\u2019s see how to implement it in Python with PyTorch\nGeometric.\n\n### \ud83e\udde0 III. Implementing a GCN\n\nPyTorch Geometric provides the `GCNConv` function, which directly implements\nthe graph convolutional layer.\n\nIn this example, we\u2019ll create a basic Graph Convolutional Network with a\nsingle GCN layer, a ReLU activation function, and a linear output layer. This\noutput layer will yield **four values** corresponding to our four categories,\nwith the highest value determining the class of each node.\n\nIn the following code block, we define the GCN layer with a 3-dimensional\nhidden layer.\n\n \n \n from torch.nn import Linear\n from torch_geometric.nn import GCNConv\n \n \n \n \n class GCN(torch.nn.Module):\n def __init__(self):\n super().__init__()\n self.gcn = GCNConv(dataset.num_features, 3)\n self.out = Linear(3, dataset.num_classes)\n \n \n def forward(self, x, edge_index):\n h = self.gcn(x, edge_index).relu()\n z = self.out(h)\n return h, z\n \n \n model = GCN()\n print(model)\n \n \n GCN(\n (gcn): GCNConv(34, 3)\n (out): Linear(in_features=3, out_features=4, bias=True)\n )\n\nIf we added a second GCN layer, our model would not only aggregate feature\nvectors from the neighbors of each node, but also from the neighbors of these\nneighbors.\n\nWe can **stack several graph layers** to aggregate more and more distant\nvalues, but there\u2019s a catch: if we add too many layers, the aggregation\nbecomes so intense that all the embeddings end up looking the same. This\nphenomenon is called **over-smoothing** and can be a real problem when you\nhave too many layers.\n\nNow that we\u2019ve defined our GNN, let\u2019s write a simple training loop with\nPyTorch. I chose a regular cross-entropy loss since it\u2019s a multi-class\nclassification task, with Adam as optimizer. In this article, we won\u2019t\nimplement a train/test split to keep things simple and focus on how GNNs learn\ninstead.\n\nThe training loop is standard: we try to predict the correct labels, and we\ncompare the GCN\u2019s results to the values stored in `data.y`. The error is\ncalculated by the cross-entropy loss and backpropagated with Adam to fine-tune\nour GNN's weights and biases. Finally, we print metrics every 10 epochs.\n\n \n \n criterion = torch.nn.CrossEntropyLoss()\n optimizer = torch.optim.Adam(model.parameters(), lr=0.02)\n \n \n # Calculate accuracy\n def accuracy(pred_y, y):\n return (pred_y == y).sum() / len(y)\n \n \n # Data for animations\n embeddings = []\n losses = []\n accuracies = []\n outputs = []\n \n \n # Training loop\n for epoch in range(201):\n # Clear gradients\n optimizer.zero_grad()\n \n \n # Forward pass\n h, z = model(data.x, data.edge_index)\n \n \n # Calculate loss function\n loss = criterion(z, data.y)\n \n \n # Calculate accuracy\n acc = accuracy(z.argmax(dim=1), data.y)\n \n \n # Compute gradients\n loss.backward()\n \n \n # Tune parameters\n optimizer.step()\n \n \n # Store data for animations\n embeddings.append(h)\n losses.append(loss)\n accuracies.append(acc)\n outputs.append(z.argmax(dim=1))\n \n \n # Print metrics every 10 epochs\n if epoch % 10 == 0:\n print(f'Epoch {epoch:>3} | Loss: {loss:.2f} | Acc: {acc*100:.2f}%')\n \n \n Epoch 0 | Loss: 1.40 | Acc: 41.18%\n Epoch 10 | Loss: 1.21 | Acc: 47.06%\n Epoch 20 | Loss: 1.02 | Acc: 67.65%\n Epoch 30 | Loss: 0.80 | Acc: 73.53%\n Epoch 40 | Loss: 0.59 | Acc: 73.53%\n Epoch 50 | Loss: 0.39 | Acc: 94.12%\n Epoch 60 | Loss: 0.23 | Acc: 97.06%\n Epoch 70 | Loss: 0.13 | Acc: 100.00%\n Epoch 80 | Loss: 0.07 | Acc: 100.00%\n Epoch 90 | Loss: 0.05 | Acc: 100.00%\n Epoch 100 | Loss: 0.03 | Acc: 100.00%\n Epoch 110 | Loss: 0.02 | Acc: 100.00%\n Epoch 120 | Loss: 0.02 | Acc: 100.00%\n Epoch 130 | Loss: 0.02 | Acc: 100.00%\n Epoch 140 | Loss: 0.01 | Acc: 100.00%\n Epoch 150 | Loss: 0.01 | Acc: 100.00%\n Epoch 160 | Loss: 0.01 | Acc: 100.00%\n Epoch 170 | Loss: 0.01 | Acc: 100.00%\n Epoch 180 | Loss: 0.01 | Acc: 100.00%\n Epoch 190 | Loss: 0.01 | Acc: 100.00%\n Epoch 200 | Loss: 0.01 | Acc: 100.00%\n\nGreat! Without much surprise, we reach 100% accuracy on the training set (full\ndataset). It means that our model learned to correctly assign every member of\nthe karate club to its correct group.\n\nWe can produce a neat visualization by animating the graph and see the\nevolution of the GNN\u2019s predictions during the training process.\n\n \n \n %%capture\n from IPython.display import HTML\n from matplotlib import animation\n plt.rcParams[\"animation.bitrate\"] = 3000\n \n \n def animate(i):\n G = to_networkx(data, to_undirected=True)\n nx.draw_networkx(G,\n pos=nx.spring_layout(G, seed=0),\n with_labels=True,\n node_size=800,\n node_color=outputs[i],\n cmap=\"hsv\",\n vmin=-2,\n vmax=3,\n width=0.8,\n edge_color=\"grey\",\n font_size=14\n )\n plt.title(f'Epoch {i} | Loss: {losses[i]:.2f} | Acc: {accuracies[i]*100:.2f}%',\n fontsize=18, pad=20)\n \n \n fig = plt.figure(figsize=(12, 12))\n plt.axis('off')\n \n \n anim = animation.FuncAnimation(fig, animate, \\\n np.arange(0, 200, 10), interval=500, repeat=True)\n html = HTML(anim.to_html5_video())\n display(html)\n\nThe first predictions are random, but the GCN perfectly labels every node\nafter a while. Indeed, the final graph is the same as the one we plotted at\nthe end of the first section. But what does the GCN really learn?\n\nBy aggregating features from neighboring nodes, the GNN learns a vector\nrepresentation (or **embedding**) of every node in the network. In our model,\nthe final layer just learns how to use these representations to produce the\nbest classifications. However, embeddings are the real products of GNNs.\n\nLet\u2019s print the embeddings learned by our model.\n\n \n \n # Print embeddings\n print(f'Final embeddings = {h.shape}')\n print(h)\n \n \n Final embeddings = torch.Size([34, 3])\n tensor([[1.9099e+00, 2.3584e+00, 7.4027e-01],\n [2.6203e+00, 2.7997e+00, 0.0000e+00],\n [2.2567e+00, 2.2962e+00, 6.4663e-01],\n [2.0802e+00, 2.8785e+00, 0.0000e+00],\n [0.0000e+00, 0.0000e+00, 2.9694e+00],\n [0.0000e+00, 0.0000e+00, 3.3817e+00],\n [0.0000e+00, 1.5008e-04, 3.4246e+00],\n [1.7593e+00, 2.4292e+00, 2.4551e-01],\n [1.9757e+00, 6.1032e-01, 1.8986e+00],\n [1.7770e+00, 1.9950e+00, 6.7018e-01],\n [0.0000e+00, 1.1683e-04, 2.9738e+00],\n [1.8988e+00, 2.0512e+00, 2.6225e-01],\n [1.7081e+00, 2.3618e+00, 1.9609e-01],\n [1.8303e+00, 2.1591e+00, 3.5906e-01],\n [2.0755e+00, 2.7468e-01, 1.9804e+00],\n [1.9676e+00, 3.7185e-01, 2.0011e+00],\n [0.0000e+00, 0.0000e+00, 3.4787e+00],\n [1.6945e+00, 2.0350e+00, 1.9789e-01],\n [1.9808e+00, 3.2633e-01, 2.1349e+00],\n [1.7846e+00, 1.9585e+00, 4.8021e-01],\n [2.0420e+00, 2.7512e-01, 1.9810e+00],\n [1.7665e+00, 2.1357e+00, 4.0325e-01],\n [1.9870e+00, 3.3886e-01, 2.0421e+00],\n [2.0614e+00, 5.1042e-01, 2.4872e+00],\n ...\n [2.1778e+00, 4.4730e-01, 2.0077e+00],\n [3.8906e-02, 2.3443e+00, 1.9195e+00],\n [3.0748e+00, 0.0000e+00, 3.0789e+00],\n [3.4316e+00, 1.9716e-01, 2.5231e+00]], grad_fn=)\n\nAs you can see, embeddings do not need to have the same dimensions as feature\nvectors. Here, I chose to reduce the number of dimensions from 34\n(`dataset.num_features`) to three to get a nice visualization in 3D.\n\nLet\u2019s plot these embeddings before any training happens, at epoch 0.\n\n \n \n # Get first embedding at epoch = 0\n embed = h.detach().cpu().numpy()\n \n \n fig = plt.figure(figsize=(12, 12))\n ax = fig.add_subplot(projection='3d')\n ax.patch.set_alpha(0)\n plt.tick_params(left=False,\n bottom=False,\n labelleft=False,\n labelbottom=False)\n ax.scatter(embed[:, 0], embed[:, 1], embed[:, 2],\n s=200, c=data.y, cmap=\"hsv\", vmin=-2, vmax=3)\n \n \n plt.show()\n\nWe see every node from Zachary\u2019s karate club with their true labels (and not\nthe model\u2019s predictions). For now, they\u2019re all over the place since the GNN is\nnot trained yet. But if we plot these embeddings at each step of the training\nloop, we\u2019d be able to visualize what the GNN truly learns.\n\nLet\u2019s see how they evolve over time, as the GCN gets better and better at\nclassifying nodes.\n\n \n \n %%capture\n \n \n def animate(i):\n embed = embeddings[i].detach().cpu().numpy()\n ax.clear()\n ax.scatter(embed[:, 0], embed[:, 1], embed[:, 2],\n s=200, c=data.y, cmap=\"hsv\", vmin=-2, vmax=3)\n plt.title(f'Epoch {i} | Loss: {losses[i]:.2f} | Acc: {accuracies[i]*100:.2f}%',\n fontsize=18, pad=40)\n \n \n fig = plt.figure(figsize=(12, 12))\n plt.axis('off')\n ax = fig.add_subplot(projection='3d')\n plt.tick_params(left=False,\n bottom=False,\n labelleft=False,\n labelbottom=False)\n \n \n anim = animation.FuncAnimation(fig, animate, \\\n np.arange(0, 200, 10), interval=800, repeat=True)\n html = HTML(anim.to_html5_video())\n display(html)\n\nOur Graph Convolutional Network (GCN) has effectively learned embeddings that\ngroup similar nodes into **distinct clusters**. This enables the final linear\nlayer to distinguish them into separate classes with ease.\n\nEmbeddings are not unique to GNNs: they can be found everywhere in deep\nlearning. They don\u2019t have to be 3D either: actually, they rarely are. For\ninstance, language models like BERT produce embeddings with 768 or even 1024\ndimensions.\n\nAdditional dimensions store more information about nodes, text, images, etc.\nbut they also create bigger models that are more difficult to train. This is\nwhy keeping low-dimensional embeddings as long as possible is advantageous.\n\n### Conclusion\n\nGraph Convolutional Networks are an incredibly versatile architecture that can\nbe applied in **many contexts**. In this article, we familiarized ourselves\nwith the PyTorch Geometric library and objects like `Datasets` and `Data`.\nThen, we successfully reconstructed a graph convolutional layer from the\nground up. Next, we put theory into practice by implementing a GCN, which gave\nus an understanding of practical aspects and how individual components\ninteract. Finally, we visualized the training process and obtained a clear\nperspective of what it involves for such a network.\n\nZachary\u2019s karate club is a simplistic dataset, but it is good enough to\nunderstand the most important concepts in graph data and GNNs. Although we\nonly talked about node classification in this article, there are other tasks\nGNNs can accomplish: **link prediction** (e.g., to recommend a friend),\n**graph classification** (e.g., to label molecules), **graph generation**\n(e.g., to create new molecules), and so on.\n\nBeyond GCN, numerous GNN layers and architectures have been proposed by\nresearchers. In the next article, we\u2019ll introduce the Graph Attention Network\n(GAT) architecture, which dynamically computes the GCN\u2019s normalization factor\nand the importance of each connection with an attention mechanism.\n\nIf you want to know more about graph neural networks, dive deeper into the\nworld of GNNs with my book, Hands-On Graph Neural Networks.\n\n### Next article\n\n**Chapter 2: Graph Attention Networks: Self-Attention Explained** \n _A guide to GNNs with self-attention using PyTorch\nGeometric_towardsdatascience.com\n\n _Learn more about machine learning and support my work with one click \u2014\nbecome a Medium member here:_\n\n**Join Medium with my referral link \u2014 Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\n _If you\u2019re already a member, you canfollow me on Medium._\n\n2\n\nShare this post\n\n#### Graph Convolutional Networks: Introduction to GNNs\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/graph-convolutional-networks-introduction-to-gnns-24b3f60d6c95" }, { "id": "a89d6d0f-861f-4a11-aa6b-730ed30f6eb8", "content": { "Title": "4-bit Quantization with GPTQ - Maxime Labonne", "Subtitle": "Quantize your own LLMs using AutoGPTQ", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### 4-bit Quantization with GPTQ\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# 4-bit Quantization with GPTQ\n\n### Quantize your own LLMs using AutoGPTQ\n\nMaxime Labonne\n\nJul 31, 2023\n\n1\n\nShare this post\n\n#### 4-bit Quantization with GPTQ\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Quantize your own LLMs using AutoGPTQ\n\nImage by author\n\nRecent advancements in weight quantization allow us to run massive large\nlanguage models on consumer hardware, like a LLaMA-30B model on an RTX 3090\nGPU. This is possible thanks to novel 4-bit quantization techniques with\nminimal performance degradation, like GPTQ, GGML, and NF4.\n\nIn the previous article, we introduced na\u00efve 8-bit quantization techniques and\nthe excellent LLM.int8(). In this article, we will explore the popular **GPTQ\nalgorithm** to understand how it works and implement it using the AutoGPTQ\nlibrary.\n\nYou can find the code on Google Colab and GitHub.\n\n### \ud83e\udde0 Optimal Brain Quantization\n\nLet\u2019s start by introducing the problem we\u2019re trying to solve. For every layer\n\u2113 in the network, we want to find a quantized version **\u0174\u2097** _of the original\nweights_**W\u2097**. This is called the **layer-wise compression problem**. More\nspecifically, to minimize performance degradation, we want the outputs (**\u0174**\n\u1d68**X** \u1d68) of these new weights to be as close as possible to the original ones\n(**W** \u1d68**X** \u1d68). In other words, we want to find:\n\nDifferent approaches have been proposed to solve this problem, but we\u2019re\ninterested in the **Optimal Brain Quantizer** (OBQ) framework here.\n\nThis method is inspired by a **pruning technique** to carefully remove weights\nfrom a fully trained dense neural network (Optimal Brain Surgeon). It uses an\napproximation technique and provides explicit formulas for the best single\nweight _w\ud801\udfa5_ to remove and optimal update _\u03b4_ \ua7f3 to adjust the set of remaining\nnon-quantized weights _F_ to make up for the removal:\n\nwhere quant(_w_) is the weight rounding given by the quantization and **H** \ua7f3\nis the Hessian.\n\nUsing OBQ, we can quantize the easiest weight first and then adjust all\nremaining non-quantized weights to **compensate for this precision loss**.\nThen we pick the next weight to quantize, and so on.\n\nA potential issue with this approach is when there are outlier weights, which\ncan result in high **quantization error**. Usually, these outliers would be\nquantized last, when there are few non-quantized weights left that could be\nadjusted to compensate for the large error. This effect can worsen when some\nweights are pushed further outside the grid by intermediate updates. A simple\nheuristic is applied to prevent this: outliers are quantized as soon as they\nappear.\n\nThis process could be computationally heavy, especially for LLMs. To deal with\nthis, the OBQ method uses a trick that avoids redoing the entire computation\neach time a weight is simplified. After quantizing a weight, it adjusts the\nmatrix used in calculations (the Hessian) by **removing the row and column**\nassociated with that weight (using Gaussian elimination):\n\nThe method also employs vectorization to process multiple rows of the weight\nmatrix at once. Despite its efficiency, the OBQ\u2019s computation time increases\nsignificantly as the size of the weight matrix increases. This cubic growth\nmakes it difficult to use OBQ on very large models with billions of\nparameters.\n\n### \ud83e\uddee The GPTQ Algorithm\n\nIntroduced by Frantar et al. (2023), the GPTQ algorithm takes inspiration from\nthe OBQ method, but with significant improvements to scale it for (very) large\nlanguage models.\n\n#### Step 1: Arbitrary Order Insight\n\nThe OBQ method selects weights (parameters in a model) for quantization in a\ncertain order, determined by which will **add the least additional error**.\nHowever, GPTQ observes that for large models, quantizing weights in any fixed\norder can perform just as well. This is because even though some weights might\nintroduce more error individually, they are quantized later in the process\nwhen there are few other weights left that could increase the error. So the\norder doesn\u2019t matter as much as we thought.\n\nBased on this insight, GPTQ aims to quantize all weights in the **same order\nfor all rows** of a matrix. This makes the process faster because certain\ncomputations have to be done only once for each column, rather than once for\neach weight.\n\nImage by author\n\n#### Step 2: Lazy Batch-Updates\n\nThis scheme won\u2019t be fast because it requires updating a **huge matrix** with\nvery few computations for each entry. This type of operation can\u2019t utilize the\nfull compute capabilities of GPUs and will be slowed down by memory\nlimitations (memory throughput bottleneck).\n\nTo resolve this, GPTQ introduces \u201clazy batch\u201d updates. It turns out that the\nfinal rounding decisions for a given column are only affected by updates\nperformed on that column, not on later columns. Therefore, GPTQ can apply the\nalgorithm to a **batch of columns at a time** (like 128 columns), updating\nonly those columns and a corresponding block of the matrix. After a block is\nfully processed, the algorithm performs global updates on the entire matrix.\n\n#### Step 3: Cholesky Reformulation\n\nHowever, there\u2019s one more issue to address. When the algorithm scales up to\nvery large models, numerical inaccuracies can become a problem. Specifically,\nrepeated applications of a certain operation can **accumulate numerical\nerrors**.\n\nTo tackle this, GPTQ uses a Cholesky decomposition, a numerically stable\nmethod for solving certain mathematical problems. It involves precomputing\nsome required information from the matrix using the Cholesky method. This\napproach, combined with a slight \u201cdampening\u201d (adding a small constant to\ndiagonal elements of the matrix), helps the algorithm to avoid numerical\nissues.\n\nThe full algorithm can be summarized in a few steps:\n\n 1. The GPTQ algorithm begins with a Cholesky decomposition of the Hessian inverse (a matrix that helps decide how to adjust the weights)\n\n 2. It then runs in loops, handling batches of columns at a time.\n\n 3. For each column in a batch, it quantizes the weights, calculates the error, and updates the weights in the block accordingly.\n\n 4. After processing the batch, it updates all remaining weights based on the block\u2019s errors.\n\nThe GPTQ algorithm was tested on various language generation tasks. It was\ncompared with other quantization methods, like rounding all weights to the\nnearest quantized value (RTN). GPTQ was used with the BLOOM (176B parameters)\nand OPT (175B parameters) model families, and models were quantized using a\n**single NVIDIA A100 GPU**.\n\n### \ud83d\udcbb Quantize an LLM with AutoGPTQ\n\nGPTQ has been very popular to create models in 4-bit precision that can\nefficiently run on GPUs. You can find many examples on the Hugging Face Hub,\nespecially from TheBloke. If you\u2019re looking for an approach that is more CPU-\nfriendly, GGML is currently your best option. Finally, the `transformers`\nlibrary with `bitsandbytes` allows you to quantize a model when it's loaded\nusing the `load_in_4bit=true` argument, which requires downloading full models\nand storing them in your RAM.\n\nLet\u2019s implement the GPTQ algorithm using the AutoGPTQ library and quantize a\nGPT-2 model. This requires a GPU, but a free T4 on Google Colab will do. We\nstart by loading the libraries and defining the model we want to quantize (in\nthis case, GPT-2).\n\n \n \n !BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers\n \n \n import random\n \n from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig\n from datasets import load_dataset\n import torch\n from transformers import AutoTokenizer\n \n \n # Define base model and output directory\n model_id = \"gpt2\"\n out_dir = model_id + \"-GPTQ\"\n\nWe now want to load the model and the tokenizer. The tokenizer is loaded using\nthe classic `AutoTokenizer` class from the `transformers` library. On the\nother hand, we need to pass a specific configuration (`BaseQuantizeConfig`) to\nload the model.\n\nIn this configuration, we can specify the number of bits to quantize (here,\n`bits=4`) and the group size (size of the lazy batch). Note that this group\nsize is optional: we could also use **one set of parameters** for the entire\nweight matrix. In practice, these groups generally improve the quality of the\nquantization at a very low cost (especially with `group_size=1024`). The\n`damp_percent` value is here to help the Cholesky reformulation and should not\nbe changed.\n\nFinally, the `desc_act` (also called act order) is a tricky parameter. It\nallows you to **process rows based on decreasing activation** , meaning the\nmost important or impactful rows (determined by sampled inputs and outputs)\nare processed first. This method aims to place most of the quantization error\n(inevitably introduced during quantization) on less significant weights. This\napproach improves the overall accuracy of the quantization process by ensuring\nthe most significant weights are processed with greater precision. However,\nwhen used alongside group size, `desc_act` can lead to performance slowdowns\ndue to the need to frequently reload quantization parameters. For this reason,\nwe won't use it here (it will probably be fixed in the future, however).\n\n \n \n # Load quantize config, model and tokenizer\n quantize_config = BaseQuantizeConfig(\n bits=4,\n group_size=128,\n damp_percent=0.01,\n desc_act=False,\n )\n model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)\n tokenizer = AutoTokenizer.from_pretrained(model_id)\n\nThe quantization process **relies heavily on samples** to evaluate and enhance\nthe quality of the quantization. They provide a means of comparison between\nthe outputs produced by the origina and the newly quantized model. The larger\nthe number of samples provided, the greater the potential for more accurate\nand effective comparisons, leading to improved quantization quality.\n\nIn the context of this article, we utilize the **C4 (Colossal Clean Crawled\nCorpus) dataset** to generate our samples. The C4 dataset is a large-scale,\nmultilingual collection of web text gathered from the Common Crawl project.\nThis expansive dataset has been cleaned and prepared specifically for training\nlarge-scale language models, making it a great resource for tasks such as\nthis. The WikiText dataset is another popular option.\n\nIn the following code block, we load 1024 samples from the C4 dataset,\ntokenize them, and format them.\n\n \n \n # Load data and tokenize examples\n n_samples = 1024\n data = load_dataset(\"allenai/c4\", data_files=\"en/c4-train.00001-of-01024.json.gz\", split=f\"train[:{n_samples*5}]\")\n tokenized_data = tokenizer(\"\\n\\n\".join(data['text']), return_tensors='pt')\n \n # Format tokenized examples\n examples_ids = []\n for _ in range(n_samples):\n i = random.randint(0, tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1)\n j = i + tokenizer.model_max_length\n input_ids = tokenized_data.input_ids[:, i:j]\n attention_mask = torch.ones_like(input_ids)\n examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})\n\nNow that dataset is ready, we can start the quantization process with a batch\nsize of 1. Optionally, we also use OpenAI Triton, a CUDA alternative, to\ncommunicate with the GPU. Once this is done, we save the tokenizer and the\nmodel in a safetensors format.\n\n \n \n # Quantize with GPTQ\n model.quantize(\n examples_ids,\n batch_size=1,\n use_triton=True,\n )\n \n # Save model and tokenizer\n model.save_quantized(out_dir, use_safetensors=True)\n tokenizer.save_pretrained(out_dir)\n\nAs per usual, the model and tokenizer can then be loaded from the output\ndirectory using the `AutoGPTQForCausalLM` and `AutoTokenizer` classes.\n\n \n \n device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n \n # Reload model and tokenizer\n model = AutoGPTQForCausalLM.from_quantized(\n out_dir,\n device=device,\n use_triton=True,\n use_safetensors=True,\n )\n tokenizer = AutoTokenizer.from_pretrained(out_dir)\n\nLet\u2019s check that the model is working correctly. The AutoGPTQ model (mostly)\nworks as a normal `transformers` model, which makes it compatible with\ninference pipelines, as shown in the following example:\n\n \n \n from transformers import pipeline\n \n generator = pipeline('text-generation', model=model, tokenizer=tokenizer)\n result = generator(\"I have a dream\", do_sample=True, max_length=50)[0]['generated_text']\n print(result)\n \n \n I have a dream,\" she told CNN last week. \"I have this dream of helping my mother find her own. But, to tell that for the first time, now that I'm seeing my mother now, just knowing how wonderful it is that\n\nWe managed to get a convincing completion from our quantized GPT-2 model. A\nmore in-depth evaluation would require **measuring the perplexity** of the\nquantized model versus the original one. However, we will leave it out of the\nscope of this article.\n\n### Conclusion\n\nIn this article, we introduced the GPTQ algorithm, a state-of-the-art\nquantization technique to run LLMs on consumer-grade hardware. We showed how\nit addresses the layer-wise compression problem, based on an improved OBS\ntechnique with arbitrary order insight, lazy batch updates, and Cholesky\nreformulation. This novel approach **significantly reduces memory and\ncomputation requirements** , making LLMs accessible to a broader audience.\n\nIn addition, we **quantized our own LLM model** on a free T4 GPU and ran it to\ngenerate text. You can push your own version of a GPTQ 4-bit quantized model\non the Hugging Face Hub. As mentioned in the introduction, GPTQ is not the\nonly 4-bit quantization algorithm: GGML and NF4 are excellent alternatives\nwith slightly different scopes. I encourage you to learn more about them and\ngive them a shot!\n\nIf you\u2019re interested in more technical content around LLMs, follow me on\nTwitter @maximelabonne.\n\n### References\n\n * B. Hassibi, D. G. Stork and G. J. Wolff, \u201cOptimal Brain Surgeon and general network pruning,\u201d IEEE International Conference on Neural Networks, San Francisco, CA, USA, 1993, pp. 293\u2013299 vol.1, doi: 10.1109/ICNN.1993.298572.\n\n * Elias Frantar, Sidak Pal Singh, & Dan Alistarh. (2023). Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning.\n\n * Elias Frantar, Saleh Ashkboos, Torsten Hoefler, & Dan Alistarh. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.\n\n * Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, & Peter J. Liu. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.\n\n### Related articles\n\n**Introduction to Weight Quantization** \n _Reducing the size of Large Language Models with 8-bit\nquantization_towardsdatascience.com\n\n**Fine-Tune Your Own Llama 2 Model in a Colab Notebook** \n _A practical introduction to LLM fine-tuning_towardsdatascience.com\n\n _Learn more about machine learning and support my work with one click \u2014\nbecome a Medium member here:_\n\n**Join Medium with my referral link \u2014 Maxime Labonne** \n _As a Medium member, a portion of your membership fee goes to writers you\nread, and you get full access to every story\u2026_medium.com\n\n _If you\u2019re already a member, you canfollow me on Medium._\n\n1\n\nShare this post\n\n#### 4-bit Quantization with GPTQ\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/4-bit-quantization-with-gptq-36b0f4f02c34" }, { "id": "d771ccaa-ca3e-4280-bbd7-c45aec8b7f0c", "content": { "Title": "Fine-Tune Your Own Llama 2 Model in a Colab Notebook", "Subtitle": "A practical introduction to LLM fine-tuning", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Fine-Tune Your Own Llama 2 Model in a Colab Notebook\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Fine-Tune Your Own Llama 2 Model in a Colab Notebook\n\n### A practical introduction to LLM fine-tuning\n\nMaxime Labonne\n\nJul 25, 2023\n\n7\n\nShare this post\n\n#### Fine-Tune Your Own Llama 2 Model in a Colab Notebook\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### A practical introduction to LLM fine-tuning\n\nImage by author\n\nWith the release of LLaMA v1, we saw a Cambrian explosion of fine-tuned\nmodels, including Alpaca, Vicuna, and WizardLM, among others. This trend\nencouraged different businesses to launch their own base models with licenses\nsuitable for commercial use, such as OpenLLaMA, Falcon, XGen, etc. The release\nof Llama 2 now combines the best elements from both sides: it offers a\n**highly efficient base model along with a more permissive license**.\n\nDuring the first half of 2023, the software landscape was significantly shaped\nby the **widespread use of APIs** (like OpenAI API) to create infrastructures\nbased on Large Language Models (LLMs). Libraries such as LangChain and\nLlamaIndex played a critical role in this trend. Moving into the latter half\nof the year, the process of **fine-tuning (or instruction tuning) these models\nis set to become a standard procedure** in the LLMOps workflow. This trend is\ndriven by various factors: the potential for cost savings, the ability to\nprocess confidential data, and even the potential to develop models that\nexceed the performance of prominent models like ChatGPT and GPT-4 in certain\nspecific tasks.\n\nIn this article, we will see why instruction tuning works and how to implement\nit in a Google Colab notebook to create your own Llama 2 model. As usual, the\ncode is available on Colab and GitHub.\n\n### **\ud83d\udd27** Background on fine-tuning LLMs\n\nImage by author\n\nLLMs are pretrained on an extensive corpus of text. In the case of Llama 2, we\nknow very little about the composition of the training set, besides its length\nof 2 trillion tokens. In comparison, BERT (2018) was \u201conly\u201d trained on the\nBookCorpus (800M words) and English Wikipedia (2,500M words). From experience,\nthis is a **very costly and long process** with a lot of hardware issues. If\nyou want to know more about it, I recommend reading Meta\u2019s logbook about the\npretraining of the OPT-175B model.\n\nWhen the pretraining is complete, auto-regressive models like Llama 2 can\n**predict the next token** in a sequence. However, this does not make them\nparticularly useful assistants since they don\u2019t reply to instructions. This is\nwhy we employ instruction tuning to align their answers with what humans\nexpect. There are two main fine-tuning techniques:\n\n * **Supervised Fine-Tuning** (SFT): Models are trained on a dataset of instructions and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and ground-truth responses, acting as labels.\n\n * **Reinforcement Learning from Human Feedback** (RLHF): Models learn by interacting with their environment and receiving feedback. They are trained to maximize a reward signal (using PPO), which is often derived from human evaluations of model outputs.\n\nIn general, RLHF is shown to capture **more complex and nuanced** human\npreferences, but is also more challenging to implement effectively. Indeed, it\nrequires careful design of the reward system and can be sensitive to the\nquality and consistency of human feedback. A possible alternative in the\nfuture is the Direct Preference Optimization (DPO) algorithm, which directly\nruns preference learning on the SFT model.\n\nIn our case, we will perform SFT, but this raises a question: why does fine-\ntuning work in the first place? As highlighted in the Orca paper, our\nunderstanding is that fine-tuning **leverages knowledge learned during the\npretraining** process. In other words, fine-tuning will be of little help if\nthe model has never seen the kind of data you\u2019re interested in. However, if\nthat\u2019s the case, SFT can be extremely performant.\n\nFor example, the LIMA paper showed how you could outperform GPT-3 (DaVinci003)\nby fine-tuning a LLaMA (v1) model with 65 billion parameters on only 1,000\nhigh-quality samples. The **quality of the instruction dataset is essential**\nto reach this level of performance, which is why a lot of work is focused on\nthis issue (like evol-instruct, Orca, or phi-1). Note that the size of the LLM\n(65b, not 13b or 7b) is also fundamental to leverage pre-existing knowledge\nefficiently.\n\nAnother important point related to the data quality is the **prompt\ntemplate**. Prompts are comprised of similar elements: system prompt\n(optional) to guide the model, user prompt (required) to give the instruction,\nadditional inputs (optional) to take into consideration, and the model\u2019s\nanswer (required). In the case of Llama 2, the authors used the following\ntemplate:\n\n \n \n [INST] <>\n System prompt\n <>\n \n User prompt [/INST] Model answer \n\nThere are other templates, like the ones from Alpaca and Vicuna, and their\nimpact is not very clear. In this example, we will reformat our instruction\ndataset to follow Llama 2\u2019s template. For the purpose of this tutorial, I\u2019ve\nalready done it using the excellent `timdettmers/openassistant-guanaco`\ndataset. You can find it on Hugging Face under the name `mlabonne/guanaco-\nllama2-1k`.\n\n### \ud83e\udd99 How to fine-tune Llama 2\n\nIn this section, we will fine-tune a Llama 2 model with 7 billion parameters\non a T4 GPU with high RAM using Google Colab (2.21 credits/hour). Note that a\nT4 only has 16 GB of VRAM, which is barely enough to **store Llama 2\u20137b\u2019s\nweights** (7b \u00d7 2 bytes = 14 GB in FP16). In addition, we need to consider the\noverhead due to optimizer states, gradients, and forward activations (see this\nexcellent article for more information). This means that a full fine-tuning is\nnot possible here: we need parameter-efficient fine-tuning (PEFT) techniques\nlike LoRA or QLoRA.\n\nTo drastically reduce the VRAM usage, we must **fine-tune the model in 4-bit\nprecision** , which is why we\u2019ll use QLoRA here. The good thing is that we can\nleverage the Hugging Face ecosystem with the `transformers`, `accelerate`,\n`peft`, `trl`, and `bitsandbytes` libraries. We'll do this in the following\ncode based on Younes Belkada's GitHub Gist. First, we install and load these\nlibraries.\n\n \n \n !pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7\n \n \n import os\n import torch\n from datasets import load_dataset\n from transformers import (\n AutoModelForCausalLM,\n AutoTokenizer,\n BitsAndBytesConfig,\n HfArgumentParser,\n TrainingArguments,\n pipeline,\n logging,\n )\n from peft import LoraConfig, PeftModel\n from trl import SFTTrainer\n\nLet\u2019s talk a bit about the parameters we can tune here. First, we want to load\na `llama-2-7b-chat-hf` model and train it on the `mlabonne/guanaco-llama2-1k`\n(1,000 samples), which will produce our fine-tuned model\n`llama-2-7b-miniguanaco`. Feel free to change the dataset: there are many\noptions on the Hugging Face Hub.\n\nQLoRA will use a rank of 64 with a scaling parameter of 16 (see this article\nfor more information about LoRA parameters). We\u2019ll load the Llama 2 model\ndirectly in 4-bit precision using the NF4 type and train it for one epoch. To\nget more information about the other parameters, check the TrainingArguments,\nPeftModel, and SFTTrainer documentation.\n\n \n \n # The model that you want to train from the Hugging Face hub\n model_name = \"daryl149/llama-2-7b-chat-hf\"\n \n # The instruction dataset to use\n dataset_name = \"mlabonne/guanaco-llama2-1k\"\n \n # Fine-tuned model name\n new_model = \"llama-2-7b-miniguanaco\"\n \n ################################################################################\n # QLoRA parameters\n ################################################################################\n \n # LoRA attention dimension\n lora_r = 64\n \n # Alpha parameter for LoRA scaling\n lora_alpha = 16\n \n # Dropout probability for LoRA layers\n lora_dropout = 0.1\n \n ################################################################################\n # bitsandbytes parameters\n ################################################################################\n \n # Activate 4-bit precision base model loading\n use_4bit = True\n \n # Compute dtype for 4-bit base models\n bnb_4bit_compute_dtype = \"float16\"\n \n # Quantization type (fp4 or nf4)\n bnb_4bit_quant_type = \"nf4\"\n \n # Activate nested quantization for 4-bit base models (double quantization)\n use_nested_quant = False\n \n ################################################################################\n # TrainingArguments parameters\n ################################################################################\n \n # Output directory where the model predictions and checkpoints will be stored\n output_dir = \"./results\"\n \n # Number of training epochs\n num_train_epochs = 1\n \n # Enable fp16/bf16 training (set bf16 to True with an A100)\n fp16 = False\n bf16 = False\n \n # Batch size per GPU for training\n per_device_train_batch_size = 4\n \n # Batch size per GPU for evaluation\n per_device_eval_batch_size = 4\n \n # Number of update steps to accumulate the gradients for\n gradient_accumulation_steps = 2\n \n # Enable gradient checkpointing\n gradient_checkpointing = True\n \n # Maximum gradient normal (gradient clipping)\n max_grad_norm = 0.3\n \n # Initial learning rate (AdamW optimizer)\n learning_rate = 2e-4\n \n # Weight decay to apply to all layers except bias/LayerNorm weights\n weight_decay = 0.001\n \n # Optimizer to use\n optim = \"paged_adamw_32bit\"\n \n # Learning rate schedule (constant a bit better than cosine)\n lr_scheduler_type = \"constant\"\n \n # Number of training steps (overrides num_train_epochs)\n max_steps = -1\n \n # Ratio of steps for a linear warmup (from 0 to learning rate) \n warmup_ratio = 0.03\n \n # Group sequences into batches with same length\n # Saves memory and speeds up training considerably\n group_by_length = True\n \n # Save checkpoint every X updates steps\n save_steps = 10\n \n # Log every X updates steps\n logging_steps = 1\n \n ################################################################################\n # SFT parameters\n ################################################################################\n \n # Maximum sequence length to use\n max_seq_length = None\n \n # Pack multiple short examples in the same input sequence to increase efficiency\n packing = False\n \n # Load the entire model on the GPU 0\n device_map = {\"\": 0}\n\nWe can now load everything and start the fine-tuning process. We\u2019re relying on\nmultiple wrappers, so bear with me.\n\n * First of all, we want to load the dataset we defined. If you changed it, you can **preprocess it here** and adapt it to the desired prompt template.\n\n * Then, we\u2019re configuring `bitsandbytes` for 4-bit quantization.\n\n * Next, we're loading the Llama 2 model in 4-bit precision on a GPU with the corresponding tokenizer.\n\n * Finally, we're loading configurations for QLoRA, regular training parameters, and passing everything to the `SFTTrainer`. The training can finally start!\n\n \n \n # Load dataset (you can process it here)\n dataset = load_dataset(dataset_name, split=\"train\")\n \n # Load tokenizer and model with QLoRA configuration\n compute_dtype = getattr(torch, bnb_4bit_compute_dtype)\n \n bnb_config = BitsAndBytesConfig(\n load_in_4bit=use_4bit,\n bnb_4bit_quant_type=bnb_4bit_quant_type,\n bnb_4bit_compute_dtype=compute_dtype,\n bnb_4bit_use_double_quant=use_nested_quant,\n )\n \n # Check GPU compatibility with bfloat16\n if compute_dtype == torch.float16 and use_4bit:\n major, _ = torch.cuda.get_device_capability()\n if major >= 8:\n print(\"=\" * 80)\n print(\"Your GPU supports bfloat16: accelerate training with bf16=True\")\n print(\"=\" * 80)\n \n # Load base model\n model = AutoModelForCausalLM.from_pretrained(\n model_name,\n quantization_config=bnb_config,\n device_map=device_map\n )\n model.config.use_cache = False\n model.config.pretraining_tp = 1\n \n # Load LLaMA tokenizer\n tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n tokenizer.pad_token = tokenizer.eos_token\n tokenizer.padding_side = \"right\" # Fix weird overflow issue with fp16 training\n \n # Load LoRA configuration\n peft_config = LoraConfig(\n lora_alpha=lora_alpha,\n lora_dropout=lora_dropout,\n r=lora_r,\n bias=\"none\",\n task_type=\"CAUSAL_LM\",\n )\n \n # Set training parameters\n training_arguments = TrainingArguments(\n output_dir=output_dir,\n num_train_epochs=num_train_epochs,\n per_device_train_batch_size=per_device_train_batch_size,\n gradient_accumulation_steps=gradient_accumulation_steps,\n optim=optim,\n save_steps=save_steps,\n logging_steps=logging_steps,\n learning_rate=learning_rate,\n weight_decay=weight_decay,\n fp16=fp16,\n bf16=bf16,\n max_grad_norm=max_grad_norm,\n max_steps=max_steps,\n warmup_ratio=warmup_ratio,\n group_by_length=group_by_length,\n lr_scheduler_type=lr_scheduler_type,\n report_to=\"tensorboard\"\n )\n \n # Set supervised fine-tuning parameters\n trainer = SFTTrainer(\n model=model,\n train_dataset=dataset,\n peft_config=peft_config,\n dataset_text_field=\"text\",\n max_seq_length=max_seq_length,\n tokenizer=tokenizer,\n args=training_arguments,\n packing=packing,\n )\n \n # Train model\n trainer.train()\n \n # Save trained model\n trainer.model.save_pretrained(output_dir)\n\nImage by author\n\nThe training can be very long, depending on the size of your dataset. Here, it\ntook less than an hour on a T4 GPU. We can check the plots on tensorboard, as\nfollows:\n\n \n \n %load_ext tensorboard\n %tensorboard --logdir results/runs\n\nImage by author\n\nLet\u2019s make sure that the model is behaving correctly. It would require a more\nexhaustive evaluation, but we can use the **text generation pipeline** to ask\nquestions like \u201cWhat is a large language model?\u201d Note that I\u2019m formatting the\ninput to match Llama 2\u2019s prompt template.\n\n \n \n # Ignore warnings\n logging.set_verbosity(logging.CRITICAL)\n \n # Run text generation pipeline with our next model\n prompt = \"What is a large language model?\"\n pipe = pipeline(task=\"text-generation\", model=model, tokenizer=tokenizer, max_length=200)\n result = pipe(f\"[INST] {prompt} [/INST]\")\n print(result[0]['generated_text'])\n\nThe model outputs the following response:\n\n \n \n A large language model is a type of artificial intelligence that is trained on a large dataset of text to generate human-like language. It is typically trained on a dataset of text that is much larger than the dataset used for smaller language models. The large dataset allows the model to learn more complex patterns in language, which can result in more accurate and natural-sounding language generation.\n \n Large language models are often used for tasks such as text summarization, language translation, and chatbots. They are also used for more complex tasks such as writing articles, generating creative content, and even composing music.\n \n Large language models are trained using a technique called deep learning, which involves using many layers of artificial neural networks to learn complex patterns in the data. The model is trained on a large dataset of text, and the neural networks are trained to predict the next word in a sequence of text given\n\nFrom experience, it is **very coherent** for a model with only 7 billion\nparameters. You can play with it and ask harder questions from evaluation\ndatasets like BigBench-Hard. Guanaco is an excellent dataset that has produced\nhigh-quality models in the past. You can train a Llama 2 model on the entire\ndataset using `mlabonne/guanaco-llama2`.\n\nHow can we store our new `llama-2-7b-miniguanaco` model now? We need to merge\nthe weights from LoRA with the base model. Unfortunately, as far as I know,\nthere is no straightforward way to do it: we need to reload the base model in\nFP16 precision and use the `peft` library to merge everything. Alas, it also\ncreates a problem with the VRAM (despite emptying it), so I recommend\n**restarting the notebook** , re-executing the three first cells, and then\nexecuting the next one. Please contact me if you know a fix!\n\n \n \n # Reload model in FP16 and merge it with LoRA weights\n base_model = AutoModelForCausalLM.from_pretrained(\n model_name,\n low_cpu_mem_usage=True,\n return_dict=True,\n torch_dtype=torch.float16,\n device_map=device_map,\n )\n model = PeftModel.from_pretrained(base_model, output_dir)\n model = model.merge_and_unload()\n \n # Reload tokenizer to save it\n tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n tokenizer.pad_token = tokenizer.eos_token\n tokenizer.padding_side = \"right\"\n\nOur weights are merged and we reloaded the tokenizer. We can now push\neverything to the Hugging Face Hub to save our model.\n\n \n \n !huggingface-cli login\n \n model.push_to_hub(new_model, use_temp_dir=False)\n tokenizer.push_to_hub(new_model, use_temp_dir=False)\n\nYou can now use this model for inference by loading it like any other Llama 2\nmodel from the Hub. It is also possible to reload it for more fine-tuning \u2014\nperhaps with another dataset?\n\nIf you\u2019re interested in a script instead of a notebook, I recommend following\nthe instructions provided in this blog post:\n\n \n \n pip install trl\n git clone https://github.com/lvwerra/trl\n python trl/examples/scripts/sft_trainer.py \\\n --model_name meta-llama/Llama-2-7b-hf \\\n --dataset_name timdettmers/openassistant-guanaco \\\n --load_in_4bit \\\n --use_peft \\\n --batch_size 4 \\\n --gradient_accumulation_steps 2\n\n### Conclusion\n\nIn this article, we saw how to fine-tune a Llama 2 7b model using a Colab\nnotebook. We introduced some necessary background on LLM training and fine-\ntuning, as well as important considerations related to instruction datasets.\nIn the second section, we **successfully fine-tuned the Llama 2 model** with\nits native prompt template and custom parameters.\n\nThese fine-tuned models can then be integrated into LangChain and other\narchitectures as an advantageous alternative to OpenAI API. Remember that, in\nthis new paradigm, instruction datasets are the new gold, and the quality of\nyour model heavily depends on the data it\u2019s been fine-tuned on. So good luck\nbuilding high-quality datasets!\n\nIf you\u2019re interested in more content about LLMs, follow me on Twitter\n@maximelabonne.\n\n### References\n\n * Hugo Touvron, Thomas Scialom, et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models.\n\n * Philipp Schmid, Omar Sanseviero, Pedro Cuenca, & Lewis Tunstall. Llama 2 is here \u2014 get it on Hugging Face. https://huggingface.co/blog/llama2\n\n * Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, & Tatsunori B. Hashimoto. (2023). Stanford Alpaca: An Instruction-following LLaMA model.\n\n * Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.\n\n * Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, & Luke Zettlemoyer. (2023). QLoRA: Efficient Finetuning of Quantized LLMs.\n\n7\n\nShare this post\n\n#### Fine-Tune Your Own Llama 2 Model in a Colab Notebook\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32" }, { "id": "0a0993af-948a-4784-846a-2dbc73cbdadc", "content": { "Title": "Introduction to Weight Quantization - Maxime Labonne", "Subtitle": "Reducing the size of Large Language Models with 8-bit quantization", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Introduction to Weight Quantization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Introduction to Weight Quantization\n\n### Reducing the size of Large Language Models with 8-bit quantization\n\nMaxime Labonne\n\nJul 07, 2023\n\n2\n\nShare this post\n\n#### Introduction to Weight Quantization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Reducing the size of Large Language Models with 8-bit quantization\n\nLarge Language Models (LLMs) are known for their extensive computational\nrequirements. Typically, the size of a model is calculated by multiplying the\nnumber of parameters (**size**) by the precision of these values (**data\ntype**). However, to save memory, weights can be stored using lower-precision\ndata types through a process known as quantization.\n\nWe distinguish two main families of weight quantization techniques in the\nliterature:\n\n * **Post-Training Quantization** (PTQ) is a straightforward technique where the weights of an already trained model are converted to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential performance degradation.\n\n * **Quantization-Aware Training** (QAT) incorporates the weight conversion process during the pre-training or fine-tuning stage, resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training data.\n\nIn this article, we focus on PTQ to reduce the precision of our parameters. To\nget a good intuition, we will apply both na\u00efve and more sophisticated\ntechniques to a toy example using a GPT-2 model.\n\nThe entire code is freely available on Google Colab and GitHub.\n\n### \ud83d\udcda Background on Floating Point Representation\n\nThe choice of data type dictates the quantity of computational resources\nrequired, affecting the speed and efficiency of the model. In deep learning\napplications, balancing precision and computational performance becomes a\nvital exercise as higher precision often implies greater computational\ndemands.\n\nAmong various data types, floating point numbers are predominantly employed in\ndeep learning due to their ability to represent a wide range of values with\nhigh precision. Typically, a floating point number uses _n_ bits to store a\nnumerical value. These _n_ bits are further partitioned into three distinct\ncomponents:\n\n 1. **Sign** : The sign bit indicates the positive or negative nature of the number. It uses one bit where 0 indicates a positive number and 1 signals a negative number.\n\n 2. **Exponent** : The exponent is a segment of bits that represents the power to which the base (usually 2 in binary representation) is raised. The exponent can also be positive or negative, allowing the number to represent very large or very small values.\n\n 3. **Significand/Mantissa** : The remaining bits are used to store the significand, also referred to as the mantissa. This represents the significant digits of the number. The precision of the number heavily depends on the length of the significand.\n\nThis design allows floating point numbers to cover a wide range of values with\nvarying levels of precision. The formula used for this representation is:\n\nTo understand this better, let\u2019s delve into some of the most commonly used\ndata types in deep learning: float32 (FP32), float16 (FP16), and bfloat16\n(BF16):\n\n * **FP32** uses 32 bits to represent a number: one bit for the sign, eight for the exponent, and the remaining 23 for the significand. While it provides a high degree of precision, the downside of FP32 is its high computational and memory footprint.\n\n * **FP16** uses 16 bits to store a number: one is used for the sign, five for the exponent, and ten for the significand. Although this makes it more memory-efficient and accelerates computations, the reduced range and precision can introduce numerical instability, potentially impacting model accuracy.\n\n * **BF16** is also a 16-bit format but with one bit for the sign, _eight_ for the exponent, and _seven_ for the significand. BF16 expands the representable range compared to FP16, thus decreasing underflow and overflow risks. Despite a reduction in precision due to fewer significand bits, BF16 typically does not significantly impact model performance and is a useful compromise for deep learning tasks.\n\nImage by author\n\nIn ML jargon, FP32 is often termed \u201cfull precision\u201d (4 bytes), while BF16 and\nFP16 are \u201chalf-precision\u201d (2 bytes). But could we do even better and store\nweights using a single byte? The answer is the INT8 data type, which consists\nof an 8-bit representation capable of storing 2\u2078 = 256 different values. In\nthe next section, we\u2019ll see how to convert FP32 weights into an INT8 format.\n\n### \ud83d\udd30 Na\u00efve 8-bit Quantization\n\nIn this section, we will implement two quantization techniques: a symmetric\none with **absolute maximum (absmax) quantization** and an asymmetric one with\n**zero-point quantization**. In both cases, the goal is to map an FP32 tensor\n**X** (original weights) to an INT8 tensor **X_quant** (quantized weights).\n\nWith **absmax quantization** , the original number is divided by the absolute\nmaximum value of the tensor and multiplied by a scaling factor (127) to map\ninputs into the range [-127, 127]. To retrieve the original FP16 values, the\nINT8 number is divided by the quantization factor, acknowledging some loss of\nprecision due to rounding.\n\nFor instance, let\u2019s say we have an absolution maximum value of 3.2. A weight\nof 0.1 would be quantized to _round(0.1 \u00d7 127/3.2) = 4_. If we want to\ndequantize it, we would get _4 \u00d7 3.2/127 = 0.1008_ , which implies an error of\n0.008. Here\u2019s the corresponding Python implementation:\n\n \n \n import torch\n \n def absmax_quantize(X):\n # Calculate scale\n scale = 127 / torch.max(torch.abs(X))\n \n # Quantize\n X_quant = (scale * X).round()\n \n # Dequantize\n X_dequant = X_quant / scale\n \n return X_quant.to(torch.int8), X_dequant\n\nWith **zero-point quantization** , we can consider asymmetric input\ndistributions, which is useful when you consider the output of a ReLU function\n(only positive values), for example. The input values are first scaled by the\ntotal range of values (255) divided by the difference between the maximum and\nminimum values. This distribution is then shifted by the zero-point to map it\ninto the range [-128, 127] (notice the extra value compared to absmax). First,\nwe calculate the scale factor and the zero-point value:\n\nThen, we can use these variables to quantize or dequantize our weights:\n\nLet\u2019s take an example: we have a maximum value of 3.2 and a minimum value of\n-3.0. We can calculate the scale is _255/(3.2 + 3.0) = 41.13_ and the zero-\npoint _-round(41.13 \u00d7 -3.0) - 128 = 123 -128 = -5_ , so our previous weight of\n0.1 would be quantized to _round(41.13 \u00d7 0.1 -5) = -1_. This is very different\nfrom the previous value obtained using absmax (4 vs. -1).\n\nImage by author\n\nThe Python implementation is quite straightforward:\n\n \n \n def zeropoint_quantize(X):\n # Calculate value range (denominator)\n x_range = torch.max(X) - torch.min(X)\n x_range = 1 if x_range == 0 else x_range\n \n # Calculate scale\n scale = 255 / x_range\n \n # Shift by zero-point\n zeropoint = (-scale * torch.min(X) - 128).round()\n \n # Scale and round the inputs\n X_quant = torch.clip((X * scale + zeropoint).round(), -128, 127)\n \n # Dequantize\n X_dequant = (X_quant - zeropoint) / scale\n \n return X_quant.to(torch.int8), X_dequant\n\nInstead of relying on complete toy examples, we can use these two functions on\na real model thanks to the `transformers`library.\n\nWe start by loading the model and tokenizer for GPT-2. This is a very small\nmodel we probably don\u2019t want to quantize, but it will be good enough for this\ntutorial. First, we want to observe the model\u2019s size so we can compare it\nlater and evaluate the **memory savings** due to 8-bit quantization.\n\n \n \n !pip install -q bitsandbytes>=0.39.0\n !pip install -q git+https://github.com/huggingface/accelerate.git\n !pip install -q git+https://github.com/huggingface/transformers.git\n \n \n from transformers import AutoModelForCausalLM, AutoTokenizer\n import torch\n torch.manual_seed(0)\n \n # Set device to CPU for now\n device = 'cpu'\n \n # Load model and tokenizer\n model_id = 'gpt2'\n model = AutoModelForCausalLM.from_pretrained(model_id).to(device)\n tokenizer = AutoTokenizer.from_pretrained(model_id)\n \n # Print model size\n print(f\"Model size: {model.get_memory_footprint():,} bytes\")\n \n \n Model size: 510,342,192 bytes\n\nThe size of the GPT-2 model is approximately 487MB in FP32. The next step\nconsists of quantizing the weights using zero-point and absmax quantization.\nIn the following example, we apply these techniques to the first attention\nlayer of GPT-2 to see the results.\n\n \n \n # Extract weights of the first layer\n weights = model.transformer.h[0].attn.c_attn.weight.data\n print(\"Original weights:\")\n print(weights)\n \n # Quantize layer using absmax quantization\n weights_abs_quant, _ = absmax_quantize(weights)\n print(\"\\nAbsmax quantized weights:\")\n print(weights_abs_quant)\n \n # Quantize layer using absmax quantization\n weights_zp_quant, _ = zeropoint_quantize(weights)\n print(\"\\nZero-point quantized weights:\")\n print(weights_zp_quant)\n \n \n Original weights:\n tensor([[-0.4738, -0.2614, -0.0978, ..., 0.0513, -0.0584, 0.0250],\n [ 0.0874, 0.1473, 0.2387, ..., -0.0525, -0.0113, -0.0156],\n [ 0.0039, 0.0695, 0.3668, ..., 0.1143, 0.0363, -0.0318],\n ...,\n [-0.2592, -0.0164, 0.1991, ..., 0.0095, -0.0516, 0.0319],\n [ 0.1517, 0.2170, 0.1043, ..., 0.0293, -0.0429, -0.0475],\n [-0.4100, -0.1924, -0.2400, ..., -0.0046, 0.0070, 0.0198]])\n \n Absmax quantized weights:\n tensor([[-21, -12, -4, ..., 2, -3, 1],\n [ 4, 7, 11, ..., -2, -1, -1],\n [ 0, 3, 16, ..., 5, 2, -1],\n ...,\n [-12, -1, 9, ..., 0, -2, 1],\n [ 7, 10, 5, ..., 1, -2, -2],\n [-18, -9, -11, ..., 0, 0, 1]], dtype=torch.int8)\n \n Zero-point quantized weights:\n tensor([[-20, -11, -3, ..., 3, -2, 2],\n [ 5, 8, 12, ..., -1, 0, 0],\n [ 1, 4, 18, ..., 6, 3, 0],\n ...,\n [-11, 0, 10, ..., 1, -1, 2],\n [ 8, 11, 6, ..., 2, -1, -1],\n [-18, -8, -10, ..., 1, 1, 2]], dtype=torch.int8)\n\nThe difference between the original (FP32) and quantized values (INT8) is\nclear, but the difference between absmax and zero-point weights is more\nsubtle. In this case, the inputs look shifted by a value of -1. This suggests\nthat the weight distribution in this layer is quite symmetric.\n\nWe can compare these techniques by quantizing every layer in GPT-2 (linear\nlayers, attention layers, etc.) and create two new models: `model_abs` and\n`model_zp`. To be precise, we will actually replace the original weights with\n_**de**_ -quantized ones. This has two benefits: it allows us to 1/ compare\nthe distribution of our weights (same scale) and 2/ actually run the models.\n\nIndeed, PyTorch doesn\u2019t allow INT8 matrix multiplication by default. In a real\nscenario, we would dequantize them to run the model (in FP16 for example) but\nstore them as INT8. In the next section, we will use the `bitsandbytes`\nlibrary to solve this issue.\n\n \n \n import numpy as np\n from copy import deepcopy\n \n # Store original weights\n weights = [param.data.clone() for param in model.parameters()]\n \n # Create model to quantize\n model_abs = deepcopy(model)\n \n # Quantize all model weights\n weights_abs = []\n for param in model_abs.parameters():\n _, dequantized = absmax_quantize(param.data)\n param.data = dequantized\n weights_abs.append(dequantized)\n \n # Create model to quantize\n model_zp = deepcopy(model)\n \n # Quantize all model weights\n weights_zp = []\n for param in model_zp.parameters():\n _, dequantized = zeropoint_quantize(param.data)\n param.data = dequantized\n weights_zp.append(dequantized)\n\nNow that our models have been quantized, we want to check the impact of this\nprocess. Intuitively, we want to make sure that the quantized weights are\n**close to the original ones**. A visual way to check it is to plot the\ndistribution of the dequantized and original weights. If the quantization is\nlossy, it would drastically change the weight distribution.\n\nThe following figure shows this comparison, where the blue histogram\nrepresents the original (FP32) weights, and the red one represents the\ndequantized (from INT8) weights. Note that we only display this plot between\n-2 and 2 because of outliers with very high absolute values (more on that\nlater).\n\nBoth plots are quite similar, with a surprising spike around 0. This spike\nshows that our quantization is quite lossy since reversing the process doesn\u2019t\noutput the original values. This is particularly true for the absmax model,\nwhich displays both a lower valley and a higher spike around 0.\n\nLet\u2019s compare the performance of the original and quantized models. For this\npurpose, we define a `generate_text()` function to generate 50 tokens with\ntop-k sampling.\n\n \n \n def generate_text(model, input_text, max_length=50):\n input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)\n output = model.generate(inputs=input_ids,\n max_length=max_length,\n do_sample=True,\n top_k=30,\n pad_token_id=tokenizer.eos_token_id,\n attention_mask=input_ids.new_ones(input_ids.shape))\n return tokenizer.decode(output[0], skip_special_tokens=True)\n \n # Generate text with original and quantized models\n original_text = generate_text(model, \"I have a dream\")\n absmax_text = generate_text(model_abs, \"I have a dream\")\n zp_text = generate_text(model_zp, \"I have a dream\")\n \n print(f\"Original model:\\n{original_text}\")\n print(\"-\" * 50)\n print(f\"Absmax model:\\n{absmax_text}\")\n print(\"-\" * 50)\n print(f\"Zeropoint model:\\n{zp_text}\")\n \n \n Original model:\n I have a dream, and it is a dream I believe I would get to live in my future. I love my mother, and there was that one time I had been told that my family wasn't even that strong. And then I got the\n --------------------------------------------------\n Absmax model:\n I have a dream to find out the origin of her hair. She loves it. But there's no way you could be honest about how her hair is made. She must be crazy.\n \n We found a photo of the hairstyle posted on\n --------------------------------------------------\n Zeropoint model:\n I have a dream of creating two full-time jobs in America\u2014one for people with mental health issues, and one for people who do not suffer from mental illness\u2014or at least have an employment and family history of substance abuse, to work part\n\nInstead of trying to see if one output makes more sense than the others, we\ncan quantify it by calculating the **perplexity** of each output. This is a\ncommon metric used to evaluate language models, which measures the uncertainty\nof a model in predicting the next token in a sequence. In this comparison, we\nmake the common assumption that the lower the score, the better the model is.\nIn practice, a sentence with a high perplexity could also be correct.\n\nWe implement it using a minimal function since it doesn\u2019t need to consider\ndetails like the length of the context window since our sentences are short.\n\n \n \n def calculate_perplexity(model, text):\n # Encode the text\n encodings = tokenizer(text, return_tensors='pt').to(device)\n \n # Define input_ids and target_ids\n input_ids = encodings.input_ids\n target_ids = input_ids.clone()\n \n with torch.no_grad():\n outputs = model(input_ids, labels=target_ids)\n \n # Loss calculation\n neg_log_likelihood = outputs.loss\n \n # Perplexity calculation\n ppl = torch.exp(neg_log_likelihood)\n \n return ppl\n \n ppl = calculate_perplexity(model, original_text)\n ppl_abs = calculate_perplexity(model_abs, absmax_text)\n ppl_zp = calculate_perplexity(model_zp, absmax_text)\n \n print(f\"Original perplexity: {ppl.item():.2f}\")\n print(f\"Absmax perplexity: {ppl_abs.item():.2f}\")\n print(f\"Zeropoint perplexity: {ppl_zp.item():.2f}\")\n \n \n Original perplexity: 15.53\n Absmax perplexity: 17.92\n Zeropoint perplexity: 17.97\n\nWe see that the perplexity of the original model is **slightly lower** than\nthe two others. A single experiment is not very reliable, but we could repeat\nthis process multiple times to see the difference between each model. In\ntheory, zero-point quantization should be slightly better than absmax, but is\nalso more costly to compute.\n\nIn this example, we applied quantization techniques to entire layers (per-\ntensor basis). However, we could apply it at different granularity levels:\nfrom the entire model to individual values. Quantizing the entire model in one\npass would seriously degrade the performance, while quantizing individual\nvalues would create a big overhead. In practice, we often prefer the **vector-\nwise quantization** , which considers the variability of values in rows and\ncolumns inside of the same tensor.\n\nHowever, even vector-wise quantization doesn\u2019t solve the problem of outlier\nfeatures. Outlier features are extreme values (negative or positive) that\nappear in all transformer layers when the model reach a certain scale (>6.7B\nparameters). This is an issue since a single outlier can reduce the precision\nfor all other values. But discarding these outlier features is not an option\nsince it would **greatly degrade** the model\u2019s performance.\n\n### \ud83d\udd22 8-bit Quantization with LLM.int8()\n\nIntroduced by Dettmers et al. (2022), LLM.int8() is a solution to the outlier\nproblem. It relies on a vector-wise (absmax) quantization scheme and\nintroduces mixed-precision quantization. This means that outlier features are\nprocessed in a FP16 format to retain their precision, while the other values\nare processed in an INT8 format. As outliers represent about 0.1% of values,\nthis effectively reduces the memory footprint of the LLM by almost 2x.\n\nImage by author\n\nLLM.int8() works by conducting matrix multiplication computation in three key\nsteps:\n\n 1. Extract columns from the input hidden states **X** containing outlier features using a custom threshold.\n\n 2. Perform the matrix multiplication of the outliers using FP16 and the non-outliers using INT8 with vector-wise quantization (row-wise for the hidden state **X** and column-wise for the weight matrix **W**).\n\n 3. Dequantize the non-outlier results (INT8 to FP16) and add them to the outlier results to get the full result in FP16.\n\nImage by author\n\nThis approach is necessary because 8-bit precision is limited and can lead to\nsubstantial errors when quantizing a vector with large values. These errors\nalso tend to amplify as they propagate through multiple layers.\n\nWe can easily use this technique thanks to the integration of the\n`bitsandbytes` library into the Hugging Face ecosystem. We just need to\nspecify `load_in_8bit=True` when loading the model (it also requires a GPU).\n\n \n \n device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n \n model_int8 = AutoModelForCausalLM.from_pretrained(model_id,\n device_map='auto',\n load_in_8bit=True,\n )\n print(f\"Model size: {model_int8.get_memory_footprint():,} bytes\")\n \n \n Model size: 176,527,896 bytes\n\nWith this extra line of code, the model is now almost three times smaller\n(168MB vs. 487MB). We can even compare the distribution of the original and\nquantized weights as we did earlier:\n\nIn this case, we see spikes around -2, -1, 0, 1, 2, etc. These values\ncorrespond to the parameters stored in the INT8 format (non-outliers). You can\nverify it by printing the model\u2019s weights using `model_int8.parameters()`.\n\nWe can also generate text with this quantized model and compare it to the\noriginal model.\n\n \n \n # Generate text with quantized model\n text_int8 = generate_text(model_int8, \"I have a dream\")\n \n print(f\"Original model:\\n{original_text}\")\n print(\"-\" * 50)\n print(f\"LLM.int8() model:\\n{text_int8}\")\n \n \n Original model:\n I have a dream, and it is a dream I believe I would get to live in my future. I love my mother, and there was that one time I had been told that my family wasn't even that strong. And then I got the\n --------------------------------------------------\n LLM.int8() model:\n I have a dream. I don't know what will come of it, but I am going to have to look for something that will be right. I haven't thought about it for a long time, but I have to try to get that thing\n\nOnce again, it is difficult to judge what is the best output, but we can rely\non the perplexity metric to give us an (approximate) answer.\n\n \n \n print(f\"Perplexity (original): {ppl.item():.2f}\")\n \n ppl = calculate_perplexity(model_int8, text_int8)\n print(f\"Perplexity (LLM.int8()): {ppl.item():.2f}\")\n \n \n Perplexity (original): 15.53\n Perplexity (LLM.int8()): 7.93\n\nIn this case, the perplexity of the quantized model is twice as low as the\noriginal one. In general, this is not the case, but it shows that this\nquantization technique is very competitive. In fact, the authors of LLM.int8()\nshow that the performance degradation is so low it\u2019s negligible (<1%).\nHowever, it has an additional cost in terms of computation: LLM.int8() is\nroughly about 20% slower for large models.\n\n### Conclusion\n\nThis article provided an overview of the most popular weight quantization\ntechniques. We started by gaining an understanding of floating point\nrepresentation, before introducing two techniques for 8-bit quantization:\n**absmax** and **zero-point quantization**. However, their limitations,\nparticularly when it comes to handling outliers, led to **LLM.int8()** , a\ntechnique that also preserves the model\u2019s performance. This approach\nunderlines the progress being made in the field of weight quantization,\nrevealing the importance of properly addressing outliers.\n\nLooking forward, our next article will explore the GPTQ weight quantization\ntechnique in depth. This technique, introduced by Frantar et al., only\nutilizes 4 bits and represents a significant advancement in the field of\nweight quantization. We will provide a comprehensive guide on how to implement\nGPTQ using the AutoGPTQ library.\n\nIf you\u2019re interested in more technical content around LLMs, follow me on\nTwitter @maximelabonne.\n\n### References\n\n * T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. 2022.\n\n * Y. Beldaka, and T. Dettmers, A Gentle Introduction to 8-bit Matrix Multiplication, Hugging Face Blog (2022).\n\n * A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, A Survey of Quantization Methods for Efficient Neural Network Inference. 2021.\n\n * H. Wu, P. Judd, X. Zhang, M. Isaev, and P. Micikevicius, Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. 2020.\n\n * Lilian Weng, Large Transformer Model Inference Optimization, Lil\u2019Log (2023).\n\n * Kamil Czarnogorski, Local Large Language Models, Int8 (2023).\n\n2\n\nShare this post\n\n#### Introduction to Weight Quantization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/introduction-to-weight-quantization-2494701b9c0c" }, { "id": "83419ab3-ff2b-4cc7-a792-67a62fe4c585", "content": { "Title": "Decoding Strategies in Large Language Models", "Subtitle": "A Guide to Text Generation From Beam Search to Nucleus Sampling", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Decoding Strategies in Large Language Models\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Decoding Strategies in Large Language Models\n\n### A Guide to Text Generation From Beam Search to Nucleus Sampling\n\nMaxime Labonne\n\nJun 04, 2023\n\n3\n\nShare this post\n\n#### Decoding Strategies in Large Language Models\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### A Guide to Text Generation From Beam Search to Nucleus Sampling\n\nImage by author.\n\nIn the fascinating world of large language models (LLMs), much attention is\ngiven to model architectures, data processing, and optimization. However,\ndecoding strategies like beam search, which play a crucial role in text\ngeneration, are often overlooked. In this article, we will explore how LLMs\ngenerate text by delving into the mechanics of greedy search and beam search,\nas well as sampling techniques with top-k and nucleus sampling.\n\nBy the conclusion of this article, you\u2019ll not only understand these decoding\nstrategies thoroughly but also be familiar with how to handle important\nhyperparameters like temperature, num_beams, top_k, and top_p.\n\nThe code for this article can be found on GitHub and Google Colab for\nreference and further exploration.\n\n### \ud83d\udcda Background\n\nTo kick things off, let\u2019s start with an example. We\u2019ll feed the text \u201cI have a\ndream\u201d to a GPT-2 model and ask it to generate the next five tokens (words or\nsubwords).\n\n \n \n from transformers import GPT2LMHeadModel, GPT2Tokenizer\n import torch\n \n device = 'cuda' if torch.cuda.is_available() else 'cpu'\n model = GPT2LMHeadModel.from_pretrained('gpt2').to(device)\n tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n model.eval()\n \n text = \"I have a dream\"\n input_ids = tokenizer.encode(text, return_tensors='pt').to(device)\n \n outputs = model.generate(input_ids, max_length=len(input_ids.squeeze())+5)\n generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n print(f\"Generated text: {generated_text}\")\n \n \n Generated text: I have a dream of being a doctor.\n\nThe sentence \u201cI have a dream of being a doctor\u201d appears to have been generated\nby GPT-2. However, GPT-2 didn\u2019t _exactly_ produce this sentence.\n\nThere\u2019s a common misconception that LLMs like GPT-2**directly produce text**.\nThis isn\u2019t the case. Instead, LLMs calculate logits, which are scores assigned\nto every possible token in their vocabulary. To simplify, here\u2019s an\nillustrative breakdown of the process:\n\nImage by author.\n\nThe tokenizer, Byte-Pair Encoding in this instance, translates each token in\nthe input text into a corresponding token ID. Then, GPT-2 uses these token IDs\nas input and tries to predict the next most likely token. Finally, the model\ngenerates logits, which are converted into probabilities using a softmax\nfunction.\n\nFor example, the model assigns a probability of 17% to the token for \u201cof\u201d being the next token after \u201cI have a dream\u201d. This output essentially represents a ranked list of potential next tokens in the sequence. More formally, we denote this probability as _P(of | I have a dream) = 17%_.\n\nAutoregressive models like GPT predict the next token in a sequence based on\nthe preceding tokens. Consider a sequence of tokens _w = (w_ \u2081 _, w_ \u2082 _, \u2026,\nw_ \u209c _)_. The joint probability of this sequence _P(w)_ can be broken down as:\n\nFor each token _w\u1d62_ in the sequence, _P(w\u1d62 | w\u2081, w\u2082, \u2026, w\u1d62\u208b\u2081)_ represents the conditional probability of _w\u1d62_ given all the preceding tokens (_w\u2081, w\u2082, \u2026, w\u1d62\u208b\u2081_). GPT-2 calculates this conditional probability for each of the 50,257 tokens in its vocabulary.\n\nThis leads to the question: how do we use these probabilities to generate\ntext? This is where decoding strategies, such as greedy search and beam\nsearch, come into play.\n\n### \ud83c\udfc3\u200d\u2642\ufe0f Greedy Search\n\nGreedy search is a decoding method that takes the most probable token at each\nstep as the next token in the sequence. To put it simply, it only retains the\nmost likely token at each stage, discarding all other potential options. Using\nour example:\n\n * **Step 1** : Input: \u201cI have a dream\u201d \u2192 Most likely token: \u201c of\u201d\n\n * **Step 2** : Input: \u201cI have a dream of\u201d \u2192 Most likely token: \u201c being\u201d\n\n * **Step 3** : Input: \u201cI have a dream of being\u201d \u2192 Most likely token: \u201c a\u201d\n\n * **Step 4** : Input: \u201cI have a dream of being a\u201d \u2192 Most likely token: \u201c doctor\u201d\n\n * **Step 5** : Input: \u201cI have a dream of being a doctor\u201d \u2192 Most likely token: \u201c.\u201d\n\nWhile this approach might sound intuitive, it\u2019s important to note that the\ngreedy search is short-sighted: it only considers the most probable token at\neach step without considering the overall effect on the sequence. This\nproperty makes it fast and efficient as it doesn\u2019t need to keep track of\nmultiple sequences, but it also means that it can miss out on better sequences\nthat might have appeared with slightly less probable next tokens.\n\nNext, let\u2019s illustrate the greedy search implementation using graphviz and\nnetworkx. We select the ID with the highest score, compute its log probability\n(we take the log to simplify calculations), and add it to the tree. We\u2019ll\nrepeat this process for five tokens.\n\n \n \n import matplotlib.pyplot as plt\n import networkx as nx\n import numpy as np\n import time\n \n def get_log_prob(logits, token_id):\n # Compute the softmax of the logits\n probabilities = torch.nn.functional.softmax(logits, dim=-1)\n log_probabilities = torch.log(probabilities)\n \n # Get the log probability of the token\n token_log_probability = log_probabilities[token_id].item()\n return token_log_probability\n \n def greedy_search(input_ids, node, length=5):\n if length == 0:\n return input_ids\n \n outputs = model(input_ids)\n predictions = outputs.logits\n \n # Get the predicted next sub-word (here we use top-k search)\n logits = predictions[0, -1, :]\n token_id = torch.argmax(logits).unsqueeze(0)\n \n # Compute the score of the predicted token\n token_score = get_log_prob(logits, token_id)\n \n # Add the predicted token to the list of input ids\n new_input_ids = torch.cat([input_ids, token_id.unsqueeze(0)], dim=-1)\n \n # Add node and edge to graph\n next_token = tokenizer.decode(token_id, skip_special_tokens=True)\n current_node = list(graph.successors(node))[0]\n graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100\n graph.nodes[current_node]['token'] = next_token + f\"_{length}\"\n \n # Recursive call\n input_ids = greedy_search(new_input_ids, current_node, length-1)\n \n return input_ids\n \n # Parameters\n length = 5\n beams = 1\n \n # Create a balanced tree with height 'length'\n graph = nx.balanced_tree(1, length, create_using=nx.DiGraph())\n \n # Add 'tokenscore', 'cumscore', and 'token' attributes to each node\n for node in graph.nodes:\n graph.nodes[node]['tokenscore'] = 100\n graph.nodes[node]['token'] = text\n \n # Start generating text\n output_ids = greedy_search(input_ids, 0, length=length)\n output = tokenizer.decode(output_ids.squeeze().tolist(), skip_special_tokens=True)\n print(f\"Generated text: {output}\")\n \n \n Generated text: I have a dream of being a doctor.\n\nOur greedy search generates the same text as the one from the transformers\nlibrary: \u201cI have a dream of being a doctor.\u201d Let\u2019s visualize the tree we\ncreated.\n\n \n \n import matplotlib.pyplot as plt\n import networkx as nx\n import matplotlib.colors as mcolors\n from matplotlib.colors import LinearSegmentedColormap\n \n def plot_graph(graph, length, beams, score):\n fig, ax = plt.subplots(figsize=(3+1.2*beams**length, max(5, 2+length)), dpi=300, facecolor='white')\n \n # Create positions for each node\n pos = nx.nx_agraph.graphviz_layout(graph, prog=\"dot\")\n \n # Normalize the colors along the range of token scores\n if score == 'token':\n scores = [data['tokenscore'] for _, data in graph.nodes(data=True) if data['token'] is not None]\n elif score == 'sequence':\n scores = [data['sequencescore'] for _, data in graph.nodes(data=True) if data['token'] is not None]\n vmin = min(scores)\n vmax = max(scores)\n norm = mcolors.Normalize(vmin=vmin, vmax=vmax)\n cmap = LinearSegmentedColormap.from_list('rg', [\"r\", \"y\", \"g\"], N=256) \n \n # Draw the nodes\n nx.draw_networkx_nodes(graph, pos, node_size=2000, node_shape='o', alpha=1, linewidths=4, \n node_color=scores, cmap=cmap)\n \n # Draw the edges\n nx.draw_networkx_edges(graph, pos)\n \n # Draw the labels\n if score == 'token':\n labels = {node: data['token'].split('_')[0] + f\"\\n{data['tokenscore']:.2f}%\" for node, data in graph.nodes(data=True) if data['token'] is not None}\n elif score == 'sequence':\n labels = {node: data['token'].split('_')[0] + f\"\\n{data['sequencescore']:.2f}\" for node, data in graph.nodes(data=True) if data['token'] is not None}\n nx.draw_networkx_labels(graph, pos, labels=labels, font_size=10)\n plt.box(False)\n \n # Add a colorbar\n sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)\n sm.set_array([])\n if score == 'token':\n fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label='Token probability (%)')\n elif score == 'sequence':\n fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label='Sequence score')\n plt.show()\n \n # Plot graph\n plot_graph(graph, length, 1.5, 'token')\n\nImage by author.\n\nIn this graph, the top node stores the input token (thus with a 100%\nprobability), while all other nodes represent generated tokens. Although each\ntoken in this sequence was the most likely at the time of prediction, \u201cbeing\u201d\nand \u201cdoctor\u201d were assigned relatively low probabilities of 9.68% and 2.86%,\nrespectively. This suggests that \u201cof\u201d, our first predicted token, may not have\nbeen the most suitable choice as it led to \u201cbeing\u201d, which is quite unlikely.\n\nIn the following section, we\u2019ll explore how beam search can address this\nproblem.\n\n### \u2696\ufe0f Beam Search\n\nUnlike greedy search, which only considers the next most probable token, beam\nsearch takes into account the _n_ most likely tokens, where _n_ represents the\nnumber of beams. This procedure is repeated until a predefined maximum length\nis reached or an end-of-sequence token appears. At this point, the sequence\n(or \u201cbeam\u201d) with the highest overall score is chosen as the output.\n\nWe can adapt the previous function to consider the _n_ most probable tokens\ninstead of just one. Here, we\u2019ll maintain the sequence score log _P(w)_ ,\nwhich is the cumulative sum of the log probability of every token in the beam.\nWe normalize this score by the sequence length to prevent bias towards longer\nsequences (this factor can be adjusted). Once again, we\u2019ll generate five\nadditional tokens to complete the sentence \u201cI have a dream.\u201d\n\n \n \n from tqdm.notebook import tqdm\n \n def greedy_sampling(logits, beams):\n return torch.topk(logits, beams).indices\n \n def beam_search(input_ids, node, bar, length, beams, sampling, temperature=0.1):\n if length == 0:\n return None\n \n outputs = model(input_ids)\n predictions = outputs.logits\n \n # Get the predicted next sub-word (here we use top-k search)\n logits = predictions[0, -1, :]\n \n if sampling == 'greedy':\n top_token_ids = greedy_sampling(logits, beams)\n elif sampling == 'top_k':\n top_token_ids = top_k_sampling(logits, temperature, 20, beams)\n elif sampling == 'nucleus':\n top_token_ids = nucleus_sampling(logits, temperature, 0.5, beams)\n \n for j, token_id in enumerate(top_token_ids):\n bar.update(1)\n \n # Compute the score of the predicted token\n token_score = get_log_prob(logits, token_id)\n cumulative_score = graph.nodes[node]['cumscore'] + token_score\n \n # Add the predicted token to the list of input ids\n new_input_ids = torch.cat([input_ids, token_id.unsqueeze(0).unsqueeze(0)], dim=-1)\n \n # Add node and edge to graph\n token = tokenizer.decode(token_id, skip_special_tokens=True)\n current_node = list(graph.successors(node))[j]\n graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100\n graph.nodes[current_node]['cumscore'] = cumulative_score\n graph.nodes[current_node]['sequencescore'] = 1/(len(new_input_ids.squeeze())) * cumulative_score\n graph.nodes[current_node]['token'] = token + f\"_{length}_{j}\"\n \n # Recursive call\n beam_search(new_input_ids, current_node, bar, length-1, beams, sampling, 1)\n \n # Parameters\n length = 5\n beams = 2\n \n # Create a balanced tree with height 'length' and branching factor 'k'\n graph = nx.balanced_tree(beams, length, create_using=nx.DiGraph())\n bar = tqdm(total=len(graph.nodes))\n \n # Add 'tokenscore', 'cumscore', and 'token' attributes to each node\n for node in graph.nodes:\n graph.nodes[node]['tokenscore'] = 100\n graph.nodes[node]['cumscore'] = 0\n graph.nodes[node]['sequencescore'] = 0\n graph.nodes[node]['token'] = text\n \n # Start generating text\n beam_search(input_ids, 0, bar, length, beams, 'greedy', 1)\n\nThe function computes the scores for 63 tokens and beams^length = 5\u00b2 = 25\npossible sequences. In our implementation, all the information is stored in\nthe graph. Our next step is to extract the best sequence.\n\nFirst, we identify the leaf node with the highest sequence score. Next, we\nfind the shortest path from the root to this leaf. Every node along this path\ncontains a token from the optimal sequence. Here\u2019s how we can implement it:\n\n \n \n def get_best_sequence(G):\n # Create a list of leaf nodes\n leaf_nodes = [node for node in G.nodes() if G.out_degree(node)==0]\n \n # Get the leaf node with the highest cumscore\n max_score_node = None\n max_score = float('-inf')\n for node in leaf_nodes:\n if G.nodes[node]['sequencescore'] > max_score:\n max_score = G.nodes[node]['sequencescore']\n max_score_node = node\n \n # Retrieve the sequence of nodes from this leaf node to the root node in a list\n path = nx.shortest_path(G, source=0, target=max_score_node)\n \n # Return the string of token attributes of this sequence\n sequence = \"\".join([G.nodes[node]['token'].split('_')[0] for node in path])\n \n return sequence, max_score\n \n sequence, max_score = get_best_sequence(graph)\n print(f\"Generated text: {sequence}\")\n \n \n Generated text: I have a dream. I have a dream\n\nThe best sequence seems to be \u201cI have a dream. I have a dream,\u201d which is a\ncommon response from GPT-2, even though it may be surprising. To verify this,\nlet\u2019s plot the graph.\n\nIn this visualization, we\u2019ll display the sequence score for each node, which\nrepresents the score of the sequence up to that point. If the function\nget_best_sequence() is correct, the \u201cdream\u201d node in the sequence \u201cI have a\ndream. I have a dream\u201d should have the highest score among all the leaf nodes.\n\n \n \n # Plot graph\n plot_graph(graph, length, beams, 'sequence')\n\nIndeed, the \u201cdream\u201d token has the **highest sequence score** with a value of\n-0.69. Interestingly, we can see the score of the greedy sequence \u201cI have a\ndream of being a doctor.\u201d on the left with a value of -1.16.\n\nAs expected, the greedy search leads to suboptimal results. But, to be honest,\nour new outcome is not particularly compelling either. To generate more varied\nsequences, we\u2019ll implement two sampling algorithms: top-k and nucleus.\n\n### \ud83c\udfb2 Top-k sampling\n\nTop-k sampling is a technique that leverages the probability distribution\ngenerated by the language model to **select a token randomly from the**\n_**k**_**most likely options**.\n\nTo illustrate, suppose we have _k = 3_ and four tokens: A, B, C, and D, with\nrespective probabilities: _P(A) = 30%_ , _P(B) = 15%_ , _P(C) = 5%_ , and\n_P(D) = 1%_. In top-k sampling, token D is disregarded, and the algorithm will\noutput A 60% of the time, B 30% of the time, and C 10% of the time. This\napproach ensures that we prioritize the most probable tokens while introducing\nan element of randomness in the selection process.\n\nAnother way of introducing randomness is the concept of temperature. The\ntemperature _T_ is a parameter that ranges from 0 to 1, which affects the\nprobabilities generated by the softmax function, making the most likely tokens\nmore influential. In practice, it simply consists of dividing the input logits\nby a value we call temperature:\n\nHere is a chart that demonstrates the impact of temperature on the\nprobabilities generated for a given set of input logits [1.5, -1.8, 0.9,\n-3.2]. We\u2019ve plotted three different temperature values to observe the\ndifferences.\n\nA temperature of 1.0 is equivalent to a default softmax with no temperature at\nall. On the other hand, a low temperature setting (0.1) significantly alters\nthe probability distribution. This is commonly used in text generation to\ncontrol the level of \u201ccreativity\u201d in the generated output. By adjusting the\ntemperature, we can influence the extent to which the model produces more\ndiverse or predictable responses.\n\nLet\u2019s now implement the top k sampling algorithm. We\u2019ll use it in the\nbeam_search() function by providing the \u201ctop_k\u201d argument. To illustrate how\nthe algorithm works, we will also plot the probability distributions for top_k\n= 20.\n\n \n \n def plot_prob_distribution(probabilities, next_tokens, sampling, potential_nb, total_nb=50):\n # Get top k tokens\n top_k_prob, top_k_indices = torch.topk(probabilities, total_nb)\n top_k_tokens = [tokenizer.decode([idx]) for idx in top_k_indices.tolist()]\n \n # Get next tokens and their probabilities\n next_tokens_list = [tokenizer.decode([idx]) for idx in next_tokens.tolist()]\n next_token_prob = probabilities[next_tokens].tolist()\n \n # Create figure\n plt.figure(figsize=(0.4*total_nb, 5), dpi=300, facecolor='white')\n plt.rc('axes', axisbelow=True)\n plt.grid(axis='y', linestyle='-', alpha=0.5)\n if potential_nb < total_nb:\n plt.axvline(x=potential_nb-0.5, ls=':', color='grey', label='Sampled tokens')\n plt.bar(top_k_tokens, top_k_prob.tolist(), color='blue')\n plt.bar(next_tokens_list, next_token_prob, color='red', label='Selected tokens')\n plt.xticks(rotation=45, ha='right', va='top')\n plt.gca().spines['top'].set_visible(False)\n plt.gca().spines['right'].set_visible(False)\n if sampling == 'top_k':\n plt.title('Probability distribution of predicted tokens with top-k sampling')\n elif sampling == 'nucleus':\n plt.title('Probability distribution of predicted tokens with nucleus sampling')\n plt.legend()\n plt.savefig(f'{sampling}_{time.time()}.png', dpi=300)\n plt.close()\n \n def top_k_sampling(logits, temperature, top_k, beams, plot=True):\n assert top_k >= 1\n assert beams <= top_k\n \n indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]\n new_logits = torch.clone(logits)\n new_logits[indices_to_remove] = float('-inf')\n \n # Convert logits to probabilities\n probabilities = torch.nn.functional.softmax(new_logits / temperature, dim=-1)\n \n # Sample n tokens from the resulting distribution\n next_tokens = torch.multinomial(probabilities, beams)\n \n # Plot distribution\n if plot:\n total_prob = torch.nn.functional.softmax(logits / temperature, dim=-1)\n plot_prob_distribution(total_prob, next_tokens, 'top_k', top_k)\n \n return next_tokens\n \n # Start generating text\n beam_search(input_ids, 0, bar, length, beams, 'top_k', 1)\n\nImage by author.\n\nThese plots give a good intuition of how top-k sampling works, with all the\npotentially selected tokens on the left of the horizontal bar. While the most\nprobable tokens are selected (in red) most of the time, it also allows less\nlikely tokens to be chosen. This offers an interesting tradeoff that can steer\na sequence towards a less predictable but more natural-sounding sentence. Now\nlet\u2019s print the text it generated.\n\n \n \n sequence, max_score = get_best_sequence(graph)\n print(f\"Generated text: {sequence}\")\n \n \n Generated text: I have a dream job and I want to\n\nThe top-k sampling found a new sequence: \u201cI have a dream job and I want to\u201d,\nwhich feels significantly more natural than \u201cI have a dream. I have a dream\u201d.\nWe\u2019re making progress!\n\nLet\u2019s see how this decision tree differs from the previous one.\n\n \n \n # Plot graph\n plot_graph(graph, length, beams, 'sequence')\n\nYou can see how the nodes differ significantly from the previous iteration,\nmaking more diverse choices. Although the sequence score of this new outcome\nmight not be the highest (-1.01 instead of -0.69 previously), it\u2019s important\nto remember that higher scores do not always lead to more realistic or\nmeaningful sequences.\n\nNow that we\u2019ve introduced top-k sampling, we have to present the other most\npopular sampling technique: nucleus sampling.\n\n### \ud83d\udd2c Nucleus sampling\n\nNucleus sampling, also known as top-p sampling, takes a different approach\nfrom top-k sampling. Rather than selecting the top _k_ most probable tokens,\nnucleus sampling chooses a cutoff value _p_ such that the **sum of the\nprobabilities of the selected tokens exceeds** _**p**_. This forms a \u201cnucleus\u201d\nof tokens from which to randomly choose the next token.\n\nIn other words, the model examines its top probable tokens in descending order\nand keeps adding them to the list until the total probability surpasses the\nthreshold _p_. Unlike top-k sampling, the number of tokens included in the\nnucleus can vary from step to step. This variability often results in a more\ndiverse and creative output, making nucleus sampling popular for tasks such as\ntext generation.\n\nTo implement the nucleus sampling method, we can use the \u201cnucleus\u201d parameter\nin the beam_search() function. In this example, we\u2019ll set the value of _p_ to\n0.5. To make it easier, we\u2019ll include a minimum number of tokens equal to the\nnumber of beams. We\u2019ll also consider tokens with cumulative probabilities\nlower than _p_ , rather than higher. It\u2019s worth noting that while the details\nmay differ, the core idea of nucleus sampling remains the same.\n\n \n \n def nucleus_sampling(logits, temperature, p, beams, plot=True):\n assert p > 0\n assert p <= 1\n \n # Sort the probabilities in descending order and compute cumulative probabilities\n sorted_logits, sorted_indices = torch.sort(logits, descending=True)\n probabilities = torch.nn.functional.softmax(sorted_logits / temperature, dim=-1)\n cumulative_probabilities = torch.cumsum(probabilities, dim=-1)\n \n # Create a mask for probabilities that are in the top-p\n mask = cumulative_probabilities < p\n \n # If there's not n index where cumulative_probabilities < p, we use the top n tokens instead\n if mask.sum() > beams:\n top_p_index_to_keep = torch.where(mask)[0][-1].detach().cpu().tolist()\n else:\n top_p_index_to_keep = beams\n \n # Only keep top-p indices\n indices_to_remove = sorted_indices[top_p_index_to_keep:]\n sorted_logits[indices_to_remove] = float('-inf')\n \n # Sample n tokens from the resulting distribution\n probabilities = torch.nn.functional.softmax(sorted_logits / temperature, dim=-1)\n next_tokens = torch.multinomial(probabilities, beams)\n \n # Plot distribution\n if plot:\n total_prob = torch.nn.functional.softmax(logits / temperature, dim=-1)\n plot_prob_distribution(total_prob, next_tokens, 'nucleus', top_p_index_to_keep)\n \n return next_tokens\n \n # Start generating text\n beam_search(input_ids, 0, bar, length, beams, 'nucleus', 1)\n\nImage by author.\n\nIn this plot, you can see that the number of tokens included in the nucleus\n(left of the vertical bar) fluctuates a lot. The generated probability\ndistributions vary considerably, leading to the selection of tokens that are\nnot always among the most probable ones. This opens the door to the generation\nof unique and varied sequences. Now, let\u2019s observe the text it generated.\n\n \n \n sequence, max_score = get_best_sequence(graph)\n print(f\"Generated text: {sequence}\")\n \n \n Generated text: I have a dream. I'm going to\n\nThe nucleus sampling algorithm produces the sequence: \u201cI have a dream. I\u2019m\ngoing to\u201d, which shows a notable enhancement in semantic coherence compared to\ngreedy sampling.\n\nTo compare the decision paths, let\u2019s visualize the new tree nucleus sampling\ngenerated.\n\n \n \n # Plot graph\n plot_graph(graph, length, beams, 'sequence')\n\nAs with top-k sampling, this tree is very different from the one generated\nwith greedy sampling, displaying more variety. Both top-k and nucleus sampling\noffer unique advantages when generating text, enhancing diversity, and\nintroducing creativity into the output. Your choice between the two methods\n(or even greedy search) will depend on the specific requirements and\nconstraints of your project.\n\n### Conclusion\n\nIn this article, we have delved deep into various decoding methods used by\nLLMs, specifically GPT-2. We started with a simply **greedy search** and its\nimmediate (yet often suboptimal) selection of the most probable next token.\nNext, we introduced the **beam search** technique, which considers several of\nthe most likely tokens at each step. Although it offers more nuanced results,\nbeam search can sometimes fall short in generating diverse and creative\nsequences.\n\nTo bring more variability into the process, we then moved on to **top-k\nsampling** and **nucleus sampling**. Top-k sampling diversifies the text\ngeneration by randomly selecting among the _k_ most probable tokens, while\nnucleus sampling takes a different path by dynamically forming a nucleus of\ntokens based on cumulative probability. Each of these methods brings unique\nstrengths and potential drawbacks to the table, and the specific requirements\nof your project will largely dictate the choice among them.\n\nUltimately, understanding these techniques and their trade-offs will equip you\nto better guide the LLMs towards producing increasingly realistic, nuanced,\nand compelling textual output.\n\nIf you\u2019re interested in more technical content around LLMs, you can follow me\non Twitter @maximelabonne.\n\n3\n\nShare this post\n\n#### Decoding Strategies in Large Language Models\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/decoding-strategies-in-large-language-models-9733a8f70539" }, { "id": "d0f2f790-c745-4858-a2c5-e4daeedb53cf", "content": { "Title": "The Art of Spending: Optimizing Your Marketing Budget with Nonlinear Optimization", "Subtitle": "Introduction to CVXPY to maximize marketing ROI", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### The Art of Spending: Optimizing Your Marketing Budget with Nonlinear\nOptimization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# The Art of Spending: Optimizing Your Marketing Budget with Nonlinear\nOptimization\n\n### Introduction to CVXPY to maximize marketing ROI\n\nMaxime Labonne\n\nMay 22, 2023\n\n1\n\nShare this post\n\n#### The Art of Spending: Optimizing Your Marketing Budget with Nonlinear\nOptimization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Introduction to CVXPY to maximize marketing ROI\n\nImage by author\n\nIn the age of digital marketing, businesses face the challenge of allocating\ntheir marketing budget across multiple channels to maximize sales.\n\nHowever, as they broaden their reach, these firms inevitably face the issue of\n**diminishing returns** \u2014 the phenomenon where additional investment in a\nmarketing channel yields progressively smaller increases in conversions. This\nis where the concept of marketing budget allocation steps in, adding another\nlayer of complexity to the whole process.\n\nIn this article, we\u2019re going to explore the potential of nonlinear\nprogramming, specifically conic optimization (or cone programming), as a tool\nfor marketing budget allocation. With the use of this advanced mathematical\ntechnique, we aim to optimize the distribution of marketing budget across\nvarious platforms to extract the maximum value and the highest possible ROI.\n\nThe code is available on GitHub and Google Colab.\n\n### **\ud83d\udcb0 Marketing budget allocation**\n\nMarketing budget allocation is a critical aspect of any advertising campaign,\nrequiring businesses to strategically distribute their resources across\ndifferent channels. The goal is to maximize the effectiveness of their\nmarketing efforts and achieve the highest possible return on investment (ROI).\nTo tackle this challenge, we need to consider three key components:\n\n 1. **Attribution** : How can we connect conversion events to specific campaigns?\n\n 2. **Performance Estimation** : How can we predict the performance of a campaign based on its allocated budget?\n\n 3. **Optimization** : How can we allocate budgets across various campaigns to maximize ROI?\n\n### **\ud83d\udd17 1. Attribution: Connecting Conversions to Campaigns**\n\nAttribution is the process of determining which campaigns are responsible for\nconverting customers. Some channels, like Facebook or AdWords, can directly\nclaim conversions. However, there are various attribution models to consider,\nincluding:\n\n * First touch\n\n * Last touch\n\n * Multi-touch\n\n * Time decay\n\n * Position-based\n\nAttribution systems are not without their issues, with two main challenges:\n\n * **Lag** : The time it takes to measure the performance of ads and attribute conversions accurately\n\n * **Attribution Window** : The trade-off between using a short versus a long window to attribute conversions\n\nFor example, DoorDash used a several-day last-touch attribution system. The\nproblem they faced was the need to wait for several days to measure the\nperformance of their ads, which proved too lengthy given the rapid changes in\ntheir market.\n\n### **\ud83d\udd2e 2. Performance Estimation: Predicting Campaign Success**\n\nPerformance estimation involves creating a model that can predict the success\nof a marketing campaign based on its budget allocation. Here, success can be\ndefined in terms of various Key Performance Indicators (KPIs), such as:\n\n * Leads\n\n * Cost per Lead (CPL)\n\n * Customer Lifetime Value (CLV)\n\n * Customer Acquisition Cost (CAC)\n\nTraditionally, linear models have been used for performance estimation.\nHowever, they assume that marketing channels **don\u2019t exhibit diminishing\nreturns** , which is often not the case. To obtain nontrivial solutions,\nlinear models typically incorporate multiple constraints and are solved using\nLinear Programming (LP).\n\nIn reality, response curves in marketing mix modeling often display different\nshapes, such as:\n\n * Linear (rare)\n\n * Concave (common, indicating diminishing returns)\n\n * Convex (rare)\n\n * S-shaped (rare)\n\nImage by author\n\nThese shapes reflect the **diminishing returns** of marketing spending or the\nvarying effectiveness of different channels at different budget levels. For\nexample, investing more money into a channel might initially yield higher\nreturns (convex), but after a certain point, each additional dollar may\ngenerate less and less incremental outcome (becoming concave), creating an\nS-shaped curve overall.\n\nTo capture the intrinsic nonlinearity of the marketing budget allocation\nproblem, a more sophisticated approach is needed. This is where nonlinear\nprogramming, specifically conic optimization, comes into play.\n\n### **\ud83d\udd04 3. Optimization: Nonlinear Optimization with CVXPY**\n\nNonlinear programming, also known as nonlinear optimization, is a method used\nto solve optimization problems where the **objective function, constraints** ,\nor both, are **nonlinear**. In simple terms, it\u2019s the process of finding the\noptimal solution (either maximizing or minimizing) for a system that\u2019s\ngoverned by a set of nonlinear equations.\n\nIn this example, we will model the returns for each marketing channel\n(response curve) using the natural logarithm as follows:\n\nThe two previous steps of attribution and performance estimation approximate\nthe values of \u03b1\u1d62 and \u03b2\u1d62 for every channel _i_. Let\u2019s take a simple example\nwith three channels:\n\nThe noise observed in these values is typical in marketing budget allocation\nproblems. Note that the alpha values are **negative** ; this can be\ninterpreted as the initial cost of engaging with a new marketing channel.\n\nWe can plot the response curves of each marketing channel using matplotlib.\n\n \n \n import matplotlib.pyplot as plt\n import numpy as np\n np.random.seed(0)\n \n TOTAL_BUDGET = 100_000\n \n # Alpha and beta constants\n alphas = np.array([-9453.72, -8312.84, -7371.33])\n betas = np.array([8256.21, 7764.20, 7953.36])\n \n # Linearly spaced numbers\n x = np.linspace(1, TOTAL_BUDGET, TOTAL_BUDGET)\n \n # Plot the response curves\n fig = plt.figure(figsize=(10, 5), dpi=300)\n plt.plot(x, alphas[0] + betas[0] * np.log(x), color='red', label='Google Ads')\n plt.plot(x, alphas[1] + betas[1] * np.log(x), color='blue', label='Facebook Ads')\n plt.plot(x, alphas[2] + betas[2] * np.log(x), color='green', label='Twitter Ads')\n plt.xlabel('Budget ($)')\n plt.ylabel('Returns ($)') \n plt.legend()\n plt.show()\n\nHow to find the best values for each response curve? The easiest solution\nconsists of a greedy algorithm that randomly samples values and evaluates the\nresult. Our optimization problem can be described as follows:\n\nThe following function has a budget of 1,000 iterations to find the best\nallocation.\n\n \n \n def greedy_optimization(TOTAL_BUDGET, alphas, betas, num_iterations=1_000):\n # Initialize the budget allocation and the best objective value\n google_budget = facebook_budget = twitter_budget = TOTAL_BUDGET / 3\n obj = alphas[0] + betas[0] * np.log(google_budget) + alphas[1] + betas[1] * np.log(facebook_budget) + alphas[2] + betas[2] * np.log(twitter_budget)\n \n for _ in range(num_iterations):\n # Generate a new random allocation\n random_allocation = np.random.dirichlet(np.ones(3)) * TOTAL_BUDGET\n google_budget_new, facebook_budget_new, twitter_budget_new = random_allocation\n \n # Calculate the new objective value\n new_obj = alphas[0] + betas[0] * np.log(google_budget_new) + alphas[1] + betas[1] * np.log(facebook_budget_new) + alphas[2] + betas[2] * np.log(twitter_budget_new)\n \n # If the new allocation improves the objective value, keep it\n if new_obj > obj:\n google_budget, facebook_budget, twitter_budget = google_budget_new, facebook_budget_new, twitter_budget_new\n obj = new_obj\n \n # Return the best allocation and the corresponding objective value\n return (google_budget, facebook_budget, twitter_budget), objp\n\nLet\u2019s run it and see the approximated solution it found:\n\n \n \n # Run the greedy optimization\n (best_google, best_facebook, best_twitter), obj = greedy_optimization(TOTAL_BUDGET, alphas, betas)\n \n # Print the result\n print('='*59 + '\\n' + ' '*24 + 'Solution' + ' '*24 + '\\n' + '='*59)\n print(f'Returns = ${round(obj):,}\\n')\n print('Marketing allocation:')\n print(f' - Google Ads = ${round(best_google):,}')\n print(f' - Facebook Ads = ${round(best_facebook):,}')\n print(f' - Twitter Ads = ${round(best_twitter):,}')\n \n \n ===========================================================\n Solution \n ===========================================================\n Returns = $224,534\n \n Marketing allocation:\n - Google Ads = $35,476\n - Facebook Ads = $31,722\n - Twitter Ads = $32,802\n\nAfter running our calculations, we find that our total return is $224,533. You\nmight wonder if we can improve it by tweaking our model more or running more\niterations.\n\nThis kind of guarantee is exactly where nonlinear programming comes to the\nrescue: it can output the **best solution possible** , also called the optimal\nsolution. On top of this overwhelming advantage, it is also faster to run.\n\nTo solve the marketing budget allocation problem using nonlinear programming,\nwe\u2019ll use the **CVXPY** library, which supports conic optimization thanks to\nspecialized solvers like ECOS, MOSEK (interior point method), and SCS (first-\norder method). In this example, we\u2019ll use the open-source ECOS solver to find\nthe optimal solution.\n\nLet\u2019s set up the optimization problem:\n\n * Our decision **variables** are the (positive) budgets for each channel\n\n * Our **constraint** is that the sum of all budgets must not exceed the total budget\n\n * Our **objective** is to maximize the total return, which is the sum of the returns for each channel\n\n \n \n import cvxpy as cp\n \n # Variables\n google = cp.Variable(pos=True)\n facebook = cp.Variable(pos=True)\n twitter = cp.Variable(pos=True)\n \n # Constraint\n constraint = [google + facebook + twitter <= TOTAL_BUDGET]\n \n # Objective\n obj = cp.Maximize(alphas[0] + betas[0] * cp.log(google)\n + alphas[1] + betas[1] * cp.log(facebook)\n + alphas[2] + betas[2] * cp.log(twitter))\n\nFinally, we call the ECOS solver to find the optimal budget allocations and\ndisplay the results.\n\n \n \n # Solve\n prob = cp.Problem(obj, constraint)\n prob.solve(solver='ECOS', verbose=False)\n \n # Print solution\n print('='*59 + '\\n' + ' '*24 + 'Solution' + ' '*24 + '\\n' + '='*59)\n print(f'Status = {prob.status}')\n print(f'Returns = ${round(prob.value):,}\\n')\n print('Marketing allocation:')\n print(f' - Google Ads = ${round(google.value):,}')\n print(f' - Facebook Ads = ${round(facebook.value):,}')\n print(f' - Twitter Ads = ${round(twitter.value):,}')\n \n \n ===========================================================\n Solution \n ===========================================================\n Status = optimal\n Returns = $224,540\n \n Marketing allocation:\n - Google Ads = $34,439\n - Facebook Ads = $32,386\n - Twitter Ads = $33,175\n\nThe optimal allocation found by the solver is $34,439 for Google Ads, $32,386\nfor Facebook Ads, and $33,175 for YouTube, for a total return of $224,540!\nThis is **$7 higher than what the greedy algorithm returned**($224,533).\n\nKeep in mind that this allocation maximizes the returns based on our response\ncurves: correctly modeling these curves is crucial for optimizing the budget\neffectively.\n\nLet\u2019s visualize this optimal allocation on top of the previous response\ncurves.\n\n \n \n # Plot the functions and the results\n fig = plt.figure(figsize=(10, 5), dpi=300)\n \n plt.plot(x, alphas[0] + betas[0] * np.log(x), color='red', label='Google Ads')\n plt.plot(x, alphas[1] + betas[1] * np.log(x), color='blue', label='Facebook Ads')\n plt.plot(x, alphas[2] + betas[2] * np.log(x), color='green', label='Twitter Ads')\n \n # Plot optimal points\n plt.scatter([google.value, facebook.value, twitter.value],\n [alphas[0] + betas[0] * np.log(google.value),\n alphas[1] + betas[1] * np.log(facebook.value),\n alphas[2] + betas[2] * np.log(twitter.value)],\n marker=\"+\", color='black', zorder=10)\n \n plt.xlabel('Budget ($)')\n plt.ylabel('Returns ($)') \n plt.legend()\n plt.show()\n\nBut is it **really optimal**? We can do a quick sanity check by running the\ngreedy algorithm for different numbers of iterations. This will show us the\ndifference between these two approaches.\n\nLet\u2019s run it for 20 different numbers of iterations between 1 and 1,000,000.\n\n \n \n # List to store the best objective value for each number of iterations\n best_obj_list = []\n \n # Range of number of iterations to test\n num_iterations_range = np.logspace(0, 6, 20).astype(int)\n \n # Run the greedy algorithm for each number of iterations and store the best objective value\n for num_iterations in num_iterations_range:\n _, best_obj = greedy_optimization(TOTAL_BUDGET, alphas, betas, num_iterations)\n best_obj_list.append(best_obj)\n\nWe can now plot the resulting list using matplotlib and compare it to the\noptimal solution:\n\n \n \n # Plot the results\n plt.figure(figsize=(10, 5), dpi=300)\n plt.ticklabel_format(useOffset=False)\n plt.plot(num_iterations_range, best_obj_list, label='Greedy algorithm')\n plt.axhline(y=prob.value, color='r', linestyle='--', label='Optimal solution (CVXPY)')\n plt.xlabel('Number of iterations')\n plt.xticks(num_iterations_range)\n plt.xscale(\"log\")\n plt.ylabel('Best returns ($)')\n plt.title('Best returns found by the greedy algorithm for different numbers of iterations')\n plt.legend()\n plt.show()\n\nWe observe that the greedy algorithm performs relatively well when given a\nlarge number of iterations. However, despite one million attempts, it falls\njust short of finding the optimal allocation, which yields a return of\n$224,540.1500. The best non-rounded value it could reach is $224,540.1489.\n\nTo add to this, there\u2019s a significant difference in terms of **computational\nspeed** between the two approaches. The nonlinear programming model identified\nthe optimal solution in a swift 22.3 milliseconds. In stark contrast, the\ngreedy algorithm took a considerable 30 seconds to run its 1 million\niterations and find a nearly optimal solution.\n\nThis disparity becomes even more crucial when we extend our problem to\n**numerous marketing channels**. Nonlinear programming with CVXPY maintains\nits speed and precision, making it a highly efficient tool for complex, high-\ndimensional marketing budget allocation problems.\n\n### **Conclusion**\n\nNonlinear programming offers a powerful approach to tackling the marketing\nbudget allocation problem. By modeling the diminishing returns of each\nmarketing channel with **nonlinear functions** and leveraging the CVXPY\nlibrary, we can find the optimal allocation of resources that maximizes sales.\n\nAs the marketing landscape evolves and the number of channels increases,\noptimization techniques like nonlinear programming can help businesses make\nbetter, data-driven decisions about their marketing investments. While this\narticle provides a starting point, there are many more advanced techniques and\nmodels to explore. Keep learning and experimenting to find the best approach\nfor your business.\n\nIf you\u2019re interested to know more about it, feel free to follow me on Twitter\n@maximelabonne. Happy optimizing!\n\n### **References**\n\nIf you want to learn more about marketing budget allocation, I recommend the\nfollowing resources:\n\n * Park et al., A Nonlinear Optimization Model of Advertising Budget Allocation across Multiple Digital Media Channels (2022): an excellent approach based on diminishing returns, which inspired this article.\n\n * Zhao et al., A Unified Framework for Marketing Budget Allocation (2019): fascinating architecture currently in production at Alibaba, based on a logit response curve.\n\n * Katsov, Cross-channel marketing spend optimization using deep learning (2019): blog post about an intriguing LSTM-based approach, without convex optimization.\n\n### Related articles\n\n**Introduction to Linear Programming in Python** \n _A guide to mathematical optimization with Google OR-\nTools_towardsdatascience.com\n\n**Integer vs. Linear Programming in Python** \n _A guide to identify and solve any optimization\nproblem_towardsdatascience.com\n\n1\n\nShare this post\n\n#### The Art of Spending: Optimizing Your Marketing Budget with Nonlinear\nOptimization\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/the-art-of-spending-optimizing-your-marketing-budget-with-nonlinear-optimization-6c8a39afb3c2" }, { "id": "319b83ba-c6bd-44bf-9f73-91096f4a0c47", "content": { "Title": "Reinforcement Learning in Minecraft: Create a Bot to Find Diamonds", "Subtitle": "Reinforcement Learning and Behavior Cloning in Python with MineRL", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Reinforcement Learning in Minecraft: Create a Bot to Find Diamonds\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Reinforcement Learning in Minecraft: Create a Bot to Find Diamonds\n\n### Reinforcement Learning and Behavior Cloning in Python with MineRL\n\nMaxime Labonne\n\nMay 25, 2022\n\nShare this post\n\n#### Reinforcement Learning in Minecraft: Create a Bot to Find Diamonds\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Reinforcement Learning and Behavior Cloning in Python with MineRL\n\nImage by author (Mojang license)\n\nMinecraft is an incredible challenge for Reinforcement Learning.\n\nIt\u2019s a huge game, with many mechanics and complex sequences of actions. It\ntakes an entire wiki with **over 8000 pages** just to teach humans how to play\nMinecraft. So how good can be machine learning?\n\nThis is the question we\u2019ll answer in this article. We\u2019ll design a bot and try\nto achieve one of the most difficult challenges in Minecraft: finding\n**diamonds from scratch**. To make things even worse, we will take on this\nchallenge in randomly generated**** worlds so we can\u2019t learn a particular\nseed.\n\nSequence of actions to find diamonds, image by author (Mojang license)\n\nWhat we\u2019re gonna talk about is not limited to Minecraft. It can be applied to\nsimilar **complex environments**. More specifically, we will implement two\ndifferent techniques that will become the backbone of our intelligent agent.\n\nBut before we can train an agent, we need to understand **how to interact**\nwith the environment. Let\u2019s start with a scripted bot to get familiar with the\nsyntax. We\u2019ll use MineRL, a fantastic library to build AI applications in\nMinecraft.\n\nThe code used in this article is available on Google Colab. It is a simplified\nand finetuned version of the excellent notebooks made by the organizers of the\nMineRL 2021 competition (MIT License).\n\n### \ud83d\udcdc I. Scripted bot\n\nMineRL allows us to launch Minecraft in Python and interact with the game.\nThis is done through the popular `gym` library.\n\n \n \n env = gym.make('MineRLObtainDiamond-v0')\n env.seed(21)\n\nImage by author\n\nWe are in front of a tree. As you can see, the resolution is **quite low**. A\nlow resolution means fewer pixels, which speeds things up. Fortunately for us,\nneural networks don\u2019t need a 4K resolution to understand what\u2019s happening on\nscreen.\n\nNow, we would like to **interact** with the game. What can our agent do?\nHere\u2019s the list of possible actions:\n\nList of actions (image by author)\n\nThe first step to find diamonds is to **get wood** to make a crafting table\nand a wooden pickaxe.\n\nLet\u2019s try to get closer to the tree. It means that we need to hold the\n\u201cforward\u201d button for less than a second. With MineRL, there are **20 actions\nprocessed per second** : we don\u2019t need a full second so let\u2019s process it 5\ntimes, and wait for 40 more ticks.\n\nImage by author\n\n \n \n # Define the sequence of actions\n script = ['forward'] * 5 + [''] * 40\n \n env = gym.make('MineRLObtainDiamond-v0')\n env = Recorder(env, './video', fps=60)\n env.seed(21)\n obs = env.reset()\n \n for action in script:\n # Get the action space (dict of possible actions)\n action_space = env.action_space.noop()\n \n # Activate the selected action in the script\n action_space[action] = 1\n \n # Update the environment with the new action space\n obs, reward, done, _ = env.step(action_space)\n \n env.release()\n env.play()\n\nImage by author\n\nGreat, let\u2019s chop this tree now. We need four actions in total:\n\n * **Forward** to go in front of the tree;\n\n * **Attack** to chop the tree;\n\n * **Camera** to look up or down;\n\n * **Jump** to get the final piece of wood.\n\nImage by author\n\nHandling the camera can be a hassle. To simplify the syntax, we\u2019re gonna use\nthe `str_to_act` function from this GitHub repository (MIT license). This is\nwhat the new script looks like:\n\n \n \n script = []\n script += [''] * 20 \n script += ['forward'] * 5\n script += ['attack'] * 61\n script += ['camera:[-10,0]'] * 7 # Look up\n script += ['attack'] * 240\n script += ['jump']\n script += ['forward'] * 10 # Jump forward\n script += ['camera:[-10,0]'] * 2 # Look up\n script += ['attack'] * 150\n script += ['camera:[10,0]'] * 7 # Look down\n script += [''] * 40\n \n for action in tqdm(script):\n obs, reward, done, _ = env.step(str_to_act(env, action))\n \n env.release()\n env.play()\n\nThe agent efficiently chopped the **entire tree**. This is a good start, but\nwe would like to do it in a more automated way\u2026\n\n### \ud83e\udde0 II. Deep Learning\n\nOur bot works well in a fixed environment, but what happens if we change the\nseed or its starting point?\n\nEverything is **scripted** so the agent would probably try to chop a non-\nexistent tree.\n\nThis approach is **too static** for our requirements: we need something that\ncan adapt to new environments. Instead of scripting orders, we want an AI that\nknows how to chop trees. Naturally, reinforcement learning is a pertinent\nframework to train this agent. More specifically, deep RL seems to be the\nsolution since we\u2019re processing images to select the best actions.\n\nThere are two ways of implementing it:\n\n * **Pure deep RL** : the agent is trained from scratch by interacting with the environment. It is rewarded every time it chops a tree.\n\n * **Imitation learning** : the agent learns how to chop trees from a dataset. In this case, it is a sequence of actions to chop trees made by a human.\n\nThe two approaches have the same outcome, but they\u2019re not equivalent.\nAccording to the authors of the MineRL 2021 competition, it takes **8 hours**\nfor the pure RL solution and **15 minutes** for the imitation learning agent\nto reach the same level of performance.\n\nWe don\u2019t have that much time to spend, so we\u2019re going for the Imitation\nLearning solution. This technique is also called **Behavior Cloning** , which\nis the simplest form of imitation.\n\nNote that Imitation Learning is not always more efficient than RL. If you want\nto know more about it, Kumar et al. wrote a great blog post about this topic.\n\nImage by author\n\nThe problem is reduced to a multi-class classification task. Our dataset\nconsists of mp4 videos, so we\u2019ll use a Convolutional Neural Network (CNN) to\ntranslate these images into relevant actions. Our goal is also to **limit the\nnumber of actions** (classes) that can be taken so the CNN has fewer options,\nwhich means it\u2019ll be trained more efficiently.\n\n \n \n class CNN(nn.Module):\n def __init__(self, input_shape, output_dim):\n super().__init__()\n n_input_channels = input_shape[0]\n self.cnn = nn.Sequential(\n nn.Conv2d(n_input_channels, 32, kernel_size=8, stride=4),\n nn.BatchNorm2d(32),\n nn.ReLU(),\n nn.Conv2d(32, 64, kernel_size=4, stride=2),\n nn.BatchNorm2d(64),\n nn.ReLU(),\n nn.Conv2d(64, 64, kernel_size=3, stride=1),\n nn.BatchNorm2d(64),\n nn.ReLU(),\n nn.Flatten(),\n nn.Linear(1024, 512),\n nn.ReLU(),\n nn.Linear(512, output_dim)\n )\n \n def forward(self, observations):\n return self.cnn(observations)\n \n def dataset_action_batch_to_actions(dataset_actions, camera_margin=5):\n ...\n \n class ActionShaping(gym.ActionWrapper):\n ...\n\nIn this example, we manually define **7 relevant actions** : attack, forward,\njump, and move the camera (left, right, up, down). Another popular approach is\nto apply K-means in order to automatically retrieve the most relevant actions\ntaken by humans. In any case, the objective is to discard the least useful\nactions to complete our objective, such as crafting in our example.\n\nLet\u2019s train our CNN on the `MineRLTreechop-v0` dataset. Other datasets can be\nfound at this address. We chose a learning rate of 0.0001 and 6 epochs with a\nbatch size of 32.\n\n \n \n # Get data\n minerl.data.download(directory='data', environment='MineRLTreechop-v0')\n data = minerl.data.make(\"MineRLTreechop-v0\", data_dir='data', num_workers=2)\n \n # Model\n model = CNN((3, 64, 64), 7).cuda()\n optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)\n criterion = nn.CrossEntropyLoss()\n \n # Training loop\n step = 0\n losses = []\n for state, action, _, _, _ \\\n in tqdm(data.batch_iter(num_epochs=6, batch_size=32, seq_len=1)):\n # Get pov observations\n obs = state['pov'].squeeze().astype(np.float32)\n # Transpose and normalize\n obs = obs.transpose(0, 3, 1, 2) / 255.0\n \n # Translate batch of actions for the ActionShaping wrapper\n actions = dataset_action_batch_to_actions(action)\n \n # Remove samples with no corresponding action\n mask = actions != -1\n obs = obs[mask]\n actions = actions[mask]\n \n # Update weights with backprop\n logits = model(torch.from_numpy(obs).float().cuda())\n loss = criterion(logits, torch.from_numpy(actions).long().cuda())\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n \n # Print loss\n step += 1\n losses.append(loss.item())\n if (step % 2000) == 0:\n mean_loss = sum(losses) / len(losses)\n tqdm.write(f'Step {step:>5} | Training loss = {mean_loss:.3f}')\n losses.clear()\n \n \n Step 4000 | Training loss = 0.878\n Step 8000 | Training loss = 0.826\n Step 12000 | Training loss = 0.805\n Step 16000 | Training loss = 0.773\n Step 20000 | Training loss = 0.789\n Step 24000 | Training loss = 0.816\n Step 28000 | Training loss = 0.769\n Step 32000 | Training loss = 0.777\n Step 36000 | Training loss = 0.738\n Step 40000 | Training loss = 0.751\n Step 44000 | Training loss = 0.764\n Step 48000 | Training loss = 0.732\n Step 52000 | Training loss = 0.748\n Step 56000 | Training loss = 0.765\n Step 60000 | Training loss = 0.735\n Step 64000 | Training loss = 0.716\n Step 68000 | Training loss = 0.710\n Step 72000 | Training loss = 0.693\n Step 76000 | Training loss = 0.695\n\nOur model is trained. We can now instantiate an environment and see how it\nbehaves. If the training was successful, it should frantically **cut all the\ntrees in sight**.\n\nThis time, we\u2019ll use the `ActionShaping` wrapper to map the array of numbers\ncreated with `dataset_action_batch_to_actions` to discrete actions in MineRL.\n\nOur model needs a **pov observation** in the correct format and outputs\nlogits. These logits can be turned into a probability distribution over a set\nof 7 actions with the `softmax` function. We then randomly choose an action\nbased on the probabilities. The selected action is implemented in MineRL\nthanks to `env.step(action)`.\n\nThis process is repeated as many times as we want. Let\u2019s do it 1000 times and\nwatch the result.\n\n \n \n model = CNN((3, 64, 64), 7).cuda()\n model.load_state_dict(torch.load('model.pth'))\n \n env = gym.make('MineRLObtainDiamond-v0')\n env1 = Recorder(env, './video', fps=60)\n env = ActionShaping(env1)\n \n action_list = np.arange(env.action_space.n)\n \n obs = env.reset()\n \n for step in tqdm(range(1000)):\n # Get input in the correct format\n obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()\n # Turn logits into probabilities\n probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()\n # Sample action according to the probabilities\n action = np.random.choice(action_list, p=probabilities)\n \n obs, reward, _, _ = env.step(action)\n \n env1.release()\n env1.play()\n\nOur agent is quite chaotic but it manages to chop trees in this **new, unseen\nenvironment**. Now, how to find diamonds?\n\n### \u26cf\ufe0f III. Script + Imitation Learning\n\nA simple yet powerful approach consists of **combining** scripted actions with\nartificial intelligence. Learn the boring stuff, script the knowledge.\n\nIn this paradigm, we\u2019ll use the CNN to get a healthy amount of wood (3000\nsteps). Then, we can **script a sequence** to craft planks, sticks, a crafting\ntable, a wooden pickaxe, and start mining stone (it should be below our feet).\nThis stone can then be used to craft a stone pickaxe, which can mine iron ore.\n\nCNN + script approach, image by author (Mojang license)\n\nThis is when things get complicated: iron ore is **quite rare** , so we would\nneed to run the game for a while to find a deposit. Then, we would have to\ncraft a furnace and melt it to get the iron pickaxe. Finally, we would have to\ngo even deeper and be **even luckier** to obtain a diamond without falling\ninto lava.\n\nAs you can see, it\u2019s doable but the outcome is fairly random. We could train\nanother agent to find diamonds, and even a third one to create the iron\npickaxe. If you\u2019re interested in more complex approaches, you can read the\nresults of the MineRL Diamond 2021 Competition by Kanervisto et al. It\ndescribes several solutions using different clever techniques, including end-\nto-end deep learning architectures. Nonetheless, it is a complex problem and\nno team managed to consistently find diamonds, if at all.\n\nThis is why we will limit ourselves to obtaining a stone pickaxe in the\nfollowing example, but you can modify the code to go further.\n\n \n \n obs = env_script.reset()\n done = False\n \n # 1. Get wood with the CNN\n for i in tqdm(range(3000)):\n obs = torch.from_numpy(obs['pov'].transpose(2, 0, 1)[None].astype(np.float32) / 255).cuda()\n probabilities = torch.softmax(model(obs), dim=1)[0].detach().cpu().numpy()\n action = np.random.choice(action_list, p=probabilities)\n obs, reward, done, _ = env_script.step(action)\n if done:\n break\n \n # 2. Craft stone pickaxe with scripted actions\n if not done:\n for action in tqdm(script):\n obs, reward, done, _ = env_cnn.step(str_to_act(env_cnn, action))\n if done:\n break\n \n print(obs[\"inventory\"])\n env_cnn.release()\n env_cnn.play()\n\nWe can see our agent chopping wood like a madman during the first 3000 steps,\nthen our script takes over and completes the task. It might not be obvious,\nbut the command `print(obs.inventory)` shows a stone pickaxe. Note that this\nis a **cherry-picked** example: most of the runs don\u2019t end that well.\n\nThere are **several reasons** why the agent may fail: it can spawn in a\nhostile environment (water, lava, etc.), in an area without wood, or even fall\nand die. Playing with different seeds will give you a good understanding of\nthe complexity of this problem and, hopefully, ideas to build event better\nagents.\n\n### Conclusion\n\nI hope you enjoyed this little guide to reinforcement learning in Minecraft.\nBeyond its obvious popularity, Minecraft is an interesting environment to try\nand test RL agents. Like NetHack, it requires a **thorough knowledge** of its\nmechanics to plan precise sequences of actions in a procedurally-generated\nworld. In this article,\n\n * We learned how to use **MineRL** ;\n\n * We saw **two approaches** (script and behavior cloning) and how to combine them;\n\n * We **visualized** the agent\u2019s actions with short videos.\n\nThe main drawback of the environment is its **slow processing time**.\nMinecraft is not a lightweight game like NetHack or Pong, which is why the\nagents take a long time to be trained. If this is a problem for you, I would\nrecommend lighter environments like Gym Retro.\n\nThank you for your attention! Feel free to follow me on Twitter if you\u2019re\ninterested in AI applied to video games.\n\nShare this post\n\n#### Reinforcement Learning in Minecraft: Create a Bot to Find Diamonds\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/create-a-bot-to-find-diamonds-in-minecraft-d836606a993a" }, { "id": "fef26b86-df5b-4379-8e7d-03bb90767e4e", "content": { "Title": "Constraint Programming in Python - Maxime Labonne", "Subtitle": "The Programming Paradigm to Find One Solution Among 8,080,104 Candidates", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Constraint Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Constraint Programming in Python\n\n### The Programming Paradigm to Find One Solution Among 8,080,104 Candidates\n\nMaxime Labonne\n\nMay 02, 2022\n\nShare this post\n\n#### Constraint Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### The Programming Paradigm to Find One Solution Among 8,080,104 Candidates\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\nConstraint Programming is a technique to **find every solution** that respects\na set of predefined constraints.\n\nIt is an invaluable tool for data scientists to solve a huge variety of\nproblems, such as scheduling, timetabling, sequencing, etc. In this article,\nwe\u2019ll see how to use CP in two different ways:\n\n 1. **Satisfiability** : the goal is to find one or multiple feasible solutions (_i.e._ , solutions that respect our constraints) by narrowing down a large set of potential solutions;\n\n 2. **Optimization** : the goal is to find the best feasible solution according to an objective function, just like Linear Programming (LP).\n\nWe\u2019ll use CP-SAT from Google OR-Tools, an excellent free and open source CP\nsolver. Note that it is **different** from MPSolver, which is dedicated to\nLinear and Mixed Integer Programming. The difference between CP and LP is\nquite confusing, we\u2019ll touch on this topic at the end of the article.\n\nYou can run the code with the following Google Colab notebook.\n\n### **\ud83e\ude96 I.** Satisfiability with the 3 scouts problem\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\nIn the previous article, we created an army to defeat our opponent. But there\nwas one small problem: we had to guess how powerful his army was.\n\nThis time, let\u2019s send scouts to know the **exact number**. Our 3 scouts\nobserved the enemy camp, and this is what they tell us:\n\n * **Scout 1** : \u201c _the number of soldiers is a multiple of 13_ \u201d;\n\n * **Scout 2** : \u201c _the number of soldiers is a multiple of 19_ \u201d;\n\n * **Scout 3** : \u201c _the number of soldiers is a multiple of 37_ \u201d;\n\n * They all agree that the number of soldiers **doesn\u2019t exceed 10,000**.\n\nOur scouts have a personal way of counting soldiers, but we can **combine**\nthese three observations to make a model.\n\nLet\u2019s call the number of soldiers _army_. We can translate our problem into\nthe following congruence system:\n\nIf you\u2019re not familiar with this notation, this is what it means in\n**programming terms** :\n\nLet\u2019s implement it with OR-Tools. The first thing we need to do is to import\nand create the **CP-SAT model and solver**.\n\nThe **modeling process** is very similar to what we did in Linear Programming.\n\nThe first step to create our CP model is to declare the **variables**. In this\nexample, we only have one: _army_ , the number of soldiers.\n\nWe have to give lower and upper bounds. The **lower bound** is 1 since we know\nthere\u2019s an army, and the **upper bound** is 10,000 according to the scouts:\n\nIn OR-Tools, we use the `NewIntVar` method to create this variable.\n\nThe second step is to declare the **constraints**.\n\nWe identified three constraints in this example. Modulo is a special operator,\nso we need a specific function to handle it with CP-SAT: `AddModuloEquality`.\nYou can find a reference guide at this address if you need other methods.\n\nUnlike Linear Programming, we **don\u2019t have to define an objective function**\nhere.\n\nThe reason is simple: there is nothing to optimize! We just want to find a\n**feasible solution** that satisfies our constraints, but there is no \u201cgood\u201d\nor \u201cbad\u201d answers. This is a **key feature** of Constraint Programming.\n\nOur model is **complete** , we can now ask OR-Tools to solve it.\n\n \n \n ================= Solution =================\n Solved in 0.00 milliseconds\n \n \n \ud83e\ude96 Army = 9139\n \n \n Check solution:\n - Constraint 1: 9139 % 13 = 0\n - Constraint 2: 9139 % 19 = 0\n - Constraint 3: 9139 % 37 = 0\n\nWe obtained our solution in less than a millisecond: there are **9,139\nsoldiers** in the enemy army. Huzzah, we can now fire the scouts!\n\nWe limited the search space with an upper bound of 10,000, which gave us a\n**unique solution**. But is it still the case if we push this limit?\n\nAnother perk of CP is the ability to **find every possible solution** to a\nproblem. This might take a long time when the search space is large because\nthe solver has to brute force the entire space (instead of reducing it with\nheuristics). Let\u2019s explore this feature by printing every possible solution\nwith a new upper bound of **100,000**.\n\nWith OR-Tools, we ask the solver to look for every possible solution thanks to\nthe `enumerate_all_solutions` parameter. We then assign it a **callback**\nclass that prints every solution the solver finds.\n\nWe found **10 solutions**! This was to be expected since we increased the\nupper bound tenfold: these solutions all are **multiples** of 9,139.\n\nAs you can see, this example has nothing to do with optimization: it\u2019s a pure\n**satisfiability problem**. On another note, this congruence system can be\nsolved manually with the Chinese remainder theorem. But CP is not limited to\nthat\u2026\n\n### **\ud83c\udf7b II. Optimization and beer**\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\nLet\u2019s see another problem: our army will face the enemy in a few days. In the\nmeantime, the quartermaster has to **prepare the rations** that will be used\nduring the campaign.\n\nThe space in the supply wagons is **limited** and some rations are more\n**popular** than others. There are three possible rations:\n\n * \ud83e\udd56 **Bread** : it takes only 1 space but soldiers don\u2019t like it that much with a popularity of 3;\n\n * \ud83e\udd69 **Meat** : it takes 3 spaces and has a popularity of 10;\n\n * \ud83c\udf7a **Beer** : it takes 7 spaces but soldiers love it with a popularity of 26.\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\nThe supply wagons have a capacity of **19 spaces**. How to select the best\nrations to **maximize** the popularity?\n\nThis is an **optimization** problem we\u2019ve already seen: actually, it is a\nvariant of the famous knapsack problem. We could reuse the code from the\nprevious article and just change the input parameters.\n\nThis time, we\u2019ll solve it using Constraint Programming. This paradigm is not\nlimited to finding feasible solutions. It can also perform optimization using\ndifferent algorithms to handle this overhead.\n\nLet\u2019s create a model of the problem. First of all, we have to declare three\nvariables: \ud83e\udd56**bread** , \ud83e\udd69**meat** , and \ud83c\udf7a**beer**. It\u2019s possible to have 0 of\nthem, but their number cannot exceed the maximal capacity.\n\nThis time, we only have one constraint: the space occupied by the bread, the\nmeat, and the beer **cannot exceed the wagons\u2019 capacity** (19).\n\nWe want to **maximize the total popularity** of the rations that are selected:\n\nThe model is complete, CP-SAT can **solve the problem**!\n\n \n \n ================= Solution =================\n Solved in 0.00 milliseconds\n \n \n Optimal value = 68 popularity\n Food:\n - \ud83e\udd56Bread = 2\n - \ud83e\udd69Meat = 1\n - \ud83c\udf7aBeer = 2\n\nWe obtained the **highest popularity** (68) possible with a capacity of 19.\n\nIs the constraint respected? Let\u2019s quickly check it: 1\u00d72 \ud83e\udd56 + 3\u00d71 \ud83e\udd69 + 7\u00d72 \ud83c\udf7a =\n19, which is indeed \u2264 19.\n\nOkay, I\u2019d like to ask another question: **how many solutions** to this problem\nare there? Once again, we can answer it with a specific callback to count\nthem.\n\n \n \n 121\n\nWe found **121 solutions** with a capacity of 19. But this number quickly\nincreases: with a capacity of 1000, there are **8,080,104** possible\nsolutions! And yet, CP-SAT finds the optimal solution in less than a second.\nHow is it possible?\n\nCP solvers do not brute force the problem with an exhaustive search but\n**combine** heuristics and combinatorial search instead. More specifically,\nthe three most popular techniques for constraint satisfaction problems are\n**backtracking** , **constraint propagation** , and **local search**.\n\nCP-SAT is quite particular since it combines CP and **SAT** : it is part of a\nbroader trend of merging CP, LP, SAT, and metaheuristics.\n\nWe said that the previous problem could be solved with Linear Programming, so\nlet\u2019s compare the code of both solutions:\n\nLeft: LP code, Right: CP code (image by author)\n\nAs you can see, the syntax is quite similar but it\u2019s not the same:\nmodel/solver vs. solver, `NewIntVar` instead of `IntVar`, etc. There's a bit\nof translation to do, but it's easily manageable.\n\nThese two techniques are **incredibly close to each other** : they both handle\nvariables with constraints and perform optimization using math and heuristics.\nHowever, CP is limited to discrete parameters, while LP handles continuous\nones. On the other hand, you can implement specialized constraints like \u201call\ndifferent\u201d in CP, but not in LP. Here is a summary of the main differences\nbetween these two technologies:\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\nIf you want to know more about this topic, I would recommend this article by\nIrvin J. Lustig and Jean-Fran\u00e7ois Puget. CPLEX\u2019s documentation also details\nthe differences at this address, in terms of modeling and optimization.\n\n### Conclusion\n\nImage by author\n\nConstraint Programming is another incredible technique in the **mathematical\noptimization** toolbox. It is a radically different approach compared to\ntraditional, declarative programming. In this article,\n\n * We saw **two applications** of CP with satisfiability and optimization;\n\n * We implemented **CP models** in OR-Tools and played with the callback function;\n\n * We highlighted the **differences** between CP and LP.\n\nWe limited ourselves to simple problems in this introduction, but CP has\namazing applications in complex scheduling and routing problems. This is a\ntopic I\u2019d love to address in a future article.\n\nIf you\u2019re interested to know more about it, feel free to follow me on\n**Twitter** at @maximelabonne. Thanks for your attention!\n\n### Related articles\n\n**Introduction to Linear Programming in Python** \n _A guide to mathematical optimization with Google OR-\nTools_towardsdatascience.com\n\n**Integer vs. Linear Programming in Python** \n _A guide to identify and solve any optimization\nproblem_towardsdatascience.com\n\nShare this post\n\n#### Constraint Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/constraint-programming-67ac16fa0c81" }, { "id": "9de9825b-36e8-4512-b1c8-4c1d60fbcb6c", "content": { "Title": "GIN: How to Design the Most Powerful Graph Neural Network", "Subtitle": "Graph classification with Graph Isomorphism Networks", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### GIN: How to Design the Most Powerful Graph Neural Network\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# GIN: How to Design the Most Powerful Graph Neural Network\n\n### Graph classification with Graph Isomorphism Networks\n\nMaxime Labonne\n\nApr 27, 2022\n\nShare this post\n\n#### GIN: How to Design the Most Powerful Graph Neural Network\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Graph classification with Graph Isomorphism Networks\n\nImage by author\n\nGraph Neural Networks are not limited to classifying nodes.\n\nOne of the most popular applications is **graph classification**. This is a\ncommon task when dealing with molecules: they are represented as graphs and\nfeatures about each atom (node) can be used to predict the behavior of the\nentire molecule.\n\nHowever, GNNs only learn node embeddings. How to combine them in order to\nproduce an entire **graph embedding**? In this article, we will:\n\n * See a new type of layer, called \u201c**global pooling** \u201d, to combine node embeddings;\n\n * Introduce a new architecture called **Graph Isomorphism Network** (GIN), designed by Xu et al. in 2018.\n\nWe\u2019ll detail the advantages of GIN in terms of **discriminative power**\ncompared to a GCN or GraphSAGE, and its connection to the Weisfeiler-Lehman\ntest. Beyond its powerful aggregator, GIN brings exciting takeaways about GNNs\nin general.\n\nYou can run the code with the following Google Colab notebook.\n\n### \ud83c\udf10 I. PROTEINS dataset\n\n3D plot of a protein (image by author)\n\nPROTEINS\u00b9 is a popular dataset in bioinformatics. It is a collection of **1113\ngraphs** representing proteins, where nodes are amino acids. Two nodes are\nconnected by an edge when they are close enough (< 0.6 nanometers). The goal\nis to classify each protein as an **enzyme** or **not**.\n\nEnzymes are a particular type of **proteins** that act as catalysts to speed\nup chemical reactions in the cell. They are essential for digestion (e.g.,\nlipases), respiration (e.g., oxidases), and other crucial functions of the\nhuman body. They are also used in commercial applications, like the production\nof antibiotics.\n\nThis dataset is also available on TUDataset\u00b9 and implemented in PyTorch\nGeometric.\n\n \n \n Dataset: PROTEINS(1113)\n ----------------------\n Number of graphs: 1113\n Number of nodes: 23\n Number of features: 3\n Number of classes: 2\n\nI\u2019m not a biochemist so I\u2019m curious about these proteins. Let\u2019s plot one as a\ngraph to see what it looks like:\n\n3D plot of a protein with matplotlib (image by author)\n\nThe previous 3D structure is **randomly generated** : obtaining the correct 3D\nrepresentation is a problem so difficult it\u2019s the whole point of AlphaFold.\n\nGraphs are not the only way to represent molecules. The simplified molecular-\ninput line-entry system (**SMILES**) is another popular method, which uses a\nline (string) notation. It is obtained by printing the nodes encountered in a\ndepth-first tree traversal of a slightly modified molecular graph.\n\nResearchers often use this representation when working with molecules or\nchemical compounds. Fortunately for us, the PROTEINS dataset is **already\nencoded** in the form of graphs. Otherwise, we could have to translate the\nSMILES strings into `networkx` graphs.\n\nIt doesn\u2019t mean we\u2019ll directly feed the PROTEINS dataset to our GNN. If\nGraphSAGE taught us anything, it\u2019s that **mini-batching is incredibly\nefficient**. It is now an indispensable tool whenever we implement a GNN.\n\n \n \n Training set = 890 graphs (14 subgraphs)\n Validation set = 111 graphs (2 subgraphs)\n Test set = 112 graphs (2 subgraphs)\n\nPROTEINS is not a huge dataset, but mini-batching will **s** peed up the\ntraining nonetheless. We could use a GCN or a GAT, but there\u2019s a new\narchitecture I\u2019d like to introduce: the **Graph Isomorphism Network**.\n\n### \ud83c\udf7e II. Graph Isomorphism Network (GIN)\n\nGIN was designed by researchers trying to maximize**** the**representational\n(or discriminative) power** of a GNN. But how do you define a\n\u201crepresentational power\u201d?\n\n### A. Weisfeiler-Lehman test\n\nA way to characterize the \u201cpower\u201d of a GNN is to use the Weisfeiler-Lehman\n(WL) graph isomorphism test. Isomorphic graphs mean that they have the **same\nstructure** : identical connections but a permutation of nodes. The WL test is\nable to tell if two graphs are non-isomorphic, but it cannot guarantee that\nthey are isomorphic.\n\nTwo isomorphic graphs (image by author)\n\nThis might not seem like much, but it can be **extremely difficult** to tell\ntwo large graphs apart. In fact, this problem is not known**** to be solvable\nin polynomial time, nor to be NP-complete. It might even be somewhere in\nbetween, in the computational complexity class NP-intermediate (if it only\nexists).\n\nOkay, but how is it related to GNNs? Some researchers in graph learning\nnoticed that **this test and the way GNNs learn are oddly similar**. In the WL\ntest,\n\n 1. Every node starts with the **same label** ;\n\n 2. Labels from neighboring nodes are aggregated and **hashed** to produce a new label;\n\n 3. The previous step is repeated until the labels **stop changing**.\n\nIf you\u2019re interested in the WL test, I would recommend this blog post by David\nBieber and this article by Michael Bronstein.\n\nNot only this test is similar to how feature vectors are aggregated in GNNs,\nbut its ability to tell graphs apart makes it **more powerful** than a lot of\narchitectures, including GCNs and GraphSAGE. This is what inspired Xu et al.\u00b2\nto design a new aggregator that they proved to be as good as the WL test.\n\n### B. One aggregator to rule them all\n\nTo be as good as the WL test, this new aggregator must produce **different\nnode embeddings** when dealing with non-isomorphic graphs.\n\nWe\u2019ll skip the math-heavy part of the paper, but the solution they found is to\nuse two injective functions. Which ones? We don\u2019t know, we can just learn them\nwith a MLP!\n\n * With GATs, we used a neural network to learn the **best weighting factors** for a given task;\n\n * With GINs, we now learn the **approximation of two injective functions** thanks to the Universal Approximation Theorem.\n\nHere\u2019s how to calculate the hidden vector of a particular node _i_ with GIN:\n\nIn this formula, \u025b determines the **importance of the target node** compared\nto its neighbors (it has the same importance if \u025b = 0). It can be a learnable\nparameter or a fixed scalar.\n\nNote that we talk about MLPs to highlight the fact that there is more than one\nlayer. According to the authors, one layer is **not sufficient** for graph\nlearning in general.\n\n### C. Global pooling\n\nGlobal pooling or graph-level readout consists of producing a **graph\nembedding** using the node embeddings calculated by the GNN.\n\nA simple way to obtain a graph embedding is to use the **mean** , **sum**\n,**** or**max** of every node embedding _h\u1d62_ :\n\nThe authors make two important points about graph-level readout:\n\n * To consider all structural information, it is necessary to **keep embeddings from previous layers** ;\n\n * The sum operator is surprisingly **more expressive** than the mean and the max.\n\nThese observations lead them to propose the following global pooling method:\n\nFor each layer, node embeddings are **summed** and the result is\n**concatenated**. This solution combines the expressiveness of the sum\noperator with the memory of previous iterations from the concatenation.\n\n### \ud83e\udde0 III. GIN in PyTorch Geometric\n\nIt is always interesting to see the differences between the original design\nand its implementations.\n\nThere is a `GINConv` layer in PyTorch Geometric with different parameters:\n\n * `nn`: the **MLP** that is used to approximate our two injective functions;\n\n * `eps`: the initial value of \u025b, which is **0 by default** ;\n\n * `train_eps`: a True/False statement to determine if \u025b is trainable, which is **False by default**.\n\nYou can see that \u025b is entirely removed by default in this implementation: it\u2019s\na hyperparameter we can tune, but probably not an essential one.\n\nThere is a **second GIN layer** in PyTorch Geometric, called `GINEConv`. It\ncomes from this paper's implementation of GIN, which applies a _ReLU_ function\nto the neighbors' features. We won't use it in this tutorial, since the\nbenefits are not clear.\n\nWe still need to design a MLP for the `GINConv` layer. Here's the design we'll\nimplement, inspired by the original paper:\n\nMLP used in the GIN layer (image by author)\n\nThe paper stacks**5 layers** but we\u2019ll be more humble with **3 layers**\ninstead. Here is what the entire architecture looks like:\n\nOur GIN architecture (image by author)\n\nI could not find any implementation of GIN with graph embedding\n**concatenation** , so here is my version (it improves the accuracy by 1% on\naverage). Let\u2019s compare it to a GCN with a simple mean pooling (and no\nconcatenation).\n\n \n \n GCN test accuracy = 59.38%\n GIN test accuracy = 73.70%\n\nThis time, there\u2019s no competition!\n\nThe GIN architecture completely**** outperforms the GCN. This gap (10%\naccuracy on average) is due to several reasons:\n\n * GIN\u2019s aggregator is specifically designed to **discriminate graphs** that the GCN\u2019s aggregator cannot;\n\n * Graph hidden vectors from every layer are **concatenated instead** of only considering the last one;\n\n * The sum operator is **superior** to the mean operator (at least in theory).\n\nLet\u2019s visualize the proteins we classified with the GCN and the GIN.\n\nImage by author\n\nInterestingly enough, the two models make **different mistakes**. This is a\ncommon result in machine learning when different algorithms are applied to the\nsame problem.\n\nWe can take advantage of this behavior by creating an**ensemble**. There are\nmany ways of combining our graph embeddings. The simplest method is to take\nthe mean of the normalized output vectors.\n\n \n \n GCN test accuracy = 59.38%\n GIN test accuracy = 73.70%\n GCN+GIN test accuracy = 75.00%\n\nThis time, we\u2019re lucky enough to see the **accuracy improved**.\n\nObviously, it\u2019s not always the case. More sophisticated methods involve\nbuilding an entirely different ML algorithm for classification, such as a\nRandom Forest. This classifier takes graph embeddings as inputs and outputs\nthe final classification.\n\n### Conclusion\n\nGraph Isomorphism Networks are an important step in the understanding of GNNs.\n\nThey not only improve the accuracy scores on several benchmarks but also\nprovide a **theoretical framework** to explain why one architecture is better\nthan another. In this article,\n\n * We saw a new task with **graph classification** , performed with global pooling;\n\n * We introduced the **WL test** and its connection with the new GIN layer;\n\n * We implemented a GIN and a GCN and made a simple**ensemble** with their classifications.\n\nAlthough GINs achieve good performance, especially with social graphs, their\ntheoretical superiority doesn\u2019t always translate well in the real world. It is\ntrue with other \u201cprovably powerful\u201d architectures, which tend to\n**underperform in practice** , such as the 3WLGNN.\n\nIf you enjoyed this article, feel free to follow me on Twitter for more graph\ncontent! \ud83d\udce3\n\n### References\n\n[1] Christopher Morris and Nils M. Kriege and Franka Bause and Kristian\nKersting and Petra Mutzel and Marion Neumann. TUDataset: A collection of\nbenchmark datasets for learning with graphs. In _ICML 2020 Workshop on Graph\nRepresentation Learning and Beyond_.\n\n[2] Xu, Keyulu and Hu, Weihua and Leskovec, Jure and Jegelka, Stefanie. How\nPowerful are Graph Neural Networks?__ In _ICLR 2019_.\n\n### Related articles\n\n**Introduction to GraphSAGE in Python** \n _Scaling Graph Neural Networks to billions of\nconnections_towardsdatascience.com\n\n**Graph Attention Networks: Self-Attention Explained** \n _A guide to GNNs with self-attention using PyTorch\nGeometric_towardsdatascience.com\n\nShare this post\n\n#### GIN: How to Design the Most Powerful Graph Neural Network\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/how-to-design-the-most-powerful-graph-neural-network-3d18b07a6e66" }, { "id": "4ddd85f7-4d82-4be0-96c1-16056bd9ec18", "content": { "Title": "GraphSAGE: Scaling up Graph Neural Networks", "Subtitle": "Introduction to GraphSAGE with PyTorch Geometric", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### GraphSAGE: Scaling up Graph Neural Networks\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# GraphSAGE: Scaling up Graph Neural Networks\n\n### Introduction to GraphSAGE with PyTorch Geometric\n\nMaxime Labonne\n\nApr 20, 2022\n\nShare this post\n\n#### GraphSAGE: Scaling up Graph Neural Networks\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Introduction to GraphSAGE with PyTorch Geometric\n\nImage by author, emoji by OpenMoji (CC BY-SA 4.0)\n\nWhat do **UberEats** and **Pinterest** have in common?\n\nThey both use GraphSAGE**** to power their recommender system on a massive\nscale: **millions and billions** of nodes and edges.\n\n * \ud83d\uddbc\ufe0f **Pinterest** developed its own version called PinSAGE to recommend the most relevant images (pins) to its users. \n\u2192 Their graph has 18 billion connections and 3 billion nodes.\n\n * \ud83c\udf7d\ufe0f **UberEats** also reported using a modified version of GraphSAGE to suggest dishes, restaurants, and cuisines**.** \n\u2192 UberEats claims to support more than 600,000 restaurants and 66 million\nusers.\n\nIn this tutorial, we\u2019ll use a dataset with 20k nodes instead of billions\nbecause Google Colab cannot handle our ambitions. We will stick to the\n**original GraphSAGE** architecture, but the previous variants also bring\nexciting features we will discuss.\n\nYou can run the code with the following Google Colab notebook.\n\n### \ud83c\udf10 I. PubMed dataset\n\nt-SNE plot of PubMed (image by author)\n\nIn this article, we will use the **PubMed** dataset. As we saw in the previous\narticle, PubMed is part of the Planetoid dataset (MIT license). Here\u2019s a quick\nsummary:\n\n * It contains **19,717 scientific publications** about diabetes from PubMed\u2019s database;\n\n * Node features are **TF-IDF weighted word vectors** with 500 dimensions, which is an efficient way of summarizing documents without transformers;\n\n * The task is a multi-class classification with**three categories** : diabetes mellitus experimental, diabetes mellitus type 1, and diabetes mellitus type 2.\n\nThis is the beauty and the curse of deep learning: I don\u2019t know anything about\ndiabetes, but I\u2019ll still feel pretty satisfied if we reach 70% accuracy. At\nleast we\u2019re not building the next IBM Watson.\n\n \n \n Dataset: Pubmed()\n ------------------- \n Number of graphs: 1\n Number of nodes: 19717\n Number of features: 500\n Number of classes: 3\n \n \n Graph:\n ------\n Training nodes: 60\n Evaluation nodes: 500\n Test nodes: 1000\n Edges are directed: False\n Graph has isolated nodes: False\n Graph has loops: False\n\nAs we can see, PubMed has an insanely**low number of training nodes** compared\nto the whole graph. There are only 60 samples to learn how to classify the\n1000 test nodes.\n\nDespite this challenge, GNNs manage to obtain high levels of accuracy. Here\u2019s\nthe leaderboard of known techniques (a more exhaustive benchmark can be found\non PapersWithCode):\n\nI couldn\u2019t find any result for GraphSAGE on PubMed with this specific setting\n(60 training nodes, 1000 test nodes), so I don\u2019t expect a great accuracy. But\nanother metric can be just as relevant when working with large graphs:\n**training time**.\n\n### \ud83e\uddd9\u200d\u2642\ufe0f II. GraphSAGE in theory\n\nImage by author\n\nThe GraphSAGE algorithm can be divided into two steps:\n\n 1. **Neighbor sampling;**\n\n 2. **Aggregation**.\n\n### \ud83c\udfb0 A. Neighbor sampling\n\nMini-batching is a common technique used in machine learning.\n\nIt works by **breaking down a dataset** **into smaller batches** , which\nallows us to train models more effectively. Mini-batching has several\nbenefits**:**\n\n 1. **Improved accuracy** \u2014 mini-batches help to reduce overfitting (gradients are averaged), as well as variance in error rates;\n\n 2. **Increased speed** \u2014 mini-batches are processed in parallel and take less time to train than larger batches;\n\n 3. **Improved scalability** \u2014 an entire dataset can exceed the GPU memory, but smaller batches can get around this limitation.\n\nMini-batching is so useful it became standard in regular neural networks.\nHowever, it is not as straightforward with graph data, since splitting the\ndataset into smaller chunks would **break essential connections** between\nnodes.\n\nSo, what can we do? In recent years, researchers developed different\nstrategies to create graph mini-batches. The one we\u2019re interested in is called\n**neighbor sampling**. There are many other techniques you can find on PyG\u2019s\ndocumentation, such as subgraph clustering.\n\nNeighbor sampling (image by author)\n\nNeighbor sampling considers only a **fixed number** of random neighbors.\nHere\u2019s the process:\n\n 1. We define the **number of neighbors** (1 hop), the number of neighbors of neighbors (2 hops), etc. we would like to have.\n\n 2. The sampler looks at the list of neighbors, of neighbors of neighbors, etc. of a target node and **randomly selects** a predefined number of them;\n\n 3. The sampler **outputs a subgraph** containing the target node and the randomly selected neighboring nodes.\n\nThis process is **repeated for every node** in a list or the entirety of the\ngraph. However, creating a subgraph for each node is not efficient, that is\nwhy we can process them in batches instead. In this case, each subgraph is\nshared by multiple target nodes.\n\nNeighbor sampling has an added benefit. Sometimes, we observe extremely\npopular nodes that act like hubs, such as celebrities on social media.\nObtaining the hidden vectors of these nodes can be **computationally very\nexpensive** since it requires calculating the hidden vectors of thousands or\neven millions of neighbors. GraphSAGE fixes this issue by simply ignoring most\nof the nodes!\n\nIn PyG, neighbor sampling is implemented through the `NeighborLoader` object.\nLet's say we want **5 neighbors and 10 of their neighbors** (`num_neighbors`).\nAs we discussed, we can also specify a `batch_size` to speed up the process by\ncreating subgraphs for multiple target nodes.\n\n \n \n Subgraph 0: Data(x=[389, 500], edge_index=[2, 448], batch_size=16)\n Subgraph 1: Data(x=[264, 500], edge_index=[2, 314], batch_size=16)\n Subgraph 2: Data(x=[283, 500], edge_index=[2, 330], batch_size=16)\n Subgraph 3: Data(x=[189, 500], edge_index=[2, 229], batch_size=12)\n\nWe created **4 subgraphs** of various sizes. It allows us to process them in\nparallel and they're easier to fit on a GPU since they're smaller.\n\nThe number of neighbors is an important parameter since pruning our graph\nremoves a lot of information. How much, exactly? Well, quite a lot. We can\nvisualize this effect by looking at the **node degrees** (number of\nneighbors).\n\nNode degrees in the original graph\n\nNode degrees after neighbor sampling\n\nIn this example, the **maximum node degree** of our subgraphs is 5, which is\nmuch lower than the original max value. It\u2019s important to remember this\ntradeoff when talking about GraphSAGE.\n\nPinSAGE**** implements another sampling solution using **random walks**. It\nhas two main objectives:\n\n 1. Sample a **fixed number of neighbors** (like GraphSAGE);\n\n 2. Obtain their **relative importance** (important nodes are seen more frequently than others).\n\nThis strategy feels a bit like a fast **attention mechanism**. It assigns\nweights to nodes and increases the relevance of the most popular ones.\n\n### **\ud83d\udca5 B. Aggregation**\n\nThe aggregation process determines how to combine the feature vectors to\nproduce the node embeddings. The original paper presents three ways of\naggregating features:\n\n * **Mean** aggregator;\n\n * **LSTM** aggregator;\n\n * **Pooling** aggregator.\n\nAggregation (image by author)\n\nThe **mean aggregator** is the simplest one. The idea is close to a GCN\napproach:\n\n 1. The hidden features of the target node and its selected neighbors are averaged (\u00d1\u1d62);\n\n 2. A linear transformation with a weight matrix \ud835\udc16 is applied.\n\nThe result can then be fed to a non-linear activation function like _ReLU_.\n\nThe **LSTM aggregator** can seem like a weird idea because this architecture\nis sequential: it assigns an order to our unordered nodes. This is why the\nauthors randomly shuffle them to force the LSTM to only consider the hidden\nfeatures. It is the best performing technique in their benchmarks.\n\nThe **pooling aggregator** feeds each neighbor\u2019s hidden vector to a\nfeedforward neural network. A max-pooling operation is applied to the result.\n\n### \ud83e\udde0 III. GraphSAGE in PyTorch Geometric\n\nWe can easily implement a GraphSAGE architecture in PyTorch Geometric with the\n`SAGEConv` layer. This implementation uses two weight matrices instead of one,\nlike UberEats\u2019 version of GraphSAGE:\n\nLet's create a network with two `SAGEConv` layers:\n\n * The first one will use _**ReLU**_ as the activation function and a **dropout layer** ;\n\n * The second one will directly output the **node embeddings**.\n\nAs we're dealing with a multi-class classification task, we'll use the cross-\nentropy loss as our loss function. I also added an L2 regularization of 0.0005\nfor good measure.\n\nTo see the benefits of GraphSAGE, let's **compare** it with**** a GCN and a\nGAT without any sampling.\n\nWith GraphSAGE, we loop through **batches** (our 4 subgraphs) created by the\nneighbor sampling process. The way we calculate the accuracy and the\nvalidation loss is also different because of that.\n\nHere are the results (in terms of **accuracy** and **training time**) for****\nthe GCN, the GAT, and GraphSAGE:\n\n \n \n GCN test accuracy: 78.40% (52.6 s)\n GAT test accuracy: 77.10% (18min 7s)\n GraphSAGE test accuracy: 77.20% (12.4 s)\n\nThe three models obtain **similar** results in terms of accuracy. We expect\nthe GAT to perform better because its aggregation mechanism is more nuanced,\nbut it\u2019s not always the case.\n\nThe real difference is the training time: GraphSAGE is **88 times** faster\nthan the GAT and 4 times**** faster than the GCN in this example!\n\nHere lies the true power of GraphSAGE. We do lose a lot of information by\npruning our graph with neighbor sampling. The final node embeddings might\n**not be as good** as what we could find with a GCN or a GAT. But this is not\nthe point: GraphSAGE is designed to improve scalability. In turn, it can lead\nto building larger graphs that can improve accuracy.\n\nImage by author\n\nThis work was done in a supervised training setting (node classification), but\nwe could also train GraphSAGE in an **unsupervised way**.\n\nIn this case, we can\u2019t use the cross-entropy loss. We have to engineer a loss\nfunction that forces nodes that are nearby in the original graph to remain\nclose to each other in the embedding space. Conversely, the same function must\nensure that **distant nodes** in the graph must have **distant\nrepresentations** in the embedding space. This is the loss that is presented\nin GraphSAGE\u2019s paper.\n\nIn the case of PinSAGE and UberEeats\u2019 modified GraphSAGE, we\u2019re dealing with\n**recommender systems**.\n\nThe goal is to correctly rank the most relevant items (pins, restaurants) for\neach user, which is very different. We don\u2019t only want to know what the\nclosest embeddings are, we have to produce the **best rankings possible**.\nThis is why these systems are also trained in an unsupervised way, but with\nanother loss function: a max-margin ranking loss.\n\n### **Conclusion**\n\nGraphSAGE is an incredibly fast architecture to process large graphs. It might\nnot be as accurate as a GCN or a GAT, but it is an essential model for\nhandling **massive amounts of data**. It delivers this speed thanks to a\nclever combination of 1/ neighbor sampling to prune the graph and 2/ fast\naggregation with a mean aggregator in this example. In this article,\n\n * We explored a **new dataset** with PubMed, which is several times larger than the previous one;\n\n * We explained the idea behind **neighbor sampling** , which only considers a predefined number of random neighbors at each hop;\n\n * We saw the **three aggregators** presented in GraphSAGE\u2019s paper and focused on the mean aggregator;\n\n * We benchmarked**** three models (GraphSAGE, GAT, and GCN) in terms of **accuracy** and **training time**.\n\nWe saw three architectures with the same end application: node classification.\nBut GNNs have been successfully applied to other tasks. In the next tutorials,\nI\u2019d like to use them in two different contexts: **graph and edge prediction**.\nThis will be a good way to discover new datasets and applications where GNNs\ndominate the state of the art.\n\nIf you enjoyed this article, let\u2019s connect on Twitter @maximelabonne for more\ngraph learning content.\n\nThanks for your attention! \ud83d\udce3\n\n### Related articles\n\n**How to Design the Most Powerful Graph Neural Network** \n _Graph classification with Graph Isomorphism Networks_towardsdatascience.com\n\n**Graph Attention Networks: Self-Attention Explained** \n _A guide to GNNs with self-attention using PyTorch\nGeometric_towardsdatascience.com\n\nShare this post\n\n#### GraphSAGE: Scaling up Graph Neural Networks\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/introduction-to-graphsage-in-python-a9e7f9ecf9d7" }, { "id": "e48f1530-201c-4ee2-8d49-bdc30a70b5af", "content": { "Title": "Graph Attention Networks: Self-Attention Explained", "Subtitle": "A guide to GNNs with self-attention using PyTorch Geometric", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Graph Attention Networks: Self-Attention Explained\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Graph Attention Networks: Self-Attention Explained\n\n### A guide to GNNs with self-attention using PyTorch Geometric\n\nMaxime Labonne\n\nApr 17, 2022\n\nShare this post\n\n#### Graph Attention Networks: Self-Attention Explained\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### A guide to GNNs with self-attention using PyTorch Geometric\n\nImage by author, file icon by OpenMoji (CC BY-SA 4.0)\n\nGraph Attention Networks are **one of the most popular types** of Graph Neural\nNetworks. For a good reason.\n\nWith Graph _Convolutional_ Networks (GCN), every neighbor has the **same\nimportance**. Obviously, it should not be the case: some nodes are more\nessential than others.\n\nNode 4 is more important than node 3, which is more important than node 2\n(image by author)\n\nGraph _Attention_ Networks offer a solution to this problem. To consider the\nimportance of each neighbor, an attention mechanism assigns a **weighting\nfactor to every connection**.\n\nIn this article, we\u2019ll see how to **calculate** these attention scores and\n**implement** an efficient GAT in PyTorch Geometric (PyG). You can run the\ncode of this tutorial with the following Google Colab notebook.\n\n### \ud83c\udf10 I. Graph data\n\nCiteSeer dataset (image by author, made with yEd Live)\n\nThere are three classic graph datasets we can use for this work (MIT license).\nThey represent networks of research papers, where each connection is a\ncitation.\n\n * **Cora** : it consists of 2708 machine learning papers that belong to one of 7 categories. \n\u27a1\ufe0f Node features represent the presence (1) or absence (0) of 1433 words in a\npaper (binary bag of words).\n\n * **CiteSeer** : it is a bigger but similar dataset of 3312 scientific papers to classify into one of 6 categories. \n\u27a1\ufe0f Node features represent the presence (1) or absence (0) of 3703 words in a\npaper.\n\n * **PubMed** : it is an even bigger dataset with 19717 scientific publications about diabetes from PubMed\u2019s database, classified into 3 categories. \n\u27a1\ufe0f Node features are TF-IDF weighted word vectors from a dictionary of 500\nunique words.\n\nThese datasets have been widely used by the scientific community. As a\nchallenge, we can compare our accuracy scores to those obtained in the\nliterature using **Multilayer Perceptrons** (MLPs), **GCNs** , and **GATs** :\n\nPubMed is quite large so it would take longer to process it and train a GNN on\nit. Cora is the most studied one in the literature, so let\u2019s **focus on\nCiteSeer** as a middle ground.\n\nWe can directly import any of these datasets in PyTorch Geometric with the\nPlanetoid class:\n\n \n \n Number of graphs: 1\n Number of nodes: 3327\n Number of features: 3703\n Number of classes: 6\n Has isolated nodes: True\n\nInterestingly enough, we have **3327 nodes instead of 3312.** I found that PyG\nactually uses this paper\u2019s implementation of CiteSeer, which also displays\n3327 nodes. Mystery solved for now.\n\nHowever, we observe that **some nodes are isolated** (48 to be precise)!\nCorrectly classifying these isolated nodes will be a challenge since we cannot\nrely on any aggregation.\n\nLet\u2019s plot the number of connections of each node with `degree`:\n\nMost nodes only have **1 or 2 neighbors**. It could explain why CiteSeer****\nobtains lower accuracy scores than the two other datasets\u2026\n\n### \u26a0\ufe0f II. Self-attention\n\nIntroduced by Veli\u010dkovi\u0107 et al. in 2017, self-attention in GNNs relies on a\nsimple idea: **nodes should not all have the same importance**.\n\nWe talk about _self_ -attention (and not just attention) because inputs are\ncompared to each other.\n\nImage by author\n\nThis mechanism assigns a**weighting factor**(attention score)**** to each\nconnection. Let\u2019s call _**\u03b1**_**\u1d62\u2c7c** the attention score between the nodes _i_\nand _j_.\n\nHere\u2019s how to calculate the embedding of node 1, where \ud835\udc16 is a shared weight\nmatrix:\n\nBut how do we calculate the attention scores? We could write a static formula,\nbut there\u2019s a smarter solution: we can **learn** **their values with a neural\nnetwork**. There are three steps in this process:\n\n 1. **Linear transformation** ;\n\n 2. **Activation function** ;\n\n 3. **Softmax normalization.**\n\n#### 1\ufe0f\u20e3 Linear transformation\n\nWe want to calculate the **importance of each connection** , so we need pairs\nof hidden vectors. An easy way to create these pairs is to concatenate vectors\nfrom both nodes.\n\nOnly then can we apply a new **linear transformation** with a weight matrix\n\ud835\udc16**\u2090\u209c\u209c** :\n\nImage by author\n\n#### 2\ufe0f\u20e3 Activation function\n\nWe\u2019re building a**** neural network, so the second step is to add an\nactivation function. In this case, the authors of the paper chose the\n_LeakyReLU_ function.\n\nImage by author\n\n#### 3\ufe0f\u20e3 Softmax normalization\n\nThe output of our neural network is **not normalized** , which is a problem\nsince we want to compare these scores. To be able to say if node 2 is more\nimportant to node 1 than node 3 (_\u03b1_ \u2081\u2082 > _\u03b1_ \u2081\u2083), we need to share the same\nscale.\n\nA common way to do it with neural networks is to use the _**softmax**_\nfunction. Here, we apply it to every neighboring node:\n\nImage by author\n\nHere you have it: we can calculate every _\u03b1_ \u1d62\u2c7c. The only problem is\u2026 **self-\nattention is not very stable**. In order to improve performance, Vaswani et\nal. introduced multi-head attention in the transformer architecture.\n\n#### 4\ufe0f\u20e3 Bonus: multi-head attention\n\nThis is only slightly surprising since we\u2019ve been talking about self-attention\na lot but, in reality, **transformers are GNNs in disguise**. This is why we\ncan reuse some ideas from Natural Language Processing here.\n\nMulti-head attention (image by author)\n\nIn GATs, multi-head attention consists of **replicating the same 3 steps\nseveral times** in order to average or concatenate the results. That\u2019s it.\nInstead of a single _h\u2081_ , we get one hidden vector _h\u2081\u1d4f_ per attention head.\nOne of the two following schemes can then be applied:\n\n * **Average** : we sum the different _h\u1d62\u1d4f\u200b_ and normalize the result by the number of attention heads _n_ ;\n\n * **Concatenation** : we concatenate the different _h\u1d62\u1d4f_.\u200b\n\nIn practice, we use the **concatenation scheme** when it\u2019s a hidden layer, and\nthe **average scheme** when it\u2019s the last layer of the network.\n\n### \ud83e\udde0 III. Graph Attention Networks\n\nLet\u2019s implement a GAT in PyTorch Geometric. This library has **two different\ngraph attention layers** : `GATConv` and `GATv2Conv`.\n\nWhat we talked about so far is the `GatConv` layer, but in 2021 Brody et al.\nintroduced an improvement by modifying the order of operations. The weight\nmatrix \ud835\udc16 is applied **after the concatenation** , and the attention weight\nmatrix \ud835\udc16**\u2090\u209c\u209c** is used **after the** _**LeakyReLU**_**function**. In summary:\n\n * `GatConv`:\n\n * `Gatv2Conv`:\n\nWhich one should you use? According to Brody et al., **`Gatv2Conv`\nconsistently outperforms `GatConv` **and thus should be preferred.\n\nNow let\u2019s classify the papers from CiteSeer! I tried to **roughly reproduce\nthe experiments** of the original authors without adding too much complexity.\nYou can find the official implementation of GAT on GitHub.\n\nNote that we use graph attention layers in two configurations:\n\n * The**first layer** concatenates 8 outputs (multi-head attention);\n\n * The **second layer** only has 1 head, which produces our final embeddings.\n\nWe\u2019re also gonna train and test a GCN to compare the accuracy scores.\n\n \n \n GCN(\n (gcn1): GCNConv(3703, 16)\n (gcn2): GCNConv(16, 6)\n )\n \n \n Epoch 0 | Train Loss: 1.782 | Train Acc: 20.83% | Val Loss: 1.79 \n Epoch 20 | Train Loss: 0.165 | Train Acc: 95.00% | Val Loss: 1.30 \n Epoch 40 | Train Loss: 0.069 | Train Acc: 99.17% | Val Loss: 1.66 \n Epoch 60 | Train Loss: 0.053 | Train Acc: 99.17% | Val Loss: 1.50 \n Epoch 80 | Train Loss: 0.054 | Train Acc: 100.00% | Val Loss: 1.67 \n Epoch 100 | Train Loss: 0.062 | Train Acc: 99.17% | Val Loss: 1.62 \n Epoch 120 | Train Loss: 0.043 | Train Acc: 100.00% | Val Loss: 1.66 \n Epoch 140 | Train Loss: 0.058 | Train Acc: 98.33% | Val Loss: 1.68 \n Epoch 160 | Train Loss: 0.037 | Train Acc: 100.00% | Val Loss: 1.44 \n Epoch 180 | Train Loss: 0.036 | Train Acc: 99.17% | Val Loss: 1.65 \n Epoch 200 | Train Loss: 0.093 | Train Acc: 95.83% | Val Loss: 1.73 \n \n GCN test accuracy: 67.70%\n \n CPU times: user 25.1 s, sys: 847 ms, total: 25.9 s\n Wall time: 32.4 s\n \n \n GAT(\n (gat1): GATv2Conv(3703, 8, heads=8)\n (gat2): GATv2Conv(64, 6, heads=1)\n )\n \n \n Epoch 0 | Train Loss: 1.790 | Val Loss: 1.81 | Val Acc: 12.80%\n Epoch 20 | Train Loss: 0.040 | Val Loss: 1.21 | Val Acc: 64.80%\n Epoch 40 | Train Loss: 0.027 | Val Loss: 1.20 | Val Acc: 67.20%\n Epoch 60 | Train Loss: 0.009 | Val Loss: 1.11 | Val Acc: 67.00%\n Epoch 80 | Train Loss: 0.013 | Val Loss: 1.16 | Val Acc: 66.80%\n Epoch 100 | Train Loss: 0.013 | Val Loss: 1.07 | Val Acc: 67.20%\n Epoch 120 | Train Loss: 0.014 | Val Loss: 1.12 | Val Acc: 66.40%\n Epoch 140 | Train Loss: 0.007 | Val Loss: 1.19 | Val Acc: 65.40%\n Epoch 160 | Train Loss: 0.007 | Val Loss: 1.16 | Val Acc: 68.40%\n Epoch 180 | Train Loss: 0.006 | Val Loss: 1.13 | Val Acc: 68.60%\n Epoch 200 | Train Loss: 0.007 | Val Loss: 1.13 | Val Acc: 68.40%\n \n GAT test accuracy: 70.00%\n \n CPU times: user 53.4 s, sys: 2.68 s, total: 56.1 s\n Wall time: 55.9 s\n\nThis experiment is not super rigorous: we\u2019d need to **repeat it**\n_**n**_**times** and take the average accuracy with a standard deviation as\nthe final result.\n\nWe can see in this example that the **GAT outperforms the GCN** in terms of\naccuracy (70.00% vs. 67.70%), but takes longer to train (55.9s vs. 32.4s).\nIt\u2019s a tradeoff that can cause scalability issues when working with large\ngraphs.\n\nThe authors obtained 72.5% for the GAT and 70.3% for the GCN, which is clearly\nbetter than what we did. The difference can be explained by **preprocessing**\n, some **tweaks in the models,** and a different **training setting**(_e.g.,_\na patience of 100 instead of a fixed number of epochs).\n\nLet\u2019s visualize what the GAT learned. We\u2019re gonna use t-SNE, a powerful method\nto plot high-dimensional data in 2D or 3D. First, let\u2019s see what the\nembeddings looked like before any training: it should be absolutely **random**\nsince they\u2019re produced by randomly initialized weight matrices.\n\nIndeed, there\u2019s **no apparent structure**. But do the embeddings produced by\nour trained model look better?\n\nThe difference is noticeable: **nodes belonging to the same classes cluster\ntogether**. We can see 6 clusters, corresponding to the 6 classes of papers.\nThere are outliers, but this was to be expected: our accuracy score is far\nfrom perfect.\n\nPreviously, I speculated that poorly connected nodes**** might**negatively\nimpact** performance on CiteSeer. Let\u2019s calculate the model\u2019s accuracy for\neach degree.\n\nThese results confirm our intuition: nodes with few neighbors are indeed\n**harder to classify**. This is due to the nature of GNNs: the more relevant\nconnections you have, the more information you can aggregate.\n\n### Conclusion\n\nWhile they take longer to train, GATs are a **substantial improvement** over\nGCNs in terms of accuracy. The self-attention mechanism automatically\ncalculates weighting factors instead of static coefficients to produce better\nembeddings. In this article,\n\n * We learned about the **self-attention** mechanism applied to GNNs;\n\n * We implemented and **compared** two**** architectures (a GCN and a GAT) in PyTorch Geometric;\n\n * We visualized how and what the GAT learns with a **t-SNE** plot and the accuracy score for each degree;\n\nGATs are the de facto standard in a lot of GNN applications. However, their\n**slow training time** can become a problem when applied to massive graph\ndatasets. Scalability is an important factor in deep learning: most often,\nmore data can lead to better performance.\n\nIn the next article, we\u2019ll see **how to improve scalability** with mini-\nbatching and a new GNN architecture called GraphSAGE.\n\nIf you enjoyed this tutorial, feel free to **follow me on Twitter** for more\nGNN content. Thank you and see you in the next article! \ud83d\udce3\n\n### Related articles\n\n**Introduction to GraphSAGE in Python** \n _Scaling Graph Neural Networks to billions of\nconnections_towardsdatascience.com\n\n**How to Design the Most Powerful Graph Neural Network** \n _Graph classification with Graph Isomorphism Networks_towardsdatascience.com\n\nShare this post\n\n#### Graph Attention Networks: Self-Attention Explained\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/graph-attention-networks-in-python-975736ac5c0c" }, { "id": "bb728e7c-4c22-443c-a630-b68f5e54b5a6", "content": { "Title": "Integer vs. Linear Programming in Python", "Subtitle": "A guide to identify and solve any optimization problem", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Integer vs. Linear Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Integer vs. Linear Programming in Python\n\n### A guide to identify and solve any optimization problem\n\nMaxime Labonne\n\nApr 07, 2022\n\nShare this post\n\n#### Integer vs. Linear Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Mixed Integer Programming for optimization with Google OR-Tools\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\nWhy is **linear programming** called that way?\n\nBoth terms are confusing:\n\n * **Linear** implies that **nonlinear** programming exists;\n\n * **Programming** actually**** means \u201c**planning** \u201d in this context.\n\nIn summary, it has nothing to do with code: linear or not. It\u2019s about\n**optimizing** variables with various constraints.\n\nIn this article, we\u2019re gonna talk about another type of optimization:\n**integer programming**. We\u2019ll see why a good understanding of the problem we\nface is necessary to choose the right solver. Finally, we will write a model\nthat can take on a bigger challenge and actually solve a whole class of\noptimization problems.\n\nYou can run the code from this tutorial with the following **Google Colab\nnotebook**.\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\n### \ud83d\udcca I. Optimization problem types\n\nIn the introduction to linear programming, we **optimized an army\ncomposition**. Here was the result:\n\n \n \n ================= Solution =================\n Solved in 87.00 milliseconds in 2 iterations\n \n Optimal power = 1800.0 \ud83d\udcaapower\n Army:\n - \ud83d\udde1\ufe0fSwordsmen = 6.0000000000000036\n - \ud83c\udff9Bowmen = 0.0\n - \ud83d\udc0eHorsemen = 5.999999999999999\n\nHow can we have 5.999\u2026 horsemen? We specified that our variables **should be\nintegers** with `VarInt`. What was wrong with our code?\n\nThe problem is not the model but the choice of the solver.\n\nGLOP is a pure linear programming solver. This means that it **cannot\nunderstand the concept of integers**. It is limited to continuous parameters\nwith a linear relationship.\n\nThis is the difference between **linear** programming (LP) and **integer\nlinear** programming (ILP). In summary, LP solvers can only use real numbers\nand not integers as variables. So why did we declare our variables as integers\nif it doesn\u2019t take them into account?\n\nGLOP cannot solve ILP problems, but other solvers can. Actually, a lot of them\nare **mixed integer linear programming** (MILP, commonly called MIP) solvers.\nThis means that they can consider both **continuous** (real numbers) and\n**discrete** (integers) variables. A particular case of discrete values is\nBoolean variables to represent decisions with 0\u20131 values.\n\nOther solvers like SCIP or CBC can solve both **MILP and MINLP** (mixed\ninteger _nonlinear_ programming) problems. Thanks to OR-Tools, we can use the\nsame model and just change the solver to SCIP or CBC.\n\n \n \n ================= Solution =================\n Solved in 3.00 milliseconds in 0 iterations\n \n \n Optimal value = 1800.0 \ud83d\udcaapower\n Army: \n \u2014 \ud83d\udde1\ufe0fSwordsmen = 6.0\n \u2014 \ud83c\udff9Bowmen = 0.0\n \u2014 \ud83d\udc0eHorsemen = 6.0\n\nStrictly speaking, our variables are still floats\n(`type(swordsmen.solution_value()) = float`) but we can see that they don't\nhave weird decimals anymore: the CBC solver really considered them as\n**integers**.\n\nIn this example, we would generally just **round up these values** since the\nerror is insignificant. However, it is important to remember to choose the\nappropriate solver according to the studied problem:\n\n * **LP** for continuous variables;\n\n * **MIP/MILP** for a combination of continuous and discrete variables.\n\nThere are other types such as **quadratic** (QP) or **nonlinear** (NLP or\nMINLP, with an exponential objective function or constraints for instance)\nproblems. They\u2019re applied in different contexts, but follow the same\nprinciples as LP or MIP solvers.\n\nImage by author\n\n### \ud83e\uddf1 II. Building a general model\n\nBut what if our **resources change**? Or if the cost of a unit evolved? What\nif we upgraded horsemen and their power increased?\n\nOne of the best perks of OR-Tools is that it uses a general-purpose\nprogramming language like Python. Instead of static numbers, we can store our\nparameters in objects like **dictionaries** or **lists**.\n\nThe code won\u2019t be as readable, but it becomes much more flexible: actually, it\ncan be so flexible that we can solve an **entire class of optimization\nproblems** without changing the model (just the parameters).\n\nLet\u2019s transform our input parameters into Python lists and feed them to the\nsolver through a function.\n\n \n \n ================= Solution =================\n Solved in 2.00 milliseconds in 0 iterations\n \n \n Optimal value = 1800.0 \ud83d\udcaapower \n Army:\n \u2014 \ud83d\udde1\ufe0fSwordsmen = 6.0\n \u2014 \ud83c\udff9Bowmen = 0.0\n \u2014 \ud83d\udc0eHorsemen = 6.0\n\nWe obtain the same results: our code seems to work. Now let\u2019s **change the\nparameters** to tackle a slightly more complex problem.\n\nImagine we have a lot more resources: \ud83c\udf3e**183000** , \ud83e\udeb5**90512** , and\n\ud83e\ude99**80150** , so we can also produce a lot more units! This is the new table:\n\nNotice that we transformed the \ud83d\udcaa**power** into two values: \ud83d\udcaa**attack** and\n\u2764\ufe0f**health** , which is a little more detailed. Health values are higher than\nattack values, which is why we want to add a weighting factor to make them\nmore comparable.\n\nLet\u2019s take 10 as an example, so _power = 10*attack + health_. Our objective\nfunction becomes:\n\nAdapting our code to this new problem is actually quite simple: we just have\nto **change the input parameters** and update the **objective function**.\n\n \n \n ================= Solution =================\n Solved in 74.00 milliseconds in 412 iterations\n \n \n Optimal value = 1393145.0 \ud83d\udcaapower\n Army:\n \u2014 \ud83d\udde1\ufe0fSwordsmen = 2.0\n \u2014 \ud83d\udee1\ufe0fMen-at-arms = 1283.0\n \u2014 \ud83c\udff9Bowmen = 3.0\n \u2014 \u274cCrossbowmen = 0.0\n \u2014 \ud83d\udd2bHandcannoneers = 454.0\n \u2014 \ud83d\udc0eHorsemen = 0.0\n \u2014 \u265eKnights = 0.0\n \u2014 \ud83d\udc0fBattering rams = 301.0\n \u2014 \ud83c\udfafSpringalds = 0.0\n \u2014 \ud83e\udea8Mangonels = 0.0\n\nThis problem would take a long time for humans to address, but the ILP solver\ndid it in the blink of an eye. Better than that: it also gives us the\nguarantee that **our solution is optimal** , which means that our enemy cannot\nfind a better army composition for the same cost!\n\nWe could increase the number of units and give billions of resources but you\nget the picture: it would just take longer to obtain a solution, but it\nwouldn\u2019t change the problem.\n\n### \u2694\ufe0f III. Combining constraints\n\nNow, let\u2019s say we scouted our enemy and know that their army has a \ud83d\udcaapower of\n**1,000,000**. We could build a much better army, but our resources are\nprecious and it wouldn\u2019t be very efficient: all we have to do is to build an\narmy with a **\ud83d\udcaapower higher than 1,000,000** (even 1,000,001 would be enough).\n\nIn other words, the total power is now a **constraint**(\ud83d\udcaa > 1,000,000) instead\nof the objective to maximize. The new goal is to minimize the resources we\nneed to produce this army. However, we can reuse our input parameters since\nthey didn\u2019t change.\n\nThe new constraint can be translated as \u201cthe sum of the power of the selected\nunits must be strictly greater than 1,000,000\u201d.\n\nIn code, we can loop through our units and resources to design this\nconstraint.\n\nThe objective function also has to change. Our goal is to **minimize the sum\nof resources spent** to build the army.\n\nOnce again, we can loop through our resources to implement it in OR-Tools.\n\n \n \n ================= Solution =================\n Solved in 4.00 milliseconds in 0 iterations\n \n \n Optimal value = 111300.0 \ud83c\udf3e\ud83e\udeb5\ud83e\ude99resources\n Power = \ud83d\udcaa1001700.0 \n Army:\n \u2014 \ud83d\udde1\ufe0fSwordsmen = 0.0\n \u2014 \ud83d\udee1\ufe0fMen-at-arms = 0.0\n \u2014 \ud83c\udff9Bowmen = 0.0\n \u2014 \u274cCrossbowmen = 0.0\n \u2014 \ud83d\udd2bHandcannoneers = 0.0\n \u2014 \ud83d\udc0eHorsemen = 0.0\n \u2014 \u265eKnights = 0.0\n \u2014 \ud83d\udc0fBattering rams = 371.0\n \u2014 \ud83c\udfafSpringalds = 0.0\n \u2014 \ud83e\udea8Mangonels = 0.0\n \n \n Resources:\n \u2014 \ud83c\udf3eFood = 0.0\n \u2014 \ud83e\udeb5Wood = 111300.0\n \u2014 \ud83e\ude99Gold = 0.0\n\nThe solver found an optimal solution: we need to build **371 \ud83d\udc0fbattering rams**\nfor a total cost of 111,300 \ud83e\udeb5wood. Wait, what if we don\u2019t have that much wood?\nIn the previous section, we only had \ud83e\udeb590512: we cannot produce 371 \ud83d\udc0fbattering\nrams. \ud83d\ude31\n\nSo is it possible to take these **limited resources** into account and still\ntry to **build the best army**? Actually, it\u2019s super easy: we just have to\ncopy/paste the constraints from the previous section.\n\nIn this version, we have two types of constraints:\n\n * The total power must be **greater than 1,000,000** ;\n\n * We cannot spend more than our **limited resources**.\n\n \n \n ================= Solution =================\n Solved in 28.00 milliseconds in 1 iterations\n \n \n Optimal value = 172100.0 \ud83c\udf3e\ud83e\udeb5\ud83e\ude99resources\n Power = \ud83d\udcaa1000105.0\n Army:\n \u2014 \ud83d\udde1\ufe0fSwordsmen = 1.0\n \u2014 \ud83d\udee1\ufe0fMen-at-arms = 681.0\n \u2014 \ud83c\udff9Bowmen = 0.0\n \u2014 \u274cCrossbowmen = 0.0\n \u2014 \ud83d\udd2bHandcannoneers = 0.0\n \u2014 \ud83d\udc0eHorsemen = 0.0\n \u2014 \u265eKnights = 0.0\n \u2014 \ud83d\udc0fBattering rams = 301.0\n \u2014 \ud83c\udfafSpringalds = 0.0\n \u2014 \ud83e\udea8Mangonels = 0.0 \n \n \n Resources:\n \u2014 \ud83c\udf3eFood = 68160.0\n \u2014 \ud83e\udeb5Wood = 90320.0\n \u2014 \ud83e\ude99Gold = 13620.0\n\nSince we now have a **limited resource of \ud83e\udeb5wood** , the number of \ud83d\udc0fbattering\nrams sadly dropped from 371 to 301. In exchange, we got 681 \ud83d\udee1\ufe0fmen-at-arms and\n1 lost \ud83d\udde1\ufe0fswordsman (welcome to them).\n\nThe total cost of the army is **172,100** , which is much higher than the\n111,300 we previously found (+65% increase) but it truly is the optimal\nsolution under these constraints. It shows that we should produce more wood\nbecause these \ud83d\udc0f battering rams are extremely cost-efficient!\n\nThis example shows **how modular** LP models can be. It is possible to reuse\nparts of the code, like constraints, in another model to combine them and\nsolve more complex problems.\n\n### \ud83e\udde0 IV. Linear Programming vs Machine Learning\n\nLet\u2019s talk about the elephant in the room. Why not use **machine learning**\n(in a broad sense) instead of linear programming? It\u2019s not like this problem\ncannot be solved with a genetic algorithm for instance.\n\nMathematical optimization is often neglected in favor of machine learning\ntechniques, but both have their merits:\n\n * Linear programming can produce an **optimal solution** in an undetermined amount of time (it can take years), while machine learning can approximate complex functions in no time.\n\n * There is **no training** in LP, but an expert is required to build a mathematical model. Machine learning needs data, but the models can be used as black boxes to solve a problem.\n\n * As a rule of thumb, problems that **do not have a particular time constraint** and/or are not extremely complex can be advantageously solved with linear programming.\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\n### Conclusion\n\nIn this tutorial, we dived deeper into our understanding of mathematical\noptimization.\n\n * We talked about solvers and types of optimization problems: **LP, MIP, NLP** ;\n\n * We modeled and solved an extremely common optimization problem in an optimal way and **generalized our model** through a function;\n\n * We reframed this problem and **merged two sets of constraints** to obtain the best army composition for the lowest price;\n\n * We compared the **pros and cons** of linear programming and machine learning.\n\nThere are **a lot more problems** where optimization can be applied. For\ninstance, how to create school timetables that satisfy everybody\u2019s\nrequirements? How to deliver 1,000 different orders in a minimum amount of\ntime? Where to create a new metro line to maximize its usefulness?\n\nIn future articles, we\u2019ll talk about new types of applications for these\ntechniques, including satisfiability and nonlinear problems.\n\nI hope you enjoyed this more advanced article. If you like machine learning\nand optimization, **let\u2019s connect on Twitter**!\n\n### Related articles\n\n**Part 3: Constraint Programming in Python** \n _The Programming Paradigm to Find One Solution Among 8,080,104\nCandidates_towardsdatascience.com\n\n**Part 1: Introduction to Linear Programming in Python** \n _A guide to mathematical optimization with Google OR-\nTools_towardsdatascience.com\n\nShare this post\n\n#### Integer vs. Linear Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/integer-programming-vs-linear-programming-in-python-f1be5bb4e60e" }, { "id": "e75d9b4e-1a14-450e-ad51-b396969de6c5", "content": { "Title": "Introduction to Linear Programming in Python", "Subtitle": "A guide to mathematical optimization with Google OR-Tools", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Introduction to Linear Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Introduction to Linear Programming in Python\n\n### A guide to mathematical optimization with Google OR-Tools\n\nMaxime Labonne\n\nApr 04, 2022\n\nShare this post\n\n#### Introduction to Linear Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### A guide to mathematical optimization with Google OR-Tools\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0)\n\nLinear programming is a technique to **optimize any problem** with multiple\nvariables and constraints. It\u2019s a simple but powerful tool every data\nscientist should master.\n\nImagine you are a **strategist** recruiting an **army**. You have:\n\n * **Three resources** : \ud83c\udf3e**food** , \ud83e\udeb5**wood** , and \ud83e\ude99**gold**\n\n * **Three units** : \ud83d\udde1\ufe0f**swordsmen** , \ud83c\udff9**bowmen** , and \ud83d\udc0e**horsemen**.\n\nHorsemen are stronger than bowmen, who are in turn stronger than swordsmen.\nThe following table provides the cost and power of each unit:\n\nImage by author\n\nNow we have 1200 \ud83c\udf3efood, 800 \ud83e\udeb5wood, and 600 \ud83e\ude99gold. How should we **maximize the\npower of our army** considering these resources?\n\nWe could simply find the unit with the best power/cost ratio, take as many of\nthem as possible, and repeat the process with the other two units. But this\n\u201cguess and check\u201d solution might **not even be optimal** \u2026\n\nNow imagine we have **millions of units and resources** : the previous greedy\nstrategy is likely to completely miss the optimal solution. It is possible to\nuse a machine learning algorithm (e.g., a genetic algorithm) to solve this\nproblem, but we have no guarantee that the solution will be optimal either.\n\nFortunately for us, there is a method that can solve our problem in an optimal\nway: **linear programming** (or linear optimization), which is part of the\nfield of operations research (OR). In this article, we\u2019ll use it to find the\nbest numbers of swordsmen, bowmen, and horsemen to build the **army with the\nhighest power possible**.\n\nYou can run the code from this tutorial with the following **Google Colab\nnotebook**.\n\n### \ud83e\udde0 I. Solvers\n\nIn Python, there are different libraries for linear programming such as the\nmulti-purposed **SciPy** , the beginner-friendly **PuLP** , the exhaustive\n**Pyomo** , and many others.\n\nToday, we are going to use **Google OR-Tools** , which is quite user-friendly,\ncomes with several prepackaged solvers, and has by far the most stars on\nGitHub.\n\nIf the installation doesn't work, please restart the kernel and try again: it\ncan fail sometimes. \u00af\\\\_(\u30c4)_/\u00af\n\nAll these libraries have a hidden benefit: they act as **interfaces** to **use\nthe same model with different solvers**. Solvers like Gurobi, Cplex, or SCIP\nhave their own APIs, but the models they create are tied to a specific solver.\n\nOR-Tools allows us to use an abstract (and quite pythonic) way of modeling our\nproblems.**** We can then choose **one or several solvers** to find an optimal\nsolution. The model we built is thus highly reusable!\n\nImage by author\n\nOR-Tools comes with its own linear programming solver, called **GLOP** (Google\nLinear Optimization Package). It is an open-source project created by Google\u2019s\nOperations Research Team and written in C++.\n\nOther solvers are available such as **SCIP** , an excellent non-commercial\nsolver created in 2005 and updated and maintained to this day. We could also\nuse popular commercial options like **Gurobi** and **Cplex**. However, we\nwould need to install them on top of OR-Tools and get the appropriate licenses\n(which can be quite costly). For now, let\u2019s try GLOP.\n\n### \ud83e\uddee II. Variables\n\nWe created an instance of the OR-Tools solver using GLOP. Now, how to use\nlinear programming? The first thing we want to define is the **variables we\nwant to optimize**.\n\nIn our example, we have three variables: the number of \ud83d\udde1\ufe0fswordsmen, \ud83c\udff9bowmen,\nand \ud83d\udc0ehorsemen in the army. OR-Tools accepts three types of variables:\n\n * `NumVar` for **continuous** variables;\n\n * `IntVar` for **integer** variables;\n\n * `BoolVar` for **boolean** variables.\n\nWe\u2019re looking for **round numbers** of units, so let\u2019s choose `IntVar`. We\nthen need to specify lower and upper bounds for these variables. We want at\nleast 0 unit, but we don't really have an upper bound. So we can say that our\nupper bound is infinity (or any big number we will never reach). It can be\nwritten as:\n\nLet\u2019s translate it into code. Infinity is replaced by `solver.infinity()` in\nOR-Tools. Other than that, the syntax is **quite straightforward** :\n\n### \u26d3\ufe0f III. Constraints\n\nWe defined our variables, but the **constraints** are just as important.\n\nPerhaps counter-intuitively, adding more constraints helps the solver to\n**find an optimal solution faster**. Why is this the case? Think of the solver\nas a tree: constraints help it trim branches and reduce the search space.\n\nIn our case, we have a limited number of resources we can use to produce\nunits. In other words, **we can\u2019t spend more resources than we have**. For\ninstance, the \ud83c\udf3efood spent to recruit units cannot be higher than 1200. The\nsame is true with \ud83e\udeb5wood (800) and \ud83e\ude99gold (600).\n\nAccording to our table, units have the following costs:\n\n * 1**swordsman** = \ud83c\udf3e60 + \ud83e\udeb520;\n\n * 1 **bowman** = \ud83c\udf3e80 + \ud83e\udeb510 + \ud83e\ude9940;\n\n * 1**horseman** = \ud83c\udf3e140 + \ud83e\ude99100.\n\nWe can write one constraint per resource as follows:\n\nIn OR-Tools, we simply add the constraints to our solver instance with\n`solver.Add()`.\n\n### \ud83c\udfaf IV. Objective\n\nNow that we have our variables and constraints, we want to **define our goal**\n(or objective function).\n\nIn linear programming, this function **has to be linear**(like the\nconstraints), so of the form _ax + by + cz + d_. In our example, the objective\nis quite clear: we want to recruit the army with the highest power. The table\ngives us the following power values:\n\n * 1 **swordsman** = \ud83d\udcaa70;\n\n * 1 **bowman** = \ud83d\udcaa95;\n\n * 1 **horseman** = \ud83d\udcaa230.\n\nMaximizing the power of the army amounts to **maximizing the sum of the power\nof each unit**. Our objective function can be written as:\n\nIn general, there are only two types of objective functions: **maximizing** or\n**minimizing**. In OR-Tools, we declare this goal with `solver.Maximize()` or\n`solver.Minimize()`.\n\nAnd we\u2019re done! There are three steps to model any linear optimization\nproblem:\n\n 1. Declaring the **variables** to optimize with lower and upper bounds;\n\n 2. Adding **constraints** to these variables;\n\n 3. Defining the **objective function** to maximize or to minimize.\n\nNow that is clear, we can ask the solver to find an optimal solution for us.\n\n### \ud83e\udd47 V. Optimize!\n\nCalculating the optimal solution is done with `solver.Solve(``)` . This\nfunction returns a status that can be used to **check that the solution is\nindeed optimal**.\n\nLet's print the highest total power we can get with the best army\nconfiguration.\n\n \n \n ================= Solution =================\n Solved in 87.00 milliseconds in 2 iterations\n \n Optimal power = 1800.0 \ud83d\udcaapower\n Army:\n - \ud83d\udde1\ufe0fSwordsmen = 6.0000000000000036\n - \ud83c\udff9Bowmen = 0.0\n - \ud83d\udc0eHorsemen = 5.999999999999999\n\nGreat! The solver found an optimal solution: our army has a **total power of\n\ud83d\udcaa1800** with 6 \ud83d\udde1\ufe0fswordsmen and 6 \ud83d\udc0ehorsemen (sorry bowmen!).\n\nLet\u2019s unpack this result:\n\n * The solver decided to take the **maximum number of \ud83d\udc0ehorsemen** (6, since we only have \ud83e\ude99600 and they each cost \ud83e\ude99100);\n\n * The remaining resources are spent in \ud83d\udde1\ufe0f**swordsmen** : we have 1200 \u2013 6*140 = 360\ud83c\udf3efood left, which is why the solver chose 6 \ud83d\udde1\ufe0fswordsmen;\n\n * We can deduce that the horsemen are the best unit and the**bowmen are the worst one** because they haven\u2019t been chosen at all.\n\nOkay, but there\u2019s something quite weird: these numbers are not round, even\nthough we specified that we wanted **integers** (`IntVar`). So what happened?\n\nUnfortunately, answering this question requires a deep dive into linear\nprogramming\u2026 To keep things simple in this introduction, let\u2019s say it\u2019s\nbecause of GLOP. Solvers have characteristics we have to take into account,\nand **GLOP doesn\u2019t handle integers**. This is another proof that building\nreusable models is more than just convenient.\n\nWe\u2019ll explain why GLOP has this strange behavior and **how to fix it** in a\nmore advanced tutorial.\n\n### Conclusion\n\nWe saw through this example the **five main steps** of any linear optimization\nproblem:\n\n 1. **Choosing a solver** : in our case, we selected GLOP for convenience.\n\n 2. **Declaring variables** : the parameters to optimize were the number of swordsmen, bowmen, and horsemen.\n\n 3. **Declaring constraints** : each of these units has a cost. The total cost could not exceed our limited resources.\n\n 4. **Defining objective:** the criterion to maximize was the total power of this army. It could have been something else, like the number of units.\n\n 5. **Optimizing** : GLOP found an optimal solution to this problem in less than a second.\n\nImage by author\n\nThis is the main benefit of linear programming: the algorithm gives us a\n**guarantee that the solution that was found is** **optimal**(with a certain\nerror). This guarantee is powerful, but comes at a cost: the model can be so\ncomplex that the solver takes years (or more) to find an optimal solution. In\nthis scenario, we have two options:\n\n * We can **stop the solver** after a certain time (and probably obtain a suboptimal answer);\n\n * We can use a **metaheuristic** like a genetic algorithm to calculate an excellent solution in a short amount of time.\n\nIn the next article, we\u2019ll talk about the different types of optimization\nproblems and generalize our approach to an entire class of them.\n\nI hope you enjoyed this introduction! Feel free to share it and spread the\nknowledge about linear optimization. Don\u2019t forget to **check my blog** and\n**follow me on Twitter** where I post summaries of these articles. Cheers!\n\n### Related articles\n\n**Part 2: Integer vs. Linear Programming in Python** \n _A guide to identify and solve any optimization\nproblem_towardsdatascience.com\n\n**Part 3: Constraint Programming in Python** \n _The Programming Paradigm to Find One Solution Among 8,080,104\nCandidates_towardsdatascience.com\n\nShare this post\n\n#### Introduction to Linear Programming in Python\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/introduction-to-linear-programming-in-python-9261e7eb44b" }, { "id": "3ab3dc4a-2632-46fc-b12e-6ed4fc48fe9f", "content": { "Title": "What is a Tensor in Machine Learning? - Maxime Labonne", "Subtitle": "The difference between tensors, arrays, and matrices", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### What is a Tensor in Machine Learning?\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# What is a Tensor in Machine Learning?\n\n### The difference between tensors, arrays, and matrices\n\nMaxime Labonne\n\nMar 29, 2022\n\nShare this post\n\n#### What is a Tensor in Machine Learning?\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### The difference between tensors, arrays, and matrices\n\nImage by author\n\nWhat is a tensor, exactly?\n\nMost deep learning practitioners know about them but can\u2019t pinpoint an **exact\ndefinition**.\n\nTensorFlow, PyTorch: every deep learning framework relies on the same basic\nobject: **tensors**. They\u2019re used to store almost everything in deep learning:\ninput data, weights, biases, predictions, etc.\n\nAnd yet, their definition is incredibly fuzzy: the Wikipedia category alone\nhas **over 100 pages** related to tensors.\n\nIn this article, we'll give a **definitive answer** to the following question:\nwhat is a tensor in neural networks?\n\n### \ud83d\udcbb Tensors in computer science\n\nSo why are there so many definitions?\n\nIt's quite simple: different fields have different definitions. Tensors in\n**mathematics** are not quite the same as tensors in **physics** , which are\ndifferent from tensors in **computer science**.\n\nImage by author\n\nThese definitions can be divided into two categories: tensors as a data\nstructure or as objects (in an object-oriented programming sense).\n\n * **Data structure** : this is the definition we use in computer science. Tensors are multidimensional arrays that store a specific type of value.\n\n * **Objects** : this is the definition used in other fields. In mathematics and physics, tensors are not just a data structure: they also have a list of properties, like a specific product.\n\nThis is why you see a lot of people (sometimes quite pedantically) saying \"\n_tensors are**not** n-dimensional arrays/matrices_\": they don't talk about\ndata structures, but about**objects with properties**.\n\nEven the same words have **different meanings**. For instance, in computer\nscience, a 2D tensor is a matrix (it's a tensor of rank 2). In linear algebra,\na tensor with 2 dimensions means it only stores two values. The rank also has\na completely different definition: it is the maximum number of its linearly\nindependent column (or row) vectors.\n\nIn computer science, we're only interested in a definition focused on the\n**data structure**. From this point of view, tensors truly are a\ngeneralization in _n_ dimensions of matrices.\n\nBut we're still missing an important nuance when talking about tensors\nspecifically in the context of deep learning...\n\n### \ud83e\udde0 Tensors in deep learning\n\n _Icons created by Freepik and smashingstocks \u2014Flaticon_\n\nSo why are they called \"tensors\" instead of \"multidimensional arrays\"? Ok, it\nis shorter, but is it all there is to it? Actually, people make an **implicit\nassumption** when they talk about tensors.\n\nPyTorch\u2019s official documentation gives us a practical answer:\n\n> _The biggest difference between a numpy array and a PyTorch Tensor is that a\n> PyTorch Tensor can run on either**CPU or GPU**._\n\nIn deep learning, we need performance to compute a lot of matrix\nmultiplications in a highly parallel way. These matrices (and n-dimensional\narrays in general) are generally stored and processed on GPUs to speed up\ntraining and inference times.\n\nThis is what was missing in our previous definition: tensors in deep learning\nare not just n-dimensional arrays, there's also the implicit assumption they\ncan be **run on a GPU**.\n\n### \u2694\ufe0f NumPy vs PyTorch\n\nLet's see the difference between NumPy arrays and PyTorch tensors.\n\nImage by author\n\nThese two objects are very similar: we can initialize a **1D array** and a\n**1D tensor** with nearly the same syntax. They also share a lot of methods\nand can be easily converted into one another.\n\nYou can find the code used in this article at this address.\n\n \n \n NumPy Array: [1 2 3]\n \n \n PyTorch Tensor: tensor([1, 2, 3])\n\nInitializing 2D arrays and 2D tensors is not more complicated.\n\n \n \n NumPy Array: [[1 2 3]\n [4 5 6]]\n \n \n PyTorch Tensor: tensor([[1, 2, 3],\n [4, 5, 6]])\n\nWe said that the only difference between tensors and arrays was the fact that\ntensors can be **run on GPUs**. So in the end, this distinction is based on\nperformance. But is this boost that important?\n\nLet's compare the performance between NumPy arrays and PyTorch tensors on\nmatrix multiplication. In the following example, we randomly initialize **4D\narrays/tensors and multiply them**.\n\n \n \n >>> 1.32 s\n \n \n >>> 25.2 ms\n\nAs we can see, PyTorch tensors completed outperformed NumPy arrays: they\ncompleted the multiplication **52 times faster**!\n\nWe could attribute this performance to different factors, such as:\n\n * NumPy arrays use a _float64_ format, whereas PyTorch tensors leverage the more efficient _float32_ format. However, even when NumPy arrays are converted to _float32_ , PyTorch tensors are still 40 times faster.\n\n * PyTorch tensors are stored on a GPU, unlike NumPy arrays. But if we repeat the same experiment on a CPU, PyTorch tensors still manage to be 2.8 times faster on average.\n\nEven when combining both factors, PyTorch tensors prove to be 1.4 times\nfaster, showing that NumPy arrays are truly less performant for matrix\nmultiplication.\n\nThis is the true power of tensors: they're **blazingly fast**! Performance\nmight vary depending on the dimensions, the implementation**,** and the\nhardware, but this speed is the reason why tensors (and not arrays) are so\ncommon in deep learning.\n\n### \ud83d\udcdd Conclusion\n\nIn this article, we wrote a definition of tensors based on:\n\n 1. Their use in **computer science**(data structure);\n\n 2. More specifically, in **deep learning** (they can run on GPUs).\n\nHere's how we can summarize it in one sentence:\n\n> _Tensors are**n-dimensional arrays** with the implicit assumption that they\n> can **run on a GPU.**_\n\nFinally, we saw the difference in performance between tensors and arrays,\nwhich motivates the need for tensors in deep learning.\n\nSo next time someone tries to explain to you that tensors are not exactly a\ngeneralization of matrices, you'll know that they're right in a particular\ndefinition of tensors, but not in the computer science/deep learning one.\n\nIf you're looking for more data science and machine learning content in\nn-dimensions, please **follow me on twitter@maximelabonne**. You can find the\ncode used in this article at this address. \ud83d\udce3\n\nShare this post\n\n#### What is a Tensor in Machine Learning?\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/what-is-a-tensor-in-deep-learning-6dedd95d6507" }, { "id": "eac6604b-9bfe-4039-99b1-6449c0a65dd2", "content": { "Title": "Efficiently iterating over rows in a Pandas DataFrame", "Subtitle": "Never use iterrows and itertuples again", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Efficiently iterating over rows in a Pandas DataFrame\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Efficiently iterating over rows in a Pandas DataFrame\n\n### Never use iterrows and itertuples again\n\nMaxime Labonne\n\nMar 21, 2022\n\nShare this post\n\n#### Efficiently iterating over rows in a Pandas DataFrame\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Never use iterrows and itertuples again\n\nImage by author, emojis by OpenMoji (CC BY-SA 4.0).\n\nWhen I started machine learning, I followed the guidelines and created my own\nfeatures by combining multiple columns in my dataset. It\u2019s all well and good,\nbut the way I did it was **horribly inefficient**. I had to wait several\nminutes to do the most basic operations.\n\nMy problem was simple: I didn\u2019t know the fastest way to iterate over rows in\nPandas.\n\nI often see people online using the same techniques I used to apply. It\u2019s not\nelegant but it\u2019s ok if you don\u2019t have much data. However, if you process\n**more than 10k rows** , it quickly becomes an obvious performance issue.\n\nIn this article, I\u2019m gonna give you the **best way to iterate over rows in a\nPandas DataFrame** , with no extra code required. It\u2019s not just about\nperformance: it\u2019s also about understanding what\u2019s going on under the hood to\nbecome a better data scientist.\n\nLet\u2019s import a dataset in Pandas. In this case, I chose the one I worked on\nwhen I started: it\u2019s time to fix my past mistakes! \ud83e\ude79\n\nYou can run the code with the following Google Colab notebook.\n\nThis dataset has 22k rows and 43 columns with a combination of categorical and\nnumerical values. Each row describes a connection between two computers.\n\nLet\u2019s say we want to create a new feature: the **total number of bytes** in\nthe connection. We just have to sum up two existing features: `src_bytes` and\n`dst_bytes`. Let's see different methods to calculate this new feature.\n\n### \u274c\u274c 1. Iterrows\n\nAccording to the official documentation, `iterrows()` iterates \"over the rows\nof a Pandas DataFrame as (index, Series) pairs\". It converts each row into a\nSeries object, which causes two problems:\n\n 1. It can **change the type** of your data (dtypes);\n\n 2. The conversion **greatly degrades performance**.\n\nFor these reasons, the ill-named `iterrows()` is the WORST possible method to\nactually iterate over rows.\n\n \n \n 10 loops, best of 5: 1.07 s per loop\n\nNow let\u2019s see slightly better techniques\u2026\n\n### \u274c 2. For loop with .loc or .iloc (3\u00d7 faster)\n\nThis is what I used to do when I started: a **basic for loop** to select rows\nby index (with `.loc` or `.iloc`).\n\nWhy is it bad? Because DataFrames are not designed for this purpose. As with\nthe previous method, rows are converted into Pandas Series objects, which\ndegrades performance.\n\nInterestingly enough,`.iloc` is faster than `.loc`. It makes sense since\nPython doesn't have to check user-defined labels and directly look at where\nthe row is stored in memory.\n\n \n \n 10 loops, best of 5: 600 ms per loop\n \n \n 10 loops, best of 5: 377 ms per loop\n\nEven this basic for loop with `.iloc` is **3 times** faster than the first\nmethod!\n\n### \u274c 3. Apply (4\u00d7 faster)\n\nThe `apply()` method is another popular choice to iterate over rows. It\ncreates code that is easy to understand but at a cost: performance is nearly\nas bad as the previous for loop.\n\nThis is why I would strongly advise you to **avoid this function** for this\nspecific purpose (it's fine for other applications).\n\nNote that I convert the DataFrame into a list using the `to_list()` method to\nobtain identical results.\n\n \n \n 10 loops, best of 5: 282 ms per loop\n\nThe `apply()` method is a for loop in disguise, which is why the performance\ndoesn't improve that much: it's only **4 times faster** than the first\ntechnique.\n\n### \u274c 4. Itertuples (10\u00d7 faster)\n\nIf you know about `iterrows()`, you probably know about `itertuples()`.\nAccording to the official documentation, it iterates \"over the rows of a\nDataFrame as namedtuples of the values\". In practice, it means that **rows are\nconverted into tuples** , which are **much lighter objects** than Pandas\nSeries.\n\nThis is why `itertuples()` is a better version of `iterrows()`. This time, we\nneed to access the values with an **attribute**(or an index). If you want to\naccess them with a **string**(e.g., if there\u2019s a space in the string), you can\nuse the `getattr()` function instead.\n\n \n \n 10 loops, best of 5: 99.3 ms per loop\n\nThis is starting to look better: it is now **10 times faster** than\n`iterrows()` .\n\n### \u274c 5. List comprehensions (200\u00d7 faster)\n\nList comprehensions are a fancy way to iterate over a list as a one-liner.\n\nFor instance, `[print(i) for i in range(10)]` prints numbers from 0 to 9\n**without any explicit for loop**. I say \"explicit\" because Python actually\nprocesses it as a for loop if we look at the bytecode.\n\nSo why is it faster? Quite simply because we don't call the `.append()` method\nin this version.\n\n \n \n 100 loops, best of 5: 5.54 ms per loop\n\nIndeed, this technique is **200 times faster** than the first one! But we can\nstill do better.\n\n### \u2705 6. Pandas vectorization (1500\u00d7 faster)\n\nUntil now, all the techniques used simply add up single values. Instead of\nadding single values, why not **group them into vectors** to sum them up? The\ndifference between adding two numbers or two vectors is not significant for a\nCPU, which should speed things up.\n\nOn top of that, Pandas can **process Series objects in parallel** , using\nevery CPU core available!\n\nThe syntax is also the simplest imaginable: this solution is extremely\nintuitive. Under the hood, Pandas takes care of vectorizing our data with an\noptimized C code using contiguous memory blocks.\n\n \n \n 1000 loops, best of 5: 734 \u00b5s per loop\n\nThis code is **1500 times faster** than `iterrows()` and it is even simpler to\nwrite.\n\n### \u2705\u2705 7. NumPy vectorization (1900\u00d7 faster)\n\nNumPy is designed to handle scientific computing. It has **less overhead**\nthan Pandas methods since rows and dataframes all become `np.array`. It relies\non the same optimizations as Pandas vectorization.\n\nThere are **two ways** of converting a Series into a `np.array`: using\n`.values` or `.to_numpy()`. The former has been deprecated for years, which is\nwhy we're gonna use `.to_numpy()` in this example.\n\n \n \n 1000 loops, best of 5: 575 \u00b5s per loop\n\nWe found our winner with a technique that is **1900 times faster** than our\nfirst competitor! Let\u2019s wrap things up.\n\n### \ud83c\udfc6 Conclusion\n\nThe number of rows in the dataset can greatly impact the performance of\ncertain techniques (image by author).\n\nDon\u2019t be like me: if you need to iterate over rows in a DataFrame,\n**vectorization** is the way to go! You can find the code to reproduce the\nexperiments at this address. Vectorization is not harder to read, it doesn\u2019t\ntake longer to write, and the performance gain is incredible.\n\nIt\u2019s not just about performance: understanding how each method works under the\nhood helped me to **write better code**. Performance gains are always based on\nthe same techniques: transforming data into vectors and matrices to take\nadvantage of parallel processing. Alas, this is often at the expense of\nreadability. But it doesn\u2019t have to be.\n\nIterating over rows is **just an example** but it shows that, sometimes, you\ncan have the cake and eat it. \ud83c\udf82\n\nIf you liked this article, **follow me on Twitter** **@maximelabonne **for\nmore tips about data science and machine learning!\n\nShare this post\n\n#### Efficiently iterating over rows in a Pandas DataFrame\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/efficiently-iterating-over-rows-in-a-pandas-dataframe-7dd5f9992c01" }, { "id": "59fc9ced-cf49-4c21-9875-7c6c99fb0c16", "content": { "Title": "Q-learning for beginners - Maxime Labonne", "Subtitle": "Train an AI to solve the Frozen Lake environment", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### Q-learning for beginners\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Q-learning for beginners\n\n### Train an AI to solve the Frozen Lake environment\n\nMaxime Labonne\n\nMar 07, 2022\n\nShare this post\n\n#### Q-learning for beginners\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Train an AI to solve the Frozen Lake environment\n\nImage by author\n\nThe goal of this article is to **teach an AI how to solve the \u2744\ufe0fFrozen Lake\nenvironment using reinforcement learning**. Instead of reading Wikipedia\narticles and explaining formulas, we\u2019re going to **start from scratch and try\nto recreate the \ud83e\udd16Q-learning** algorithm by ourselves. We\u2019ll not just\nunderstand **how it works** , but more importantly **why it works** : why was\nit designed that way? What are the hidden assumptions, the details that are\nnever explained in regular courses and tutorials?\n\nAt the end of this article, you\u2019ll **master the Q-learning algorithm** and be\nable to **apply it to other environments and real-world problems**. It\u2019s a\ncool mini-project that gives a **better insight into how reinforcement\nlearning works** and **can hopefully inspire ideas for original and creative\napplications**.\n\nLet\u2019s start by installing the \u2744\ufe0f**Frozen Lake** environment and importing the\nnecessary libraries: `gym` for the game, `random` to generate random numbers,\nand `numpy` to do some math.\n\n### \u2744\ufe0f I. Frozen Lake\n\nNow, let\u2019s talk about the game we\u2019re going to be solving in this tutorial.\n\u2744\ufe0f**Frozen Lake** is a simple environment composed of tiles, where the AI has\nto **move from an initial tile** to a **goal**. Tiles can be a safe **frozen\nlake** \u2705, or a **hole** \u274c that gets you stuck forever. The AI, or agent, has 4\npossible actions: go \u25c0\ufe0f**LEFT** , \ud83d\udd3d**DOWN** , \u25b6\ufe0f**RIGHT** , or \ud83d\udd3c**UP**. The\nagent must learn to avoid holes in order to **reach the goal** in a **minimal\nnumber of actions**. By default, the environment is **always in the same\nconfiguration**. In the environment\u2019s code, **each tile is represented by a\nletter** as follows:\n\n \n \n S F F F (S: starting point, safe)\n F H F H (F: frozen surface, safe)\n F F F H (H: hole, stuck forever)\n H F F G (G: goal, safe)\n\nImage by author\n\nWe can try to manually solve the example above to understand the game. Let\u2019s\nsee if the following sequence of actions is a correct solution: **RIGHT** \u2192\n**RIGHT** \u2192 **RIGHT** \u2192 **DOWN** \u2192 **DOWN** \u2192 **DOWN**. Our agent starts on\ntile **S** , so we move right on a frozen surface \u2705, then again \u2705, then once\nmore \u2705, then we go down and find a hole \u274c.\n\nActually, it\u2019s really easy to find several correct solutions: **RIGHT** \u2192\n**RIGHT** \u2192 **DOWN** \u2192 **DOWN** \u2192 **DOWN** \u2192 **RIGHT** is an obvious one. But\nwe could make a sequence of actions that loops around a hole 10 times before\nreaching the goal. This sequence is valid, but it doesn\u2019t meet our final\nrequirement: **the agent needs to meet the goal in a minimum number of\nactions**. In this example, the minimum number of actions to complete the game\nis **6**. We need to remember this fact to check if our agent really masters\n\u2744\ufe0f**Frozen Lake** or not.\n\nImage by author\n\nLet\u2019s initialize the environment thanks to the `gym` library. There are two\nversions of the game: one with **slippery ice** , where selected actions have\na **random chance of being disregarded by the agent** ; and a **non-slippery\none** , where **actions cannot be ignored**. We'll use the **non-slippery**\none to begin with because it's easier to understand.\n\n \n \n \ud83d\udfe5FFF\n FHFH\n FFFH\n HFFG\n\nWe can see that the game that was created has **the exact same configuration\nas in our example** : it is the same puzzle. The position of our agent is\nindicated by a **red rectangle**. Solving this puzzle can be done with a\nsimple script and if\u2026else conditions, which would actually be **useful to\ncompare our AI to a simpler approach**. However, we want to try a more\nexciting solution: **reinforcement learning**.\n\n### \ud83c\udfc1 II. Q-table\n\nIn \u2744\ufe0f**Frozen Lake** , there are 16 tiles, which means our agent can be found\nin 16 different positions, called **states**. For each state, there are 4\npossible actions: go \u25c0\ufe0f**LEFT** , \ud83d\udd3d**DOWN** , \u25b6\ufe0f**RIGHT** , and \ud83d\udd3c**UP**.\nLearning how to play Frozen Lake is like **learning which action you should\nchoose in every state**. To know which action is the best in a given state, we\nwould like to assign a **quality value** to our actions. We have 16 states and\n4 actions, so want to calculate 16 x 4 = 64 values.\n\nA nice way of representing it is using a table, known as a Q-table, where\n**rows list every state s** and **columns list every action a**. In this\nQ-table, each cell contains a value Q(s, a), which is the **value (quality) of\nthe action a in the state s** (1 if it\u2019s the best action possible, 0 if it\u2019s\nreally bad). When our agent is in a particular state s, it **just has to check\nthis table to see which action has the highest value**. Taking the action with\nthe highest value makes sense but **we\u2019ll see later that we can design\nsomething even better** \u2026\n\n _Example of Q-table, where each cell contains the value_ Q(a, s)_of the\naction_ a _(column) in a given state_ s _(row)_\n\nLet\u2019s create our Q-table and fill it with zeros since **we still have no idea\nof the value of each action in each state**.\n\n \n \n Q-table =\n [[0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]]\n\nGreat! We have our Q-table with **16 rows** (our 16 states) and **4 columns**\n(our 4 actions) as expected. Let\u2019s try to see what we can do next: every value\nis set to zero, so we have no information at all. Let\u2019s say that the agent\ntakes a **random action** : \u25c0\ufe0f**LEFT** , \ud83d\udd3d**DOWN** , \u25b6\ufe0f**RIGHT** , or \ud83d\udd3c**UP**.\n\nWe can use the `random` library with the `choice` method to randomly choose an\naction.\n\n \n \n 'LEFT'\n\nWait, actually the agent is currently on the initial state **S** , which means\nonly two actions are possible: \u25b6\ufe0f**RIGHT** and \ud83d\udd3d**DOWN**. The agent can also\ntake the actions \ud83d\udd3c**UP** and \u25c0\ufe0f**LEFT** , but it won't move: its state doesn't\nchange. Therefore, we **do not put any constraint on what actions are\npossible** : the agent will **naturally understand that some of them don't do\nanything**.\n\nWe can keep using `random.choice()`, but the `gym` library **already\nimplements a method to randomly choose an action**. It might save us some\nhassle later, so let's try it.\n\n \n \n 0\n\nOops... this time it's a **number**. We could read `gym`'s documentation but\nit is quite scarce unfortunately. No worries though, **we can check the source\ncode on GitHub** to understand **what these numbers mean**. It's actually\nsuper straightforward:\n\n \n \n \u25c0\ufe0f LEFT = 0\n \ud83d\udd3d DOWN = 1\n \u25b6\ufe0f RIGHT = 2\n \ud83d\udd3c UP = 3\n\nImage by author\n\nOkay, now that **we understand how`gym` connects numbers to directions**,\nlet's try to use it to **move our agent to the right** \u25b6\ufe0f. This time, it can\nbe performed using the `step(action)` method. We can try to **directly provide\nit the number 2** , corresponding to the direction we chose (right), and check\nif the agent moved.\n\n \n \n (Right)\n S\ud83d\udfe5FF\n FHFH\n FFFH\n HFFG\n\n**Huzzah**! The red square moved from the initial state **S** to the right:\n**our prediction was correct**. And that's all we need to know in order to\ninteract with the environment:\n\n 1. How to **randomly choose an action** using `action_space.sample()`;\n\n 2. How to **implement this action and move our agent in the desired direction** with `step(action)`.\n\nTo be completely exhaustive, we can add:\n\n 1. How to **display the current map to see what we\u2019re doing** with `render()`;\n\n 2. How to **restart the game** when the agent falls into a hole or reaches the goal **G** with `reset()`.\n\nNow that we understand how to interact with our `gym` environment, let's go\nback to our algorithm. In reinforcement learning, **agents are rewarded by the\nenvironment when they accomplish a predefined goal**. In \u2744\ufe0f**Frozen Lake** ,\nthe agent is only rewarded when it reaches the state **G** (see the source\ncode). We cannot control this reward, it is set in the environment: **it's 1\nwhen the agent reaches G, and 0 otherwise**.\n\nLet\u2019s print it every time we implement an action. The reward is given by the\nmethod `step(action)`.\n\n \n \n (Left)\n \ud83d\udfe5FFF\n FHFH\n FFFH\n HFFG\n Reward = 0.0\n\nThe reward is indeed 0\u2026 \ud83d\ude31 wow, I guess we\u2019re in a pickle, because **only one\nstate can give us a positive reward** in the entire game. How are we supposed\nto **take the right directions at the very beginning when the only validation\nwe have is at the very end?** If we ever want to see a reward of 1, we\u2019d need\nto be lucky enough to **find the correct sequence of actions by chance**.\nUnfortunately, that\u2019s exactly how it works\u2026 **the Q-table will remain filled\nwith zeros until the agent randomly reaches the goal G**.\n\nThe problem would be much simpler if we could have intermediate, smaller\nrewards to guide our path towards the goal **G**. Alas, this is actually one\nof the **main issues of reinforcement learning** : this phenomenon, called\n**sparse rewards** , makes agents very difficult to train on problems **where\nthe only reward is at the end of a long sequence of actions**. Different\ntechniques were proposed to mitigate this issue, but we\u2019ll talk about it\nanother time.\n\n### \ud83e\udd16 III. Q-learning\n\nLet\u2019s go back to our problem. Okay, we need to be lucky enough to find the\ngoal **G** by accident. But once it\u2019s done, how to backpropagate the\ninformation to the initial state? The \ud83e\udd16**Q-learning algorithm offers a clever\nsolution** to this issue. We need to update the value of our state-action\npairs (each cell in the Q-table) considering 1/ the **reward** for reaching\nthe next state, and 2/ the **highest possible value in the next state**.\n\nImage by author\n\nWe know we get a reward of 1 when we move to **G**. As we just said, the value\nof **the state next to G** (let\u2019s call it **G-1**) with **the relevant action\nto reach G** is increased thanks to the reward. Okay good, end of the episode:\nthe agent won and we restart the game. Now, the next time the agent is in **a\nstate next to G-1** , it will increase the value of this state (let\u2019s call it\n**G-2**) with **the relevant action to reach G-1**. The next time the agent is\nin a state next to **G-2** , it will do the same. Rinse and repeat, until the\nupdate reaches the initial state **S**.\n\nLet\u2019s try to find the **update formula** to backpropagate the values from\n**G** to **S**. Remember: values denote the **quality** of **an action in a\nspecific state** (0 if it\u2019s terrible, 1 if it\u2019s the best action possible in\nthis state). We try to **update the value** of the action a\u209c (for example, a\u209c=\n0 if the action is left) in the state s\u209c (for example, s\u209c = 0 when the agent\nis in the initial state **S**). This **value is just a cell in our Q-table** ,\ncorresponding to the **row number s** \u209c**and the column number a** \u209c: this\nvalue is formally called Q(s\u209c, a\u209c).\n\nAs we said previously, we need to update it using 1/ **the reward for the next\nstate** (formally noted r\u209c), and 2/ **the maximum possible value in the next\nstate** (max\u2090 _Q(s_ \u209c\u208a\u2081, a)). Therefore, the update formula must look like:\n\nThe new value is the current one + the reward + the highest value in the next\nstate. We can manually try our formula to check if it looks correct: let\u2019s\npretend our agent is **in the state G-1 next to the goal G for the first\ntime**. We can update the value corresponding to the winning action in this\nstate **G-1** with:\n\nwhere Q(G-1, a\u209c) = 0 and max\u2090 _Q(G_ , a) = 0 because the Q-table is empty, and\nr\u209c _= 1_ because we get the only reward in this environment. We obtain\nQ{new}(G-1, a\u209c) = 1. The next time the agent is in a state next to this one\n(**G-2**), we update it too using the formula and get the same result:\n_Q_{new}(G-2, a\u209c) = 1. In the end, **we backpropagate ones in the Q-table**\nfrom **G** to **S**. Okay it works, but the result is **binary** : either it\u2019s\nthe **wrong state-action pair or the best one**. We would like more nuance\u2026\n\nActually, we almost **found the true Q-learning update formula** with common\nsense. The nuance we\u2019re looking for adds two parameters:\n\n * **\u03b1** is the \ud83d\udca1**learning rate** (between 0 and 1), which is how much we should change the original Q(s\u209c, a\u209c) value. If \u03b1 = 0, the value **never changes** , but if \u03b1 = 1, the value **changes extremely fast**. In our attempt, we didn\u2019t limit the learning rate so \u03b1 = 1. But this is too fast in reality: the reward and the maximum value in the next state quickly **overpower the current value**. We need to find a **balance between the importance of past and new knowledge**.\n\n * **\u03b3** is the \ud83d\udcc9**discount factor** (between 0 and 1), which determines how much the agent cares about future rewards compared to immediate ones (as the saying goes, \u201ca bird in the hand is worth two in the bush\u201d). If \u03b3 = 0, the agent only focuses on **immediate rewards** , but if \u03b3 = 1, any **potential future reward has the same value than current ones**. In \u2744\ufe0f**Frozen Lake** , we want a high discount factor since there\u2019s only one possible reward at the very end of the game.\n\nWith the real Q-learning algorithm, the new value is calculated as follows:\n\nOkay, let\u2019s try this new formula before implementing it. Once again, we can\npretend that our agent is **next to the goal G for the first time**. We can\nupdate the state-action pair to win the game using our formula: Q{new}(G-1,\na\u209c) = 0 + \u03b1 \u00b7 (1 + \u03b3 \u00b7 0 \u2212 0)_._ We can assign arbitrary values to \u03b1 and \u03b3 to\ncalculate the result. With \u03b1 = 0.5 and \u03b3 = 0.9, we get Q{new}(G-1, a\u209c) = 0 +\n0.5 \u00b7 (1 + 0.9 \u00b7 0 \u2212 0) = 0.5. The second time the agent is in this state, we\nwould get: Q{new}(G-1, a\u209c) = 0.5 + 0.5 \u00b7 (1 + 0.9 \u00b7 0 \u2212 0.5) = 0.75, then\n0.875, 0.9375, 0.96875, etc.\n\nImage by author\n\nSo training our agent in code means:\n\n 1. **Choosing a random action** (using `action_space.sample()`) if the values in the current state are just zeros. Otherwise, we take the **action with the highest value** in the current state with the function `np.argmax()`;\n\n 2. **Implementing this action** by moving in the desired direction with `step(action)`;\n\n 3. **Updating the value** of the original state with the action we took, using information about the new state and the reward given by `step(action)`;\n\nWe keep repeating these 3 steps until the agent **gets stuck in a hole** or\n**reaches the goal G**. When it happens, we just **restart the environment**\nwith `reset()` and start a new episode until we hit 1,000 episodes.\nAdditionally, we can plot the **outcome of each run** (failure if it didn't\nreach the goal, success otherwise) to **observe the progress** of our agent.\n\n \n \n Q-table before training:\n [[0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]]\n \n ===========================================\n Q-table after training:\n [[0. 0. 0.59049 0. ]\n [0. 0. 0.6561 0. ]\n [0. 0.729 0. 0. ]\n [0. 0. 0. 0. ]\n [0. 0.02050313 0. 0. ]\n [0. 0. 0. 0. ]\n [0. 0.81 0. 0. ]\n [0. 0. 0. 0. ]\n [0. 0. 0.17085938 0. ]\n [0. 0. 0.49359375 0. ]\n [0. 0.9 0. 0. ]\n [0. 0. 0. 0. ]\n [0. 0. 0. 0. ]\n [0. 0. 0. 0. ]\n [0. 0. 1. 0. ]\n [0. 0. 0. 0. ]]\n\nImage by author\n\nThe agent is trained! Each blue bar on the figure corresponds to a win, so we\ncan see that the agent had a **hard time finding the goal at the beginning**\nof the training. But once it found it several times in a row, it began to\n**consistently win**. \ud83e\udd73 The trained Q-table is also very interesting: these\nvalues indicate the **unique sequence of actions the agent learned to reach\nthe goal**.\n\nNow let\u2019s see how it performs by evaluating it on 100 episodes. We consider\nthat the training is over, so **we don\u2019t need to update the Q-table anymore**.\nTo see how the agent performs, we can **calculate the percentage of times the\nit managed to reach the goal** (success rate).\n\n \n \n Success rate = 100.0%\n\nNot only our agent has been trained, but it manages to hit a **100% success\nrate**. Great job everyone, the non-slippery \u2744\ufe0f**Frozen Lake** is solved!\n\nWe can even **visualize the agent moving on the map** by executing the code\nbelow and print the **sequence of actions it took** to check if it\u2019s the best\none.\n\n \n \n (Right)\n SFFF\n FHFH\n FFFH\n HFF\ud83d\udfe5\n Sequence = [2, 2, 1, 1, 1, 2]\n\nThe agent can learn several correct sequence of actions: [2, 2, 1, 1, 1, 2],\n[1, 1, 2, 2, 1, 2], etc. The good thing is there\u2019s **only 6 actions in our\nsequence** , which was the **minimum possible number of actions we counted** :\nit means that our agent learned to solve the game in an optimal way. In the\ncase of [2, 2, 1, 1, 1, 2], which corresponds to RIGHT \u2192 RIGHT \u2192 DOWN \u2192 DOWN \u2192\nDOWN \u2192 RIGHT, it\u2019s exactly the sequence we predicted at the very beginning of\nthe article. \ud83d\udce3\n\n### \ud83d\udcd0 IV. Epsilon-Greedy algorithm\n\nDespite this success, there\u2019s something that bothers me with our previous\napproach: the agent always chooses the action with the **highest** value. So\nwhenever a state-action pair **starts having a non-zero value, the agent will\nalways choose it**. The other actions will never be taken, which means we\u2019ll\nnever update their value\u2026 But what if one of these actions was **better than\nthe one the agent always takes**? Shouldn\u2019t we encourage the agent to try news\nthings from time to time and see if it can improve?\n\nIn other words, we want to allow our agent to either:\n\n * **Take the action with the highest value** (exploitation);\n\n * **Choose a random action to try to find even better ones** (exploration).\n\nA tradeoff between these two behaviors is important: if the agent only focuses\non **exploitation** , it cannot try new solutions and thus **doesn\u2019t learn\nanymore**. On the other hand, if the agent only takes **random actions** , the\n**training is pointless** since it doesn\u2019t use the Q-table. So we want to\n**change this parameter over time** : at the beginning of the training, we\nwant to **explore the environment as much as possible**. But exploration\nbecomes less and less interesting, as **the agent already knows every possible\nstate-action pairs**. This parameter represents the **amount of randomness in\nthe action selection**.\n\nThis technique is commonly called the **epsilon-greedy algorithm** , where\nepsilon is our parameter. It is a **simple but extremely efficient** method to\nfind a good tradeoff. Every time the agent has to take an action, it has a\n**probability \u03b5 of choosing a random one** , and a **probability 1-\u03b5 of\nchoosing the one with the highest value**. We can decrease the value of\nepsilon **at the end of each episode** by a fixed amount (**linear decay**),\nor based on the current value of epsilon (**exponential decay**).\n\nImage by author\n\nLet\u2019s implement a **linear decay**. Beforehand, I\u2019d like to see how the curve\nlooks like with arbitrary parameters. We\u2019ll start with \u03b5 = 1 to be in full\nexploration mode, and decrease this value by 0.001 after each episode.\n\nImage by author\n\nOkay now that we have a sound understanding of it, we can implement it for\nreal and see **how it changes the agent\u2019s behavior**.\n\n \n \n Q-table before training:\n [[0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]]\n \n ===========================================\n Q-table after training:\n [[0.531441 0.59049 0.59049 0.531441 ]\n [0.531441 0. 0.6561 0.56396466]\n [0.58333574 0.729 0.56935151 0.65055117]\n [0.65308668 0. 0.33420534 0.25491326]\n [0.59049 0.6561 0. 0.531441 ]\n [0. 0. 0. 0. ]\n [0. 0.81 0. 0.65519631]\n [0. 0. 0. 0. ]\n [0.6561 0. 0.729 0.59049 ]\n [0.6561 0.81 0.81 0. ]\n [0.72899868 0.9 0. 0.72711067]\n [0. 0. 0. 0. ]\n [0. 0. 0. 0. ]\n [0. 0.81 0.9 0.729 ]\n [0.81 0.9 1. 0.81 ]\n [0. 0. 0. 0. ]]\n\nImage by author\n\nHey, **the agent takes more time to consistently win the game** now! And the\nQ-table has **a lot more non-zero values** than the previous one, which means\nthe agent has learned **several sequences of actions** to reach the goal. It\nis understandable, since this new agent is **forced to explore state-action\npairs instead of always exploiting ones with non-zero values**.\n\nLet\u2019s see if it\u2019s **as successful as the previous one** to win the game. In\nevaluation mode, we **don\u2019t want exploration anymore** because the agent is\ntrained now.\n\n \n \n Success rate = 100.0%\n\nPhew, it\u2019s another **100% success rate**! We didn\u2019t degrade the model. \ud83d\ude0c The\nbenefits of this approach might not be obvious in this example, but our model\nbecame **less static** and **more flexible**. It learned different paths\n(sequences of actions) from **S** to **G** instead of just one as in the\nprevious approach. More exploration **can degrade performance** but it\u2019s\nnecessary to train agents that can **adapt to new environments**.\n\n### \u2744\ufe0f IV. Challenge: slippery Frozen Lake\n\nWe didn\u2019t solve the **entire \u2744\ufe0fFrozen Lake environment** : we only trained an\nagent on the non-slippery version, using `is_slippery = False` during\ninitialization. In the slippery variant, the action the agent takes only has\n**33% chance of succeeding**. In case of failure, one of the three other\nactions is randomly taken instead. This feature adds a lot of randomness to\nthe training, which makes things more difficult for our agent. Let's see how\nwell our code is doing in this new environment...\n\n \n \n Q-table before training:\n [[0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]\n [0. 0. 0. 0.]]\n \n ===========================================\n Q-table after training:\n [[0.06208723 0.02559574 0.02022059 0.01985828]\n [0.01397208 0.01425862 0.01305446 0.03333396]\n [0.01318348 0.01294602 0.01356014 0.01461235]\n [0.01117016 0.00752795 0.00870601 0.01278227]\n [0.08696239 0.01894036 0.01542694 0.02307306]\n [0. 0. 0. 0. ]\n [0.09027682 0.00490451 0.00793372 0.00448314]\n [0. 0. 0. 0. ]\n [0.03488138 0.03987256 0.05172554 0.10780482]\n [0.12444437 0.12321815 0.06462294 0.07084008]\n [0.13216145 0.09460133 0.09949734 0.08022573]\n [0. 0. 0. 0. ]\n [0. 0. 0. 0. ]\n [0.1606242 0.18174032 0.16636549 0.11444442]\n [0.4216631 0.42345944 0.40825367 0.74082329]\n [0. 0. 0. 0. ]]\n\nImage by author\n\n \n \n Success rate = 17.0%\n\nOof it\u2019s not so good. But can you improve the performance by tweaking the\ndifferent parameters we talked about? I encourage you to take this **little\nchallenge** and do it on your own to **have fun with reinforcement learning**\nand check if you understood **everything we said about Q-learning**. And why\nnot implementing **exponential decay** for the epsilon-greedy algorithm too?\nDuring this quick exercise, you might realise that **slightly modifying the\nhyperparameters can completely destroy the results**. This is another quirk of\nreinforcement learning: hyperparameters are quite moody, and it is important\nto understand their meaning if you want to tweak them. It\u2019s always good to\ntest and try new combinations to **build your intuition and become more\nefficient**. Good luck and have fun!\n\n### \ud83d\udd1a V. Conclusion\n\nQ-learning is a **simple yet powerful algorithm** at the core of reinforcement\nlearning. In this article,\n\n * We learned to **interact with the`gym` environment** to choose actions and move our agent;\n\n * We introduced the idea of a **Q-table** , where **rows are states** , **columns are actions** , and **cells are the value** of an action in a given state;\n\n * We experimentally recreated the **Q-learning update formula** to tackle the **sparse reward problem** ;\n\n * We implemented an entire training and evaluation process, that solved the **\u2744\ufe0fFrozen Lake** environment with 100% success rate;\n\n * We implemented the famous **epsilon-greedy algorithm** in order to create a tradeoff between the **exploration of unknown state-action pairs** and the **exploitation of the most successful ones**.\n\nThe **\u2744\ufe0fFrozen Lake** is a very simple environment, but others can have **so\nmany states and actions that it becomes impossible to store the Q-table in\nmemory**. This is especially the case in environments where events are **not\ndiscrete, but continuous** (like Super Mario Bros. or Minecraft). When the\nproblem arises, a popular technique consists of training a **deep neural\nnetwork to approximate the Q-table**. This method adds several layers of\ncomplexity, since the neural networks are **not very stable**. But I will\ncover it in another tutorial with different techniques to stabilize them.\n\nUntil then, **share this article** if it helped you and **follow me on\nTwitter** and **Medium** for more **practical content** around machine\nlearning and deep learning. \ud83d\udce3\n\nShare this post\n\n#### Q-learning for beginners\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/q-learning-for-beginners-2837b777741" }, { "id": "8fbc7862-3fd6-4e44-a9c2-19bf6eb43ba4", "content": { "Title": "How to start Machine Learning for Developers in 2022", "Subtitle": "A list of curated resources to start your ML journey", "Content": "# Maxime Labonne\n\nSubscribeSign in\n\nShare this post\n\n#### How to start Machine Learning for Developers in 2022\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# How to start Machine Learning for Developers in 2022\n\n### A list of curated resources to start your ML journey\n\nMaxime Labonne\n\nJan 31, 2022\n\nShare this post\n\n#### How to start Machine Learning for Developers in 2022\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### A list of curated resources to start your ML journey\n\nAs a PhD student and a research scientist in machine learning, many people\nhave asked me the same question over the years: _\u201chow do I start machine\nlearning?\u201d_ My answers varied greatly, ranging from the most technical _\u201cstart\nlooking at notebooks on Kaggle?\u201d,_ to the more approachable _\u201cI think fast.ai\nhas a great course\u201d_ , or _\u201coh\u2026 do you know Coursera?\u201d_ So, it\u2019s finally time\nfor me to settle the matter once and for all, until next year.\n\nMachine learning is a constantly evolving field with an abundance of guides\nand tutorials. And that may just be the main problem: there are just **too\nmany options**. Even searching for \u201c _start machine learning_ \u201d on the\nInternet yields mixed results: alluring ads, outdated forum responses, and an\noverwhelming amount of e-learning courses.\n\nIn this post, I want to talk about my recommended methods for learning about\nthis ever-changing field and provide you with the **best resources for getting\nstarted with machine learning**. This guide is not just for coding, but also\nfor inspiration and motivation, depending on your learning style.\n\n### Top-down learning style\n\nImage by author.\n\nLearning is difficult; it takes time and motivation. To me, the most daunting\npart of learning something new is the fact that I do not know yet how much\nwork it entails. So I find that the best first step in my learning journey is\nto try and map the field that I am entering. When it\u2019s a niche topic, I can\nlook at academic surveys. But for something as big as machine learning, I\nconsume **high-level resources** like videos and podcasts to stay up-to-date.\nThese high-level resources are a great way to understand the breadth and depth\nof this field, which keeps growing on a daily basis with new methods,\napplications, and challenges.\n\nUnfortunately, these resources are usually not technical enough to truly teach\nmachine learning. To truly delve deeper into ML, start implementing\nalgorithms, and understand more of the field, some kind of course is needed.\nThe choice of language and libraries is not very relevant at this point, so\nit\u2019s better to follow the standards found in most guides: Python, scikit-\nlearn, Pandas\u2026 It is much more important to understand the concepts than to\nlearn the syntax of each and every framework. Courses can be complemented by\nmore specific **technical articles** , often in the form of blog posts. These\nare an essential link between the theoretical knowledge from courses and the\nactual implementation to solve real problems.\n\nFinally, whether it\u2019s because you encounter fundamental problems that you\ndon\u2019t know how to solve or because you seek a complete understanding of the\nfield, **low-level resources** become necessary at some point. They can be\nbooks, academic courses, scientific papers, etc. The goal here is not to learn\nmath from scratch, but to take a bottom-up approach to identify what was\nmissing in our understanding of the problem. In the case of machine learning,\nsome grasp of statistics, probability, and linear algebra is a plus.\n\nYou may already be using this learning style instead of the opposite\n\u201cacademic\u201d approach, and you may be encountering hurdles in your learning\nprocess, or you have not used any of these methods before. In any case, this\narticle aims to provide you with the best educational resources for different\ntypes of media, divided per tier. And since individuals differ in the way they\nlearn, I encourage you to choose the materials that best suit you. The most\neffective way to make progress is to **combine different media at different\nlevels** to see the same concepts addressed in different ways. Whatever you\nchoose, these guides are great tools for starting or continuing to learn\nmachine learning. \ud83d\udc4d\n\n### Tier 1: educational entertainment\n\nVideos and podcasts are the easiest way to approach a new topic. They do not\nrequire extensive work or focus and can be consumed anywhere. While they by no\nmeans replace proper courses, they can be highly motivating and are effective\nin introducing a lot of applications and topics in a short amount of time.\n\n#### Two Minute Papers\n\n**Two Minute Papers** is a YouTube channel run by K\u00e1roly Zsolnai-Feh\u00e9, an ex-\nresearcher at TU Wien. He showcases and explains in simple terms research\nworks in several minutes. This channel focuses on topics related to physical\nsimulation and computer graphics. It\u2019s a great way to see a variety of\noriginal machine learning applications and find inspiration for your own\nprojects.\n\n#### Yannic Kilcher\n\n**Yannic Kilcher** is the host of _ML news_ , an upbeat summary of the latest\nnews in machine learning. And there is a lot of news: more and more companies,\ninstitutions, and universities communicate about new projects, products, and\nadvancements in this field. The last segment of ML news, called \u201cuseful\nthings\u201d, is entirely dedicated to the presentation of new and popular\nlibraries, frameworks, and applications.\n\nYannic Kilcher also (and maybe most importantly) makes videos of paper\nreviews, where he explains and annotates research papers in an easy-to-follow\nstep-by-step manner. Though this type of video content is more specific and\ndoes require a good understanding of the topic, it is an excellent solution if\nyou need to read a paper he already covered.\n\n#### AI Coffee Break with Letitia\n\n**AI Coffee Break with Letitia Parcalabescu** covers recent research articles\nand advancements in deep learning. Her videos can be quite technical and\nrequire some prior knowledge of the topic, but there are quite a few that are\nmore high-level and talk about broader topics in AI. They are a good way of\nunderstanding what\u2019s currently happening in research (sometimes in great\ndetail) and what we can expect next.\n\n#### Practical AI\n\n**The Practical AI Podcast** \n _In the second of the\"AI in Africa\" spotlight episodes, we welcome guests\nfrom Radiant Earth to talk about machine\u2026_changelog.com\n\n**Practical AI** is a podcast hosted by a data scientist at SIL International\nand a principal AI strategist at Lockheed Martin. As the name suggests, it has\na particular focus on making AI accessible to everyone with real-world\nimplementations. They talk about tools to automate and simplify ML tasks and\nhow to scale a product to serve millions of users. Their grounded approach\nmakes them accessible, even to beginners in this field.\n\n**The TWIML AI Podcast**\n\n**The TWIML AI Podcast (This Week in Machine Learning and AI Podcast)** \n_Keep up with the most interesting& important stories from the world of\nmachine learning, deep learning & artificial\u2026_twimlai.com\n\n**This Week in Machine Learning & Artificial Intelligence** is your typical\ninterview podcast with ML practitioners and enthusiasts. It has over 500\nepisodes and covers a broad spectrum of interviewees: engineers, leaders,\nresearchers, and business people. This means they tackle ML from different\npoints of view, giving unique perspectives to problems in the field and on ML\nas a subject, and allows a better understanding of the topic and its stakes.\n\n### Tier 2: courses and technical posts\n\nTaking courses still is a necessary step to learn the libraries and tools\nrelated to machine learning. The resources I list below focus primarily on the\nPython ecosystem since Python is the most used language in ML thanks to its\npowerful libraries (sklearn, Tensorflow, Pytorch\u2026) and its clean and easy\nsyntax. However, the knowledge from these courses is absolutely transferable\nto other languages and frameworks.\n\nDepending on the end application, technical posts are also a great source of\ninformation since they can point towards certain techniques and give you clear\nanswers to particular problems. Keep in mind though that posts and articles\ncan easily be outdated and so their results are not always easily\nreproducible.\n\n#### Kaggle\u2019s Intro to Machine Learning\n\n**Kaggle** has a great introductory course with a practical approach to the\nbasics of machine learning. It\u2019s a series of 7 quick tutorials with exercises,\nfor example on how to set up a classic pipeline with data exploration and how\nto get started with model training and model validation. It\u2019s the perfect\nfirst step to learn machine learning in under 3 hours, without any\ninstallation required. Another perk: Kaggle offers online notebooks, which\nmakes practicing the exercises very accessible.\n\n#### fast.ai\n\n**fast.ai** provides great online courses designed by a passionate and active\nteam. Their goal is to make AI accessible to everyone, regardless of your\nbackground, your preferred language, or your data and applications. Instead of\nbeing confronted with an overwhelming amount of theory at the start, they\nadvocate a very hands-on approach.\n\nTheir \u201cPractical Deep Learning for Coders\u201d course is a good example of this.\nFrom the first lesson, you are able to execute very recent models of deep\nneural networks and see their results. In the following lessons, they build on\nthese insights by giving you an explanation of their architectures, how they\ntruly work, and are able to output these results.\n\nWhile this particular course can be quite advanced, their other course\n\u201cIntroduction to Machine Learning\u201d covers regular ML starting with the basics:\ntabular datasets, random forests, and model validation. It has the same\npractical and comprehensive approach that is very effective in teaching you\nthe basics and complexities of ML and can be seen as an extended version\n(around 24 hours) of the Kaggle course.\n\n#### Machine Learning Mastery\n\n**Machine Learning Mastery - Machine Learning Mastery** \n _Making developers awesome at machine learning._machinelearningmastery.com\n\n**Machine Learning Mastery** is a popular blog among practitioners with a lot\nof practical applications of ML tasks and topics, like time series forecasting\nor imbalanced learning. Unsurprisingly, it is often one of the first results\nthat appear on Google when I look for an answer to specific ML problems. And\nthat\u2019s also probably the best way of using it: there are so many articles that\nit\u2019s simply impossible to read them all, but you should definitely check if\nthey have something about your problem of interest. Machine Learning Mastery\ncreates a valuable library of practical ML resources you can pick and choose.\n\n#### Towards Data Science\n\n**Towards Data Science** \n _Your home for data science. A Medium publication sharing concepts, ideas and\ncodes._towardsdatascience.com\n\n**Towards Data Science** is a Medium publication focused on data science,\nmachine learning, and deep learning. Articles are not necessarily of the\nhighest academic quality: you can find language-specific tips and other kinds\nof clickbait content. But it also tackles a wide range of topics, from cool\napplications, like geospatial wildfire risk prediction, to educational pieces,\nsuch as a specific new metric. \u201cTowards Data Science\u201d (and posts on Medium in\ngeneral) can be used as a place to find answers to specific problems, like\nMachine Learning Mastery, or these posts can simply act as inspiration from\ncreative and well-presented work.\n\n### Tier 3: academic sources\n\nAcademic sources have the benefit that they are backed, checked, and managed\nby known and trusted sources. On the other hand, they\u2019re also more difficult\nto read and can be quite time-consuming. The investment you make in reading\nthem does not bring the same level of reward as for online courses, because\nthe information is significantly less dense. Nonetheless, they are a necessary\nstep to reproduce models and architectures from research papers or to truly\nmaster the fundamentals of machine learning.\n\n#### Machine Learning (Stanford University)\n\n**Machine Learning** \n _4,627,641 already enrolled Machine learning is the science of getting\ncomputers to act without being explicitly\u2026_www.coursera.org\n\nAndrew Ng is the co-founder of Coursera and is especially known for his\n\u201c**Machine Learning** \u201d course. It is by far the most popular and influential\ncourse in ML. His teaching style is the opposite of fast.ai\u2019s: it\u2019s a bottom-\nup approach, with a lot of theory to understand before applying it to real\nproblems. Since it was released in 2011, the quality of the audio and video\nleaves something to be desired. However, the content is still relevant and can\nbe completed with a deep learning specialization.\n\n#### Neural Network and Deep Learning book\n\n**Neural networks and deep learning** \n _Neural Networks and Deep Learning is a free online book. The book will teach\nyou about: Neural networks, a beautiful\u2026_neuralnetworksanddeeplearning.com\n\n**Neural Network and Deep Learning** is a book focused on explaining the core\nconcepts of neural networks step by step, with clear code and explanations. It\ndoes not cover any other ML algorithm but is an excellent introduction to the\ntheory behind _deep_ and _shallow_ neural networks. The author does a great\njob of building the reader\u2019s intuition into key concepts to be able to make\ntheir own nets from scratch. The book also answers fundamental questions like\n\u201cwhy are deep neural networks difficult to train?\u201d that can be applied to a\nvariety of deep learning architectures.\n\n#### Scientific papers\n\n**arXiv.org** \n _arXiv is a free distribution service and an open-access archive for\n2,011,228 scholarly articles in the fields of\u2026_arxiv.org\n\n**Scientific papers** are published in journals or as proceedings at\nconferences and are most often protected behind a paywall. Fortunately, there\nis a culture in ML of publishing preprints (non-final versions of articles) on\narXiv in machine learning. This website is a popular open access archive of\nover 2 million articles in various scientific fields. If all else fails and\nyou can\u2019t find the article you\u2019re looking for on arXiv, you can always send a\npolite email to the first author to request it. We\u2019re generally happy to share\nour work with as many people as possible.\n\n### Conclusion\n\nThis article is far from being an exhaustive list of resources to learn ML,\nbut the content discussed above does provide a solid foundation and specific\nknowledge of ML. But practice makes perfect, and only practice can truly give\nyou the skills to translate the theoretical knowledge you learn into real-\nworld applications. Therefore, it is important to play with ML projects,\nwhether they are real problems you want to tackle or public projects on\nKaggle. And to be honest, they probably **won\u2019t** be solved with linear\nregression or k-means clustering. \u00af\\\\_(\u30c4)_/\u00af Learning the basics and\npracticing is nonetheless an important step to master if you want to build\nexpertise in more in-depth subfields, like natural language processing or\ngraph neural networks.\n\nI hope you can apply the same learning framework to every topic you encounter\nand become an expert in no time. AI is an exciting field, so don\u2019t forget to\nhave fun!\n\nFollow me on Twitter @maximelabonne and tell me what resources you use(d) in\nyour ML journey, I need inspiration for next year.\n\nShare this post\n\n#### How to start Machine Learning for Developers in 2022\n\nmaximelabonne.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Maxime Labonne\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/how-to-start-machine-learning-for-developers-in-2022-390af12b193f" }, { "id": "34978aea-e179-44b5-975c-7deb64456380", "content": { "Title": "An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin", "Subtitle": "From data gathering to productionizing LLMs using LLMOps good practices.", "Content": "End-to-End Framework for Production-Ready LLMs | Decoding MLOpen in appSign upSign inWriteSign upSign inTop highlightLLM Twin Course: Building Your Production-Ready AI ReplicaAn End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Paul Iusztin\u00b7FollowPublished inDecoding ML\u00b716 min read\u00b7Mar 16, 20242.1K13ListenShare\u2192 the 1st out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL-EWhy is this course different?By finishing the \u201cLLM Twin: Building Your Production-Ready AI Replica\u201d free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? \ud83e\udef5\u2192 No more isolated scripts or Notebooks! Learn production ML by building and deploying an end-to-end production-grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real-world LLM system from start to finish \u2014 from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices:the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML\u2019s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet\u2019s model registry. (deployed on Qwak)the inference pipeline: load and quantize the fine-tuned LLM from Comet\u2019s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet\u2019s prompt monitoring dashboard. (deployed on Qwak)LLM twin system architecture [Image by the Author]Along the 4 microservices, you will learn to integrate 3 serverless tools:Comet ML as your ML Platform;Qdrant as your vector DB;Qwak as your ML infrastructure;Who is this for?Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps good principles.Level: intermediatePrerequisites: basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace.\u2192 To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by:Paul Iusztin | Senior ML & MLOps EngineerAlex Vesa | Senior AI EngineerAlex Razvant | Senior ML & MLOps EngineerLessons\u2192 Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson:An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture: Enabling Event-Driven ArchitecturesSOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine-Tuning LLMsHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine-Tuned LLMsArchitect scalable and cost-effective LLM & RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code[Bonus] Build Multi-Index Advanced RAG Apps\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fLet\u2019s start with Lesson 1 \u2193\u2193\u2193Lesson 1: End-to-end framework for production-ready LLM systemsIn the first lesson, we will present the project you will build during the course: your production-ready LLM Twin/AI replica.Afterward, we will explain what the 3-pipeline design is and how it is applied to a standard ML system.Ultimately, we will dig into the LLM project system design.We will present all our architectural decisions regarding the design of the data collection pipeline for social media data and how we applied the 3-pipeline architecture to our LLM microservices.In the following lessons, we will examine each component\u2019s code and learn how to implement and deploy it to AWS and Qwak.LLM twin system architecture [Image by the Author]Table of ContentsWhat are you going to build? The LLM twin conceptThe 3-pipeline architectureLLM twin system design\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0f1. What are you going to build? The LLM twin conceptThe outcome of this course is to learn to build your own AI replica. We will use an LLM to do that, hence the name of the course: LLM Twin: Building Your Production-Ready AI Replica.But what is an LLM twin?Shortly, your LLM twin will be an AI character who writes like you, using your writing style and personality.It will not be you. It will be your writing copycat.More concretely, you will build an AI replica that writes social media posts or technical articles (like this one) using your own voice.Why not directly use ChatGPT? You may ask\u2026When trying to generate an article or post using an LLM, the results tend to:be very generic and unarticulated,contain misinformation (due to hallucination),require tedious prompting to achieve the desired result.But here is what we are going to do to fix that \u2193\u2193\u2193First, we will fine-tune an LLM on your digital data gathered from LinkedIn, Medium, Substack and GitHub.By doing so, the LLM will align with your writing style and online personality. It will teach the LLM to talk like the online version of yourself.Have you seen the universe of AI characters Meta released in 2024 in the Messenger app? If not, you can learn more about it here [2].To some extent, that is what we are going to build.But in our use case, we will focus on an LLM twin who writes social media posts or articles that reflect and articulate your voice.For example, we can ask your LLM twin to write a LinkedIn post about LLMs. Instead of writing some generic and unarticulated post about LLMs (e.g., what ChatGPT will do), it will use your voice and style.Secondly, we will give the LLM access to a vector DB to access external information to avoid hallucinating. Thus, we will force the LLM to write only based on concrete data.Ultimately, in addition to accessing the vector DB for information, you can provide external links that will act as the building block of the generation process.For example, we can modify the example above to: \u201cWrite me a 1000-word LinkedIn post about LLMs based on the article from this link: [URL].\u201dExcited? Let\u2019s get started \ud83d\udd252. The 3-pipeline architectureWe all know how messy ML systems can get. That is where the 3-pipeline architecture kicks in.The 3-pipeline design brings structure and modularity to your ML system while improving your MLOps processes.ProblemDespite advances in MLOps tooling, transitioning from prototype to production remains challenging.In 2022, only 54% of the models get into production. Auch.So what happens?Maybe the first things that come to your mind are:the model is not mature enoughsecurity risks (e.g., data privacy)not enough dataTo some extent, these are true.But the reality is that in many scenarios\u2026\u2026the architecture of the ML system is built with research in mind, or the ML system becomes a massive monolith that is extremely hard to refactor from offline to online.So, good SWE processes and a well-defined architecture are as crucial as using suitable tools and models with high accuracy.Solution\u2192 The 3-pipeline architectureLet\u2019s understand what the 3-pipeline design is.It is a mental map that helps you simplify the development process and split your monolithic ML pipeline into 3 components:1. the feature pipeline2. the training pipeline3. the inference pipeline\u2026also known as the Feature/Training/Inference (FTI) architecture.#1. The feature pipeline transforms your data into features & labels, which are stored and versioned in a feature store. The feature store will act as the central repository of your features. That means that features can be accessed and shared only through the feature store.#2. The training pipeline ingests a specific version of the features & labels from the feature store and outputs the trained model weights, which are stored and versioned inside a model registry. The models will be accessed and shared only through the model registry.#3. The inference pipeline uses a given version of the features from the feature store and downloads a specific version of the model from the model registry. Its final goal is to output the predictions to a client.The 3-pipeline architecture [Image by the Author].This is why the 3-pipeline design is so beautiful:- it is intuitive- it brings structure, as on a higher level, all ML systems can be reduced to these 3 components- it defines a transparent interface between the 3 components, making it easier for multiple teams to collaborate- the ML system has been built with modularity in mind since the beginning- the 3 components can easily be divided between multiple teams (if necessary)- every component can use the best stack of technologies available for the job- every component can be deployed, scaled, and monitored independently- the feature pipeline can easily be either batch, streaming or bothBut the most important benefit is that\u2026\u2026by following this pattern, you know 100% that your ML model will move out of your Notebooks into production.\u21b3 If you want to learn more about the 3-pipeline design, I recommend this excellent article [3] written by Jim Dowling, one of the creators of the FTI architecture.3. LLM Twin System designLet\u2019s understand how to apply the 3-pipeline architecture to our LLM system.The architecture of the LLM twin is split into 4 Python microservices:The data collection pipelineThe feature pipelineThe training pipelineThe inference pipelineLLM twin system architecture [Image by the Author]As you can see, the data collection pipeline doesn\u2019t follow the 3-pipeline design. Which is true.It represents the data pipeline that sits before the ML system.The data engineering team usually implements it, and its scope is to gather, clean, normalize and store the data required to build dashboards or ML models.But let\u2019s say you are part of a small team and have to build everything yourself, from data gathering to model deployment.Thus, we will show you how the data pipeline nicely fits and interacts with the FTI architecture.Now, let\u2019s zoom in on each component to understand how they work individually and interact with each other. \u2193\u2193\u21933.1. The data collection pipelineIts scope is to crawl data for a given user from:Medium (articles)Substack (articles)LinkedIn (posts)GitHub (code)As every platform is unique, we implemented a different Extract Transform Load (ETL) pipeline for each website.\ud83d\udd17 1-min read on ETL pipelines [4]However, the baseline steps are the same for each platform.Thus, for each ETL pipeline, we can abstract away the following baseline steps:log in using your credentialsuse selenium to crawl your profileuse BeatifulSoup to parse the HTMLclean & normalize the extracted HTMLsave the normalized (but still raw) data to Mongo DBImportant note: We are crawling only our data, as most platforms do not allow us to access other people\u2019s data due to privacy issues. But this is perfect for us, as to build our LLM twin, we need only our own digital data.Why Mongo DB?We wanted a NoSQL database that quickly allows us to store unstructured data (aka text).How will the data pipeline communicate with the feature pipeline?We will use the Change Data Capture (CDC) pattern to inform the feature pipeline of any change on our Mongo DB.\ud83d\udd17 1-min read on the CDC pattern [5]To explain the CDC briefly, a watcher listens 24/7 for any CRUD operation that happens to the Mongo DB.The watcher will issue an event informing us what has been modified. We will add that event to a RabbitMQ queue.The feature pipeline will constantly listen to the queue, process the messages, and add them to the Qdrant vector DB.For example, when we write a new document to the Mongo DB, the watcher creates a new event. The event is added to the RabbitMQ queue; ultimately, the feature pipeline consumes and processes it.Doing this ensures that the Mongo DB and vector DB are constantly in sync.With the CDC technique, we transition from a batch ETL pipeline (our data pipeline) to a streaming pipeline (our feature pipeline).Using the CDC pattern, we avoid implementing a complex batch pipeline to compute the difference between the Mongo DB and vector DB. This approach can quickly get very slow when working with big data.Where will the data pipeline be deployed?The data collection pipeline and RabbitMQ service will be deployed to AWS. We will also use the freemium serverless version of Mongo DB.3.2. The feature pipelineThe feature pipeline is implemented using Bytewax (a Rust streaming engine with a Python interface). Thus, in our specific use case, we will also refer to it as a streaming ingestion pipeline.It is an entirely different service than the data collection pipeline.How does it communicate with the data pipeline?As explained above, the feature pipeline communicates with the data pipeline through a RabbitMQ queue.Currently, the streaming pipeline doesn\u2019t care how the data is generated or where it comes from.It knows it has to listen to a given queue, consume messages from there and process them.By doing so, we decouple the two components entirely. In the future, we can easily add messages from multiple sources to the queue, and the streaming pipeline will know how to process them. The only rule is that the messages in the queue should always respect the same structure/interface.What is the scope of the feature pipeline?It represents the ingestion component of the RAG system.It will take the raw data passed through the queue and:clean the data;chunk it;embed it using the embedding models from Superlinked;load it to the Qdrant vector DB.Every type of data (post, article, code) will be processed independently through its own set of classes.Even though all of them are text-based, we must clean, chunk and embed them using different strategies, as every type of data has its own particularities.What data will be stored?The training pipeline will have access only to the feature store, which, in our case, is represented by the Qdrant vector DB.Note that a vector DB can also be used as a NoSQL DB.With these 2 things in mind, we will store in Qdrant 2 snapshots of our data:1. The cleaned data (without using vectors as indexes \u2014 store them in a NoSQL fashion).2. The cleaned, chunked, and embedded data (leveraging the vector indexes of Qdrant)The training pipeline needs access to the data in both formats as we want to fine-tune the LLM on standard and augmented prompts.With the cleaned data, we will create the prompts and answers.With the chunked data, we will augment the prompts (aka RAG).Why implement a streaming pipeline instead of a batch pipeline?There are 2 main reasons.The first one is that, coupled with the CDC pattern, it is the most efficient way to sync two DBs between each other. Otherwise, you would have to implement batch polling or pushing techniques that aren\u2019t scalable when working with big data.Using CDC + a streaming pipeline, you process only the changes to the source DB without any overhead.The second reason is that by doing so, your source and vector DB will always be in sync. Thus, you will always have access to the latest data when doing RAG.Why Bytewax?Bytewax is a streaming engine built in Rust that exposes a Python interface. We use Bytewax because it combines Rust\u2019s impressive speed and reliability with the ease of use and ecosystem of Python. It is incredibly light, powerful, and easy for a Python developer.Where will the feature pipeline be deployed?The feature pipeline will be deployed to AWS. We will also use the freemium serverless version of Qdrant.3.3. The training pipelineHow do we have access to the training features?As highlighted in section 3.2, all the training data will be accessed from the feature store. In our case, the feature store is the Qdrant vector DB that contains:the cleaned digital data from which we will create prompts & answers;we will use the chunked & embedded data for RAG to augment the cleaned data.We will implement a different vector DB retrieval client for each of our main types of data (posts, articles, code).We must do this separation because we must preprocess each type differently before querying the vector DB, as each type has unique properties.Also, we will add custom behavior for each client based on what we want to query from the vector DB. But more on this in its dedicated lesson.What will the training pipeline do?The training pipeline contains a data-to-prompt layer that will preprocess the data retrieved from the vector DB into prompts.It will also contain an LLM fine-tuning module that inputs a HuggingFace dataset and uses QLoRA to fine-tune a given LLM (e.g., Mistral). By using HuggingFace, we can easily switch between different LLMs so we won\u2019t focus too much on any specific LLM.All the experiments will be logged into Comet ML\u2019s experiment tracker.We will use a bigger LLM (e.g., GPT4) to evaluate the results of our fine-tuned LLM. These results will be logged into Comet\u2019s experiment tracker.Where will the production candidate LLM be stored?We will compare multiple experiments, pick the best one, and issue an LLM production candidate for the model registry.After, we will inspect the LLM production candidate manually using Comet\u2019s prompt monitoring dashboard. If this final manual check passes, we will flag the LLM from the model registry as accepted.A CI/CD pipeline will trigger and deploy the new LLM version to the inference pipeline.Where will the training pipeline be deployed?The training pipeline will be deployed to Qwak.Qwak is a serverless solution for training and deploying ML models. It makes scaling your operation easy while you can focus on building.Also, we will use the freemium version of Comet ML for the following:experiment tracker;model registry;prompt monitoring.3.4. The inference pipelineThe inference pipeline is the final component of the LLM system. It is the one the clients will interact with.It will be wrapped under a REST API. The clients can call it through HTTP requests, similar to your experience with ChatGPT or similar tools.How do we access the features?To access the feature store, we will use the same Qdrant vector DB retrieval clients as in the training pipeline.In this case, we will need the feature store to access the chunked data to do RAG.How do we access the fine-tuned LLM?The fine-tuned LLM will always be downloaded from the model registry based on its tag (e.g., accepted) and version (e.g., v1.0.2, latest, etc.).How will the fine-tuned LLM be loaded?Here we are in the inference world.Thus, we want to optimize the LLM's speed and memory consumption as much as possible. That is why, after downloading the LLM from the model registry, we will quantize it.What are the components of the inference pipeline?The first one is the retrieval client used to access the vector DB to do RAG. This is the same module as the one used in the training pipeline.After we have a query to prompt the layer, that will map the prompt and retrieved documents from Qdrant into a prompt.After the LLM generates its answer, we will log it to Comet\u2019s prompt monitoring dashboard and return it to the clients.For example, the client will request the inference pipeline to:\u201cWrite a 1000-word LinkedIn post about LLMs,\u201d and the inference pipeline will go through all the steps above to return the generated post.Where will the inference pipeline be deployed?The inference pipeline will be deployed to Qwak.By default, Qwak also offers autoscaling solutions and a nice dashboard to monitor all the production environment resources.As for the training pipeline, we will use a serverless freemium version of Comet for its prompt monitoring dashboard.ConclusionThis is the 1st article of the LLM Twin: Building Your Production-Ready AI Replica free course.In this lesson, we presented what you will build during the course.After we briefly discussed how to design ML systems using the 3-pipeline design.Ultimately, we went through the system design of the course and presented the architecture of each microservice and how they interact with each other:The data collection pipelineThe feature pipelineThe training pipelineThe inference pipelineIn Lesson 2, we will dive deeper into the data collection pipeline, learn how to implement crawlers for various social media platforms, clean the gathered data, store it in a Mongo DB, and finally, show you how to deploy it to AWS.\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fHave you enjoyed this article? Then\u2026\u2193\u2193\u2193Join 5k+ engineers in the \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf \ud835\udde1\ud835\uddf2\ud835\ude04\ud835\ude00\ud835\uddf9\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff for battle-tested content on production-grade ML. \ud835\uddd8\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06 \ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\uddf8:Decoding ML Newsletter | Paul Iusztin | SubstackJoin for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For\u2026decodingml.substack.comReferences[1] Your LLM Twin Course \u2014 GitHub Repository (2024), Decoding ML GitHub Organization[2] Introducing new AI experiences from Meta (2023), Meta[3] Jim Dowling, From MLOps to ML Systems with Feature/Training/Inference Pipelines (2023), Hopsworks[4] Extract Transform Load (ETL), Databricks Glossary[5] Daniel Svonava and Paolo Perrone, Understanding the different Data Modality / Types (2023), SuperlinkedSign up to discover human stories that deepen your understanding of the world.FreeDistraction-free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member-only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for $5/monthGenerative AiLarge Language ModelsMlopsArtificial IntelligenceMachine Learning2.1K2.1K13FollowWritten by Paul Iusztin5.1K Followers\u00b7Editor for Decoding MLSenior ML & MLOps Engineer \u2022 Founder @ Decoding ML ~ Content about building production-grade ML/AI systems \u2022 DML Newsletter: https://decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post-retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLArchitect scalable and cost-effective LLM & RAG inference pipelinesDesign, build and deploy RAG inference pipeline using LLMOps best practices.Jun 15601See all from Paul IusztinSee all from Decoding MLRecommended from MediumVishal RajputinAIGuysWhy GEN AI Boom Is Fading And What\u2019s Next?Every technology has its hype and cool down period.Sep 42.3K72DerckData architecture for MLOps: Metadata storeIntroductionJul 17ListsAI Regulation6 stories\u00b7593 savesNatural Language Processing1766 stories\u00b71367 savesPredictive Modeling w/ Python20 stories\u00b71607 savesPractical Guides to Machine Learning10 stories\u00b71961 savesIda Silfverski\u00f6ldinLevel Up CodingAgentic AI: Build a Tech Research AgentUsing a custom data pipeline with millions of textsSep 679610Alex RazvantinDecoding MLHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine-tune a Mistral7b-Instruct using PEFT & QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922Vipra SinghBuilding LLM Applications: Serving LLMs (Part 9)Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation ( RAG ) Application.Apr 188666Steve HeddeninTowards Data ScienceHow to Implement Graph RAG Using Knowledge Graphs and Vector DatabasesA Step-by-Step Tutorial on Implementing Retrieval-Augmented Generation (RAG), Semantic Search, and RecommendationsSep 61.4K18See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTo make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy." }, "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin-2cc6bb01141f" }, { "id": "d331f23e-88c6-4606-b397-52842c9a6295", "content": { "Title": "A Real-time Retrieval System for RAG on Social Media Data", "Subtitle": "Use a streaming engine to populate a vector DB in real-time. Improve RAG accuracy using rerank & UMAP.", "Content": "Real-time Retrieval for RAG on Social Media Data | Decoding MLOpen in appSign upSign inWriteSign upSign inA Real-time Retrieval System for RAG on Social Media DataUse a streaming engine to populate a vector DB in real-time. Improve RAG accuracy using rerank & UMAP.Paul Iusztin\u00b7FollowPublished inDecoding ML\u00b712 min read\u00b7Mar 30, 2024358ListenShareImage by DALL-EIn this article, you will learn how to build a real-time retrieval system for social media data. In our example, we will use only my LinkedIn posts, but our implementation can easily be extended to other platforms supporting written content, such as X, Instagram, or Medium.In this article, you will learn how to:build a streaming pipeline that ingests LinkedIn posts into a vector DB in real-timeclean, chunk, and embed LinkedIn postsbuild a retrieval client to query LinkedIn postsuse a rerank pattern to improve retrieval accuracyvisualize content retrieved for a given query in a 2D plot using UMAPOur implementation focuses on just the retrieval part of an RAG system. But you can quickly hook the retrieved LinkedIn posts to an LLM for post analysis or personalized content generation.Table of Contents:System DesignDataStreaming ingestion pipelineRetrieval clientConclusion1. System DesignThe retrieval system is based on 2 detached components:the streaming ingestion pipelinethe retrieval clientThe architecture of the retrieval system [Image by the Author \u2014 in collaboration with VectorHub].The streaming ingestion pipeline runs 24/7 to keep the vector DB synced up with current raw LinkedIn posts data source, while the retrieval client is used in RAG applications to query the vector DB. These 2 components communicate with each other only through the vector DB.1.1. The streaming ingestion pipelineThe streaming ingestion pipeline implements the Change Data Capture (CDC) pattern between a data source containing the raw LinkedIn posts and the vector DB used for retrieval.In a real-world scenario, the streaming pipeline listens to a queue populated by all the changes made to the source database. But because we are focusing primarily on the retrieval system, we simulate the data within the queue with a couple of JSON files.The streaming pipeline is built in Python using Bytewax, and cleans, chunks, and embeds the LinkedIn posts before loading them into a Qdrant vector DB.Why do we need a stream engine?Because LinkedIn posts (or any other social media data) evolve frequently, your vector DB can quickly get out of sync. To handle this, you can build a batch pipeline that runs every minute. But to really minimize data lag, to make sure your vector DB stays current with new social media posts, you need to use a streaming pipeline that immediately takes every new item the moment it\u2019s posted, preprocesses it, and loads it into the vector DB.Why Bytewax?Bytewax is a streaming engine built in Rust that exposes a Python interface. We use Bytewax because it combines the impressive speed and reliability of Rust with the ease of use and ecosystem of Python.1.2. The retrieval clientOur retrieval client is a standard Python module that preprocesses user queries and searches the vector DB for most similar results. Qdrant vector DB lets us decouple the retrieval client from the streaming ingestion pipeline.Using a semantic-based retrieval system lets us query our LinkedIn post collection very flexibly. For example, we can retrieve similar posts using a variety of query types \u2014 e.g., posts, questions, sentences.Also, to improve the retrieval system\u2019s accuracy, we use a rerank pattern.Lastly, to better understand and explain the retrieval process for particular queries, we visualize our results on a 2D plot using UMAP.2. DataWe will ingest 215 LinkedIn posts from my Linked profile \u2014 Paul Iusztin. Though we simulate the post ingestion step using JSON files, the posts themselves are authentic.Before diving into the code, let\u2019s take a look at an example LinkedIn post to familiarize ourselves with the challenges it will introduce \u2193[ { \"text\": \"\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 do you need to \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 an open-source \ud835\udddf\ud835\udddf\ud835\udde0 to create your own \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddfc\ud835\uddff?\\nThis is the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf8\ud835\uddf6\ud835\ude01 you must know \u2193\\n\ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01\\nThe key component of any successful ML project is the data.\\nYou need a 100 - 1000 sample Q&A (questions & answers) dataset with financial scenarios.\\nThe best approach is to hire a bunch of experts to create it manually.\\nBut, for a PoC, that might get expensive & slow.\\nThe good news is that a method called \\\"\ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude25\ud835\ude2a\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\\\" exists.\\n ...Along with ease of deployment, you can easily add your training code to your CI/CD to add the final piece of the MLOps puzzle, called CT (continuous training).\\n\u21b3 Beam: \ud83d\udd17\\nhttps://lnkd.in/dedCaMDh\\n.\\n\u21b3 To see all these components in action, check out my FREE \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 & give it a \u2b50: \ud83d\udd17\\nhttps://lnkd.in/dZgqtf8f\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\", \"image\": \"https://media.licdn.com/dms/image/D4D10AQHWQzZcToQQ1Q/image-shrink_800/0/1698388219549?e=1705082400&v=beta&t=9mrDC_NooJgD7u7Qk0PmrTGGaZtuwDIFKh3bEqeBsm0\" }]The following features of the above post are not compatible with embedding models. We\u2019ll need to find some way of handling them in our preprocessing step:emojisbold, italic textother non-ASCII charactersURLscontent that exceeds the context window limit of the embedding modelEmojis and bolded and italic text are represented by Unicode characters that are not available in the vocabulary of the embedding model. Thus, these items cannot be tokenized and passed to the model; we have to remove them or normalize them to something that can be parsed by the tokenizer. The same holds true for all other non-ASCII characters.URLs take up space in the context window without providing much semantic value. Still, knowing that there\u2019s a URL in the sentence may add context. For this reason, we replace all URLs with a [URL] token. This lets us ingest whatever value the URL\u2019s presence conveys without it taking up valuable space.3. Streaming ingestion pipelineLet\u2019s dive into the streaming pipeline, starting from the top and working our way to the bottom \u21933.1. The Bytewax flowThe Bytewax flow transparently conveys all the steps of the streaming pipeline.The first step is ingesting every LinkedIn post from our JSON files. In the next steps, every map operation has a single responsibility:validate the ingested data using a RawPost pydantic modelclean the postschunk the posts; because chunking will output a list of ChunkedPost objects, we use a flat_map operation to flatten them outembed the postsload the posts to a Qdrant vector DBdef build_flow(): embedding_model = EmbeddingModelSingleton() flow = Dataflow(\"flow\") stream = op.input(\"input\", flow, JSONSource([\"data/paul.json\"])) stream = op.map(\"raw_post\", stream, RawPost.from_source) stream = op.map(\"cleaned_post\", stream, CleanedPost.from_raw_post) stream = op.flat_map( \"chunked_post\", stream, lambda cleaned_post: ChunkedPost.from_cleaned_post( cleaned_post, embedding_model=embedding_model ), ) stream = op.map( \"embedded_chunked_post\", stream, lambda chunked_post: EmbeddedChunkedPost.from_chunked_post( chunked_post, embedding_model=embedding_model ), ) op.inspect(\"inspect\", stream, print) op.output( \"output\", stream, QdrantVectorOutput(vector_size=model.embedding_size) ) return flow3.2. The processing stepsEvery processing step is incorporated into a pydantic model. This way, we can easily validate the data at each step and reuse the code in the retrieval module.We isolate every step of an ingestion pipeline into its own class:cleaningchunkingembeddingDoing so, we follow the separation of concerns good SWE practice. Thus, every class has its own responsibility.Now the code is easy to read and understand. Also, it\u2019s future-proof, as it\u2019s extremely easy to change or extend either of the 3 steps: cleaning, chunking and embedding.Here is the interface of the pydantic models:class RawPost(BaseModel): post_id: str text: str image: Optional[str] @classmethod def from_source(cls, k_v: Tuple[str, dict]) -> \"RawPost\": ... # Mapping a dictionary to a RawPost validated pydantic model. return cls(...)class CleanedPost(BaseModel): post_id: str raw_text: str text: str image: Optional[str] @classmethod def from_raw_post(cls, raw_post: RawPost) -> \"CleanedPost\": ... # Cleaning the raw post return cls(...)class ChunkedPost(BaseModel): post_id: str chunk_id: str full_raw_text: str text: str image: Optional[str] @classmethod def from_cleaned_post( cls, cleaned_post: CleanedPost, embedding_model: EmbeddingModelSingleton ) -> list[\"ChunkedPost\"]: chunks = ... # Compute chunks return [cls(...) for chunk in chunks]class EmbeddedChunkedPost(BaseModel): post_id: str chunk_id: str full_raw_text: str text: str text_embedding: list image: Optional[str] = None score: Optional[float] = None rerank_score: Optional[float] = None @classmethod def from_chunked_post( cls, chunked_post: ChunkedPost, embedding_model: EmbeddingModelSingleton ) -> \"EmbeddedChunkedPost\": ... # Compute embedding. return cls(...)Now, the data at each step is validated and has a clear structure.Note: Providing different types when instantiating a pydantic model will throw a validation error. For example, if the post_id is defined as a string, and we try to instantiate an EmbeddedChunkedPost with a None or int post_id, it will throw an error.Check out the full implementation on our \ud83d\udd17 GitHub Articles Hub repository.3.3. Load to QdrantTo load the LinkedIn posts to Qdrant, you have to override Bytewax\u2019s StatelessSinkPartition class (which acts as an output in a Bytewax flow):class QdrantVectorSink(StatelessSinkPartition): def __init__( self, client: QdrantClient, collection_name: str ): self._client = client self._collection_name = collection_name def write_batch(self, chunks: list[EmbeddedChunkedPost]): ... # Map chunks to ids, embeddings, and metadata. self._client.upsert( collection_name=self._collection_name, points=Batch( ids=ids, vectors=embeddings, payloads=metadata, ), )Within this class, you must overwrite the write_batch() method, where we will serialize every EmbeddedChunkedPost to a format expected by Qdrant and load it to the vector DB.4. Retrieval clientHere, we focus on preprocessing a user\u2019s query, searching the vector DB, and postprocessing the retrieved posts for maximum results.To design the retrieval step, we implement a QdrantVectorDBRetriever class to expose all the necessary features for our retrieval client.class QdrantVectorDBRetriever: def __init__( self, embedding_model: EmbeddingModelSingleton, vector_db_client: QdrantClient, cross_encoder_model: CrossEncoderModelSingleton vector_db_collection: str ): self._embedding_model = embedding_model self._vector_db_client = vector_db_client self._cross_encoder_model = cross_encoder_model self._vector_db_collection = vector_db_collection def search( self, query: str, limit: int = 3, return_all: bool = False ) -> Union[list[EmbeddedChunkedPost], dict[str, list]]: ... # Search the Qdrant vector DB based on the given query. def embed_query(self, query: str) -> list[list[float]]: ... # Embed the given query. def rerank(self, query: str, posts: list[EmbeddedChunkedPost]) -> list[EmbeddedChunkedPost]: ... # Rerank the posts relative to the given query. def render_as_html(self, post: EmbeddedChunkedPost) -> None: ... # Map the embedded post to HTML to display it.4.1. Embed queryWe must embed the query in precisely the same way we ingested our posts into the vector DB. Because the streaming pipeline is written in Python (thanks to Bytewax), and every preprocessing operation is modular, we can quickly replicate all the steps necessary to embed the query.class QdrantVectorDBRetriever: ... def embed_query(self, query: str) -> list[list[float]]: cleaned_query = CleanedPost.clean(query) chunks = ChunkedPost.chunk(cleaned_query, self._embedding_model) embdedded_queries = [ self._embedding_model(chunk, to_list=True) for chunk in chunks ] return embdedded_queriesCheck out the full implementation on our \ud83d\udd17 GitHub repository.4.2. Plain retrievalLet\u2019s try to retrieve a set of posts without using the rerank algorithm.vector_db_retriever = QdrantVectorDBRetriever( embedding_model=EmbeddingModelSingleton(), vector_db_client=build_qdrant_client())query = \"Posts about Qdrant\"retrieved_results = vector_db_retriever.search(query=query)for post in retrieved_results[\"posts\"]: vector_db_retriever.render_as_html(post)Here are the top 2 retrieved results sorted using the cosine similarity score \u2193Result 1:Result 1 for the \u201cPosts about Qdrant\u201d query (without using reranking) [Image by the Author \u2014 in collaboration with VectorHub]Result 2:Result 2 for the \u201cPosts about Qdrant\u201d query (without using reranking) [Image by the Author \u2014 in collaboration with VectorHub]You can see from the results above, that starting from the second post the results are irrelevant. Even though it has a cosine similarly score of ~0.69 the posts doesn\u2019t contain any information about Qdrant or vector DBs.Note: We looked over the top 5 retrieved results. Nothing after the first post was relevant. We haven\u2019t added them here as the article is already too long.4.3. Visualize retrievalTo visualize our retrieval, we implement a dedicated class that uses the UMAP dimensionality reduction algorithm. We have picked UMAP as it preserves the geometric properties between points (e.g., the distance) in higher dimensions when they are projected onto lower dimensions better than its peers (e.g., PCA, t-SNE).The RetrievalVisualizer computes the projected embeddings for the entire vector space once. Afterwards, it uses the render() method to project only the given query and retrieved posts, and plot them to a 2D graph.class RetrievalVisualizer: def __init__(self, posts: list[EmbeddedChunkedPost]): self._posts = posts self._umap_transform = self._fit_model(self._posts) self._projected_post_embeddings = self.project_posts(self._posts) def _fit_model(self, posts: list[EmbeddedChunkedPost]) -> umap.UMAP: umap_transform = ... # Fit a UMAP model on the given posts. return umap_transform def project_posts(self, posts: list[EmbeddedChunkedPost]) -> np.ndarray: embeddings = np.array([post.text_embedding for post in posts]) return self._project(embeddings=embeddings) def _project(self, embeddings: np.ndarray) -> np.ndarray: ... # Project the embeddings to 2D using UMAP. return umap_embeddings def render( self, embedded_queries: list[list[float]], retrieved_posts: list[EmbeddedChunkedPost], ) -> None: ... # Render the given queries & retrieved posts using matplotlib.Let\u2019s take a look at the result to see how the \u201cPosts about Qdrant\u201d query looks \u2193Visualization of the \u201cPosts about Qdrant\u201d query using UMAP (without reranking) [Image by the Author \u2014 in collaboration with VectorHub].Our results are not great. You can see how far the retrieved posts are from our query in the vector space.Can we improve the quality of our retrieval system using the rerank algorithm?4.4. RerankWe use the reranking algorithm to refine our retrieval for the initial query. Our initial retrieval step \u2014 because it used cosine similarity (or similar distance metrics) to compute the distance between a query and post embeddings \u2014 may have missed more complex (but essential) relationships between the query and the documents in the vector space. Reranking leverages the power of transformer models that are capable of understanding more nuanced semantic relationships.We use a cross-encoder model to implement the reranking step, so we can score the query relative to all retrieved posts individually. These scores take into consideration more complex relationships than cosine similarity can. Under the hood is a BERT classifier that outputs a number between 0 and 1 according to how similar the 2 given sentences are. The BERT classifier outputs 0 if they are entirely different and 1 if they are a perfect match.Bi-Encoder vs. Cross-Encoder [Image by the Author \u2014 in collaboration with VectorHub]Bi-Encoder vs. Cross-Encoder [Image by the Author \u2014 in collaboration with VectorHub]But, you might ask, \u201cWhy not use the cross-encoder model from the start if it is that much better?\u201dThe answer, in a word, is speed. Using a cross-encoder model to search your whole collection is much slower than using cosine similarity. To optimize your retrieval, therefore, your reranking process should involve 2 steps:an initial rough retrieval step using cosine similarity, which retrieves the top N items as potential candidatesfiltering the rough search using the rerank strategy, which retrieves the top K items as your final resultsThe implementation is relatively straightforward. For each retrieved post, we create a pair consisting of the (cleaned) query and the text of the post. We do this for all retrieved posts, resulting in a list of pairs.Next, we call a cross-encoder/ms-marco-MiniLM-L-6-v2 model (from sentence-transformers) to give the retrieved posts their rerank score. We then sort the posts in descending order based on their rerank score.Check out the rerank algorithm implementation on our \ud83d\udd17 GitHub repository.4.5. Visualize retrieval with rerankNow that we\u2019ve added the rerank pattern to our retrieval system, let\u2019s see if it improves the results of our \u201cPosts about Qdrant\u201d query \u2193Result 1Result 1 for the \u201cPosts about Qdrant\u201d query (using reranking) [Image by the Author \u2014 in collaboration with VectorHub]Result 2:Result 2 for the \u201cPosts about Qdrant\u201d query (using reranking) [Image by the Author \u2014 in collaboration with VectorHub]The improvement is remarkable! All our results are about Qdrant and vector DBs.Note: We looked over the top 5 retrieved results. The top 4 out of 5 posts are relevant to our query, which is incredible.Now, let\u2019s look at the UMAP visualization:Visualization of the \u201cPosts about Qdrant\u201d query using UMAP (with reranking) [Image by the Author \u2014 in collaboration with VectorHub].While the returned posts aren\u2019t very close to the query, they are a lot closer to the query compared to when we weren\u2019t reranking the retrieved posts.5. ConclusionIn this article, we learned how to adapt a RAG retrieval pattern to improve LinkedIn post retrieval. To keep our database up to date with rapidly changing social media data, we implemented a real-time streaming pipeline that uses CDC to sync the raw LinkedIn posts data source with a vector DB. You also saw how to use Bytewax to write \u2014 using only Python \u2014 a streaming pipeline that cleans, chunks, and embeds LinkedIn posts.Finally, you learned how to implement a standard retrieval client for RAG and saw how to improve it using the rerank pattern. As retrieval is complex to evaluate, you saw how to visualize the retrieval for a given query by rendering all the posts, the query, and the retrieved posts in a 2D space using UMAP.This article is a summary of my contribution from VectorHub. Check out the full article here to dig into the details, the code and more experiments.\u2192 Join 5k+ engineers in the \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf \ud835\udde1\ud835\uddf2\ud835\ude04\ud835\ude00\ud835\uddf9\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff for battle-tested content on production-grade ML. \ud835\uddd8\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06 \ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\uddf8:Decoding ML Newsletter | Paul Iusztin | SubstackJoin for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For\u2026decodingml.substack.comSign up to discover human stories that deepen your understanding of the world.FreeDistraction-free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member-only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for $5/monthMl System DesignArtificial IntelligenceMachine LearningStreaming PipelineData Science358358FollowWritten by Paul Iusztin5.1K Followers\u00b7Editor for Decoding MLSenior ML & MLOps Engineer \u2022 Founder @ Decoding ML ~ Content about building production-grade ML/AI systems \u2022 DML Newsletter: https://decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post-retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLAn End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13See all from Paul IusztinSee all from Decoding MLRecommended from MediumMdabdullahalhasibinTowards AIA Complete Guide to Embedding For NLP & Generative AI/LLMUnderstand the concept of vector embedding, why it is needed, and implementation with LangChain.3d agoVishal RajputinAIGuysWhy GEN AI Boom Is Fading And What\u2019s Next?Every technology has its hype and cool down period.Sep 42.3K72ListsPredictive Modeling w/ Python20 stories\u00b71607 savesNatural Language Processing1766 stories\u00b71367 savesPractical Guides to Machine Learning10 stories\u00b71961 savesChatGPT prompts 50 stories\u00b72121 savesTarun SinghinAI AdvancesAI-Powered OCR with Phi-3-Vision-128K: The Future of Document ProcessingIn the fast-evolving world of artificial intelligence, multimodal models are setting new standards for integrating visual and textual data\u2026Oct 989916Alex RazvantinDecoding MLHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine-tune a Mistral7b-Instruct using PEFT & QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922Kamal DhunganaImplementing Human-in-the-Loop with LangGraphStreamlit app\u200a\u2014\u200aHIL (Agent Framework\u200a\u2014\u200aLangGraph)Jul 16205Umair Ali KhaninTowards Data ScienceIntegrating Multimodal Data into a Large Language ModelDeveloping a context-retrieval, multimodal RAG using advanced parsing, semantic & keyword search, and re-ranking4d ago841See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTo make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy." }, "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/a-real-time-retrieval-system-for-rag-on-social-media-data-9cc01d50a2a0" }, { "id": "c647c345-aeb5-46f7-8f16-8a6345344069", "content": { "Title": "SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!", "Subtitle": "Use a Python streaming engine to populate a feature store from 4+ data sources", "Content": "Streaming Pipelines for LLMs and RAG | Decoding MLOpen in appSign upSign inWriteSign upSign inTop highlightLLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICASOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!Use a Python streaming engine to populate a feature store from 4+ data sourcesPaul Iusztin\u00b7FollowPublished inDecoding ML\u00b719 min read\u00b7Apr 20, 20248241ListenShare\u2192 the 4th out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL-EWhy is this course different?By finishing the \u201cLLM Twin: Building Your Production-Ready AI Replica\u201d free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? \ud83e\udef5\u2192 No more isolated scripts or Notebooks! Learn production ML by building and deploying an end-to-end production-grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real-world LLM system from start to finish \u2014 from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices:the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML\u2019s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet\u2019s model registry. (deployed on Qwak)the inference pipeline: load and quantize the fine-tuned LLM from Comet\u2019s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet\u2019s prompt monitoring dashboard. (deployed on Qwak)LLM twin system architecture [Image by the Author]Along the 4 microservices, you will learn to integrate 3 serverless tools:Comet ML as your ML Platform;Qdrant as your vector DB;Qwak as your ML infrastructure;Who is this for?Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps good principles.Level: intermediatePrerequisites: basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace.\u2192 To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by:Paul Iusztin | Senior ML & MLOps EngineerAlex Vesa | Senior AI EngineerAlex Razvant | Senior ML & MLOps Engineer\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fLessons\u2192 Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson:An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture: Enabling Event-Driven ArchitecturesSOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine-Tuning LLMsHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine-Tuned LLMsArchitect scalable and cost-effective LLM & RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code[Bonus] Build Multi-Index Advanced RAG AppsTo better understand the course\u2019s goal, technical details, and system design \u2192 Check out Lesson 1Let\u2019s start with Lesson 4 \u2193\u2193\u2193Lesson 4: Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!In the 4th lesson, we will focus on the feature pipeline.The feature pipeline is the first pipeline presented in the 3 pipeline architecture: feature, training and inference pipelines.A feature pipeline is responsible for taking raw data as input, processing it into features, and storing it in a feature store, from which the training & inference pipelines will use it.The component is completely isolated from the training and inference code. All the communication is done through the feature store.To avoid repeating myself, if you are unfamiliar with the 3 pipeline architecture, check out Lesson 1 for a refresher.By the end of this article, you will learn to design and build a production-ready feature pipeline that:uses Bytewax as a stream engine to process data in real-time;ingests data from a RabbitMQ queue;uses SWE practices to process multiple data types: posts, articles, code;cleans, chunks, and embeds data for LLM fine-tuning and RAG;loads the features to a Qdrant vector DB.Note: In our use case, the feature pipeline is also a streaming pipeline, as we use a Bytewax streaming engine. Thus, we will use these words interchangeably.We will wrap up Lesson 4 by showing you how to deploy the feature pipeline to AWS and integrate it with the components from previous lessons: data collection pipeline, MongoDB, and CDC.In the 5th lesson, we will go through the vector DB retrieval client, where we will teach you how to query the vector DB and improve the accuracy of the results using advanced retrieval techniques.Excited? Let\u2019s get started!The architecture of the feature/streaming pipeline.Table of ContentsWhy are we doing this?System design of the feature pipelineThe Bytewax streaming flowPydantic data modelsLoad data to QdrantThe dispatcher layerPreprocessing steps: Clean, chunk, embedThe AWS infrastructureRun the code locallyDeploy the code to AWS & Run it from the cloudConclusion\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0f1. Why are we doing this?A quick reminder from previous lessonsTo give you some context, in Lesson 2, we crawl data from LinkedIn, Medium, and GitHub, normalize it, and load it to MongoDB.In Lesson 3, we are using CDC to listen to changes to the MongoDB database and emit events in a RabbitMQ queue based on any CRUD operation done on MongoDB.\u2026and here we are in Lesson 4, where we are building the feature pipeline that listens 24/7 to the RabbitMQ queue for new events to process and load them to a Qdrant vector DB.The problem we are solvingIn our LLM Twin use case, the feature pipeline constantly syncs the MongoDB warehouse with the Qdrant vector DB while processing the raw data into features.Important: In our use case, the Qdrant vector DB will be our feature store.Why we are solving itThe feature store will be the central point of access for all the features used within the training and inference pipelines.For consistency and simplicity, we will refer to different formats of our text data as \u201cfeatures.\u201d\u2192 The training pipeline will use the feature store to create fine-tuning datasets for your LLM twin.\u2192 The inference pipeline will use the feature store for RAG.For reliable results (especially for RAG), the data from the vector DB must always be in sync with the data from the data warehouse.The question is, what is the best way to sync these 2?Other potential solutionsThe most common solution is probably to use a batch pipeline that constantly polls from the warehouse, computes a difference between the 2 databases, and updates the target database.The issue with this technique is that computing the difference between the 2 databases is extremely slow and costly.Another solution is to use a push technique using a webhook. Thus, on any CRUD change in the warehouse, you also update the source DB.The biggest issue here is that if the webhook fails, you have to implement complex recovery logic.Lesson 3 on CDC covers more of this.2. System design of the feature pipeline: our solutionOur solution is based on CDC, a queue, a streaming engine, and a vector DB:\u2192 CDC adds any change made to the Mongo DB to the queue (read more in Lesson 3).\u2192 the RabbitMQ queue stores all the events until they are processed.\u2192 The Bytewax streaming engine cleans, chunks, and embeds the data.\u2192 A streaming engine works naturally with a queue-based system.\u2192 The data is uploaded to a Qdrant vector DB on the flyWhy is this powerful?Here are 4 core reasons:The data is processed in real-time.Out-of-the-box recovery system: If the streaming pipeline fails to process a message will be added back to the queueLightweight: No need for any diffs between databases or batching too many recordsNo I/O bottlenecks on the source database\u2192 It solves all our problems!The architecture of the feature/streaming pipeline.How is the data stored?We store 2 snapshots of our data in the feature store. Here is why \u2193Remember that we said that the training and inference pipeline will access the features only from the feature store, which, in our case, is the Qdrant vector DB?Well, if we had stored only the chunked & embedded version of the data, that would have been useful only for RAG but not for fine-tuning.Thus, we make an additional snapshot of the cleaned data, which will be used by the training pipeline.Afterward, we pass it down the streaming flow for chunking & embedding.How do we process multiple data types?How do you process multiple types of data in a single streaming pipeline without writing spaghetti code?Yes, that is for you, data scientists! Joking\u2026am I?We have 3 data types: posts, articles, and code.Each data type (and its state) will be modeled using Pydantic models.To process them we will write a dispatcher layer, which will use a creational factory pattern [9] to instantiate a handler implemented for that specific data type (post, article, code) and operation (cleaning, chunking, embedding).The handler follows the strategy behavioral pattern [10].Intuitively, you can see the combination between the factory and strategy patterns as follows:Initially, we know we want to clean the data, but as we don\u2019t know the data type, we can\u2019t know how to do so.What we can do, is write the whole code around the cleaning code and abstract away the login under a Handler() interface (aka the strategy).When we get a data point, the factory class creates the right cleaning handler based on its type.Ultimately the handler is injected into the rest of the system and executed.By doing so, we can easily isolate the logic for a given data type & operation while leveraging polymorphism to avoid filling up the code with 1000x \u201cif else\u201d statements.We will dig into the implementation in future sections.Streaming over batchYou may ask why we need a streaming engine instead of implementing a batch job that polls the messages at a given frequency.That is a valid question.The thing is that\u2026Nowadays, using tools such as Bytewax makes implementing streaming pipelines a lot more frictionless than using their JVM alternatives.The key aspect of choosing a streaming vs. a batch design is real-time synchronization between your source and destination DBs.In our particular case, we will process social media data, which changes fast and irregularly.Also, for our digital twin, it is important to do RAG on up-to-date data. We don\u2019t want to have any delay between what happens in the real world and what your LLM twin sees.That being said choosing a streaming architecture seemed natural in our use case.3. The Bytewax streaming flowThe Bytewax flow is the central point of the streaming pipeline. It defines all the required steps, following the next simplified pattern: \u201cinput -> processing -> output\u201d.As I come from the AI world, I like to see it as the \u201cgraph\u201d of the streaming pipeline, where you use the input(), map(), and output() Bytewax functions to define your graph, which in the Bytewax world is called a \u201cflow\u201d.As you can see in the code snippet below, we ingest posts, articles or code messages from a RabbitMQ queue. After we clean, chunk and embed them. Ultimately, we load the cleaned and embedded data to a Qdrant vector DB, which in our LLM twin use case will represent the feature store of our system.To structure and validate the data, between each Bytewax step, we map and pass a different Pydantic model based on its current state: raw, cleaned, chunked, or embedded.Bytewax flow \u2192 GitHub Code \u20ea\u2190We have a single streaming pipeline that processes everything.As we ingest multiple data types (posts, articles, or code snapshots), we have to process them differently.To do this the right way, we implemented a dispatcher layer that knows how to apply data-specific operations based on the type of message.More on this in the next sections \u2193Why Bytewax?Bytewax is an open-source streaming processing framework that:- is built in Rust \u2699\ufe0f for performance- has Python \ud83d\udc0d bindings for leveraging its powerful ML ecosystem\u2026 so, for all the Python fanatics out there, no more JVM headaches for you.Jokes aside, here is why Bytewax is so powerful \u2193- Bytewax local setup is plug-and-play- can quickly be integrated into any Python project (you can go wild \u2014 even use it in Notebooks)- can easily be integrated with other Python packages (NumPy, PyTorch, HuggingFace, OpenCV, SkLearn, you name it)- out-of-the-box connectors for Kafka and local files, or you can quickly implement your ownWe used Bytewax to build the streaming pipeline for the LLM Twin course and loved it.To learn more about Bytewax, go and check them out. They are open source, so no strings attached \u2192 Bytewax [2] \u21904. Pydantic data modelsLet\u2019s take a look at what our Pydantic models look like.First, we defined a set of base abstract models for using the same parent class across all our components.Pydantic base model structure \u2192 GitHub Code \u20ea\u2190Afterward, we defined a hierarchy of Pydantic models for:all our data types: posts, articles, or codeall our states: raw, cleaned, chunked, and embeddedThis is how the set of classes for the posts will look like \u2193Pydantic posts model structure \u2192 GitHub Code \u20ea\u2190We repeated the same process for the articles and code model hierarchy.Check out the other data classes on our GitHub.Why is keeping our data in Pydantic models so powerful?There are 4 main criteria:every field has an enforced type: you are ensured the data types are going to be correctthe fields are automatically validated based on their type: for example, if the field is a string and you pass an int, it will through an errorthe data structure is clear and verbose: no more clandestine dicts that you never know what is in themyou make your data the first-class citizen of your program5. Load data to QdrantThe first step is to implement our custom Bytewax DynamicSink class \u2193Qdrant DynamicSink \u2192 GitHub Code \u20ea\u2190Next, for every type of operation we need (output cleaned or embedded data ) we have to subclass the StatelessSinkPartition Bytewax class (they also provide a stateful option \u2192 more in their docs)An instance of the class will run on every partition defined within the Bytewax deployment.In the course, we are using a single partition per worker. But, by adding more partitions (and workers), you can quickly scale your Bytewax pipeline horizontally.Qdrant worker partitions \u2192 GitHub Code \u20ea\u2190Note that we used Qdrant\u2019s Batch method to upload all the available points at once. By doing so, we reduce the latency on the network I/O side: more on that here [8] \u2190The RabbitMQ streaming input follows a similar pattern. Check it out here \u21906. The dispatcher layerNow that we have the Bytewax flow and all our data models.How do we map a raw data model to a cleaned data model?\u2192 All our domain logic is modeled by a set of Handler() classes.For example, this is how the handler used to map a PostsRawModel to a PostCleanedModel looks like \u2193Handler hierarchy of classes \u2192 GitHub Code \u20ea\u2190Check out the other handlers on our GitHub:\u2192 ChunkingDataHandler and EmbeddingDataHandlerIn the next sections, we will explore the exact cleaning, chunking and embedding logic.Now, to build our dispatcher, we need 2 last components:a factory class: instantiates the right handler based on the type of the eventa dispatcher class: the glue code that calls the factory class and handlerHere is what the cleaning dispatcher and factory look like \u2193The dispatcher and factory classes \u2192 GitHub Code \u20ea\u2190Check out the other dispatchers on our GitHub.By repeating the same logic, we will end up with the following set of dispatchers:RawDispatcher (no factory class required as the data is not processed)CleaningDispatcher (with a ChunkingHandlerFactory class)ChunkingDispatcher (with a ChunkingHandlerFactory class)EmbeddingDispatcher (with an EmbeddingHandlerFactory class)7. Preprocessing steps: Clean, chunk, embedHere we will focus on the concrete logic used to clean, chunk, and embed a data point.Note that this logic is wrapped by our handler to be integrated into our dispatcher layer using the Strategy behavioral pattern [10].We already described that in the previous section. Thus, we will directly jump into the actual logic here, which can be found in the utils module of our GitHub repository.Note: These steps are experimental. Thus, what we present here is just the first iteration of the system. In a real-world scenario, you would experiment with different cleaning, chunking or model versions to improve it on your data.CleaningThis is the main utility function used to clean the text for our posts, articles, and code.Out of simplicity, we used the same logic for all the data types, but after more investigation, you would probably need to adapt it to your specific needs.For example, your posts might start containing some weird characters, and you don\u2019t want to run the \u201cunbold_text()\u201d or \u201cunitalic_text()\u201d functions on your code data point as is completely redundant.Cleaning logic \u2192 GitHub Code \u20ea\u2190Most of the functions above are from the unstructured [3] Python package. It is a great tool for quickly finding utilities to clean text data.\ud83d\udd17 More examples of unstructured here [3] \u2190One key thing to notice is that at the cleaning step, we just want to remove all the weird, non-interpretable characters from the text.Also, we want to remove redundant data, such as extra whitespace or URLs, as they do not provide much value.These steps are critical for our tokenizer to understand and efficiently transform our string input into numbers that will be fed into the transformer models.Note that when using bigger models (transformers) + modern tokenization techniques, you don\u2019t need to standardize your dataset too much.For example, it is redundant to apply lemmatization or stemming, as the tokenizer knows how to split your input into a commonly used sequence of characters efficiently, and the transformers can pick up the nuances of the words.\ud83d\udca1 What is important at the cleaning step is to throw out the noise.ChunkingWe are using Langchain to chunk our text.We use a 2 step strategy using Langchain\u2019s RecursiveCharacterTextSplitter [4] and SentenceTransformersTokenTextSplitter [5]. As seen below \u2193Chunking logic \u2192 GitHub Code \u20ea\u2190Overlapping your chunks is a common pre-indexing RAG technique, which helps to cluster chunks from the same document semantically.Again, we are using the same chunking logic for all of our data types, but to get the most out of it, we would probably need to tweak the separators, chunk_size, and chunk_overlap parameters for our different use cases.But our dispatcher + handler architecture would easily allow us to configure the chunking step in future iterations.EmbeddingThe data preprocessing, aka the hard part is done.Now we just have to call an embedding model to create our vectors.Embedding logic \u2192 GitHub Code \u20ea\u2190We used the all-MiniLm-L6-v2 [6] from the sentence-transformers library to embed our articles and posts: a lightweight embedding model that can easily run in real-time on a 2 vCPU machine.As the code data points contain more complex relationships and specific jargon to embed, we used a more powerful embedding model: hkunlp/instructor-xl [7].This embedding model is unique as it can be customized on the fly with instructions based on your particular data. This allows the embedding model to specialize on your data without fine-tuning, which is handy for embedding pieces of code.8. The AWS infrastructureIn Lesson 2, we covered how to deploy the data collection pipeline that is triggered by a link to Medium, Substack, LinkedIn or GitHub \u2192 crawls the given link \u2192 saves the crawled information to a MongoDB.In Lesson 3, we explained how to deploy the CDC components that emit events to a RabbitMQ queue based on any CRUD operation done to MongoDB.What is left is to deploy the Bytewax streaming pipeline and Qdrant vector DB.We will use Qdrant\u2019s self-hosted option, which is easy to set up and scale.To test things out, they offer a Free Tier plan for up to a 1GB cluster, which is more than enough for our course.\u2192 We explained in our GitHub repository how to configure Qdrant.AWS infrastructure of the feature/streaming pipeline.The last piece of the puzzle is the Bytewax streaming pipeline.As we don\u2019t require a GPU and the streaming pipeline needs to run 24/7, we will deploy it to AWS Fargate, a cost-effective serverless solution from AWS.As a serverless solution, Fargate allows us to deploy our code quickly and scale it fast in case of high traffic.How do we deploy the streaming pipeline code to Fargate?Using GitHub Actions, we wrote a CD pipeline that builds a Docker image on every new commit made on the main branch.After, the Docker image is pushed to AWS ECR. Ultimately, Fargate pulls the latest version of the Docker image.This is a common CD pipeline to deploy your code to AWS services.Why not use lambda functions, as we did for the data pipeline?An AWS lambda function executes a function once and then closes down.This worked perfectly for the crawling logic, but it won't work for our streaming pipeline, which has to run 24/7.9. Run the code locallyTo quickly test things up, we wrote a docker-compose.yaml file to spin up the MongoDB, RabbitMQ queue and Qdrant vector db.You can spin up the Docker containers using our Makefile by running the following, which will start the CDC component and streaming pipeline:make local-startTo start the data collection pipeline, run the following:make local-test-githubThe documentation of our GitHub repository provides more details on how to run and set up everything.10. Deploy the code to AWS & Run it from the cloudThis article is already too long, so I won\u2019t go into the details of how to deploy the AWS infrastructure described above and test it out here.But to give you some insights, we have used Pulumi as our infrastructure as a code (IaC) tool, which will allow you to spin it quickly with a few commands.Also, I won\u2019t let you hang on to this one. We made a promise and\u2026 \u2193We prepared step-by-step instructions in the README of our GitHub repository on how to use Pulumni to spin up the infrastructure and test it out.ConclusionNow you know how to write streaming pipelines like a PRO!In Lesson 4, you learned how to:design a feature pipeline using the 3-pipeline architecturewrite a streaming pipeline using Bytewax as a streaming engineuse a dispatcher layer to write a modular and flexible application to process multiple types of data (posts, articles, code)load the cleaned and embedded data to Qdrantdeploy the streaming pipeline to AWS\u2192 This is only the ingestion part used for fine-tuning LLMs and RAG.In Lesson 5, you will learn how to write a retrieval client for the 3 data types using good SWE practices and improve the retrieval accuracy using advanced retrieval & post-retrieval techniques. See you there!\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fEnjoyed This Article?Join the Decoding ML Newsletter for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For FREE \u2193Decoding ML Newsletter | Paul Iusztin | SubstackJoin for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For\u2026decodingml.substack.comReferencesLiterature[1] Your LLM Twin Course \u2014 GitHub Repository (2024), Decoding ML GitHub Organization[2] Bytewax, Bytewax Landing Page[3] Unstructured Cleaning Examples, Unstructured Documentation[4] Recursively split by character, LangChain\u2019s Documentation[5] Split by tokens, LangChain\u2019s Documentation[6] sentence-transformers/all-MiniLM-L6-v2, HuggingFace[7] hkunlp/instructor-xl, HuggingFace[8] Qdrant, Qdrant Documentation[9] Abstract Factory Pattern, Refactoring Guru[10] Strategy Pattern, Refactoring GuruImagesIf not otherwise stated, all images are created by the author.Sign up to discover human stories that deepen your understanding of the world.FreeDistraction-free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member-only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for $5/monthMl System DesignMachine LearningArtificial IntelligenceData ScienceSoftware Engineering8248241FollowWritten by Paul Iusztin5.1K Followers\u00b7Editor for Decoding MLSenior ML & MLOps Engineer \u2022 Founder @ Decoding ML ~ Content about building production-grade ML/AI systems \u2022 DML Newsletter: https://decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post-retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLAn End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13See all from Paul IusztinSee all from Decoding MLRecommended from MediumVipra SinghBuilding LLM Applications: Serving LLMs (Part 9)Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation ( RAG ) Application.Apr 188666Vishal RajputinAIGuysWhy GEN AI Boom Is Fading And What\u2019s Next?Every technology has its hype and cool down period.Sep 42.3K72ListsPredictive Modeling w/ Python20 stories\u00b71607 savesNatural Language Processing1766 stories\u00b71367 savesPractical Guides to Machine Learning10 stories\u00b71961 savesdata science and AI40 stories\u00b7269 savesDerckData architecture for MLOps: Metadata storeIntroductionJul 17Alex RazvantinDecoding MLHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine-tune a Mistral7b-Instruct using PEFT & QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922Tarun SinghinAI AdvancesMastering RAG Chunking Techniques for Enhanced Document ProcessingDividing large documents into smaller parts is a crucial yet intricate task that significantly impacts the performance of\u2026Jun 182592Steve HeddeninTowards Data ScienceHow to Implement Graph RAG Using Knowledge Graphs and Vector DatabasesA Step-by-Step Tutorial on Implementing Retrieval-Augmented Generation (RAG), Semantic Search, and RecommendationsSep 61.4K18See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTo make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy." }, "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/sota-python-streaming-pipelines-for-fine-tuning-llms-and-rag-in-real-time-82eb07795b87" }, { "id": "649bd7d7-aa0e-4ada-b5e2-1c50fe7c95e6", "content": { "Title": "The 4 Advanced RAG Algorithms You Must Know to Implement", "Subtitle": "Implement from scratch 4 advanced RAG methods to optimize your retrieval and post-retrieval algorithm", "Content": "4 Advanced RAG Algorithms You Must Know | Decoding MLOpen in appSign upSign inWriteSign upSign inTop highlightLLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICAThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post-retrieval algorithmPaul Iusztin\u00b7FollowPublished inDecoding ML\u00b716 min read\u00b7May 4, 20241.8K12ListenShare\u2192 the 5th out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL-EWhy is this course different?By finishing the \u201cLLM Twin: Building Your Production-Ready AI Replica\u201d free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? \ud83e\udef5\u2192 No more isolated scripts or Notebooks! Learn production ML by building and deploying an end-to-end production-grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real-world LLM system from start to finish \u2014 from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices:the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML\u2019s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet\u2019s model registry. (deployed on Qwak)the inference pipeline: load and quantize the fine-tuned LLM from Comet\u2019s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet\u2019s prompt monitoring dashboard. (deployed on Qwak)LLM twin system architecture [Image by the Author]Along the 4 microservices, you will learn to integrate 3 serverless tools:Comet ML as your ML Platform;Qdrant as your vector DB;Qwak as your ML infrastructure;Who is this for?Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps good principles.Level: intermediatePrerequisites: basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace.\u2192 To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by:Paul Iusztin | Senior ML & MLOps EngineerAlex Vesa | Senior AI EngineerAlex Razvant | Senior ML & MLOps Engineer\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fLessons\u2192 Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson:An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture: Enabling Event-Driven ArchitecturesSOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine-Tuning LLMsHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine-Tuned LLMsArchitect scalable and cost-effective LLM & RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code[Bonus] Build Multi-Index Advanced RAG AppsTo better understand the course\u2019s goal, technical details, and system design \u2192 Check out Lesson 1Let\u2019s start with Lesson 5 \u2193\u2193\u2193Lesson 5: The 4 Advanced RAG Algorithms You Must Know to ImplementIn Lesson 5, we will focus on building an advanced retrieval module used for RAG.We will show you how to implement 4 retrieval and post-retrieval advanced optimization techniques to improve the accuracy of your RAG retrieval step.In this lesson, we will focus only on the retrieval part of the RAG system.In Lesson 4, we showed you how to clean, chunk, embed, and load social media data to a Qdrant vector DB (the ingestion part of RAG).In future lessons, we will integrate this retrieval module into the inference pipeline for a full-fledged RAG system.Retrieval Python Module ArchitectureWe assume you are already familiar with what a naive RAG looks like. If not, check out the following article from Decoding ML, where we present in a 2-minute read what a naive RAG looks like:Why you must choose streaming over batch pipelines when doing RAG in LLM applicationsLesson 2: RAG, streaming pipelines, vector DBs, text processingmedium.comTable of ContentsOverview of advanced RAG optimization techniquesAdvanced RAG techniques applied to the LLM twinRetrieval optimization (1): Query expansionRetrieval optimization (2): Self queryRetrieval optimization (3): Hybrid & filtered vector searchImplement the advanced retrieval Python classPost-retrieval optimization: Rerank using GPT-4How to use the retrievalConclusion\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0f1. Overview of advanced RAG optimization techniquesA production RAG system is split into 3 main components:ingestion: clean, chunk, embed, and load your data to a vector DBretrieval: query your vector DB for contextgeneration: attach the retrieved context to your prompt and pass it to an LLMThe ingestion component sits in the feature pipeline, while the retrieval and generation components are implemented inside the inference pipeline.You can also use the retrieval and generation components in your training pipeline to fine-tune your LLM further on domain-specific prompts.You can apply advanced techniques to optimize your RAG system for ingestion, retrieval and generation.That being said, there are 3 main types of advanced RAG techniques:Pre-retrieval optimization [ingestion]: tweak how you create the chunksRetrieval optimization [retrieval]: improve the queries to your vector DBPost-retrieval optimization [retrieval]: process the retrieved chunks to filter out the noiseThe generation step can be improved through fine-tuning or prompt engineering, which will be explained in future lessons.The pre-retrieval optimization techniques are explained in Lesson 4.In this lesson, we will show you some popular retrieval and post-retrieval optimization techniques.2. Advanced RAG techniques applied to the LLM twinRetrieval optimizationWe will combine 3 techniques:Query ExpansionSelf QueryFiltered vector searchPost-retrieval optimizationWe will use the rerank pattern using GPT-4 and prompt engineering instead of Cohere or an open-source re-ranker cross-encoder [4].I don\u2019t want to spend too much time on the theoretical aspects. There are plenty of articles on that.So, we will jump straight to implementing and integrating these techniques in our LLM twin system.But before seeing the code, let\u2019s clarify a few things \u2193Advanced RAG architecture2.1 Important Note!We will show you a custom implementation of the advanced techniques and NOT use LangChain.Our primary goal is to build your intuition about how they work behind the scenes. However, we will attach LangChain\u2019s equivalent so you can use them in your apps.Customizing LangChain can be a real headache. Thus, understanding what happens behind its utilities can help you build real-world applications.Also, it is critical to know that if you don\u2019t ingest the data using LangChain, you cannot use their retrievals either, as they expect the data to be in a specific format.We haven\u2019t used LangChain\u2019s ingestion function in Lesson 4 either (the feature pipeline that loads data to Qdrant) as we want to do everything \u201cby hand\u201d.2.2. Why Qdrant?There are many vector DBs out there, too many\u2026But since we discovered Qdrant, we loved it.Why?It is built in Rust.Apache-2.0 license \u2014 open-source \ud83d\udd25It has a great and intuitive Python SDK.It has a freemium self-hosted version to build PoCs for free.It supports unlimited document sizes, and vector dims of up to 645536.It is production-ready. Companies such as Disney, Mozilla, and Microsoft already use it.It is one of the most popular vector DBs out there.To put that in perspective, Pinecone, one of its biggest competitors, supports only documents with up to 40k tokens and vectors with up to 20k dimensions\u2026. and a proprietary license.I could go on and on\u2026\u2026but if you are curious to find out more, check out Qdrant \u21903. Retrieval optimization (1): Query expansionThe problemIn a typical retrieval step, you query your vector DB using a single point.The issue with that approach is that by using a single vector, you cover only a small area of your embedding space.Thus, if your embedding doesn't contain all the required information, your retrieved context will not be relevant.What if we could query the vector DB with multiple data points that are semantically related?That is what the \u201cQuery expansion\u201d technique is doing!The solutionQuery expansion is quite intuitive.You use an LLM to generate multiple queries based on your initial query.These queries should contain multiple perspectives of the initial query.Thus, when embedded, they hit different areas of your embedding space that are still relevant to our initial question.You can do query expansion with a detailed zero-shot prompt.Here is our simple & custom solution \u2193Query expansion template \u2192 GitHub Code \u2190Here is LangChain\u2019s MultiQueryRetriever class [5] (their equivalent).4. Retrieval optimization (2): Self queryThe problemWhen embedding your query, you cannot guarantee that all the aspects required by your use case are present in the embedding vector.For example, you want to be 100% sure that your retrieval relies on the tags provided in the query.The issue is that by embedding the query prompt, you can never be sure that the tags are represented in the embedding vector or have enough signal when computing the distance against other vectors.The solutionWhat if you could extract the tags within the query and use them along the embedded query?That is what self-query is all about!You use an LLM to extract various metadata fields that are critical for your business use case (e.g., tags, author ID, number of comments, likes, shares, etc.)In our custom solution, we are extracting just the author ID. Thus, a zero-shot prompt engineering technique will do the job.But, when extracting multiple metadata types, you should also use few-shot learning to optimize the extraction step.Self-queries work hand-in-hand with vector filter searches, which we will explain in the next section.Here is our solution \u2193Self-query template \u2192 GitHub Code \u2190Here is LangChain\u2019s SelfQueryRetriever class [6] equivalent and this is an example using Qdrant [8].5. Retrieval optimization (3): Hybrid & filtered vector searchThe problemEmbeddings are great for capturing the general semantics of a specific chunk.But they are not that great for querying specific keywords.For example, if we want to retrieve article chunks about LLMs from our Qdrant vector DB, embeddings would be enough.However, if we want to query for a specific LLM type (e.g., LLama 3), using only similarities between embeddings won\u2019t be enough.Thus, embeddings are not great for finding exact phrase matching for specific terms.The solutionCombine the vector search technique with one (or more) complementary search strategy, which works great for finding exact words.It is not defined which algorithms are combined, but the most standard strategy for hybrid search is to combine the traditional keyword-based search and modern vector search.How are these combined?The first method is to merge the similarity scores of the 2 techniques as follows:hybrid_score = (1 - alpha) * sparse_score + alpha * dense_scoreWhere alpha takes a value between [0, 1], with:alpha = 1: Vector Searchalpha = 0: Keyword searchAlso, the similarity scores are defined as follows:sparse_score: is the result of the keyword search that, behind the scenes, uses a BM25 algorithm [7] that sits on top of TF-IDF.dense_score: is the result of the vector search that most commonly uses a similarity metric such as cosine distanceThe second method uses the vector search technique as usual and applies a filter based on your keywords on top of the metadata of retrieved results.\u2192 This is also known as filtered vector search.In this use case, the similar score is not changed based on the provided keywords.It is just a fancy word for a simple filter applied to the metadata of your vectors.But it is essential to understand the difference between the first and second methods:the first method combines the similarity score between the keywords and vectors using the alpha parameter;the second method is a simple filter on top of your vector search.How does this fit into our architecture?Remember that during the self-query step, we extracted the author_id as an exact field that we have to match.Thus, we will search for the author_id using the keyword search algorithm and attach it to the 5 queries generated by the query expansion step.As we want the most relevant chunks from a given author, it makes the most sense to use a filter using the author_id as follows (filtered vector search) \u2193self._qdrant_client.search( collection_name=\"vector_posts\", query_filter=models.Filter( must=[ models.FieldCondition( key=\"author_id\", match=models.MatchValue( value=metadata_filter_value, ), ) ] ), query_vector=self._embedder.encode(generated_query).tolist(), limit=k,)Note that we can easily extend this with multiple keywords (e.g., tags), making the combination of self-query and hybrid search a powerful retrieval duo.The only question you have to ask yourself is whether we want to use a simple vector search filter or the more complex hybrid search strategy.Note that LangChain\u2019s SelfQueryRetriever class combines the self-query and hybrid search techniques behind the scenes, as can be seen in their Qdrant example [8]. That is why we wanted to build everything from scratch.6. Implement the advanced retrieval Python classNow that you\u2019ve understood the advanced retrieval optimization techniques we're using, let\u2019s combine them into a Python retrieval class.Here is what the main retriever function looks like \u2193VectorRetriever: main retriever function \u2192 GitHub \u2190Using a Python ThreadPoolExecutor is extremely powerful for addressing I/O bottlenecks, as these types of operations are not blocked by Python\u2019s GIL limitations.Here is how we wrapped every advanced retrieval step into its own class \u2193Query expansion chains wrapper \u2192 GitHub \u2190The SelfQuery class looks very similar \u2014 \ud83d\udd17 access it here [1] \u2190.Now the final step is to call Qdrant for each query generated by the query expansion step \u2193VectorRetriever: main search function \u2192 GitHub \u2190Note that we have 3 types of data: posts, articles, and code repositories.Thus, we have to make a query for each collection and combine the results in the end.The most performant method is to use multi-indexing techniques, which allow you to query multiple types of data at once.But at the time I am writing this article, this is not a solved problem at the production level.Thus, we gathered data from each collection individually and kept the best-retrieved results using rerank.Which is the final step of the article.7. Post-retrieval optimization: Rerank using GPT-4We made a different search in the Qdrant vector DB for N prompts generated by the query expansion step.Each search returns K results.Thus, we end up with N x K chunks.In our particular case, N = 5 & K = 3. Thus, we end up with 15 chunks.Post-retrieval optimization: rerankThe problemThe retrieved context may contain irrelevant chunks that only:add noise: the retrieved context might be irrelevantmake the prompt bigger: results in higher costs & the LLM is usually biased in looking only at the first and last pieces of context. Thus, if you add a big context, there is a big chance it will miss the essence.unaligned with your question: the chunks are retrieved based on the query and chunk embedding similarity. The issue is that the embedding model is not tuned to your particular question, which might result in high similarity scores that are not 100% relevant to your question.The solutionWe will use rerank to order all the N x K chunks based on their relevance relative to the initial question, where the first one will be the most relevant and the last chunk the least.Ultimately, we will pick the TOP K most relevant chunks.Rerank works really well when combined with query expansion.A natural flow when using rerank is as follows:Search for >K chunks >>> Reorder using rerank >>> Take top KThus, when combined with query expansion, we gather potential useful context from multiple points in space rather than just looking for more than K samples in a single location.Now the flow looks like:Search for N x K chunks >>> Reoder using rerank >>> Take top KA typical re-ranking solution uses open-source Cross-Encoder models from sentence transformers [4].These solutions take both the question and context as input and return a score from 0 to 1.In this article, we want to take a different approach and use GPT-4 + prompt engineering as our reranker.If you want to see how to apply rerank using open-source algorithms, check out this hands-on article from Decoding ML:A Real-time Retrieval System for RAG on Social Media DataUse a streaming engine to populate a vector DB in real-time. Improve RAG accuracy using rerank & UMAP.medium.comNow let\u2019s see our implementation using GPT-4 & prompt engineering.Similar to what we did for the expansion and self-query chains, we define a template and a chain builder \u2193Rerank chain \u2192 GitHub \u2190Here is how we integrate the rerank chain into the retriever:Retriever: rerank step \u2192 GitHub \u2190\u2026and that\u2019s it!Note that this is an experimental process. Thus, you can further tune your prompts for better results, but the primary idea is the same.8. How to use the retrievalThe last step is to run the whole thing.But there is a catch.As we said in the beginning the retriever will not be used as a standalone component in the LLM system.It will be used as a layer between the data and the Qdrant vector DB by the:training pipeline to retrieve raw data for fine-tuning (we haven\u2019t shown that as it\u2019s a straightforward search operation \u2014 no RAG involved)inference pipeline to do RAG\u2192 That is why, for this lesson, there is no infrastructure involved!But, to test the retrieval, we wrote a simple script \u2193Retriever testing entry point \u2192 GitHub \u2190Look at how easy it is to call the whole chain with our custom retriever\u2014no fancy LangChain involved!Now, to call this script, run the following Make command:make local-test-retriever\u2026and that\u2019s it!In future lessons, we will learn to integrate it into the training & inference pipelines.\u2192 Check out the LLM Twin GitHub repository and try it yourself! \u2026 Of course, don\u2019t forget to give it a \u2b50\ufe0f to stay updated with the latest changes.ConclusionCongratulations!In Lesson 5, you learned to build an advanced RAG retrieval module optimized for searching posts, articles, and code repositories from a Qdrant vector DB.First, you learned about where the RAG pipeline can be optimized:pre-retrievalretrievalpost-retrievalAfter you learn how to build from scratch (without using LangChain\u2019s utilities) the following advanced RAG retrieval & post-retrieval optimization techniques:query expansionself queryhybrid searchrerankUltimately, you understood where the retrieval component sits in an RAG production LLM system, where the code is shared between multiple microservices and doesn\u2019t sit in a single Notebook.In Lesson 6, we will move to the training pipeline and show you how to automatically transform the data crawled from LinkedIn, Substack, Medium, and GitHub into an instruction dataset using GPT-4 to fine-tune your LLM Twin.See you there! \ud83e\udd17\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fEnjoyed This Article?Join the Decoding ML Newsletter for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For FREE \u2193Decoding ML Newsletter | Paul Iusztin | SubstackJoin for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For\u2026decodingml.substack.comReferencesLiterature[1] Your LLM Twin Course \u2014 GitHub Repository (2024), Decoding ML GitHub Organization[2] Bytewax, Bytewax Landing Page[3] Qdrant, Qdrant Documentation[4] Retrieve & Re-Rank, Sentence Transformers Documentation[5] MultiQueryRetriever, LangChain\u2019s Documentation[6] Self-querying, LangChain\u2019s Documentation[7] Okapi BM25, Wikipedia[8] Qdrant Self Query Example, LangChain\u2019s DocumentationImagesIf not otherwise stated, all images are created by the author.Sign up to discover human stories that deepen your understanding of the world.FreeDistraction-free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member-only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for $5/monthData ScienceMachine LearningArtificial IntelligenceRagGenerative Ai1.8K1.8K12FollowWritten by Paul Iusztin5.1K Followers\u00b7Editor for Decoding MLSenior ML & MLOps Engineer \u2022 Founder @ Decoding ML ~ Content about building production-grade ML/AI systems \u2022 DML Newsletter: https://decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Paul IusztininDecoding MLAn End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLArchitect scalable and cost-effective LLM & RAG inference pipelinesDesign, build and deploy RAG inference pipeline using LLMOps best practices.Jun 15601See all from Paul IusztinSee all from Decoding MLRecommended from MediumVishal RajputinAIGuysWhy GEN AI Boom Is Fading And What\u2019s Next?Every technology has its hype and cool down period.Sep 42.3K72Austin StarksinDataDrivenInvestorI used OpenAI\u2019s o1 model to develop a trading strategy. It is DESTROYING the marketIt literally took one try. I was shocked.Sep 154.3K119ListsPredictive Modeling w/ Python20 stories\u00b71607 savesNatural Language Processing1766 stories\u00b71367 savesPractical Guides to Machine Learning10 stories\u00b71961 savesAI Regulation6 stories\u00b7593 savesIda Silfverski\u00f6ldinLevel Up CodingAgentic AI: Build a Tech Research AgentUsing a custom data pipeline with millions of textsSep 679610Steve HeddeninTowards Data ScienceHow to Implement Graph RAG Using Knowledge Graphs and Vector DatabasesA Step-by-Step Tutorial on Implementing Retrieval-Augmented Generation (RAG), Semantic Search, and RecommendationsSep 61.4K18Louis-Fran\u00e7ois BouchardinTowards AIThe Best RAG Stack to Date(exploring every component)Sep 1473911Necati DemirAdvanced RAG: Implementing Advanced Techniques to Enhance Retrieval-Augmented Generation SystemsMay 16481See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTo make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy." }, "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/the-4-advanced-rag-algorithms-you-must-know-to-implement-5d0c7f1199d2" }, { "id": "597ead2d-ae88-43f9-945d-d974630e858a", "content": { "Title": "Architect scalable and cost-effective LLM & RAG inference pipelines", "Subtitle": "Design, build and deploy RAG inference pipeline using LLMOps best practices.", "Content": "Architect LLM & RAG inference pipelines | Decoding MLOpen in appSign upSign inWriteSign upSign inLLM TWIN COURSE: BUILDING YOUR PRODUCTION-READY AI REPLICAArchitect scalable and cost-effective LLM & RAG inference pipelinesDesign, build and deploy RAG inference pipeline using LLMOps best practices.Paul Iusztin\u00b7FollowPublished inDecoding ML\u00b717 min read\u00b7Jun 1, 20245601ListenShare\u2192 the 9th out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL-EWhy is this course different?By finishing the \u201cLLM Twin: Building Your Production-Ready AI Replica\u201d free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? \ud83e\udef5\u2192 No more isolated scripts or Notebooks! Learn production ML by building and deploying an end-to-end production-grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real-world LLM system from start to finish \u2014 from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices:the data collection pipeline: crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. (deployed on AWS)the feature pipeline: consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded (using Superlinked), and loaded into a Qdrant vector DB in real-time. (deployed on AWS)the training pipeline: create a custom dataset based on your digital data. Fine-tune an LLM using QLoRA. Use Comet ML\u2019s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet\u2019s model registry. (deployed on Qwak)the inference pipeline: load and quantize the fine-tuned LLM from Comet\u2019s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet\u2019s prompt monitoring dashboard. (deployed on Qwak)LLM twin system architecture [Image by the Author]Along the 4 microservices, you will learn to integrate 3 serverless tools:Comet ML as your ML Platform;Qdrant as your vector DB;Qwak as your ML infrastructure;Who is this for?Audience: MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLMOps good principles.Level: intermediatePrerequisites: basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands-on written lessons and the open-source code you can access on GitHub, showing how to build an end-to-end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace.\u2192 To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms (AWS, Qwak) have a pay-as-you-go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools (Qdrant, Comet), we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by:Paul Iusztin | Senior ML & MLOps EngineerAlex Vesa | Senior AI EngineerAlex Razvant | Senior ML & MLOps Engineer\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fLessons\u2192 Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson:An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture: Enabling Event-Driven ArchitecturesSOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine-Tuning LLMsHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine-Tuned LLMsArchitect scalable and cost-effective LLM & RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code[Bonus] Build Multi-Index Advanced RAG AppsTo better understand the course\u2019s goal, technical details, and system design \u2192 Check out Lesson 1Let\u2019s start with Lesson 9 \u2193\u2193\u2193Lesson 9: Architect scalable and cost-effective LLM & RAG inference pipelinesIn Lesson 9, we will focus on implementing and deploying the inference pipeline of the LLM twin system.First, we will design and implement a scalable LLM & RAG inference pipeline based on microservices, separating the ML and business logic into two layers.Secondly, we will use Comet ML to integrate a prompt monitoring service to capture all input prompts and LLM answers for further debugging and analysis.Ultimately, we will deploy the inference pipeline to Qwak and make the LLM twin service available worldwide.\u2192 Context from previous lessons. What you must know.This lesson is part of a more extensive series in which we learn to build an end-to-end LLM system using LLMOps best practices.In Lesson 4, we populated a Qdrant vector DB with cleaned, chunked, and embedded digital data (posts, articles, and code snippets).In Lesson 5, we implemented the advanced RAG retrieval module to query relevant digital data. Here, we will learn to integrate it into the final inference pipeline.In Lesson 7, we used Qwak to build a training pipeline to fine-tune an open-source LLM on our custom digital data. The LLM weights are available in a model registry.In Lesson 8, we evaluated the fine-tuned LLM to ensure the production candidate behaves accordingly.So\u2026 What you must know from all of this?Don\u2019t worry. If you don\u2019t want to replicate the whole system, you can read this article independently from the previous lesson.Thus, the following assumptions are what you have to know. We have:a Qdrant vector DB populated with digital data (posts, articles, and code snippets)a vector DB retrieval module to do advanced RAGa fine-tuned open-source LLM available in a model registry from Comet ML\u2192 In this lesson, we will focus on gluing everything together into a scalable inference pipeline and deploying it to the cloud.Architect scalable and cost-effective LLM & RAG inference pipelinesTable of ContentsThe architecture of the inference pipelineThe training vs. the inference pipelineSettings Pydantic classThe RAG business moduleThe LLM microservicePrompt monitoringDeploying and running the inference pipelineConclusion\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0f1. The architecture of the inference pipelineOur inference pipeline contains the following core elements:a fine-tuned LLMa RAG modulea monitoring serviceLet\u2019s see how to hook these into a scalable and modular system.The interface of the inference pipelineAs we follow the feature/training/inference (FTI) pipeline architecture, the communication between the 3 core components is clear.Our LLM inference pipeline needs 2 things:a fine-tuned LLM: pulled from the model registryfeatures for RAG: pulled from a vector DB (which we modeled as a logical feature store)This perfectly aligns with the FTI architecture.\u2192 If you are unfamiliar with the FTI pipeline architecture, we recommend you review Lesson 1\u2019s section on the 3-pipeline architecture.Monolithic vs. microservice inference pipelinesUsually, the inference steps can be split into 2 big layers:the LLM service: where the actual inference is being donethe business service: domain-specific logicWe can design our inference pipeline in 2 ways.Option 1: Monolithic LLM & business serviceIn a monolithic scenario, we implement everything into a single service.Pros:easy to implementeasy to maintainCons:harder to scale horizontally based on the specific requirements of each componentharder to split the work between multiple teamsnot being able to use different tech stacks for the two servicesMonolithic vs. microservice inference pipelinesOption 2: Different LLM & business microservicesThe LLM and business services are implemented as two different components that communicate with each other through the network, using protocols such as REST or gRPC.Pros:each component can scale horizontally individuallyeach component can use the best tech stack at handCons:harder to deployharder to maintainLet\u2019s focus on the \u201ceach component can scale individually\u201d part, as this is the most significant benefit of the pattern. Usually, LLM and business services require different types of computing. For example, an LLM service depends heavily on GPUs, while the business layer can do the job only with a CPU.As the LLM inference takes longer, you will often need more LLM service replicas to meet the demand. But remember that GPU VMs are really expensive.By decoupling the 2 components, you will run only what is required on the GPU machine and not block the GPU VM with other computing that can quickly be done on a much cheaper machine.Thus, by decoupling the components, you can scale horizontally as required, with minimal costs, providing a cost-effective solution to your system\u2019s needs.Microservice architecture of the LLM twin inference pipelineLet\u2019s understand how we applied the microservice pattern to our concrete LLM twin inference pipeline.As explained in the sections above, we have the following components:A business microserviceAn LLM microserviceA prompt monitoring microserviceThe business microservice is implemented as a Python module that:contains the advanced RAG logic, which calls the vector DB and GPT-4 API for advanced RAG operations;calls the LLM microservice through a REST API using the prompt computed utilizing the user\u2019s query and retrieved contextsends the prompt and the answer generated by the LLM to the prompt monitoring microservice.As you can see, the business microservice is light. It glues all the domain steps together and delegates the computation to other services.The end goal of the business layer is to act as an interface for the end client. In our case, as we will ship the business layer as a Python module, the client will be a Streamlit application.However, you can quickly wrap the Python module with FastAPI and expose it as a REST API to make it accessible from the cloud.Microservice architecture of the LLM twin inference pipelineThe LLM microservice is deployed on Qwak. This component is wholly niched on hosting and calling the LLM. It runs on powerful GPU-enabled machines.How does the LLM microservice work?It loads the fine-tuned LLM twin model from Comet\u2019s model registry [2].It exposes a REST API that takes in prompts and outputs the generated answer.When the REST API endpoint is called, it tokenizes the prompt, passes it to the LLM, decodes the generated tokens to a string and returns the answer.That\u2019s it!The prompt monitoring microservice is based on Comet ML\u2019s LLM dashboard. Here, we log all the prompts and generated answers into a centralized dashboard that allows us to evaluate, debug, and analyze the accuracy of the LLM.Remember that a prompt can get quite complex. When building complex LLM apps, the prompt usually results from a chain containing other prompts, templates, variables, and metadata.Thus, a prompt monitoring service, such as the one provided by Comet ML, differs from a standard logging service. It allows you to quickly dissect the prompt and understand how it was created. Also, by attaching metadata to it, such as the latency of the generated answer and the cost to generate the answer, you can quickly analyze and optimize your prompts.2. The training vs. the inference pipelineBefore diving into the code, let\u2019s quickly clarify what is the difference between the training and inference pipelines.Along with the apparent reason that the training pipeline takes care of training while the inference pipeline takes care of inference (Duh!), there are some critical differences you have to understand.The input of the pipeline & How the data is accessedDo you remember our logical feature store based on the Qdrant vector DB and Comet ML artifacts? If not, consider checking out Lesson 6 for a refresher.The core idea is that during training, the data is accessed from an offline data storage in batch mode, optimized for throughput and data lineage.Our LLM twin architecture uses Comet ML artifacts to access, version, and track all our data.The data is accessed in batches and fed to the training loop.During inference, you need an online database optimized for low latency. As we directly query the Qdrant vector DB for RAG, that fits like a glove.During inference, you don\u2019t care about data versioning and lineage. You just want to access your features quickly for a good user experience.The data comes directly from the user and is sent to the inference logic.The training vs. the inference pipelineThe output of the pipelineThe training pipeline\u2019s final output is the trained weights stored in Comet\u2019s model registry.The inference pipeline\u2019s final output is the predictions served directly to the user.The infrastructureThe training pipeline requires more powerful machines with as many GPUs as possible.Why? During training, you batch your data and have to hold in memory all the gradients required for the optimization steps. Because of the optimization algorithm, the training is more compute-hungry than the inference.Thus, more computing and VRAM result in bigger batches, which means less training time and more experiments.The inference pipeline can do the job with less computation. During inference, you often pass a single sample or smaller batches to the model.If you run a batch pipeline, you will still pass batches to the model but don\u2019t perform any optimization steps.If you run a real-time pipeline, as we do in the LLM twin architecture, you pass a single sample to the model or do some dynamic batching to optimize your inference step.Are there any overlaps?Yes! This is where the training-serving skew comes in.During training and inference, you must carefully apply the same preprocessing and postprocessing steps.If the preprocessing and postprocessing functions or hyperparameters don\u2019t match, you will end up with the training-serving skew problem.Enough with the theory. Let\u2019s dig into the RAG business microservice \u21933. Settings Pydantic classFirst, let\u2019s understand how we defined the settings to configure the inference pipeline components.We used pydantic_settings and inherited its BaseSettings class.This approach lets us quickly define a set of default settings variables and load sensitive values such as the API KEY from a .env file.from pydantic_settings import BaseSettings, SettingsConfigDictclass AppSettings(BaseSettings): model_config = SettingsConfigDict(env_file=\".env\", env_file_encoding=\"utf-8\" ... # Settings. # CometML config COMET_API_KEY: str COMET_WORKSPACE: str COMET_PROJECT: str = \"llm-twin-course\" ... # More settings.settings = AppSettings()All the variables called settings.* (e.g., settings.Comet_API_KEY) come from this class.4. The RAG business moduleWe will define the RAG business module under the LLMTwin class. The LLM twin logic is directly correlated with our business logic.We don\u2019t have to introduce the word \u201cbusiness\u201d in the naming convention of the classes. What we presented so far was used for a clear separation of concern between the LLM and business layers.Initially, within the LLMTwin class, we define all the clients we need for our business logic \u2193Inference pipeline business module: __init__() method \u2192 GitHub \u2190Now let\u2019s dig into the generate() method, where we:call the RAG module;create the prompt using the prompt template, query and context;call the LLM microservice;log the prompt, prompt template, and answer to Comet ML\u2019s prompt monitoring service.Inference pipeline business module: generate() method \u2192 GitHub \u2190Now, let\u2019s look at the complete code of the generate() method. It\u2019s the same thing as what we presented above, but with all the nitty-little details.class LLMTwin: def __init__(self) -> None: ... def generate( self, query: str, enable_rag: bool = True, enable_monitoring: bool = True, ) -> dict: prompt_template = self.template.create_template(enable_rag=enable_rag) prompt_template_variables = { \"question\": query, } if enable_rag is True: retriever = VectorRetriever(query=query) hits = retriever.retrieve_top_k( k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY ) context = retriever.rerank( hits=hits, keep_top_k=settings.KEEP_TOP_K ) prompt_template_variables[\"context\"] = context prompt = prompt_template.format(question=query, context=context) else: prompt = prompt_template.format(question=query) input_ = pd.DataFrame([{\"instruction\": prompt}]).to_json() response: list[dict] = self.qwak_client.predict(input_) answer = response[0][\"content\"][0] if enable_monitoring is True: self.prompt_monitoring_manager.log( prompt=prompt, prompt_template=prompt_template.template, prompt_template_variables=prompt_template_variables, output=answer, metadata=metadata, ) return {\"answer\": answer}Let\u2019s look at how our LLM microservice is implemented using Qwak.5. The LLM microserviceAs the LLM microservice is deployed on Qwak, we must first inherit from the QwakModel class and implement some specific functions.initialize_model(): where we load the fine-tuned model from the model registry at serving timeschema(): where we define the input and output schemapredict(): where we implement the actual inference logicNote: The build() function contains all the training logic, such as loading the dataset, training the LLM, and pushing it to a Comet experiment. To see the full implementation, consider checking out Lesson 7, where we detailed the training pipeline.LLM microservice \u2192 GitHub \u2190Let\u2019s zoom into the implementation and the life cycle of the Qwak model.The schema() method is used to define how the input and output of the predict() method look like. This will automatically validate the structure and type of the predict() method. For example, the LLM microservice will throw an error if the variable instruction is a JSON instead of a string.The other Qwak-specific methods are called in the following order:__init__() \u2192 when deploying the modelinitialize_model() \u2192 when deploying the modelpredict() \u2192 on every request to the LLM microservice>>> Note that these methods are called only during serving time (and not during training).Qwak exposes your model as a RESTful API, where the predict() method is called on each request.Inside the prediction method, we perform the following steps:map the input text to token IDs using the LLM-specific tokenizermove the token IDs to the provided device (GPU or CPU)pass the token IDs to the LLM and generate the answerextract only the generated tokens from the generated_ids variable by slicing it using the shape of the input_idsdecode the generated_ids back to textreturn the generated textHere is the complete code for the implementation of the Qwak LLM microservice:class CopywriterMistralModel(QwakModel): def __init__( self, use_experiment_tracker: bool = True, register_model_to_model_registry: bool = True, model_type: str = \"mistralai/Mistral-7B-Instruct-v0.1\", fine_tuned_llm_twin_model_type: str = settings.FINE_TUNED_LLM_TWIN_MODEL_TYPE, dataset_artifact_name: str = settings.DATASET_ARTIFACT_NAME, config_file: str = settings.CONFIG_FILE, model_save_dir: str = settings.MODEL_SAVE_DIR, ) -> None: self.use_experiment_tracker = use_experiment_tracker self.register_model_to_model_registry = register_model_to_model_registry self.model_save_dir = model_save_dir self.model_type = model_type self.fine_tuned_llm_twin_model_type = fine_tuned_llm_twin_model_type self.dataset_artifact_name = dataset_artifact_name self.training_args_config_file = config_file def build(self) -> None: # Training logic ... def initialize_model(self) -> None: self.model, self.tokenizer, _ = build_qlora_model( pretrained_model_name_or_path=self.model_type, peft_pretrained_model_name_or_path=self.fine_tuned_llm_twin_model_type, bnb_config=self.nf4_config, lora_config=self.qlora_config, cache_dir=settings.CACHE_DIR, ) self.model = self.model.to(self.device) logging.info(f\"Successfully loaded model from {self.model_save_dir}\") def schema(self) -> ModelSchema: return ModelSchema( inputs=[RequestInput(name=\"instruction\", type=str)], outputs=[InferenceOutput(name=\"content\", type=str)], ) @qwak.api(output_adapter=DefaultOutputAdapter()) def predict(self, df) -> pd.DataFrame: input_text = list(df[\"instruction\"].values) input_ids = self.tokenizer( input_text, return_tensors=\"pt\", add_special_tokens=True ) input_ids = input_ids.to(self.device) generated_ids = self.model.generate( **input_ids, max_new_tokens=500, do_sample=True, pad_token_id=self.tokenizer.eos_token_id, ) answer_start_idx = input_ids[\"input_ids\"].shape[1] generated_answer_ids = generated_ids[:, answer_start_idx:] decoded_output = self.tokenizer.batch_decode(generated_answer_ids)[0] return pd.DataFrame([{\"content\": decoded_output}]) Where the settings used in the code above have the following values:class AppSettings(BaseSettings): model_config = SettingsConfigDict(env_file=\".env\", env_file_encoding=\"utf-8\") ... # Other settings. DATASET_ARTIFACT_NAME: str = \"posts-instruct-dataset\" FINE_TUNED_LLM_TWIN_MODEL_TYPE: str = \"decodingml/llm-twin:1.0.0\" CONFIG_FILE: str = \"./finetuning/config.yaml\" MODEL_SAVE_DIR: str = \"./training_pipeline_output\" CACHE_DIR: Path = Path(\"./.cache\")The most important one is the FINE_TUNED_LLM_TWIN_MODEL_TYPE setting, which reflects what model and version to load from the model registry.Access the code \ud83d\udd17 here \u2190The final step is to look at Comet\u2019s prompt monitoring service. \u21936. Prompt monitoringComet makes prompt monitoring straightforward. There is just one API call where you connect to your project and workspace and send the following to a single function:the prompt and LLM outputthe prompt template and variables that created the final outputyour custom metadata specific to your use case \u2014 here, you add information about the model, prompt token count, token generation costs, latency, etc.Prompt monitoring service \u2192 GitHub \u2190Let\u2019s look at the logs in Comet ML\u2019sML\u2019s LLMOps dashboard.Here is how you can quickly access them \u2193log in to Comet (or create an account)go to your workspaceaccess the project with the \u201cLLM\u201d symbol attached to it. In our case, this is the \u201cllm-twin-course-monitoring\u201d project.Note: Comet ML provides a free version which is enough to run these examples.Screenshot from Comet ML\u2019s dashboardThis is how Comet ML\u2019s prompt monitoring dashboard looks. Here, you can scroll through all the prompts that were ever sent to the LLM. \u2193You can click on any prompt and see everything we logged programmatically using the PromptMonitoringManager class.Screenshot from Comet ML\u2019s dashboardBesides what we logged, adding various tags and the inference duration can be valuable.7. Deploying and running the inference pipelineQwak makes the deployment of the LLM microservice straightforward.During Lesson 7, we fine-tuned the LLM and built the Qwak model. As a quick refresher, we ran the following CLI command to build the Qwak model, where we used the build_config.yaml file with the build configuration:poetry run qwak models build -f build_config.yaml .After the build is finished, we can make various deployments based on the build. For example, we can deploy the LLM microservice using the following Qwak command:qwak models deploy realtime \\--model-id \"llm_twin\" \\--instance \"gpu.a10.2xl\" \\ --timeout 50000 \\ --replicas 2 \\--server-workers 2We deployed two replicas of the LLM twin. Each replica has access to a machine with x1 A10 GPU. Also, each replica has two workers running on it.\ud83d\udd17 More on Qwak instance types \u2190Two replicas and two workers result in 4 microservices that run in parallel and can serve our users.You can scale the deployment to more replicas if you need to serve more clients. Qwak provides autoscaling mechanisms triggered by listening to the consumption of GPU, CPU or RAM.To conclude, you build the Qwak model once, and based on it, you can make multiple deployments with various strategies.You can quickly close the deployment by running the following:qwak models undeploy --model-id \"llm_twin\"We strongly recommend closing down the deployment when you are done, as GPU VMs are expensive.To run the LLM system with a predefined prompt example, you have to run the following Python file:poetry run python main.pyWithin the main.py file, we call the LLMTwin class, which calls the other services as explained during this lesson.Note: The \u2192 complete installation & usage instructions \u2190 are available in the README of the GitHub repository.\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fConclusionCongratulations! You are close to the end of the LLM twin series.In Lesson 9 of the LLM twin course, you learned to build a scalable inference pipeline for serving LLMs and RAG systems.First, you learned how to architect an inference pipeline by understanding the difference between monolithic and microservice architectures. We also highlighted the difference in designing the training and inference pipelines.Secondly, we walked you through implementing the RAG business module and LLM twin microservice. Also, we showed you how to log all the prompts, answers, and metadata for Comet\u2019s prompt monitoring service.Ultimately, we showed you how to deploy and run the LLM twin inference pipeline on the Qwak AI platform.In Lesson 10, we will show you how to evaluate the whole system by building an advanced RAG evaluation pipeline that analyzes the accuracy of the LLMs \u2019 answers relative to the query and context.See you there! \ud83e\udd17\ud83d\udd17 Check out the code on GitHub [1] and support us with a \u2b50\ufe0fEnjoyed This Article?Join the Decoding ML Newsletter for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For FREE \u2193Decoding ML Newsletter | Paul Iusztin | SubstackJoin for battle-tested content on designing, coding, and deploying production-grade ML & MLOps systems. Every week. For\u2026decodingml.substack.comReferencesLiterature[1] Your LLM Twin Course \u2014 GitHub Repository (2024), Decoding ML GitHub Organization[2] Add your models to Model Registry (2024), Comet ML GuidesImagesIf not otherwise stated, all images are created by the author.Sign up to discover human stories that deepen your understanding of the world.FreeDistraction-free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member-only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for $5/monthMachine LearningProgrammingMl System DesignData ScienceArtificial Intelligence5605601FollowWritten by Paul Iusztin5.1K Followers\u00b7Editor for Decoding MLSenior ML & MLOps Engineer \u2022 Founder @ Decoding ML ~ Content about building production-grade ML/AI systems \u2022 DML Newsletter: https://decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post-retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLAn End-to-End Framework for Production-Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13See all from Paul IusztinSee all from Decoding MLRecommended from MediumVipra SinghBuilding LLM Applications: Serving LLMs (Part 9)Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation ( RAG ) Application.Apr 188666Vishal RajputinAIGuysWhy GEN AI Boom Is Fading And What\u2019s Next?Every technology has its hype and cool down period.Sep 42.3K72ListsPredictive Modeling w/ Python20 stories\u00b71607 savesNatural Language Processing1766 stories\u00b71367 savesPractical Guides to Machine Learning10 stories\u00b71961 savesChatGPT21 stories\u00b7846 savesDerckData architecture for MLOps: Metadata storeIntroductionJul 17Alex RazvantinDecoding MLHow to fine-tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine-tune a Mistral7b-Instruct using PEFT & QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922MdabdullahalhasibinTowards AIA Complete Guide to Embedding For NLP & Generative AI/LLMUnderstand the concept of vector embedding, why it is needed, and implementation with LangChain.3d agoNecati DemirAdvanced RAG: Implementing Advanced Techniques to Enhance Retrieval-Augmented Generation SystemsMay 16481See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTo make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy." }, "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/architect-scalable-and-cost-effective-llm-rag-inference-pipelines-73b94ef82a99" }, { "id": "d39ca560-21bf-4a6c-a080-064b1ad7996a", "content": { "Title": "Real-time feature pipelines for RAG - by Paul Iusztin", "Subtitle": "RAG hybrid search with transformers-based sparse vectors. CDC tech stack for event-driven architectures.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Real-time feature pipelines for RAG\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Real-time feature pipelines for RAG\n\n### RAG hybrid search with transformers-based sparse vectors. CDC tech stack\nfor event-driven architectures.\n\nPaul Iusztin\n\nAug 17, 2024\n\n14\n\nShare this post\n\n#### Real-time feature pipelines for RAG\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n### **This week\u2019s topics:**\n\n * CDC tech stack for event-driven architectures\n\n * Real-time feature pipelines with CDC\n\n * RAG hybrid search with transformers-based sparse vectors\n\n* * *\n\n### CDC tech stack for event-driven architectures\n\nHere is the \ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddf0\ud835\uddf8 used to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddf2 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddd6\ud835\uddee\ud835\uddfd\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 (\ud835\uddd6\ud835\uddd7\ud835\uddd6) \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01 for\nimplementing an \ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddfb\ud835\ude01-\ud835\uddf1\ud835\uddff\ud835\uddf6\ud835\ude03\ud835\uddf2\ud835\uddfb \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 in our \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 \n \n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddf2 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddd6\ud835\uddee\ud835\uddfd\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 (\ud835\uddd6\ud835\uddd7\ud835\uddd6)? \n \nThe purpose of CDC is to capture insertions, updates, and deletions applied to\na database and to make this change data available in a format easily\nconsumable by downstream applications. \n \n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddd6\ud835\uddd7\ud835\uddd6 \ud835\uddfd\ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddfb? \n \n\\- Real-time Data Syncing \n\\- Efficient Data Pipelines \n\\- Minimized System Impact \n\\- Event-Driven Architectures \n \n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\uddd6\ud835\uddd7\ud835\uddd6? \n \nWe will take the tech stack used in our LLM Twin course as an example,\nwhere... \n \n... we built a feature pipeline to gather cleaned data for fine-tuning and\nchunked & embedded data for RAG \n \n\ud835\uddd8\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddf6\ud835\uddf9\ud835\uddf9 \ud835\uddef\ud835\uddf2 \ud835\uddf1\ud835\uddfc\ud835\uddfb\ud835\uddf2 \ud835\uddfc\ud835\uddfb\ud835\uddf9\ud835\ude06 \ud835\uddf6\ud835\uddfb \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb! \n \n\ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude3a \ud835\ude22\ud835\ude33\ud835\ude26 \n \n\u2193\u2193\u2193 \n \n1\\. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\uddef\ud835\uddee\ud835\ude00\ud835\uddf2: MongoDB (it (also works for most databases such as\nMySQL, PostgreSQL, Oracle, etc.) \n \n2\\. \ud835\uddd4 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 \ud835\ude01\ud835\uddfc \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf9\ud835\uddfc\ud835\uddf4: MongoDB Watcher (also Debezium is a\npopular & scalable solution) \n \n3\\. \ud835\uddd4 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddef\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\uddf1 \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude02\ud835\uddf2: RabbitMQ (another popular option is to use Kafka, but\nit was overkill in our use case) \n \n4\\. \ud835\uddd4 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2: Bytewax (great streaming engine for the Python\necosystem) \n \n5\\. \ud835\uddd4 \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\uddef\ud835\uddee\ud835\ude00\ud835\uddf2: Qdrant (this works with any other database, but we\nneeded a vector DB to store our data for fine-tuning and RAG)\n\n\ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26, \ud835\ude29\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude22 \ud835\ude1e\ud835\ude19\ud835\ude10\ud835\ude1b\ud835\ude0c \ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude23\ud835\ude26 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude24\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude26\ud835\ude25: \n \n1\\. Write a post to the MongoDB warehouse \n2\\. A \"\ud835\ude24\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude35\ud835\ude26\" operation is logged in the transaction log of Mongo \n3\\. The MongoDB watcher captures this and emits it to the RabbitMQ queue \n4\\. The Bytewax streaming pipelines read the event from the queue \n5\\. It cleans, chunks, and embeds it right away - in real time! \n6\\. The cleaned & embedded version of the post is written to Qdrant\n\n* * *\n\n### Real-time feature pipelines with CDC\n\n\ud835\udddb\ud835\uddfc\ud835\ude04 to \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddd6\ud835\uddd7\ud835\uddd6 to \ud835\ude00\ud835\ude06\ud835\uddfb\ud835\uddf0 your \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\ude04\ud835\uddee\ud835\uddff\ud835\uddf2\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\ude00\ud835\uddf2 and \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 using a\nRabbitMQ \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude02\ud835\uddf2 and a Bytewax \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2 \u2193 \n \n\ud835\uddd9\ud835\uddf6\ud835\uddff\ud835\ude00\ud835\ude01, \ud835\uddf9\ud835\uddf2\ud835\ude01'\ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude04\ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddf2 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddd6\ud835\uddee\ud835\uddfd\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\n(\ud835\uddd6\ud835\uddd7\ud835\uddd6) \ud835\uddfd\ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddfb: \n \n\ud835\ude0a\ud835\ude0b\ud835\ude0a \ud835\ude2a\ud835\ude34 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude38\ud835\ude29\ud835\ude26\ud835\ude2f \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude22\ud835\ude2f\ud835\ude35 \ud835\ude35\ud835\ude30 \ud835\ude34\ud835\ude3a\ud835\ude2f\ud835\ude24 2 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude34. \n \nThe destination can be a complete replica of the source database (e.g., one\nfor transactional and the other for analytical applications) \n \n...or you can process the data from the source database before loading it to\nthe destination DB (e.g., retrieve various documents and chunk & embed them\nfor RAG). \n \n\ud835\ude1b\ud835\ude29\ud835\ude22\ud835\ude35'\ud835\ude34 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude10 \ud835\ude22\ud835\ude2e \ud835\ude28\ud835\ude30\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude30 \ud835\ude34\ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude3a\ud835\ude30\ud835\ude36: \n \n**How** to **use CDC** to **sync** a **MongoDB** & **Qdrant vector DB** to\nstreamline real-time documents that must be ready for fine-tuning LLMs and\nRAG. \n \n**MongoDB** is our data warehouse. \n \n**Qdrant** is our logical feature store. \n \n. \n \n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddd6\ud835\uddd7\ud835\uddd6 \ud835\uddfd\ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddfb: \n \n1\\. Use Mongo's \ud835\ude38\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29() method to listen for CRUD transactions \n \n2\\. For example, on a CREATE operation, along with saving it to Mongo, the\n\ud835\ude38\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29() method will trigger a change and return a JSON with all the\ninformation. \n \n3\\. We standardize the JSON in our desired structure. \n \n4\\. We stringify the JSON and publish it to the RabbitMQ queue \n \n\ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\ude00\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf2? \n \n\u2192 You can use Debezium instead of Mongo's \ud835\ude38\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29() method for scaling up the\nsystem, but the idea remains the same. \n \n\u2192 You can swap RabbitMQ with Kafka, but RabbitMQ can get you far. \n \n\ud835\udde1\ud835\uddfc\ud835\ude04, \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\uddfd\ud835\uddfd\ud835\uddf2\ud835\uddfb\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfc\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\ude00\ud835\uddf6\ud835\uddf1\ud835\uddf2 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude02\ud835\uddf2? \n \nYou have a Bytewax streaming pipeline - 100% written in Python that: \n \n5\\. Listens in real-time to new messages from the RabbitMQ queue \n \n6\\. It cleans, chunks, and embeds the events on the fly \n \n7\\. It loads the data to Qdrant for LLM fine-tuning & RAG\n\nMongoDB CDC example\n\n> Do you \ud835\ude04\ud835\uddee\ud835\uddfb\ud835\ude01 to check out the \ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\uddf9 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2? \n> \n> ...or even an \ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 about \ud835\uddd6\ud835\uddd7\ud835\uddd6? \n> \n> The CDC component is part of the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb FREE \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2, made by Decoding ML. \n> \n> \u2193\u2193\u2193 \n> \n> \ud83d\udd17 \ud835\ude13\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f 3: \ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude2f\ud835\ude28\ud835\ude26 \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude0a\ud835\ude22\ud835\ude31\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26: \ud835\ude0c\ud835\ude2f\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude0c\ud835\ude37\ud835\ude26\ud835\ude2f\ud835\ude35-\ud835\ude0b\ud835\ude33\ud835\ude2a\ud835\ude37\ud835\ude26\ud835\ude2f \ud835\ude08\ud835\ude33\ud835\ude24\ud835\ude29\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26\ud835\ude34 \n> \n> \ud83d\udd17 \ud835\ude0e\ud835\ude2a\ud835\ude35\ud835\ude0f\ud835\ude36\ud835\ude23\n\n* * *\n\n### RAG hybrid search with transformers-based sparse vectors\n\n\ud835\udddb\ud835\ude06\ud835\uddef\ud835\uddff\ud835\uddf6\ud835\uddf1 \ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 is standard in \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1 \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00. The \ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf8 is to \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 the\nsuitable \ud835\ude00\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude00\ud835\uddf2 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude00 for it. Here is an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 that shows \ud835\uddf5\ud835\uddfc\ud835\ude04 to use\n\ud835\udde6\ud835\udde3\ud835\udddf\ud835\uddd4\ud835\uddd7\ud835\uddd8 to \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 \ud835\ude00\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude00\ud835\uddf2 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude00 using \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddf2\ud835\uddff\ud835\ude00 and integrate them into a\n\ud835\uddf5\ud835\ude06\ud835\uddef\ud835\uddff\ud835\uddf6\ud835\uddf1 \ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\uddee\ud835\uddf9\ud835\uddf4\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddfa using Qdrant. \n \n\ud835\ude52\ud835\ude5d\ud835\ude6e \ud835\ude57\ud835\ude64\ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude67 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude68\ud835\ude65\ud835\ude56\ud835\ude67\ud835\ude68\ud835\ude5a \ud835\ude6b\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude64\ud835\ude67\ud835\ude68 \ud835\ude6c\ud835\ude5d\ud835\ude5a\ud835\ude63 \ud835\ude6c\ud835\ude5a \ud835\ude5d\ud835\ude56\ud835\ude6b\ud835\ude5a \ud835\ude59\ud835\ude5a\ud835\ude63\ud835\ude68\ud835\ude5a \ud835\ude6b\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude64\ud835\ude67\ud835\ude68 (\ud835\ude5a\ud835\ude62\ud835\ude57\ud835\ude5a\ud835\ude59\ud835\ude59\ud835\ude5e\ud835\ude63\ud835\ude5c\ud835\ude68)? \n \nSparse vectors represent data by highlighting only the most relevant features\n(like keywords), significantly reducing memory usage compared to dense\nvectors. \n \nAlso, sparse vectors work great in finding specific keywords, which is why\nthey work fantastic in combination with dense vectors used for finding\nsimilarities in semantics but not particular words. \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 \ud835\uddf5\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\uddf9\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01\ud835\ude00: \n \n\\- \ud835\ude1a\ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude37\ud835\ude34. \ud835\ude25\ud835\ude26\ud835\ude2f\ud835\ude34\ud835\ude26 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude34 \n \n\\- \ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude1a\ud835\ude17\ud835\ude13\ud835\ude08\ud835\ude0b\ud835\ude0c \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34: The SPLADE model leverages sparse vectors to perform\nbetter than traditional methods like BM25 by computing it using transformer\narchitectures. \n \n\\- \ud835\ude1e\ud835\ude29\ud835\ude3a \ud835\ude1a\ud835\ude17\ud835\ude13\ud835\ude08\ud835\ude0b\ud835\ude0c \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34: It expands terms based on context rather than just\nfrequency, offering a nuanced understanding of content relevancy. \n \n\\- \ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude30 \ud835\ude2a\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 \ud835\ude29\ud835\ude3a\ud835\ude23\ud835\ude33\ud835\ude2a\ud835\ude25 \ud835\ude34\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29 \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude1a\ud835\ude17\ud835\ude13\ud835\ude08\ud835\ude0b\ud835\ude0c with Qdrant: step-by-step code\n\nSparse vectors using transformers\n\n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 \n \n\u2193\u2193\u2193 \n \n\ud83d\udd17 \ud835\ude1a\ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude1d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude34 \ud835\ude2a\ud835\ude2f \ud835\ude18\ud835\ude25\ud835\ude33\ud835\ude22\ud835\ude2f\ud835\ude35: \ud835\ude17\ud835\ude36\ud835\ude33\ud835\ude26 \ud835\ude1d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33-\ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude0f\ud835\ude3a\ud835\ude23\ud835\ude33\ud835\ude2a\ud835\ude25 \ud835\ude1a\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n14\n\nShare this post\n\n#### Real-time feature pipelines for RAG\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/real-time-feature-pipelines-with?r=1ttoeh" }, { "id": "4271a54f-6239-4f50-97e6-b3fa3a9a2fbd", "content": { "Title": "Building ML System Using the FTI Architecture", "Subtitle": "Introduction to the feature/training/inference (FTI) design pattern to build scalable and modular ML systems using MLOps best practices.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Building ML systems the right way using the FTI architecture\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Building ML systems the right way using the FTI architecture\n\n### The fundamentals of the FTI architecture that will help you build modular\nand scalable ML systems using MLOps best practices.\n\nPaul Iusztin\n\nAug 10, 2024\n\n12\n\nShare this post\n\n#### Building ML systems the right way using the FTI architecture\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nThe feature/training/inference (FTI) architecture builds scalable and modular\nML systems using MLOps best practices.\n\nWe will start by discussing the problems of naively building ML systems. Then,\nwe will examine other potential solutions and their problems.\n\nUltimately, we will present the feature/training/inference (FTI) design\npattern and its benefits. We will also understand the benefits of using a\nfeature store and model registry when architecting your ML system.\n\n### The problem with building ML systems\n\nBuilding production-ready ML systems is much more than just training a model.\nFrom an engineering point of view, training the model is the most\nstraightforward step in most use cases.\n\nHowever, training a model becomes complex when deciding on the correct\narchitecture and hyperparameters. That\u2019s not an engineering problem but a\nresearch problem.\n\nAt this point, we want to focus on how to design a production-ready\narchitecture. Training a model with high accuracy is extremely valuable, but\njust by training it on a static dataset, you are far from deploying it\nrobustly. We have to consider how to:\n\n * ingest, clean and validate fresh data\n\n * training vs. inference setups\n\n * compute and serve features in the right environment\n\n * serve the model in a cost-effective way\n\n * version, track and share the datasets and models\n\n * monitor your infrastructure and models\n\n * deploy the model on a scalable infrastructure\n\n * automate the deployments and training\n\nThese are the types of problems an ML or MLOps engineer must consider, while\nthe research or data science team is often responsible for training the model.\n\nFigure 1: Components of an ML system. Photo from the Google Cloud Architecture\ndocuments\n\nFigure 1 shows all the components the Google Cloud team suggests that a mature\nML and MLOps system requires. Along with the ML code, there are many moving\npieces. The rest of the system comprises configuration, automation, data\ncollection, data verification, testing and debugging, resource management,\nmodel analysis, process and metadata management, serving infrastructure, and\nmonitoring. The point is that there are many components we must consider when\nproductionizing an ML model.\n\n_Thus, the**critical question** is: \u201cHow do we connect all these components\ninto a single homogenous system\u201d?_\n\nWe must create a boilerplate for clearly designing ML systems to answer that\nquestion.\n\nSimilar solutions exist for classic software. For example, if you zoom out,\nmost software applications can be split between a database, business logic and\nUI layer. Every layer can be as complex as needed, but at a high-level\noverview, the architecture of standard software can be boiled down to these\nthree components.\n\nDo we have something similar for ML applications? The first step is to examine\nprevious solutions and why they are unsuitable for building scalable ML\nsystems.\n\n* * *\n\n### **The issue with previous solutions**\n\nIn Figure 2, you can observe the typical architecture present in most ML\napplications. It is based on a monolithic batch architecture that couples the\nfeature creation, model training, and inference into the same component.\n\nBy taking this approach, you quickly solve one critical problem in the ML\nworld: the training-serving skew. The training-serving skew happens when the\nfeatures passed to the model are computed differently at training and\ninference time. In this architecture, the features are created using the same\ncode. Hence, the training-serving skew issue is solved by default.\n\nThis pattern works fine when working with small data. The pipeline runs on a\nschedule in batch mode, and the predictions are consumed by a third-party\napplication such as a dashboard.\n\nFigure 2: Monolithic batch pipeline architecture\n\nUnfortunately, building a monolithic batch system raises many other issues,\nsuch as:\n\n * features are not reusable (by your system or others)\n\n * if the data increases, you have to refactor the whole code to support PySpark or Ray\n\n * hard to rewrite the prediction module in a more efficient language such as C++, Java or Rust\n\n * hard to share the work between multiple teams between the features, training, and prediction modules\n\n * impossible to switch to a streaming technology for real-time training\n\nIn Figure 3, we can see a similar scenario for a real-time system. This use\ncase introduces another issue in addition to what we listed before. To make\nthe predictions, we have to transfer the whole state through the client\nrequest so the features can be computed and passed to the model.\n\nConsider the scenario of computing movie recommendations for a user. Instead\nof simply passing the user ID, we must transmit the entire user state,\nincluding their name, age, gender, movie history, and more. This approach is\nfraught with potential errors, as the client must understand how to access\nthis state, and it\u2019s tightly coupled with the model service.\n\nAnother example would be when implementing an LLM with RAG support. The\ndocuments we add as context along the query represent our external state. If\nwe didn\u2019t store the records in a vector DB, we would have to pass them with\nthe user query. To do so, the client must know how to query and retrieve the\ndocuments, which is not feasible. It is an antipattern for the client\napplication to know how to access or compute the features. If you don\u2019t\nunderstand how RAG works, we will explain it in future chapters.\n\nFigure 3: Stateless real-time architecture\n\nIn conclusion, our problem is accessing the features to make predictions\nwithout passing them at the client\u2019s request. For example, based on our first\nuser movie recommendation example, how can we predict the recommendations\nsolely based on the user\u2019s ID?\n\nRemember these questions, as we will answer them shortly.\n\n### **The solution: the FTI architecture**\n\nThe solution is based on creating a clear and straightforward mind map that\nany team or person can follow to compute the features, train the model, and\nmake predictions.\n\nBased on these three critical steps that any ML system requires, the pattern\nis known as the FTI (feature, training, inference) pipelines. So, how does\nthis differ from what we presented before?\n\nThe pattern suggests that any ML system can be boiled down to these three\npipelines: feature, training, and inference (similar to the database, business\nlogic and UI layers from classic software).\n\nThis is powerful, as we can clearly define the scope and interface of each\npipeline. Also, it\u2019s easier to understand how the three components interact.\n\nAs shown in Figure 4, we have the feature, training and inference pipelines.\nWe will zoom in on each of them and understand their scope and interface.\n\nBefore going into the details, it is essential to understand that each\npipeline is a different component that can run on a different process or\nhardware. Thus, each pipeline can be written using a different technology, by\na different team, or scaled differently. The key idea is that the design is\nvery flexible to the needs of your team. It acts as a mind map for structuring\nyour architecture.\n\nFigure 4: Feature/Training/Inference (FTI) pipelines architecture\n\n#### The feature pipeline\n\nThe feature pipelines take as input data and output features & labels used to\ntrain the model.\n\nInstead of directly passing them to the model, the features and labels are\nstored inside a feature store. Its responsibility is to store, version, track,\nand share the features.\n\nBy saving the features into a feature store, we always have a state of our\nfeatures. Thus, we can easily send the features to the training and inference\npipeline(s).\n\nAs the data is versioned, we can always ensure that the training and inference\ntime features match. Thus, we avoid the training-serving skew problem.\n\n#### The training pipeline\n\nThe training pipeline takes the features and labels from the features store as\ninput and outputs a train model or models.\n\nThe models are stored in a model registry. Its role is similar to that of\nfeature stores, but this time, the model is the first-class citizen. Thus, the\nmodel registry will store, version, track, and share the model with the\ninference pipeline.\n\nAlso, most modern model registries support a metadata store that allows you to\nspecify essential aspects of how the model was trained. The most important are\nthe features, labels and their version used to train the model. Thus, we will\nalways know what data the model was trained on.\n\n#### The inference pipeline\n\nThe inference pipeline takes as input the features & labels from the feature\nstore and the trained model from the model registry. With these two,\npredictions can be easily made in either batch or real-time mode.\n\nAs this is a versatile pattern, it is up to you to decide what you do with\nyour predictions. If it\u2019s a batch system, they will probably be stored in a\ndatabase. If it\u2019s a real-time system, the predictions will be served to the\nclient who requested them.\n\nAs the features, labels, and model are versioned. We can easily upgrade or\nroll back the deployment of the model. For example, we will always know that\nmodel v1 uses features F1, F2, and F3, and model v2 uses F2, F3, and F4. Thus,\nwe can quickly change the connections between the model and features.\n\n### Benefits of the FTI architecture\n\nTo conclude, the most important thing you must remember about the FTI\npipelines is their interface:\n\n\u00b7 The feature pipeline takes in data and outputs features & labels saved to\nthe feature store.\n\n\u00b7 The training pipelines query the features store for features & labels and\noutput a model to the model registry.\n\n\u00b7 The inference pipeline uses the features from the feature store and the\nmodel from the model registry to make predictions.\n\nIt doesn\u2019t matter how complex your ML system gets. These interfaces will\nremain the same.\n\nNow that we better understand how the pattern works, we want to highlight the\nmain benefits of using this pattern:\n\n * as you have just three components, it is intuitive to use and easy to understand;\n\n * each component can be written into its tech stack, so we can quickly adapt them to specific needs, such as big or streaming data. Also, it allows us to pick the best tools for the job;\n\n * as there is a transparent interface between the three components, each one can be developed by a different team (if necessary), making the development more manageable and scalable;\n\n * every component can be deployed, scaled, and monitored independently.\n\nThe final thing you must understand about the FTI pattern is that the system\ndoesn\u2019t have to contain only three pipelines. In most cases, it will include\nmore. For example, the feature pipeline can be composed of a service that\ncomputes the features and one that validates the data. Also, the training\npipeline can be composed of the training and evaluation components.\n\nThe FTI pipelines act as logical layers. Thus, it is perfectly fine for each\nto be complex and contain multiple services. However, what is essential is to\nstick to the same interface on how the FTI pipelines interact with each other\nthrough the feature store and model registries. By doing so, each FTI\ncomponent can evolve differently, without knowing the details of each other\nand without breaking the system on new changes.\n\n### Conclusion\n\nIn this article, we understood the fundamental problems when naively building\nML systems.\n\nWe also looked at potential solutions and their downsides.\n\nUltimately, we presented the FTI architecture, its benefits, and how to apply\nit to modern ML systems.\n\n* * *\n\n> My _**latest book** , \u201cLLM Engineer\u2019s Handbook,\u201d _inspired me to write this\n> article.\n\nIf you liked this article, consider supporting me by buying my book and enjoy\na lot more similar content compressed into a single book:\n\nLLM Engineer's Handbook\n\nLLM Engineer\u2019s Handbook Cover\n\n* * *\n\n### References\n\n### Literature\n\n[1] Jim Dowling, From MLOps to ML Systems with Feature/Training/Inference\nPipelines [2023], Hopsworks blog\n\n### Images\n\nIf not otherwise stated, all images are created by the author.\n\n12\n\nShare this post\n\n#### Building ML systems the right way using the FTI architecture\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/building-ml-systems-the-right-way?r=1ttoeh" }, { "id": "2ce3c5d1-730b-4258-88ab-07009eddaf33", "content": { "Title": "Reduce your PyTorch code latency by 82% - by Paul Iusztin", "Subtitle": "How not to optimize the inference of your DL models. Computer science is dead.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Reduce your PyTorch code latency by 82%\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Reduce your PyTorch code latency by 82%\n\n### How not to optimize the inference of your DL models. Computer science is\ndead.\n\nPaul Iusztin\n\nAug 03, 2024\n\n9\n\nShare this post\n\n#### Reduce your PyTorch code latency by 82%\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * Reduce the latency of your PyTorch code by 82%\n\n * How I failed to optimize the inference of my DL models\n\n * Computer science is dead\n\n* * *\n\n> \ud835\udde1\ud835\uddf2\ud835\ude04 \ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8 on engineering end-to-end LLM systems, from data collection and\n> fine-tuning to LLMOps (deployment, monitoring).\n\nI kept this one a secret, but in the past months, in collaboration with Packt\n, Alex Vesa and Maxime Labonne , we started working on the \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude0c\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33'\ud835\ude34\n\ud835\ude0f\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude23\ud835\ude30\ud835\ude30\ud835\ude2c. \n \n\ud835\uddd4 \ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8 that will walk you through everything you know to build a production-\nready LLM project.\n\nI am a big advocate of learning with hands-on examples while being anchored in\nreal-world use cases. \n \nThat is why this is not the standard theoretical book. \n \nWhile reading the book, you will learn to build a complex LLM project: an LLM\nTwin. In contrast, theoretical aspects will back everything to understand why\nwe make certain decisions. \n \nHowever, our ultimate goal is to present a framework that can be applied to\nmost LLM projects. \n \n. \n \n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddee \ud835\ude00\ud835\uddfb\ud835\uddf2\ud835\uddee\ud835\uddf8 \ud835\uddfd\ud835\uddf2\ud835\uddf2\ud835\uddf8 \ud835\uddfc\ud835\uddf3 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude04\ud835\uddf6\ud835\uddf9\ud835\uddf9 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddd8\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff'\ud835\ude00\n\ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8: \n \n\\- collect unstructured data \n\\- create instruction datasets from raw data to fine-tune LLMs \n\\- SFT techniques such as LoRA and QLoRA \n\\- LLM evaluation techniques \n\\- Preference alignment using DPO \n\\- inference optimization methods (key optimization, model parallelism,\nquantization, attention mechanisms) \n\\- advanced RAG algorithms using LangChain as our LLM framework and Qdrant as\nour vector DB \n \n\\- design LLM systems using the FTI architecture \n\\- use AWS SageMaker to fine-tune and deploy open-source LLMs \n\\- use ZenML to orchestrate all the pipelines and track the data as artifacts \n\\- LLMOps patterns such as CT/CI/CD pipelines, model registries and using\nComet for experiment tracking and prompt monitoring \n \n. \n \nThe book is still a work in progress, but we are very excited about it! \n \nThank you, Packt, for making this possible and Maxime and Alex for this\nremarkable collaboration. \n \nIf you are curious, you can currently pre-order it from Amazon. The whole book\nshould be released by the end of September 2024. \n \n\u2193\u2193\u2193 \n \n\ud83d\udd17 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude0c\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33'\ud835\ude34 \ud835\ude0f\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude23\ud835\ude30\ud835\ude30\ud835\ude2c: \ud835\ude14\ud835\ude22\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude33 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude35 \ud835\ude30\ud835\ude27 \ud835\ude26\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude13\ud835\ude22\ud835\ude33\ud835\ude28\ud835\ude26 \ud835\ude13\ud835\ude22\ud835\ude2f\ud835\ude28\ud835\ude36\ud835\ude22\ud835\ude28\ud835\ude26 \ud835\ude14\ud835\ude30\ud835\ude25\ud835\ude26\ud835\ude2d\ud835\ude34\n\ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude24\ud835\ude26\ud835\ude31\ud835\ude35 \ud835\ude35\ud835\ude30 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\n\n* * *\n\n### Reduce the latency of your PyTorch code by 82%\n\nThis is how I \ud835\uddff\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf2\ud835\uddf1 the \ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06 of my \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 by \ud835\udff4\ud835\udfee% \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\uddfb\ud835\uddf9\ud835\ude06 \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb\n& \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5. \ud835\udde1\ud835\udde2 \ud835\uddf3\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\ude06 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9\ud835\ude00 \ud835\uddf6\ud835\uddfb\ud835\ude03\ud835\uddfc\ud835\uddf9\ud835\ude03\ud835\uddf2\ud835\uddf1! \n \n\ud835\ude4f\ud835\ude5d\ud835\ude5a \ud835\ude65\ud835\ude67\ud835\ude64\ud835\ude57\ud835\ude61\ud835\ude5a\ud835\ude62? \n \nDuring inference, I am using 5 DL at ~25k images at once. \n \nThe script took around ~4 hours to run. \n \nThe problem is that this isn't a batch job that runs over the night... \n \nVarious people across the company required it to run in \"real-time\" multiple\ntimes a day.\n\n\ud835\ude4f\ud835\ude5d\ud835\ude5a \ud835\ude68\ud835\ude64\ud835\ude61\ud835\ude6a\ud835\ude69\ud835\ude5e\ud835\ude64\ud835\ude63? \n \nThe first thing that might come to your mind is to start using some fancy\noptimizer (e.g., TensorRT). \n \nEven though that should be done at some point... \n \nFirst, you should \ud835\uddee\ud835\ude00\ud835\uddf8 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2\ud835\uddf9\ud835\uddf3: \n \n\\- I/O bottlenecks: reading & writing images \n\\- preprocessing & postprocessing - can it be parallelized? \n\\- are the CUDA cores used at their maximum potential? \n\\- is the bandwidth between the CPU & GPU throttled? \n\\- can we move more computation to the GPU? \n \nThat being said... \n \n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 is what I did I \ud835\uddf1\ud835\uddf2\ud835\uddf0\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\uddf1 the \ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06 of the script by \ud835\udff4\ud835\udfee% \n \n\u2193\u2193\u2193 \n \n\ud835\udfed\\. \ud835\uddd5\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\ude00\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \n \nBatching is not only valuable for training but also mighty in speeding up your\ninference time. \n \nOtherwise, you waste your GPU CUDA cores. \n \nInstead of passing through the models one sample at a time, I now process 64. \n \n\ud835\udfee\\. \ud835\udddf\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\uddf1 \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5'\ud835\ude00 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf2\ud835\uddff \n \nThis has 2 main advantages: \n \n\\- parallel data loading & preprocessing on multiple processes (NOT threads) \n\\- copying your input images directly into the pinned memory (avoid a CPU ->\nCPU copy operation) \n \n\ud835\udfef\\. \ud835\udde0\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\ude00 \ud835\uddfa\ud835\ude02\ud835\uddf0\ud835\uddf5 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddfc\ud835\ude00\ud835\ude01\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddda\ud835\udde3\ud835\udde8 \n \nI saw that the tensor was moved too early on the CPU and mapped to a NumPy\narray. \n \nI refactored the code to keep it on the GPU as much as possible, which had 2\nmain advantages: \n \n\\- tensors are processed faster on the GPU \n\\- at the end of the logic, I had smaller tensors, resulting in smaller\ntransfers between the CPU & GPU \n \n\ud835\udff0\\. \ud835\udde0\ud835\ude02\ud835\uddf9\ud835\ude01\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf9\ud835\uddf9 \ud835\uddfa\ud835\ude06 \ud835\udddc/\ud835\udde2 \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 \n \nFor I/O bottlenecks, using Python threads is extremely powerful. \n \nI moved all my writes under a \ud835\ude1b\ud835\ude29\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude17\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude0c\ud835\ude39\ud835\ude26\ud835\ude24\ud835\ude36\ud835\ude35\ud835\ude30\ud835\ude33, batching my write\noperations. \n \n. \n \nNote that I used only good old Python & PyTorch code. \n \n\u2192 When the code is poorly written, no tool can save you \n \nOnly now is the time to add fancy tooling, such as TensorRT.\n\n.\n\nSo remember... \n \nTo optimize the PyTorch code by 82%: \n \n1\\. Batched the inference samples \n2\\. Leveraged PyTorch's DataLoader \n3\\. Moved as much of the postprocessing on the GPU \n4\\. Multithreading for all my I/O write operations \n \nWhat other methods do you have in mind? Leave them in the comments \u2193\n\n* * *\n\n### How I failed to optimize the inference of my DL models\n\nThis is how I FAILED to \ud835\uddfc\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\ude07\ud835\uddf2 the \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 of my \ud835\uddd7\ud835\udddf \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00 when \ud835\uddff\ud835\ude02\ud835\uddfb\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4\n\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddfa on a \ud835\udde1\ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf6\ud835\uddee \ud835\uddda\ud835\udde3\ud835\udde8. Let me tell you \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddee\ud835\ude03\ud835\uddfc\ud835\uddf6\ud835\uddf1 \u2193 \n \nI had a simple task. To reduce the latency of the DL models used in\nproduction. \n \nWe had 4 DL models that were running on Nvidia GPUs. \n \nAfter a first look at the inference code, I saw that the inputs to the models\nweren't batched. \n \nWe were processing one sample at a time. \n \nI said to myself: \"Ahaa! That's it. I cracked it. We just have to batch as\nmany samples as possible, and we are done.\" \n \nSo, I did just that... \n \nAfter 2-3 days of work adding the extra batch dimension to the PyTorch\npreprocessing & postprocessing code, \ud835\udddc \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddf1 \ud835\udddc \ud835\uddea\ud835\uddd4\ud835\udde6 \ud835\uddea\ud835\udde5\ud835\udde2\ud835\udde1\ud835\uddda.\n\n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\ude06 \n \n\u2193\u2193\u2193 \n \nWe were using Nvidia GPUs from the A family (A6000, A5000, etc.). \n \nAs these GPUs have a lot of memory (>40GB), I managed to max out the VRAM and\nsquash a batch of 256 images on the GPU. \n \nRelative to using a \"\ud835\ude23\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29 = 1\" it was faster, but not A LOT FASTER, as I\nexpected. \n \nThen I tried batches of 128, 64, 32, 16, and 8. \n \n...and realized that everything > batch = 16 was running slower than using a\nbatch of 16. \n \n\u2192 \ud835\uddd4 \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\uddfc\ud835\uddf3 \ud835\udfed\ud835\udff2 \ud835\ude04\ud835\uddee\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\ude01 \ud835\ude00\ud835\uddfd\ud835\uddfc\ud835\ude01. \n \nBut that is not good, as I was using only ~10% of the VRAM... \n \n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01? \n \nThe Nvidia A family of GPUs are known to: \n \n\\- having a lot of VRAM \n\\- not being very fast (the memory transfer between the CPU & GPU + the number\nof CUDA cores isn't that great) \n \nThat being said, my program was throttled. \n \nEven if my GPU could handle much more memory-wise, the memory transfer &\nprocessing speeds weren't keeping up. \n \nIn the end, it was a good optimization: ~75% faster \n \n\ud835\uddd5\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude06 \ud835\uddf6\ud835\ude00: \n \n\u2192 ALWAYS KNOW YOUR HARDWARE \u2190 \n \nMost probably, running a bigger batch on an A100 or V100 wouldn't have the\nsame problem. \n \nI plan to try that. \n \nBut that is why... \n \n\u2192 \ud835\ude6e\ud835\ude64\ud835\ude6a \ud835\ude56\ud835\ude61\ud835\ude6c\ud835\ude56\ud835\ude6e\ud835\ude68 \ud835\ude5d\ud835\ude56\ud835\ude6b\ud835\ude5a \ud835\ude69\ud835\ude64 \ud835\ude64\ud835\ude65\ud835\ude69\ud835\ude5e\ud835\ude62\ud835\ude5e\ud835\ude6f\ud835\ude5a \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude65\ud835\ude56\ud835\ude67\ud835\ude56\ud835\ude62\ud835\ude5a\ud835\ude69\ud835\ude5a\ud835\ude67\ud835\ude68 \ud835\ude64\ud835\ude5b \ud835\ude6e\ud835\ude64\ud835\ude6a\ud835\ude67 \ud835\ude68\ud835\ude6e\ud835\ude68\ud835\ude69\ud835\ude5a\ud835\ude62 \ud835\ude57\ud835\ude56\ud835\ude68\ud835\ude5a\ud835\ude59 \ud835\ude64\ud835\ude63 \ud835\ude6e\ud835\ude64\ud835\ude6a\ud835\ude67\n\ud835\ude5d\ud835\ude56\ud835\ude67\ud835\ude59\ud835\ude6c\ud835\ude56\ud835\ude67\ud835\ude5a!\n\nIn theory, I knew this, but it is completely different when you encounter it\nin production. \n \nLet me know in the comments if you want more similar stories on \"DO NOTs\" from\nmy experience.\n\n* * *\n\n### Computer science is dead\n\n\ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\ude00\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf1. Do this instead. \n \nIn a recent talk, Jensen Huang, CEO of Nvidia, said that kids shouldn't learn\nprogramming anymore. \n \nHe said that until now, most of us thought that everyone should learn to\nprogram at some point. \n \nBut the actual opposite is the truth. \n \nWith the rise of AI, nobody should have or need to learn to program anymore. \n \nHe highlights that with AI tools, the technology divide between non-\nprogrammers and engineers is closing. \n \n. \n \n\ud835\uddd4\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff, \ud835\uddfa\ud835\ude06 \ud835\uddf2\ud835\uddf4\ud835\uddfc \ud835\uddf6\ud835\ude00 \ud835\uddf5\ud835\ude02\ud835\uddff\ud835\ude01; \ud835\uddfa\ud835\ude06 \ud835\uddf3\ud835\uddf6\ud835\uddff\ud835\ude00\ud835\ude01 \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddfc \ud835\ude00\ud835\uddee\ud835\ude06 \ud835\uddf6\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\ude00\ud835\ude01\ud835\ude02\ud835\uddfd\ud835\uddf6\ud835\uddf1. \n \nBut after thinking about it more thoroughly, I tend to agree with him. \n \nAfter all, even now, almost anybody can work with AI. \n \nThis probably won't happen in the next 10 years, but at some point, 100% will\ndo. \n \nAt some point, we will ask our AI companion to write a program that does X for\nus or whatever. \n \nBut, I think this is a great thing, as it will give us more time & energy to\nfocus on what matters, such as: \n \n\\- solving real-world problems (not just tech problems) \n\\- moving to the next level of technology (Bioengineering, interplanetary\ncolonization, etc.) \n\\- think about the grand scheme of things \n\\- be more creative \n\\- more time to connect with our family \n\\- more time to take care of our \n \nI personally think it is a significant step for humanity. \n \n. \n \nWhat do you think? \n \nAs an engineer, do you see your job still present in the next 10+ years? \n \nHere is the full talk \n \n\u2193\u2193\u2193\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n9\n\nShare this post\n\n#### Reduce your PyTorch code latency by 82%\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| SorinAug 3Liked by Paul IusztinExcellent article, except the part CS is dead\nis invalidExpand full commentReplyShare \n---|--- \n \n1 reply by Paul Iusztin\n\n1 more comment...\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/reduce-your-pytorchs-code-latency?r=1ttoeh" }, { "id": "7a276ac3-5c78-42d3-9ecf-05ff7f76fe31", "content": { "Title": "LLM Agents Demystified - by Li - Decoding ML Newsletter ", "Subtitle": "Hands-on ReAct Agent implementation with AdalFlow library", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### LLM Agents Demystified\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# LLM Agents Demystified\n\n### Hands-on ReAct Agent implementation with AdalFlow library\n\nLi\n\nJul 27, 2024\n\n14\n\nShare this post\n\n#### LLM Agents Demystified\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nHi, all! I\u2019m Li Yin, Author of AdalFlow and ex AI researcher @ MetaAI\n\nFind me on LinkedIn\n\nHandy links:\n\n * AdalFlow Github\n\n * Open in Colab\n\n _AdalFlow is an LLM library that not only helps developers build but also\noptimizes LLM task pipelines. Embracing a design pattern similar to PyTorch,\nAdalFlow is light, modular, and robust, with a 100% readable codebase._\n\n_There are many tutorials that show users how to call high-level agent APIs,\nbut none of them explain how it really works in depth. This is where the\nAdalFlow library aims to make a difference._\n\n_In this blog, you will not only learn how to use the ReAct Agent but more\nimportantly, also understand how it was implemented and how you can customize\nor build your own agent with AdalFlow._\n\n_Let\u2019s get started!_\n\n_Image source , credits to Growtika_\n\n## Introduction\n\n _\u201cAn autonomous agent is a system situated within and a part of an\nenvironment that senses that environment and acts on it, over time, in pursuit\nof its own agenda and so as to effect what it senses in the future.\u201d_\n\n _\u2014 Franklin and Graesser (1997)_\n\nAlongside the well-known RAGs, agents [1] are another popular family of LLM\napplications. What makes agents stand out is their ability to reason, plan,\nand act via accessible tools. When it comes to implementation, AdalFlow has\nsimplified it down to a generator that can use tools, taking multiple steps\n(sequential or parallel) to complete a user query.\n\n* * *\n\n### Table of Contents:\n\n 1. What is ReAct Agent\n\n 2. Introduction on tools/function calls\n\n 3. ReAct Agent implementation\n\n 4. ReAct Agent in action\n\n* * *\n\n### 1\\. What is ReAct Agent\n\nReAct [2] is a general paradigm for building agents that sequentially\ninterleaves thought, action, and observation steps.\n\n * **Thought** : The reasoning behind taking an action.\n\n * **Action** : The action to take from a predefined set of actions. In particular, these are the tools/functional tools we have introduced in tools.\n\n * **Observation** : The simplest scenario is the execution result of the action in string format. To be more robust, this can be defined in any way that provides the right amount of execution information for the LLM to plan the next step.\n\n#### **Prompt and Data Models**\n\n _The prompt is the most straightforward way to understand any LLM\napplication. Always read the prompt._\n\nAdalFlow uses jinja2 syntax for the prompt.\n\nDEFAULT_REACT_AGENT_SYSTEM_PROMPT is the default prompt for the React agent\u2019s\nLLM planner. We can categorize the prompt template into four parts:\n\n 1. **Task description**\n\nThis part is the overall role setup and task description for the agent.\n\n \n \n task_desc = r\"\"\"You are a helpful assistant.Answer the user's query using the tools provided below with minimal steps and maximum accuracy.Each step you will read the previous Thought, Action, and Observation(execution result of the action) and then provide the next Thought and Action.\"\"\"\n\n 2. **Tools, output format, and example**\n\nThis part of the template is exactly the same as how we were calling functions\nin the tools. The `output_format_str` is generated by `FunctionExpression` via\n`JsonOutputParser`. It includes the actual output format and examples of a\nlist of `FunctionExpression` instances. We use `thought` and `action` fields\nof the `FunctionExpression` as the agent\u2019s response. _You will be easily\nvisualize the whole pipeline later by simply_`print(react).`\n\n \n \n tools = r\"\"\"{% if tools %}\n \n {% for tool in tools %}\n {{ loop.index }}.\n {{tool}}\n ------------------------\n {% endfor %}\n \n {% endif %}\n {{output_format_str}}\"\"\"\n\n 3. **Task specification to teach the planner how to \u201cthink\u201d.**\n\nWe provide more detailed instruction to ensure the agent will always end with\n\u2018finish\u2019 action to complete the task. Additionally, we teach it how to handle\nsimple queries and complex queries.\n\n * For simple queries, we instruct the agent to finish with as few steps as possible.\n\n * For complex queries, we teach the agent a \u2018divide-and-conquer\u2019 strategy to solve the query step by step.\n\n \n \n task_spec = r\"\"\"\n - For simple queries: Directly call the ``finish`` action and provide the answer.\n - For complex queries:\n - Step 1: Read the user query and potentially divide it into subqueries. And get started with the first subquery.\n - Call one available tool at a time to solve each subquery/subquestion. \\\n - At step 'finish', join all subqueries answers and finish the task.\n Remember:\n - Action must call one of the above tools with name. It can not be empty.\n - You will always end with 'finish' action to finish the task. The answer can be the final answer or failure message.\n \"\"\"\n\nWe put all these three parts together to be within the `` tag.\n\n 4. **Agent step history.**\n\nWe use `StepOutput` to record the agent\u2019s step history, including:\n\n * `action`: This will be the `FunctionExpression` instance predicted by the agent.\n\n * `observation`: The execution result of the action.\n\nIn particular, we format the steps history after the user query as follows:\n\n \n \n step_history = r\"\"\"User query:\n {{ input_str }}\n {# Step History #}\n {% if step_history %}\n \n {% for history in step_history %}\n Step {{ loop.index }}.\n \"Thought\": \"{{history.action.thought}}\",\n \"Action\": \"{{history.action.action}}\",\n \"Observation\": \"{{history.observation}}\"\n ------------------------\n {% endfor %}\n \n {% endif %}\n You:\"\"\"\n\n### 2\\. Introduction on tools/function calls\n\nIn addition to the tools provided by users, by default, we add a new tool\nnamed `finish` to allow the agent to stop and return the final answer.\n\n \n \n def finish(answer: str) -> str:\n \"\"\"Finish the task with answer.\"\"\"\n return answer\n\nSimply returning a string might not fit all scenarios, and we might consider\nallowing users to define their own finish function in the future for more\ncomplex cases.\n\nAdditionally, since the provided tools cannot always solve user queries, we\nallow users to configure if an LLM model should be used to solve a subquery\nvia the `add_llm_as_fallback` parameter. This LLM will use the same model\nclient and model arguments as the agent\u2019s planner. Here is our code to specify\nthe fallback LLM tool:\n\n \n \n _additional_llm_tool = (\n Generator(model_client=model_client, model_kwargs=model_kwargs)\n if self.add_llm_as_fallback\n else None\n )\n \n def llm_tool(input: str) -> str:\n \"\"\"I answer any input query with llm's world knowledge. Use me as a fallback tool or when the query is simple.\"\"\"\n # use the generator to answer the query\n try:\n output: GeneratorOutput = _additional_llm_tool(\n prompt_kwargs={\"input_str\": input}\n )\n response = output.data if output else None\n return response\n except Exception as e:\n log.error(f\"Error using the generator: {e}\")\n print(f\"Error using the generator: {e}\")\n return None\n\n### 3\\. ReAct Agent implementation\n\nWe define the class ReActAgent to put everything together. It will orchestrate\ntwo components:\n\n * `planner`: A `Generator` that works with a `JsonOutputParser` to parse the output format and examples of the function calls using `FunctionExpression`.\n\n * `ToolManager`: Manages a given list of tools, the finish function, and the LLM tool. It is responsible for parsing and executing the functions.\n\nAdditionally, it manages step_history as a list of `StepOutput` instances for\nthe agent\u2019s internal state.\n\nPrompt the agent with an input query and process the steps to generate a\nresponse.\n\n### 4\\. ReAct Agent in action\n\nWe will set up two sets of models, llama3\u201370b-8192 by Groq and gpt-3.5-turbo\nby OpenAI, to test two queries. For comparison, we will compare these with a\nvanilla LLM response without using the agent. Here are the code snippets:\n\n \n \n from lightrag.components.agent import ReActAgent\n from lightrag.core import Generator, ModelClientType, ModelClient\n from lightrag.utils import setup_env\n \n setup_env()\n \n # Define tools\n def multiply(a: int, b: int) -> int:\n \"\"\"\n Multiply two numbers.\n \"\"\"\n return a * b\n def add(a: int, b: int) -> int:\n \"\"\"\n Add two numbers.\n \"\"\"\n return a + b\n def divide(a: float, b: float) -> float:\n \"\"\"\n Divide two numbers.\n \"\"\"\n return float(a) / b\n llama3_model_kwargs = {\n \"model\": \"llama3-70b-8192\", # llama3 70b works better than 8b here.\n \"temperature\": 0.0,\n }\n gpt_model_kwargs = {\n \"model\": \"gpt-3.5-turbo\",\n \"temperature\": 0.0,\n }\n \n def test_react_agent(model_client: ModelClient, model_kwargs: dict):\n tools = [multiply, add, divide]\n queries = [\n \"What is the capital of France? and what is 465 times 321 then add 95297 and then divide by 13.2?\",\n \"Give me 5 words rhyming with cool, and make a 4-sentence poem using them\",\n ]\n # define a generator without tools for comparison\n generator = Generator(\n model_client=model_client,\n model_kwargs=model_kwargs,\n )\n react = ReActAgent(\n max_steps=6,\n add_llm_as_fallback=True,\n tools=tools,\n model_client=model_client,\n model_kwargs=model_kwargs,\n )\n # print(react)\n for query in queries:\n print(f\"Query: {query}\")\n agent_response = react.call(query)\n llm_response = generator.call(prompt_kwargs={\"input_str\": query})\n print(f\"Agent response: {agent_response}\")\n print(f\"LLM response: {llm_response}\")\n print(\"\")\n\nThe structure of React using `print(react)`, including the initialization\narguments and two major components: `tool_manager` and `planner`. You can\nvisualize the structure from our colab.\n\nNow, let\u2019s run the test function to see the agent in action.\n\n \n \n test_react_agent(ModelClientType.GROQ(), llama3_model_kwargs)\n test_react_agent(ModelClientType.OPENAI(), gpt_model_kwargs)\n\nOur agent will show the core steps for developers via colored printout,\nincluding input_query, steps, and the final answer. The printout of the first\nquery with llama3 is shown below (without the color here):\n\n \n \n 2024-07-10 16:48:47 - [react.py:287:call] - input_query: What is the capital of France? and what is 465 times 321 then add 95297 and then divide by 13.2\n \n 2024-07-10 16:48:48 - [react.py:266:_run_one_step] - Step 1:\n StepOutput(step=1, action=FunctionExpression(thought=\"Let's break down the query into subqueries and start with the first one.\", action='llm_tool(input=\"What is the capital of France?\")'), function=Function(thought=None, name='llm_tool', args=[], kwargs={'input': 'What is the capital of France?'}), observation='The capital of France is Paris!')\n _______\n 2024-07-10 16:48:49 - [react.py:266:_run_one_step] - Step 2:\n StepOutput(step=2, action=FunctionExpression(thought=\"Now, let's move on to the second subquery.\", action='multiply(a=465, b=321)'), function=Function(thought=None, name='multiply', args=[], kwargs={'a': 465, 'b': 321}), observation=149265)\n _______\n 2024-07-10 16:48:49 - [react.py:266:_run_one_step] - Step 3:\n StepOutput(step=3, action=FunctionExpression(thought=\"Now, let's add 95297 to the result.\", action='add(a=149265, b=95297)'), function=Function(thought=None, name='add', args=[], kwargs={'a': 149265, 'b': 95297}), observation=244562)\n _______\n 2024-07-10 16:48:50 - [react.py:266:_run_one_step] - Step 4:\n StepOutput(step=4, action=FunctionExpression(thought=\"Now, let's divide the result by 13.2.\", action='divide(a=244562, b=13.2)'), function=Function(thought=None, name='divide', args=[], kwargs={'a': 244562, 'b': 13.2}), observation=18527.424242424244)\n _______\n 2024-07-10 16:48:50 - [react.py:266:_run_one_step] - Step 5:\n StepOutput(step=5, action=FunctionExpression(thought=\"Now, let's combine the answers of both subqueries.\", action='finish(answer=\"The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244.\")'), function=Function(thought=None, name='finish', args=[], kwargs={'answer': 'The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244.'}), observation='The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244.')\n _______\n 2024-07-10 16:48:50 - [react.py:301:call] - answer:\n The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244.\n\nThe comparison between the agent and the vanilla LLM response is shown below:\n\n \n \n Answer with agent: The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244.\n Answer without agent: GeneratorOutput(data=\"I'd be happy to help you with that!\\n\\nThe capital of France is Paris.\\n\\nNow, let's tackle the math problem:\\n\\n1. 465 \u00d7 321 = 149,485\\n2. Add 95,297 to that result: 149,485 + 95,297 = 244,782\\n3. Divide the result by 13.2: 244,782 \u00f7 13.2 = 18,544.09\\n\\nSo, the answer is 18,544.09!\", error=None, usage=None, raw_response=\"I'd be happy to help you with that!\\n\\nThe capital of France is Paris.\\n\\nNow, let's tackle the math problem:\\n\\n1. 465 \u00d7 321 = 149,485\\n2. Add 95,297 to that result: 149,485 + 95,297 = 244,782\\n3. Divide the result by 13.2: 244,782 \u00f7 13.2 = 18,544.09\\n\\nSo, the answer is 18,544.09!\", metadata=None)\n\nThe ReAct agent is particularly helpful for answering queries that require\ncapabilities like computation or more complicated reasoning and planning.\nHowever, using it on general queries might be an overkill, as it might take\nmore steps than necessary to answer the query.\n\n### 5\\. [Optional] Customization\n\nPlease refer to our tutorial for how to customize ReAct to your use case.\n\n* * *\n\n## References\n\n[1] A survey on large language model based autonomous agents: Paitesanshi/LLM-\nAgent-Survey\n\n[2]**** ReAct: https://arxiv.org/abs/2210.03629\n\n[3] Tool Tutorial: https://lightrag.sylph.ai/tutorials/tool_helper.html \n\n## API References\n\n * components.agent.react.ReActAgent\n\n * core.types.StepOutput\n\n * components.agent.react.DEFAULT_REACT_AGENT_SYSTEM_PROMPT\n\n14\n\nShare this post\n\n#### LLM Agents Demystified\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n| A guest post by| LiAuthor of AdalFlow, Founder at SylphAI, ex AI researcher\nat MetaAI. Github: liyin2015| Subscribe to Li \n---|--- \n \n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/llm-agents-demystified?r=1ttoeh" }, { "id": "12ad5863-ba57-4f5c-9ab7-4600c7edbf5c", "content": { "Title": "Scalable RAG pipeline using 74.3% less code", "Subtitle": "Tutorial on building a scalable & modular advanced RAG feature pipeline to chunk, embed and ingest multiple data categories to a vector DB using Superlinked", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Scalable RAG ingestion pipeline using 74.3% less code\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Scalable RAG ingestion pipeline using 74.3% less code\n\n### End-to-end implementation for an advanced RAG feature pipeline\n\nPaul Iusztin\n\nJul 20, 2024\n\n13\n\nShare this post\n\n#### Scalable RAG ingestion pipeline using 74.3% less code\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _\u2192 the 1st lesson of the Superlinked bonus series from**the LLM Twin** free\ncourse_\n\n### **Why is this course different?**\n\n_By finishing the \u201c**LLM Twin: Building Your Production-Ready AI\nReplica\u201d**_****_free course, you will learn how to design, train, and deploy a\nproduction-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps\ngood practices_.\n\n_**Why should you care? \ud83e\udef5**_\n\n _**\u2192 No more isolated scripts or Notebooks!** Learn production ML by building\nand deploying an end-to-end production-grade LLM system._\n\n> _More**details** on what you will **learn** within the **LLM Twin**\n> **course** , **here** \ud83d\udc48_\n\n## Latest lessons of the LLM Twin course\n\n**Lesson 8:** Best practices when evaluating fine-tuned LLM models\n\n\u2192 Quantitative/Qualitative Evaluation Metrics, Human-in-the-Loop, LLM-Eval\n\n**Lesson 9:** Architect scalable and cost-effective LLM & RAG inference\npipelines\n\n\u2192Monolithic vs. microservice, Qwak Deployment, RAG Pipeline Walkthrough\n\n**Lesson 10:** How to evaluate your RAG using RAGAs Framework\n\n\u2192 RAG evaluation best practic, RAGAs framework\n\n* * *\n\n## **Lesson 11: Build a scalable RAG ingestion pipeline using 74.3% less\ncode**\n\n**Lessons 11** and **12** are part of a **bonus serie** s in which we will\ntake the advanced RAG system from the **LLM Twin course** (written in\nLangChain) and refactor it using Superlinked, a framework specialized in\nvector computing for information retrieval.\n\nIn **Lesson 11** **(this article)** , we will learn to build a highly\nscalable, real-time RAG feature pipeline that ingests multi-data categories\ninto a Redis vector database.\n\nMore concretely we will take the ingestion pipeline implemented in Lesson 4\nand swap the chunking, embedding, and vector DB logic with Superlinked.\n\n_You don\u2019t have to readLesson 4 to read this article. We will give enough\ncontext to make sense of it._\n\nIn the **12th lesson** , we will use Superlinked to implement a multi-index\nquery strategy and further optimize the advanced RAG retrieval module\n(initially built in Lesson 5).\n\n> _The value of this article lies in understanding how easy it is to build\n> complex advanced RAG systems usingSuperlinked._\n>\n> _**Using Superlinked** , we **reduced** the number of RAG-related **lines of\n> code** by **74.3%**. Powerful, right?_\n\nBy the **end of this article** , **you will learn** to build a production-\nready feature pipeline built in Superlinked that:\n\n * uses Bytewax as a stream engine to process data in real-time;\n\n * ingests multiple data categories from a RabbitMQ queue;\n\n * validates the data with Pydantic;\n\n * chunks, and embeds data using Superlinked for doing RAG;\n\n * loads the embedded vectors along their metadata to a Redis vector DB;\n\nUltimately, on the infrastructure side, we will show you how to deploy a\nSuperlinked vector compute server.\n\n### **Quick intro in feature pipelines**\n\nThe **feature pipeline** is the **first** **pipeline** presented in the\n**FTI** **pipeline architecture** : feature, training and inference pipelines.\n\nA **feature pipeline** takes raw data as input, processes it into features,\nand stores it in a feature store, from which the training & inference\npipelines will use it.\n\nThe component is completely isolated from the training and inference code. All\nthe communication is done through the feature store.\n\n> _To avoid repeating myself, if you are**unfamiliar** with the **FTI**\n> **pipeline architecture** , check out Lesson 1 for a refresher._\n\n* * *\n\n## **Table of Contents**\n\n 1. What is Superlinked?\n\n 2. The old architecture of the RAG feature pipeline\n\n 3. The new Superlinked architecture of the RAG feature pipeline\n\n 4. Understanding the streaming flow for real-time processing\n\n 5. Loading data to Superlinked\n\n 6. Exploring the RAG Superlinked server\n\n 7. Using Redis as a vector DB\n\n> _\ud83d\udd17**Check out** the code on GitHub [1] and support us with a _\n\n* * *\n\n## **1\\. What is Superlinked?**\n\n_Superlinked is a computing framework for turning complex data into vectors._\n\nIt lets you quickly build multimodal vectors and define weights at query time,\nso you don\u2019t need a custom reranking algorithm to optimize results.\n\nIt\u2019s focused on turning complex data into vector embeddings within your RAG,\nSearch, RecSys and Analytics stack.\n\nI love how Daniel Svonava, the CEO of Superlinked, described the value of\nvector compute and implicitly Superlinked:\n\n> _Daniel Svonava, CEO at Superlinked:_\n>\n> _\u201cVectors power most of what you already do online \u2014 hailing a cab, finding\n> a funny video, getting a date, scrolling through a feed or paying with a\n> tap. And yet, building production systems powered by vectors is still too\n> hard! Our goal is to help enterprises put vectors at the center of their\n> data & compute infrastructure, to build smarter and more reliable\n> software.\u201d_\n\nTo conclude, Superlinked is a framework that puts the vectors in the center of\ntheir universe and allows you to:\n\n * chunk and embed embeddings;\n\n * store multi-index vectors in a vector DB;\n\n * do complex vector search queries on top of your data.\n\nScreenshot from Superlinked\u2019s landing page\n\n* * *\n\n## **2\\. The old architecture of the RAG feature pipeline**\n\nHere is a quick recap of the critical aspects of the architecture of the RAG\nfeature pipeline presented in the 4th lesson of the LLM Twin course.\n\n_We are working with**3 different data categories** :_\n\n * posts (e.g., LinkedIn, Twitter)\n\n * articles (e.g., Medium, Substack, or any other blog)\n\n * repositories (e.g., GitHub, GitLab)\n\nEvery data category has to be preprocessed differently. For example, you want\nto chunk the posts into smaller documents while keeping the articles in bigger\nones.\n\n_The**solution** is based on **CDC** , a **queue,** a **streaming engine,**\nand a **vector DB:**_\n\n-> The raw data is collected from multiple social platforms and is stored in MongoDB. (Lesson 2)\n\n\u2192 CDC adds any change made to the MongoDB to a RabbitMQ queue (Lesson 3).\n\n\u2192 the RabbitMQ queue stores all the events until they are processed.\n\n\u2192 The Bytewax streaming engine reads the messages from the RabbitMQ queue and\ncleans, chunks, and embeds them.\n\n\u2192 The processed data is uploaded to a Qdrant vector DB.\n\nThe old feature/streaming pipeline architecture that was presented in Lesson\n4.\n\n### **Why is this design robust?**\n\nHere are 4 core reasons:\n\n 1. The **data** is **processed** in **real-time**.\n\n 2. **Out-of-the-box recovery system:** If the streaming pipeline fails to process a message, it will be added back to the queue\n\n 3. **Lightweight:** No need for any diffs between databases or batching too many records\n\n 4. **No I/O bottlenecks** on the source database\n\n### **What is the issue with this design?**\n\nIn this architecture, we had to write custom logic to chunk, embed, and load\nthe data to Qdrant.\n\nThe issue with this approach is that we had to leverage various libraries,\nsuch as LangChain and unstructured, to get the job done.\n\nAlso, because we have 3 data categories, we had to write a dispatcher layer\nthat calls the right function depending on its category, which resulted in\ntons of boilerplate code.\n\nUltimately, as the chunking and embedding logic is implemented directly in the\nstreaming pipeline, it is harder to scale horizontally. The embedding\nalgorithm needs powerful GPU machines, while the rest of the operations\nrequire a strong CPU.\n\nThis results in:\n\n * more time spent on development;\n\n * more code to maintain;\n\n * the code can quickly become less readable;\n\n * less freedom to scale.\n\nSuperlinked can speed up this process by providing a very intuitive and\npowerful Python API that can speed up the development of our ingestion and\nretrieval logic.\n\nThus, let\u2019s see how to redesign the architecture using Superlinked \u2193\n\n## **3\\. The new Superlinked architecture of the RAG feature pipeline**\n\nThe core idea of the architecture will be the same. We still want to:\n\n * use a Bytewax streaming engine for real-time processing;\n\n * read new events from RabbitMQ;\n\n * clean, chunk, and embed the new incoming raw data;\n\n * load the processed data to a vector DB.\n\n**The question is** , how will we do this with Superlinked?\n\nAs you can see in the image below, Superlinked will replace the logic for the\nfollowing operations:\n\n * chunking;\n\n * embedding;\n\n * vector storage;\n\n * queries.\n\nAlso, we have to swap Qdrant with a Redis vector DB because Superlinked didn\u2019t\nsupport Qdrant when I wrote this article. But they plan to add it in future\nmonths (along with many other vector DBs).\n\nWhat will remain unchanged are the following:\n\n * the Bytewax streaming layer;\n\n * the RabbitMQ queue ingestion component;\n\n * the cleaning logic.\n\n> _By seeing**what we must change** to the architecture to integrate\n> Superlinked, we can **see** the **framework\u2019s core features**._\n\nThe components that can be refactored into the Superlinked framework.\n\nNow, let\u2019s take a deeper look at the new architecture.\n\nAll the Superlinked logic will sit on its own server, completely decoupling\nthe vector compute component from the rest of the feature pipeline.\n\nWe can quickly scale the streaming pipeline or the Superlinked server\nhorizontally based on our needs. Also, this makes it easier to run the\nembedding models (from Superlinked) on a machine with a powerful GPU while\nkeeping the streaming pipeline on a machine optimized for network I/O\noperations.\n\nAll the communication to Superlinked (ingesting or query data) will be done\nthrough a REST API, automatically generated based on the schemas and queries\nyou define in your Superlinked application.\n\nThe **Bytewax streaming pipeline** will perform the following operations:\n\n * will concurrently read messages from RabbitMQ;\n\n * clean each message based on it\u2019s data category;\n\n * send the cleaned document to the Superlinked server through an HTTP request.\n\n**On the** **Superlinked server side** , we have defined an ingestion endpoint\nfor each data category (article, post or code). Each endpoint will know how to\nchunk embed and store every data point based on its category.\n\nAlso, we have a query endpoint (automatically generated) for each data\ncategory that will take care of embedding the query and perform a vector\nsemantic search operation to retrieve similar results.\n\nThe RAG feature pipeline architecture after refactoring.\n\nNow, let\u2019s finally jump into the code \u2193\n\n* * *\n\n## **4\\. Understanding the streaming flow for real-time processing**\n\nThe **Bytewax flow** is the **central point** of the **streaming pipeline**.\nIt defines all the required steps, following the next simplified pattern:\n_\u201cinput - > processing -> output\u201d._\n\nHere is the Bytewax flow and its core steps \u2193\n\n \n \n flow = Dataflow(\"Streaming RAG feature pipeline\")\n stream = op.input(\"input\", flow, RabbitMQSource())\n stream = op.map(\"raw\", stream, RawDispatcher.handle_mq_message)\n stream = op.map(\"clean\", stream, CleaningDispatcher.dispatch_cleaner)\n op.output(\n \"superlinked_output\",\n stream,\n SuperlinkedOutputSink(client=SuperlinkedClient()),\n )\n\n## **5\\. Loading data to Superlinked**\n\nBefore we explore the Superlinked application, let\u2019s review our Bytewax\n_SuperlinkedOutputSink()_ and _SuperlinkedClient() _classes.\n\nThe purpose of the _SuperlinkedOutputSink()_ class is to instantiate a new\n_SuperlinkedSinkPartition()_ instance for each worker within the Bytewax\ncluster. Thus, we can optimize the system for I/O operations by scaling our\noutput workers horizontally.\n\n \n \n class SuperlinkedOutputSink(DynamicSink):\n def __init__(self, client: SuperlinkedClient) -> None:\n self._client = client\n \n def build(self, worker_index: int, worker_count: int) -> StatelessSinkPartition:\n return SuperlinkedSinkPartition(client=self._client)\n\nThe _SuperlinkedSinkPartition()_ class inherits the _StatelessSinkPartition\nBytewax base class_ used to create custom stateless partitions.\n\nThis class takes as input batches of items and sends them to Superlinked\nthrough the _SuperlinkedClient()_.\n\n \n \n class SuperlinkedSinkPartition(StatelessSinkPartition):\n def __init__(self, client: SuperlinkedClient):\n self._client = client\n \n def write_batch(self, items: list[Document]) -> None:\n for item in tqdm(items, desc=\"Sending items to Superlinked...\"):\n match item.type:\n case \"repositories\":\n self._client.ingest_repository(item)\n case \"posts\":\n self._client.ingest_post(item)\n case \"articles\":\n self._client.ingest_article(item)\n case _:\n logger.error(f\"Unknown item type: {item.type}\")\n\nThe _SuperlinkedClient() _is a basic wrapper that makes HTTP requests to the\nSuperlinked server that contains all the RAG logic. We use _httpx_ to make __\nPOST requests for ingesting or searching data.\n\n \n \n class SuperlinkedClient:\n ...\n \n def ingest_repository(self, data: RepositoryDocument) -> None:\n self.__ingest(f\"{self.base_url}/api/v1/ingest/repository_schema\", data)\n \n def ingest_post(self, data: PostDocument) -> None:\n self.__ingest(f\"{self.base_url}/api/v1/ingest/post_schema\", data)\n \n def ingest_article(self, data: ArticleDocument) -> None:\n self.__ingest(f\"{self.base_url}/api/v1/ingest/article_schema\", data)\n \n def __ingest(self, url: str, data: T) -> None:\n ...\n \n def search_repository(\n self, search_query: str, platform: str, author_id: str, *, limit: int = 3\n ) -> list[RepositoryDocument]:\n return self.__search(\n f\"{self.base_url}/api/v1/search/repository_query\",\n RepositoryDocument,\n search_query,\n platform,\n author_id,\n limit=limit,\n )\n \n def search_post(\n self, search_query: str, platform: str, author_id: str, *, limit: int = 3\n ) -> list[PostDocument]:\n ... # URL: f\"{self.base_url}/api/v1/search/post_query\"\n \n def search_article(\n self, search_query: str, platform: str, author_id: str, *, limit: int = 3\n ) -> list[ArticleDocument]:\n ... # URL: f\"{self.base_url}/api/v1/search/article_query\"\n \n def __search(\n self, url: str, document_class: type[T], search_query: str, ...\n ) -> list[T]:\n ...\n \n\nThe Superlinked server URLs are automatically generated as follows:\n\n * the ingestion URLs are generated based on the data schemas you defined (e.g., repository schema, post schema, etc.)\n\n * the search URLs are created based on the Superlinked queries defined within the application\n\n## **6\\. Exploring the RAG Superlinked server**\n\nAs the RAG Superlinked server is a different component than the Bytewax one,\nthe implementation sits under the server folder at _6-bonus-superlinked-\nrag/server/src/app.py._\n\n_Here is a step-by-step implementation of the Superlinked application \u2193_\n\n### **Settings class**\n\nUse Pydantic settings to define a global configuration class.\n\n \n \n class Settings(BaseSettings):\n EMBEDDING_MODEL_ID: str = \"sentence-transformers/all-mpnet-base-v2\"\n \n REDIS_HOSTNAME: str = \"redis\"\n REDIS_PORT: int = 6379\n \n \n settings = Settings()\n\n### **Schemas**\n\nSuperlinked requires you to define your data structure through a set of\nschemas, which are very similar to data classes or Pydantic models.\n\nSuperlinked will use these schemas as ORMs to save your data to a specified\nvector DB.\n\nIt will also use them to define ingestion URLs automatically as POST HTTP\nmethods that expect the request body to have the same signature as the schema.\n\nSimple and effective. Cool, right?\n\n \n \n @schema\n class PostSchema:\n id: IdField\n platform: String\n content: String\n author_id: String\n type: String\n \n \n @schema\n class ArticleSchema:\n id: IdField\n platform: String\n link: String\n content: String\n author_id: String\n type: String\n \n \n @schema\n class RepositorySchema:\n id: IdField\n platform: String\n name: String\n link: String\n content: String\n author_id: String\n type: String\n \n \n post = PostSchema()\n article = ArticleSchema()\n repository = RepositorySchema()\n\n### **Spaces**\n\nThe spaces are where you define your chunking and embedding logic.\n\nA space is scoped at the field of a schema. Thus, if you want to embed\nmultiple attributes of a single schema, you must define multiple spaces and\ncombine them later into a multi-index.\n\nLet\u2019s take the spaces for the article category as an example:\n\n \n \n articles_space_content = TextSimilaritySpace(\n text=chunk(article.content, chunk_size=500, chunk_overlap=50),\n model=settings.EMBEDDING_MODEL_ID,\n )\n articles_space_plaform = CategoricalSimilaritySpace(\n category_input=article.platform,\n categories=[\"medium\", \"superlinked\"],\n negative_filter=-5.0,\n )\n\nChunking is done simply by calling the _chunk()_ function on a given schema\nfield and specifying standard parameters such as \u201c _chunk_size\u201d_ and \u201c\n_chunk_overlap\u201d_.\n\nThe embedding is done through the _TextSimilaritySpace()_ and\n_CategoricalSimilaritySpace()_ classes.\n\nAs the name suggests, the _**TextSimilaritySpace()** _embeds text data using\nthe model specified within the _\u201cmodel\u201d_ parameter. It supports any\nHuggingFace model. We are using _\u201csentence-transformers/all-mpnet-base-v2\u201d._\n\nThe _**CategoricalSimilaritySpace()**_ class uses an _n-hot encoded vector_\nwith the option to apply a negative filter for unmatched categories, enhancing\nthe distinction between matching and non-matching category items.\n\nYou must also specify all the available categories through the \u201c _categories_\n\u201d parameter to encode them in n-hot.\n\n### **Indexes**\n\nThe indexes define how a collection can be queried. They take one or multiple\nspaces from the same schema.\n\nHere is what the article index looks like:\n\n \n \n article_index = Index(\n [articles_space_content, articles_space_plaform],\n fields=[article.author_id],\n )\n\nAs you can see, the vector index combines the article\u2019s content and the posted\nplatform. When the article collection is queried, both embeddings will be\nconsidered.\n\nAlso, we index the \u201cauthor_id\u201d field to filter articles written by a specific\nauthor. It is nothing fancy\u2014it is just a classic filter. However, indexing the\nfields used in filters is often good practice.\n\n### **Queries**\n\nWe will quickly introduce what a query looks like. But in the 14th lesson, we\nwill insist on the advanced retrieval part, hence on queries.\n\nHere is what the article query looks like:\n\n \n \n article_query = (\n Query(\n article_index,\n weights={\n articles_space_content: Param(\"content_weight\"),\n articles_space_plaform: Param(\"platform_weight\"),\n },\n )\n .find(article)\n .similar(articles_space_content.text, Param(\"search_query\"))\n .similar(articles_space_plaform.category, Param(\"platform\"))\n .filter(article.author_id == Param(\"author_id\"))\n .limit(Param(\"limit\"))\n )\n\n\u2026and here is what it does:\n\n * it queries the _article_index_ using a weighted multi-index between the content and platform vectors (e.g., `0.9 * content_embedding + 0.1 * platform_embedding` );\n\n * the search text used to compute query content embedding is specified through the \u201csearch_query\u201d parameter and similar for the platform embedding through the \u201cplatform\u201d parameter;\n\n * we filter the results based on the \u201cauthor_id\u201d;\n\n * take only the top results using the \u201climit\u201d parameter.\n\nThese parameters are automatically exposed on the REST API endpoint, as seen\nin the _SuperlinkedClient()_ class.\n\n### **Sources**\n\nThe sources wrap the schemas and allow you to save that schema in the\ndatabase.\n\nIn reality, the source maps the schema to an ORM and automatically generates\nREST API endpoints to ingest data points.\n\n \n \n article_source = RestSource(article)\n\n### **Executor**\n\nThe last step is to define the executor that wraps all the sources, indices,\nqueries and vector DB into a single entity:\n\n \n \n executor = RestExecutor(\n sources=[article_source, repository_source, post_source],\n indices=[article_index, repository_index, post_index],\n queries=[\n RestQuery(RestDescriptor(\"article_query\"), article_query),\n RestQuery(RestDescriptor(\"repository_query\"), repository_query),\n RestQuery(RestDescriptor(\"post_query\"), post_query),\n ],\n vector_database=InMemoryVectorDatabase(),\n )\n \n\nNow, the last step is to register the executor to the Superlinked engine:\n\n \n \n SuperlinkedRegistry.register(executor)\n\n\u2026and that\u2019s it!\n\nJoking\u2026 there is something more. We have to use a Redis database instead of\nthe in-memory one.\n\n## **7\\. Using Redis as a vector DB**\n\nFirst, we have to spin up a Redis vector database that we can work with.\n\nWe used Docker and attached a Redis image as a service in a _docker-compose_\nfile along with the Superlinked poller and executor (which comprise the\nSuperlinked server):\n\n \n \n version: \"3\"\n \n services:\n poller:\n ...\n \n executor:\n ...\n \n redis:\n image: redis/redis-stack:latest\n ports:\n - \"6379:6379\"\n - \"8001:8001\"\n volumes:\n - redis-data:/data\n \n volumes:\n redis-data:\n\nNow, Superlinked makes everything easy. The last step is to define a\nRedisVectorDatabase connector provided by Superlinked:\n\n \n \n vector_database = RedisVectorDatabase(\n settings.REDIS_HOSTNAME,\n settings.REDIS_PORT\n )\n\n\u2026and swap it in the executor with the _InMemoryVectorDatabase()_ one:\n\n \n \n executor = RestExecutor(\n ...\n vector_database=vector_database,\n )\n\nNow we are done!\n\n* * *\n\n## **Conclusion**\n\n _Congratulations! You learned to write advanced RAG systems\nusingSuperlinked._\n\nMore concretely, in **Lesson 11** , you learned:\n\n * what is Superlinked;\n\n * how to design a streaming pipeline using Bytewax;\n\n * how to design a RAG server using Superlinked;\n\n * how to take a standard RAG feature pipeline and refactor it using Superlinked;\n\n * how to split the feature pipeline into 2 services, one that reads in real-time messages from RabbitMQ and one that chunks, embeds, and stores the data to a vector DB;\n\n * how to use a Redis vector DB.\n\n**Lesson 12** will teach you how to implement multi-index queries to optimize\nthe RAG retrieval layer further.\n\n> _\ud83d\udd17**Check out** the code on GitHub [1] and support us with a \u2b50\ufe0f_\n\n* * *\n\n### Next Steps\n\n#### Step 1\n\nThis is just the **short version** of **Lesson 11** on **building scalable RAG\ningestion pipelines.**\n\n\u2192 For\u2026\n\n * The full implementation.\n\n * Full deep dive into the code.\n\n * More on the RAG, Bytewax and Superlinked.\n\n**Check out** the **full version** of **Lesson 11** on our **Medium\npublication**. It\u2019s still FREE:\n\nLesson 11 on Medium\n\n#### Step 2\n\n\u2192 **Consider checking out theLLM Twin GitHub repository and try it yourself\n\ud83e\udef5**\n\n _Nothing compares with getting your hands dirty and doing it yourself!_\n\nLLM Twin Course - GitHub\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n13\n\nShare this post\n\n#### Scalable RAG ingestion pipeline using 74.3% less code\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/scalable-rag-ingestion-pipeline-using?r=1ttoeh" }, { "id": "0eae1447-70c8-40b2-a5c4-96f6de69f04b", "content": { "Title": "The ultimate MLOps tool - by Paul Iusztin", "Subtitle": "6 steps to build your AWS infrastructure that will work for 90% of your projects. How to build a real-time news search engine", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### The ultimate MLOps tool\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# The ultimate MLOps tool\n\n### 6 steps to build your AWS infrastructure that will work for 90% of your\nprojects. How to build a real-time news search engine\n\nPaul Iusztin\n\nJul 13, 2024\n\n18\n\nShare this post\n\n#### The ultimate MLOps tool\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\nBased on your feedback from last week\u2019s poll, we will post exclusively on\nSaturdays starting now.\n\nEnjoy today\u2019s article \ud83e\udd17\n\n* * *\n\n### **This week\u2019s topics:**\n\n * The ultimate MLOps tool\n\n * 6 steps to build your AWS infrastructure that will work for 90% of your projects\n\n * How to build a real-time news search engine\n\n* * *\n\n### The ultimate MLOps tool\n\nI tested this \ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 for my \ud835\udde0\ud835\udddf \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00 and \ud835\uddf9\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddf1 \ud835\uddf6\ud835\ude01! It is the\n\ud835\ude02\ud835\uddf9\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 to glue everything together for \ud835\uddff\ud835\uddf2\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 and\n\ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4. \n \nIn the past months, I have tested most of the top orchestrator tools out\nthere: Airflow, Prefect, Argo, Kubeflow, Metaflow... \n \nYou name it! \n \n\ud835\uddd5\ud835\ude02\ud835\ude01 \ud835\uddfc\ud835\uddfb\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf1 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddfa\ud835\uddf2. \n \nI am talking about ZenML! \n \n\ud835\uddea\ud835\uddf5\ud835\ude06? \n \nThey realized they don't have to compete with tools such as Airflow or AWS in\nthe orchestrators and MLOps race, but join them! \n \nInstead of being yet another orchestrator tool, they have built an \ud835\uddee\ud835\uddef\ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude01\n\ud835\uddf9\ud835\uddee\ud835\ude06\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddfc\ud835\uddfd \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa: \n \n\\- experiment trackers & model registries (e.g., Weights & Biases, Comet) \n\\- orchestrators (e.g., Apache Airflow, Kubeflow) \n\\- container registries for your Docker images \n\\- model deployers (Hugging Face , BentoML, Seldon) \n \nThey wrote a clever wrapper that integrated the whole MLOps ecosystem! \n \n\ud835\ude08\ud835\ude2d\ud835\ude34\ud835\ude30, \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude30 \ud835\ude3a\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f \ud835\ude24\ud835\ude30\ud835\ude25\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude2f\ud835\ude30\ud835\ude35 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude37\ud835\ude26. \n \nAs long your code is modular (which should be anyway), you have to annotate\nyour DAG: \n\\- steps with \"Stephen S.\" \n\\- entry point with james wang \n \n\ud835\ude08\ud835\ude34 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude34\ud835\ude26\ud835\ude26 \ud835\ude2a\ud835\ude2f \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude25\ud835\ude26 \ud835\ude34\ud835\ude2f\ud835\ude2a\ud835\ude31\ud835\ude31\ud835\ude26\ud835\ude35\ud835\ude34 \ud835\ude23\ud835\ude26\ud835\ude2d\ud835\ude30\ud835\ude38 \u2193 \n\nZenML Pipelines\n\n.\n\nZenML Steps\n\n \n\ud835\udde7\ud835\uddf5\ud835\uddf2\ud835\ude06 \ud835\uddee\ud835\uddf9\ud835\ude00\ud835\uddfc \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddfd\ud835\ude01 \ud835\uddfc\ud835\uddf3 \ud835\uddee \"\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddf0\ud835\uddf8\". \n \nThis allows you to configure multiple tools and infrastructure sets your\npipeline can run on. \n \n\ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26: \n \n\\- \ud835\ude22 \ud835\ude2d\ud835\ude30\ud835\ude24\ud835\ude22\ud835\ude2d \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2c: that uses a local orchestrator, artifact store, and compute\nfor quick testing (so you don't have to set up other dependencies) \n \n\\- \ud835\ude22\ud835\ude2f \ud835\ude08\ud835\ude1e\ud835\ude1a \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2c: that uses AWS SageMaker Orchestrator, Comet, and Seldon\n\nZenML Stacks\n\n \nAs I am still learning ZenML, this was just an intro post to share my\nexcitement. \n \nI plan to integrate it into Decoding ML's LLM twin open-source project and\nshare the process with you! \n \n. \n \n\ud835\udde0\ud835\uddf2\ud835\uddee\ud835\uddfb\ud835\ude04\ud835\uddf5\ud835\uddf6\ud835\uddf9\ud835\uddf2, \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddff \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddf6\ud835\uddff \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddf4\ud835\ude02\ud835\uddf6\ud835\uddf1\ud835\uddf2 \u2193 \n \n\ud83d\udd17 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25 \ud835\ude28\ud835\ude36\ud835\ude2a\ud835\ude25\ud835\ude26: https://lnkd.in/dPzXHvjH\n\n* * *\n\n### 6 steps to build your AWS infrastructure that will work for 90% of your\nprojects\n\n\ud835\udff2 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 your \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 (using \ud835\udddc\ud835\uddee\ud835\uddd6) and a \ud835\uddd6\ud835\udddc/\ud835\uddd6\ud835\uddd7 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 that\nwill \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 for \ud835\udff5\ud835\udfec% of your \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude00 \u2193 \n \nWe will use the data collection pipeline from our free digital twin course as\nan example, but it can easily be extrapolated to most of your projects. \n \n\ud835\ude0d\ud835\ude2a\ud835\ude33\ud835\ude34\ud835\ude35, \ud835\ude2d\ud835\ude26\ud835\ude35'\ud835\ude34 \ud835\ude34\ud835\ude26\ud835\ude26 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude2a\ud835\ude34 \ud835\ude2a\ud835\ude2f \ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude35\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude23\ud835\ude26\ud835\ude2d\ud835\ude35: \n \n\\- Docker \n\\- AWS ECR \n\\- AWS Lambda \n\\- MongoDB \n\\- Pulumni \n\\- GitHub Actions \n \n\ud835\ude1a\ud835\ude26\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude25\ud835\ude2d\ud835\ude3a, \ud835\ude2d\ud835\ude26\ud835\ude35'\ud835\ude34 \ud835\ude32\ud835\ude36\ud835\ude2a\ud835\ude24\ud835\ude2c\ud835\ude2d\ud835\ude3a \ud835\ude36\ud835\ude2f\ud835\ude25\ud835\ude26\ud835\ude33\ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude24\ud835\ude30\ud835\ude2d\ud835\ude2d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude25\ud835\ude30\ud835\ude2a\ud835\ude2f\ud835\ude28 \n \nIt automates your digital data collection from LinkedIn, Medium, Substack, and\nGitHub. The normalized data will be loaded into MongoDB. \n \n\ud835\ude15\ud835\ude30\ud835\ude38, \ud835\ude2d\ud835\ude26\ud835\ude35'\ud835\ude34 \ud835\ude36\ud835\ude2f\ud835\ude25\ud835\ude26\ud835\ude33\ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude08\ud835\ude1e\ud835\ude1a \ud835\ude2a\ud835\ude2f\ud835\ude27\ud835\ude33\ud835\ude22\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude0a\ud835\ude10/\ud835\ude0a\ud835\ude0b \ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34 \u2193 \n \n1\\. We wrap the application's entry point with a `\ud835\ude29\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude2d\ud835\ude26(\ud835\ude26\ud835\ude37\ud835\ude26\ud835\ude2f\ud835\ude35, \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35:\n\ud835\ude13\ud835\ude22\ud835\ude2e\ud835\ude23\ud835\ude25\ud835\ude22\ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35)` function. The AWS Lambda serverless computing service will\ndefault to the `\ud835\ude29\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude2d\ud835\ude26()` function. \n \n2\\. Build a Docker image of your application inheriting the\n`\ud835\ude31\ud835\ude36\ud835\ude23\ud835\ude2d\ud835\ude2a\ud835\ude24.\ud835\ude26\ud835\ude24\ud835\ude33.\ud835\ude22\ud835\ude38\ud835\ude34/\ud835\ude2d\ud835\ude22\ud835\ude2e\ud835\ude23\ud835\ude25\ud835\ude22/\ud835\ude31\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f:3.11` base Docker image \n \n\u2192 Now, you can quickly check your AWS Lambda function locally by making HTTP\nrequests to your Docker container. \n \n3\\. Use Pulumni IaC to create your AWS infrastructure programmatically: \n \n\\- an ECR as your Docker registry \n\\- an AWS Lambda service \n\\- a MongoDB cluster \n\\- the VPC for the whole infrastructure \n \n4\\. Now that we have our Docker image and infrastructure, we can build our\nCI/CD pipeline using GitHub Actions. The first step is to build the Docker\nimage inside the CI and push it to ECR when a new PR is merged into the main\nbranch. \n \n5\\. On the CD part, we will take the fresh Docker image from ECR and deploy it\nto AWS Lambda. \n \n6\\. Repeat the same logic with the Pulumni code \u2192 Add a CD GitHub Action that\nupdates the infrastructure whenever the IaC changes. \n \nWith \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf3\ud835\uddf9\ud835\uddfc\ud835\ude04, you will do fine for \ud835\udff5\ud835\udfec% of your \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude00 \ud83d\udd25 \n \n. \n \n\ud835\ude1b\ud835\ude30 \ud835\ude34\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26, \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude0a\ud835\ude10/\ud835\ude0a\ud835\ude0b \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34: \n \nfeature PR -> merged to main -> build Docker image -> push to ECR -> deploy to\nAWS Lambda\n\nLLM Twin AWS architecture\n\n \n \n\ud835\uddea\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddff\ud835\ude02\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2\ud835\uddf9\ud835\uddf3? \n \nConsider checking out \ud835\udddf\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb \ud835\udfee from the FREE \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 hosted by:\n\n\ud83d\udd17 _The Importance of Data Pipelines in the Era of Generative AI_\n\n* * *\n\n### How to build a real-time news search engine\n\nDecoding ML \ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\uddf1 an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 & \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 on building a \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf9-\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 \ud835\udde1\ud835\uddf2\ud835\ude04\ud835\ude00 \ud835\udde6\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\n\ud835\uddd8\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2 using \ud835\uddde\ud835\uddee\ud835\uddf3\ud835\uddf8\ud835\uddee, \ud835\udde9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5\ud835\ude00 and \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00. \n \n\ud835\ude0c\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude2a\ud835\ude2f \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f! \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf4\ud835\uddfc\ud835\uddee\ud835\uddf9? \n \nLearn to build a production-ready semantic search engine for news that is\nsynced in real-time with multiple news sources using: \n\\- a streaming engine \n\\- Kafka \n\\- a vector DB. \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa? \n \nAccording to a research study by earthweb.com, the daily influx of news\narticles, both online and offline, is between 2 and 3 million. \n \nHow would you constantly sync these data sources with your vector DB to stay\nin sync with the outside world? \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb! \n \n\u2192 Here is where the streaming pipeline kicks in. \n \nAs soon as a new data point is available, it is: \n\\- ingested \n\\- processed \n\\- loaded to a vector DB \n \n...in real-time by the streaming pipeline \u2190 \n \n. \n \n\ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude24\ud835\ude2d\ud835\ude26 \u2193 \n \n\u2192 Set up your own Upstash \ud835\uddde\ud835\uddee\ud835\uddf3\ud835\uddf8\ud835\uddee & \ud835\udde9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 \ud835\uddf0\ud835\uddf9\ud835\ude02\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\ude00 \n \n\u2192 \ud835\udde6\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 & \ud835\ude03\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2 your \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee points using Pydantic \n \n\u2192 \ud835\udde6\ud835\uddf6\ud835\uddfa\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2 multiple \ud835\uddde\ud835\uddee\ud835\uddf3\ud835\uddf8\ud835\uddee \ud835\uddd6\ud835\uddf9\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00 using \ud835\ude1b\ud835\ude29\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude17\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude0c\ud835\ude39\ud835\ude26\ud835\ude24\ud835\ude36\ud835\ude35\ud835\ude30\ud835\ude33 & \ud835\ude12\ud835\ude22\ud835\ude27\ud835\ude2c\ud835\ude22\ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude26\ud835\ude33 \n \n\u2192 \ud835\udde6\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 using Bytewax \\- learn to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddee \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9-\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 \ud835\udde5\ud835\uddd4\ud835\uddda ingestion\n\ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \n\u2192 \ud835\uddd5\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5-\ud835\ude02\ud835\uddfd\ud835\ude00\ud835\uddf2\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\ude00 + \ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee to Upstash Vector DB \n \n\u2192 Build a \ud835\udde4&\ud835\uddd4 \ud835\udde8I using Streamlit \n \n\u2192 \ud835\udde8\ud835\uddfb\ud835\uddf6\ud835\ude01 \ud835\udde7\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 - Yes, we even added unit testing!\n\n \n\ud835\uddd6\ud835\ude02\ud835\uddff\ud835\uddf6\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\ude01\ud835\uddfc \ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9 \ud835\ude02\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb, \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 & \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\uddf4\ud835\uddee\ud835\uddfa\ud835\uddf2 \ud83e\udef5 \n \nThen, consider checking out \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude24\ud835\ude2d\ud835\ude26 & \ud835\ude24\ud835\ude30\ud835\ude25\ud835\ude26. Everything is free. \n \n\u2193\u2193\u2193\n\n\ud83d\udd17 **[Article]** How to build a real-time News Search Engine using Vector DBs\n\n\ud83d\udd17 \ud835\uddda\ud835\uddf6\ud835\ude01\ud835\udddb\ud835\ude02\ud835\uddef \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n18\n\nShare this post\n\n#### The ultimate MLOps tool\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-ultimate-mlops-tool?r=1ttoeh" }, { "id": "1436e3e5-eb7c-4632-a538-00fd69c01998", "content": { "Title": "The new king of Infrastructure as Code (IaC)", "Subtitle": "Monitoring your DL models while in production. How to build a scalable data collection pipeline", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### The new king of Infrastructure as Code (IaC)\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# The new king of Infrastructure as Code (IaC)\n\n### Monitoring your DL models while in production. How to build a scalable\ndata collection pipeline\n\nPaul Iusztin\n\nJun 29, 2024\n\n11\n\nShare this post\n\n#### The new king of Infrastructure as Code (IaC)\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * The new king of Infrastructure as Code (IaC)\n\n * How to build a scalable data collection pipeline\n\n * Monitoring your DL models while in production\n\n* * *\n\n### The new king of Infrastructure as Code (IaC)\n\nThis is \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\uddf3 \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\ude00 \ud835\uddd6\ud835\uddfc\ud835\uddf1\ud835\uddf2 (\ud835\udddc\ud835\uddee\ud835\uddd6). Here is \ud835\ude04\ud835\uddf5\ud835\ude06 it is \ud835\uddef\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff\nthan \ud835\udde7\ud835\uddf2\ud835\uddff\ud835\uddff\ud835\uddee\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa or \ud835\uddd6\ud835\uddd7\ud835\uddde \u2193 \n \n\u2192 I am talking about Pulumi \u2190 \n \nLet's see what is made of \n \n\u2193\u2193\u2193 \n \n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddf6 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddf5\ud835\uddfc\ud835\ude04 \ud835\uddf6\ud835\ude00 \ud835\uddf6\ud835\ude01 \ud835\uddf1\ud835\uddf6\ud835\uddf3\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\ude01? \n \nUnlike other IaC tools that use YAML, JSON, or a Domain-Specific Language\n(DSL), Pulumi lets you write code in languages like Python, TypeScript,\nNode.js, etc. \n\\- This enables you to leverage existing programming knowledge and tooling for\nIaC tasks. \n\\- Pulumi integrates with familiar testing libraries for unit and integration\ntesting of your infrastructure code. \n\\- It integrates with most cloud providers (AWS, GCP, Azure, Oracle, etc.) \n \n\ud835\uddd5\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\ude01\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddf6: \n \n\ud835\uddd9\ud835\uddf9\ud835\uddf2\ud835\ude05\ud835\uddf6\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06: Use your preferred programming language for IaC + it works for\nmost clouds out there \n\ud835\uddd8\ud835\uddf3\ud835\uddf3\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06: Leverage existing programming skills and tooling. \n\ud835\udde7\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06: Write unit and integration tests for your infrastructure code. \n\ud835\uddd6\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\uddef\ud835\uddfc\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb: Enables Dev and Ops to work together using the same language. \n \nIf you disagree, try to apply OOP or logic (if, for statements) to Terraform\nHCL's syntax. \n \nIt works, but it quickly becomes a living hell. \n \n\ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddf6 \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8\ud835\ude00: \n \n\\- Pulumi uses a declarative approach. You define the desired state of your\ninfrastructure. \n\\- It manages the state of your infrastructure using a state file. \n\\- When changes are made to the code, Pulumi compares the desired state with\nthe current state and creates a plan to achieve the desired state. \n\\- The plan shows what resources will be created, updated, or deleted. \n\\- You can review and confirm the plan before Pulumi executes it. \n \n\u2192 It works similarly to Terraform but with all the benefits your favorite\nprogramming language and existing tooling provides \n \n\u2192 It works similar to CDK, but faster and for your favorite cloud\ninfrastructure (not only AWS)\n\nPulumi code example\n\n _What do you think? Have you used Pulumi?_ \n \nWe started using it for the LLM Twin course, and so far, we love it! I will\nprobably wholly migrate from Terraform to Pulumi in future projects.\n\n> \ud83d\udd17 More on Pulumi\n\n* * *\n\n### How to build a scalable data collection pipeline\n\n\ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1, \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 to \ud835\uddd4\ud835\uddea\ud835\udde6, \ud835\udddc\ud835\uddee\ud835\uddd6, and \ud835\uddd6\ud835\udddc/\ud835\uddd6\ud835\uddd7 for a \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 that\n\ud835\uddf0\ud835\uddff\ud835\uddee\ud835\ude04\ud835\uddf9\ud835\ude00 your \ud835\uddf1\ud835\uddf6\ud835\uddf4\ud835\uddf6\ud835\ude01\ud835\uddee\ud835\uddf9 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \u2192 \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 do you need \ud83e\udd14 \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf4\ud835\uddfc\ud835\uddee\ud835\uddf9? \n \n\ud835\ude08 \ud835\ude34\ud835\ude24\ud835\ude22\ud835\ude2d\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude24\ud835\ude33\ud835\ude22\ud835\ude38\ud835\ude2d\ud835\ude34, \ud835\ude24\ud835\ude30\ud835\ude2d\ud835\ude2d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude34, \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude34\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude26\ud835\ude34 \ud835\ude22\ud835\ude2d\ud835\ude2d \ud835\ude3a\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude25\ud835\ude2a\ud835\ude28\ud835\ude2a\ud835\ude35\ud835\ude22\ud835\ude2d\n\ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e: \n \n\\- LinkedIn \n\\- Medium \n\\- Substack \n\\- Github \n \n\ud835\udde7\ud835\uddfc \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddf6\ud835\ude01 - \ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \u2193 \n \n\ud835\udfed\\. \ud835\udde6\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddfb\ud835\uddf6\ud835\ude02\ud835\uddfa: a Python tool for automating web browsers. It\u2019s used here to\ninteract with web pages programmatically (like logging into LinkedIn,\nnavigating through profiles, etc.) \n \n\ud835\udfee\\. \ud835\uddd5\ud835\uddf2\ud835\uddee\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\udde6\ud835\uddfc\ud835\ude02\ud835\uddfd: a Python library for parsing HTML and XML documents. It\ncreates parse trees that help us extract the data quickly. \n \n\ud835\udfef\\. \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf4\ud835\uddfc\ud835\uddd7\ud835\uddd5 (\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddfb\ud835\ude06 \ud835\uddfc\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udde1\ud835\uddfc\ud835\udde6\ud835\udde4\ud835\udddf \ud835\uddd7\ud835\uddd5): a NoSQL database fits like a glove on our\nunstructured text data \n \n\ud835\udff0\\. \ud835\uddd4\ud835\uddfb \ud835\udde2\ud835\uddd7\ud835\udde0: a technique that maps between an object model in an application\nand a document database \n \n\ud835\udff1\\. \ud835\uddd7\ud835\uddfc\ud835\uddf0\ud835\uddf8\ud835\uddf2\ud835\uddff & \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\uddd8\ud835\uddd6\ud835\udde5: to deploy our code, we have to containerize it, build an\nimage for every change of the main branch, and push it to AWS ECR \n \n\ud835\udff2\\. \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\udddf\ud835\uddee\ud835\uddfa\ud835\uddef\ud835\uddf1\ud835\uddee: we will deploy our Docker image to AWS Lambda - a serverless\ncomputing service that allows you to run code without provisioning or managing\nservers. It executes your code only when needed and scales automatically, from\na few daily requests to thousands per second \n \n\ud835\udff3\\. \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddfb\ud835\uddf6: IaC tool used to programmatically create the AWS infrastructure:\nMongoDB instance, ECR, Lambdas and the VPC \n \n\ud835\udff4\\. \ud835\uddda\ud835\uddf6\ud835\ude01\ud835\udddb\ud835\ude02\ud835\uddef \ud835\uddd4\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00: used to build our CI/CD pipeline - on any merged PR to the\nmain branch, it will build & push a new Docker image and deploy it to the AWS\nLambda service\n\nETL architecture to collect digital data from social media platforms\n\n\ud835\ude3e\ud835\ude6a\ud835\ude67\ud835\ude5e\ud835\ude64\ud835\ude6a\ud835\ude68 \ud835\ude5d\ud835\ude64\ud835\ude6c \ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude68\ud835\ude5a \ud835\ude69\ud835\ude64\ud835\ude64\ud835\ude61\ud835\ude68 \ud835\ude6c\ud835\ude64\ud835\ude67\ud835\ude60 \ud835\ude69\ud835\ude64\ud835\ude5c\ud835\ude5a\ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude67?\n\n> Then... \n> \n> \u2193\u2193\u2193 \n> \n> Check out \ud835\udddf\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb \ud835\udfee from the FREE \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddd6\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 created by Decoding ML \n> \n> ...where we will walk you \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd-\ud835\uddef\ud835\ude06-\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd through the \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 and \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 of\n> the \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2:\n>\n> \ud83d\udd17 The Importance of Data Pipelines in the Era of Generative AI\n\n* * *\n\n### Monitoring your DL models while in production\n\n\ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 is \ud835\udde7\ud835\udddb\ud835\uddd8 \ud835\uddf8\ud835\uddf2\ud835\ude06 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 in ensuring your \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00 in \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb are\n\ud835\uddf3\ud835\uddee\ud835\uddf6\ud835\uddf9-\ud835\ude00\ud835\uddee\ud835\uddf3\ud835\uddf2. Here is an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 on \ud835\udde0\ud835\udddf \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 using Triton, Prometheus and\nGrafana \u2193 \n \n\nRazvant Alexandru\n\nwrote a fantastic \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd-\ud835\uddef\ud835\ude06-\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 in the\n\nDecoding ML Newsletter\n\non \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 your \ud835\uddd7\ud835\udddf \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00 while in \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb. \n \nWithin his article, he started with an example where, in one of his projects,\na main processing task was supposed to take <5 \ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34, but while in production,\nit jumped to >8 \ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34. \n \n\u2192 \ud835\ude1b\ud835\ude29\ud835\ude2a\ud835\ude34 (\ud835\ude30\ud835\ude33 \ud835\ude34\ud835\ude30\ud835\ude2e\ud835\ude26\ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude34\ud835\ude2a\ud835\ude2e\ud835\ude2a\ud835\ude2d\ud835\ude22\ud835\ude33) \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude29\ud835\ude22\ud835\ude31\ud835\ude31\ud835\ude26\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude22\ud835\ude2d\ud835\ude2d \ud835\ude30\ud835\ude27 \ud835\ude36\ud835\ude34. \n \nEven to the greatest. \n \nIt's impossible always to anticipate everything that will happen in production\n(sometimes it is a waste of time even to try to). \n \nThat is why you always need eyes and years on your production ML system. \n \nOtherwise, imagine how much $$$ or users he would have lost if he hadn't\ndetected the ~3-4 hours loss in performance as fast as possible.\n\nAfterward, he explained step-by-step how to use: \n \n\\- \ud835\uddf0\ud835\uddd4\ud835\uddf1\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddfc\ud835\uddff to scrape RAM/CPU usage per container \n \n\\- \ud835\udde7\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddfb \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde6\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf2\ud835\uddff to serve ML models and yield GPU-specific metrics. \n \n\\- \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude02\ud835\ude00 to bind between the metrics generators and the consumer. \n \n\\- \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf3\ud835\uddee\ud835\uddfb\ud835\uddee to visualize the metrics\n\n> \ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\uddfc\ud835\uddfb \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf \n> \n> \u2193\u2193\u2193 \n> \n> \ud83d\udd17 How to ensure your models are fail-safe in production?\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n11\n\nShare this post\n\n#### The new king of Infrastructure as Code (IaC)\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-new-king-of-infrastructure-as?r=1ttoeh" }, { "id": "fd48444e-ab32-49b9-afdc-14fe8ecafd41", "content": { "Title": "Data Ingestion Architecture for ML and Marketing Intelligence", "Subtitle": "Building a highly scalable data collection pipeline for AI, ML and marketing intelligence leveraging the AWS cloud, Python, data\u00a0crawling, and Docker.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Highly Scalable Data Ingestion Architecture for ML and Marketing\nIntelligence\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Highly Scalable Data Ingestion Architecture for ML and Marketing\nIntelligence\n\n### Leveraging AWS Ecosystem and Data Crawling for Scalable and Adaptive Data\nPipelines\n\nRares Istoc\n\nJun 27, 2024\n\n13\n\nShare this post\n\n#### Highly Scalable Data Ingestion Architecture for ML and Marketing\nIntelligence\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n**Today\u2019s article** is **written** by our **guest** , **Rares Istoc** , a\nveteran with over 7 years of experience building scalable software and data\nengineering systems in the industry.\n\n\u2192 Here is his \ud83d\udd17 LinkedIn.\n\nMachine learning without data is like a chef without ingredients - all the\nskills but nothing to cook.\n\nThese days, everything circulates around data, from personalized ads to\nstreaming recommendations. Data drives decisions in business, healthcare, and\nsports. Without it, apps would be clueless, smart devices would be dumb, and\npredictions would be nothing more than guesses. In this digital age, data is\nthe lifeblood of innovation and efficiency.\n\n**Ok, but why another article about data ingestion?**\n\nThere are many ways to build data ingestion pipelines, and with all the new\ntools created over the last decade, selecting the best ones can be\nchallenging. The answer often depends on your project\u2019s specific needs.\n\nIn this article, you\u2019ll explore an end-to-end solution for marketing\nintelligence. Using AWS\u2019s ecosystem, you can create a scalable data-ingestion\npipeline for data crawling and integrate it into various analytical processes\nlike sales, competitor analysis, market analysis, and customer insights.\n\nI\u2019ll also present the challenges encountered while building this solution.\nFinding a complete working solution is tough, with most answers scattered\nacross the Internet. You can access the full solution code on \ud83d\udd17 **GitHub**.\n\n_**IMPORTANT NOTE:** Before diving into this solution, you must be aware of\nthe legal implications of ingesting data from some data sources, like social\nmedia pages, so we can make sure nobody goes to jail. Please read the terms\nand conditions of each major platform; these will restrict you from crawling\nuser profiles and private pages._\n\n* * *\n\n### Table of Contents:\n\n 1. Architecture Overview\n\n 2. Implementation\n\n 3. Challenges & Pitfalls\n\n 4. Local Testings\n\n 5. Deployment\n\n* * *\n\n### 1\\. Architecture Overview\n\nThis is what we are about to build:\n\nHere are some non-functional requirements I\u2019ve aimed to achieve with this\narchitecture:\n\n**Scalability:** The solution can process many pages simultaneously and easily\nadd more, handling growth at any time.\n\n**Maintainability & Adaptability:** Each component is designed for easy\nmodification and expansion without significant development time.\n\n**Components Overview:**\n\n\u2022 **Scheduler:** Triggers crawler lambdas for each page link.\n\n\u2022 **Crawler:** Extracts various posts and information from the page link. If\nunfamiliar with crawling, look it up before proceeding. Details will follow in\nthe implementation part.\n\n\u2022 **Database:** MongoDB is used for our data lake storage, housing posts for\nlater use. It excels at handling semi-structured data.\n\nThe complete flow: the scheduler triggers a crawler lambda for each page,\nsending the page name and link. The crawler extracts posts from the past week,\nstoring the raw content, creation date, link, and name. The scheduler waits\nfor all lambdas to finish, aggregates the posts from the database, and sends\nthem to ChatGPT using prompt templates to generate reports.\n\n### 2\\. Implementation\n\nIn this section, I\u2019ll provide a detailed overview of the main components,\nbreaking them down with code samples and explanations.\n\n#### 2.1. Scheduler\n\nI\u2019ll not focus much on the reporting part, though you can find it **here**\nalong with all the code shared in this article. The main focus is the\nscheduling part, the entry point of the system where the flow starts and is\norchestrated:\n\n \n \n import json\n import os\n import time\n from datetime import datetime, timedelta\n \n import boto3\n from aws_lambda_powertools import Logger\n from aws_lambda_powertools.utilities.typing import LambdaContext\n \n from src.constants import PAGE_LINK\n from src.db import database\n from src.utils import monitor\n \n logger = Logger(service=\"decodingml/scheduler\")\n \n _client = boto3.client(\"lambda\")\n \n \n def lambda_handler(event, context: LambdaContext):\n correlation_ids = []\n \n for link in PAGE_LINK:\n response = _client.invoke(\n FunctionName=\"lambda\",\n InvocationType=\"Event\",\n Payload=json.dumps({\"link\": link}),\n )\n logger.info(f\"Triggered crawler for: {link}\")\n \n correlation_ids.append(response[\"ResponseMetadata\"][\"RequestId\"])\n \n logger.info(f\"Monitoring: {len(correlation_ids)} crawler processes\")\n \n while True:\n time.sleep(15)\n completed = monitor(correlation_ids)\n \n correlation_ids = [c for c in correlation_ids if c not in completed]\n \n if not correlation_ids:\n break\n \n logger.info(f\"Still waiting for {len(correlation_ids)} crawlers to complete\")\n \n now = datetime.now()\n posts = list(\n database.profiles.find(\n {\n \"date\": {\"$gte\": (now - timedelta(days=7)), \"$lte\": now},\n }\n )\n )\n \n logger.info(f\"Gathered {len(posts)} posts\")\n \n if not posts:\n logger.info(\"Cannot generate report, no new posts available\")\n return\n \n reports = generate_profiles_report(posts)\n \n logger.info(\"Generated new report!\")\n\nThe scheduler acts as a scatterer, iterating over a list of page links and\ninvoking a crawler asynchronously with the InvocationType parameter set to\nEvent, ensuring the scheduler won\u2019t block for a single page. It stores each\nlambda\u2019s correlation ID in a list and waits for all lambdas to finish, with a\n15-second wait time, adjustable based on your crawler\u2019s average completion\ntime. Finally, it finds all crawled posts and sends them to the report\ngeneration phase.\n\n#### 2.2. Crawler\n\nHere I\u2019ll break down the actual crawling process:\n\n \n \n import abc\n import os\n from datetime import datetime, timedelta\n from itertools import takewhile, dropwhile\n from typing import List, Dict, Any\n \n import instaloader\n \n from src.crawlers.base import BaseAbstractCrawler\n \n class BaseAbstractCrawler(abc.ABC):\n \n @abc.abstractmethod\n def extract(self, link: str, **kwargs) -> None: ...\n \n \n class InstagramCrawler(BaseAbstractCrawler):\n \n def __init__(self, link: str, proxy=None):\n self.link = link\n self.loader = instaloader.Instaloader()\n self._until = datetime.now()\n self._since = self._until - timedelta(days=7)\n self._proxy = proxy\n \n def extract(self, **kwargs) -> List[Dict[str, str | Any]]:\n parsed_url = urlparse(self.link)\n \n if self._proxy:\n os.environ['https_proxy'] = self._proxy.__dict__().get('http')\n profile = instaloader.Profile.from_username(self.loader.context, parsed_url.path.strip('/').split('/')[0])\n posts = takewhile(lambda p: p.date > self._since, dropwhile(lambda p: p.date > self._until, profile.get_posts()))\n \n return [\n {'content': post.caption, 'date': post.date, 'link': self.link}\n for post in posts\n ]\n\nI\u2019ve defined a main abstraction point for all crawlers, establishing a common\ninterface that all derived crawlers must implement. Each subclass must provide\nits implementation for the `extract()` method, ensuring reusability and\nuniformity.\n\n \n \n import re\n \n from src.crawlers.base import BaseAbstractCrawler\n from src.crawlers.instagram import InstagramCrawler\n \n \n class CrawlerDispatcher:\n \n def __init__(self) -> None:\n self._crawlers = {}\n \n def register(self, domain: str, crawler: type[BaseAbstractCrawler]) -> None:\n self._crawlers[r\"https://(www\\.)?{}.com/*\".format(re.escape(domain))] = crawler\n \n def get_crawler(self, url: str) -> BaseAbstractCrawler:\n for pattern, crawler in self._crawlers.items():\n if re.match(pattern, url):\n return crawler()\n else:\n raise ValueError(\"No crawler found for the provided link\")\n \n \n dispatcher = CrawlerDispatcher()\n dispatcher.register('instagram', InstagramCrawler)\n\nTo promote and call each crawler automatically, I\u2019ve built a dispatcher that\nselects and instantiates the correct crawler class based on the provided link.\nThis acts as a registry and factory for the crawlers, managed under a unified\ninterface and structure.\n\nAdvantages:\n\n\u2022 **Flexibility & Scalability:** Allows easy addition of new domains and\nspecialized crawlers without modifying the existing codebase.\n\n\u2022 **Encapsulation & Modularity:** The dispatcher encapsulates the logic for\ndetermining which crawler to use, making the system modular and allowing each\ncrawler to focus on its core business logic.\n\n \n \n from datetime import datetime, timedelta\n \n from aws_lambda_powertools import Logger\n from aws_lambda_powertools.utilities.typing import LambdaContext\n \n from src.crawlers import dispatcher\n from src.db import database\n \n logger = Logger(service=\"decodingml/crawler\")\n \n \n def lambda_handler(event, context: LambdaContext):\n \n link = event.get('link')\n \n logger.info(f\"Start extracting posts for {link}\")\n \n crawler = dispatcher.get_crawler(event.get('link'))\n \n posts = [{**page, 'correlation_id': context.aws_request_id} for page in crawler.extract()]\n \n now = datetime.now()\n existing_posts = database.profiles.find({\n \"date\": {\"$gte\": (now - timedelta(days=7)), \"$lte\": now},\n \"name\": link\n }, projection={'date': 1})\n \n existing_posts = [post.get('date') for post in list(existing_posts)]\n \n posts = [post for post in posts if post.get('date') not in existing_posts]\n \n if not posts:\n logger.info(\"No new posts on page\")\n return\n \n logger.info(f\"Successfully extracted {len(posts)} posts\")\n database.profiles.insert_many(posts)\n logger.info(f\"Successfully inserted data in db\")\n\nThe main entry point assembles the link from the event body, selects the\ncorrect crawler, and starts extraction jobs. After extraction, it checks for\nexisting posts to avoid duplicates and adds new posts to the database.\n\n### 3\\. Challenges & Pitfalls\n\n#### 3.1. Running headless browser instance with selenium in lambda runtime\nenvironment\n\nThis caused the most headaches. The Lambda execution environment is read-only,\nso writing to disk requires using a temporary file, complicating automatic\nbinary driver installation. Therefore, you need to install the driver directly\nin the Docker image and reference it manually in Selenium\u2019s driver options.\nThe only usable driver for this setup was the Google binary driver in my case.\n\n \n \n FROM public.ecr.aws/lambda/python:3.11 as build\n \n # Download chrome driver and browser and manually unpack them in their folders\n RUN yum install -y unzip && \\\n curl -Lo \"/tmp/chromedriver-linux64.zip\" \"https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/linux64/chromedriver-linux64.zip\" && \\\n curl -Lo \"/tmp/chrome-linux64.zip\" \"https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/linux64/chrome-linux64.zip\" && \\\n unzip /tmp/chromedriver-linux64.zip -d /opt/ && \\\n unzip /tmp/chrome-linux64.zip -d /opt/\n \n \n FROM public.ecr.aws/lambda/python:3.11\n \n # Install the function's OS dependencies using yum\n RUN yum install -y \\\n atk \\\n cups-libs \\\n gtk3 \\\n libXcomposite \\\n alsa-lib \\\n libXcursor \\\n libXdamage \\\n libXext \\\n libXi \\\n libXrandr \\\n libXScrnSaver \\\n libXtst \\\n pango \\\n at-spi2-atk \\\n libXt \\\n xorg-x11-server-Xvfb \\\n xorg-x11-xauth \\\n dbus-glib \\\n dbus-glib-devel \\\n nss \\\n mesa-libgbm \\\n ffmpeg \\\n libxext6 \\\n libssl-dev \\\n libcurl4-openssl-dev \\\n libpq-dev\n \n COPY --from=build /opt/chrome-linux64 /opt/chrome\n COPY --from=build /opt/chromedriver-linux64 /opt/\n \n COPY ./pyproject.toml ./poetry.lock ./\n \n # Install Poetry, export dependencies to requirements.txt, and install dependencies\n # in the Lambda task directory, finally cleanup manifest files.\n RUN python3 -m pip install --upgrade pip && pip install poetry\n RUN poetry export -f requirements.txt > requirements.txt && \\\n pip3 install --no-cache-dir -r requirements.txt --target \"${LAMBDA_TASK_ROOT}\" && \\\n rm requirements.txt pyproject.toml poetry.lock\n \n # Copy function code\n COPY ./src ${LAMBDA_TASK_ROOT}/src\n\nThe main idea in this Dockerfile is that I manually downloaded the Chrome\ndriver and browser and unpacked them in a location where they can be accessed\nby Selenium, which usually would\u2019ve done this directly.\n\nThis is a mandatory step for the Lambda environment. Since everything is read-\nonly, in the next code sample I\u2019ll show you how point Selenium to the correct\ndriver and browser locations:\n\n \n \n from tempfile import mkdtemp\n \n def init_driver(self):\n options = Options()\n # Setup drover binary location manually\n options.binary_location = '/opt/chrome/chrome'\n # Run browser in headless mode\n options.add_argument('--headless=new')\n options.add_argument('--no-sandbox')\n options.add_argument('--single-process')\n options.add_argument('--window-size=1420,1080')\n options.add_argument('--disable-dev-shm-usage')\n options.add_argument('--disable-gpu')\n options.add_argument('--disable-popup-blocking')\n options.add_argument('--disable-notifications')\n options.add_argument('--disable-dev-tools')\n options.add_argument('--log-level=3')\n options.add_argument('--ignore-certificate-errors')\n options.add_argument(\"--no-zygote\")\n options.add_argument(f\"--user-data-dir={mkdtemp()}\")\n options.add_argument(f\"--data-path={mkdtemp()}\")\n options.add_argument(f\"--disk-cache-dir={mkdtemp()}\")\n options.add_argument('--remote-debugging-port=9222')\n \n \n self._driver = webdriver.Chrome(\n service=Service(\"/opt/chromedriver\"),\n options=options,\n )\n\nI hardcoded the driver and browser locations in the Dockerfile. Additionally,\nI pointed several folders (e.g., user-data-dir, disk-cache-dir) to temporary\ndirectories to prevent Selenium from creating them automatically, which would\ncause errors due to Lambda\u2019s disk limitations.\n\n#### 3.2. Aggregate Empty Pages\n\nMy initial monitoring algorithm was basic, looping over lambda invocation\ncorrelation IDs and checking the database for generated posts. However, it\nencountered an infinite loop when no new posts were created for some pages.\n\n \n \n import datetime\n import re\n from typing import List\n \n import boto3\n \n _client = boto3.client('logs')\n \n \n def monitor(correlation_ids: List[str]):\n finished = []\n \n now = int((datetime.datetime.now() datetime.timedelta(days=1)).timestamp() * 1000)\n \n response = _client.filter_log_events(\n logGroupName='/aws/lambda/crawler',\n startTime=now,\n filterPattern=\"REPORT RequestId\"\n )\n \n for event in response['events']:\n match = re.search(r'REPORT RequestId: ([^\\s]+)', event.get('message'))\n if match:\n correlation_id = match.group(1)\n if correlation_id in correlation_ids:\n finished.append(correlation_id)\n \n return finished\n\nHere, I search through all log streams for each lambda generated in that\ncurrent day and look for the message, which usually has this format: _**REPORT\nRequestId:**_ . This indicates that the lambda has reached the\nend of its execution, and I can mark which correlation IDs have finished.\n\n#### 3.3. Avoid being blocked by social media platforms\n\nThis was a pity error\u2014the kind you would\u2019ve spent days on\u2014and the solution was\nto watch it from a different perspective. Popular social media platforms\nimplement many anti-bot protection mechanisms to prevent crawling, from\nrequest header analysis to rate limiting to IP blocking.\n\nAnd because we run our browser in headless mode to mimic realistic user-\nbrowser interaction, and all our crawlers send requests under the same IP\naddress to multiple pages at the same time repeatedly, this screams, please\nblock me.\n\nTo address this, I\u2019ve used a proxy to mask my IP address and location:\n\n \n \n import os\n \n \n class ProxyConnection:\n \n def __init__(\n self,\n host: str = None,\n port: str = None,\n username: str = None,\n password: str = None,\n verify_ssl: bool = False\n ):\n self.host = host or os.getenv('PROXY_HOST')\n self.port = port or os.getenv('PROXY_PORT')\n self.username = username or os.getenv('PROXY_USERNAME')\n self.password = password or os.getenv('PROXY_PASSWORD')\n self.verify_ssl = verify_ssl\n self._url = f\"{self.username}:{self.password}@{self.host}:{self.port}\"\n \n def __dict__(self):\n return {\n 'https': 'https://{}'.format(self._url.replace(\" \", \"\")),\n 'http': 'http://{}'.format(self._url.replace(\" \", \"\")),\n 'no_proxy': 'localhost, 127.0.0.1',\n 'verify_ssl': self.verify_ssl\n }\n\nTo address this, I used a proxy to mask my IP and location. Paid proxies like\nSmartProxy offer a pool of rotating IPs, assigning a different IP to each\ncrawler, mimicking regular user behavior. Additionally, using a proxy allows\nfinding a country without access restrictions to public pages, ensuring smooth\ncrawling.\n\n### 4\\. Local Testings\n\nTo prove this works, I wrote a makefile containing some simple commands for\ncrawler and lambda. The problem is that I\u2019ve only managed to test the crawler\nlocally. Since the scheduler spins up crawlers, they should be already\ndeployed on AWS.\n\n \n \n local-test-crawler: # Send test command on local to test the lambda\n curl -X POST \"http://localhost:9000/2015-03-31/functions/function/invocations\" \\\n -d '{\"link\": \"https://www.instagram.com/mcdonalds\"}'\n \n local-test-scheduler: # Send test command on local to test the lambda\n curl -X POST \"http://localhost:9000/2015-03-31/functions/function/invocations\" -d '{}'\n\nNow, most people, when testing lambda functions on a local environment, use\nAWS Lambda **RIE (Runtime Interface Emulator)** , which allows you to test\nyour lambda function packages in a container. Basically, this emulates a\nlambda execution environment on your local machine. As you can see, I\u2019ve\nmanaged to do this without using the emulator, which slightly simplified my\nenvironment.\n\nYou can use these commands to test each component. For example, if you would\nlike to test the crawler, go into your terminal and use this command:\n\n \n \n > make local-test-crawler\n\nAs you can see, the crawling process has started, and for this page, we\u2019ve\nfound three new posts in the last seven days:\n\n### 5\\. Deployment\n\nThe deployment process is defined in **our GitHub** repository under the\n**ops** folder, where you can explore the whole solution written in Pulumi.\n\nYou can play with the Makefile. It contains all the necessary commands to make\nyour infrastructure up and running.\n\n* * *\n\n### Conclusion\n\nIn this article, we\u2019ve explored a complete end-to-end robust solution for\nbuilding a Highly Scalable Data Ingestion pipeline that can leverage existing\ndata from multiple crawlable sources for various processes like ML training,\ndata analysis, etc.\n\nWe\u2019ve gone through specific challenges you might face and how to overcome them\nin this process.\n\n| _\ud83d\udd17**Check out** the code on GitHub [1] and support us with a _\u2b50\ufe0f\n\n* * *\n\nWithin our newsletter, we keep things short and sweet.\n\nIf you enjoyed reading this article, consider checking out the full version on\nMedium. It\u2019s still free \u2193\n\nFull article on Medium\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n13\n\nShare this post\n\n#### Highly Scalable Data Ingestion Architecture for ML and Marketing\nIntelligence\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/highly-scalable-data-ingestion-architecture?r=1ttoeh" }, { "id": "9c6f5239-fc76-4fe9-a8e2-77f662d0c69f", "content": { "Title": "2 Key LLMOps Concepts - by Alex Razvant", "Subtitle": "How to monitor LLM & RAG applications. Evaluate your RAG like a pro. Learn about memory/compute requirements on LLMs.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### 2 Key LLMOps Concepts\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# 2 Key LLMOps Concepts\n\n### How to monitor LLM & RAG applications. Evaluate your RAG like a pro. Learn\nabout memory/compute requirements on LLMs.\n\nAlex Razvant\n\nJun 22, 2024\n\n10\n\nShare this post\n\n#### 2 Key LLMOps Concepts\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * A powerful framework to evaluate RAG pipelines\n\n * Why do LLMs require so much VRAM?\n\n * LLMOps Chain Monitoring\n\n* * *\n\n### \ud835\udde2\ud835\uddfb\ud835\uddf2 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9\ud835\ude02\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde5\ud835\uddd4\ud835\uddda - \ud835\udde5\ud835\uddd4\ud835\uddda\ud835\uddd4\ud835\ude00\n\nBuilding an RAG pipeline is fairly simple. You just need a Vector-DB knowledge\nbase, an LLM to process your prompts, plus additional logic for interactions\nbetween these modules.\n\nLesson 10: Evaluating the RAG pipeline. (Image by Author)\n\nHowever, reaching a satisfying performance level imposes its challenges due to\nthe \u201cseparate\u201d components:\n\n**Decoding ML Newsletter** is a reader-supported publication. If you enjoy our\ncontent, please consider becoming a paid subscriber.\n\nSubscribe\n\n 1. **Retriever** \u2014 which takes care of querying the Knowledge DB and retrieves additional context that matches the user\u2019s query. \n\n 2. **Generator** \u2014 which encompasses the LLM module, generating an answer based on the context-augmented prompt. When evaluating a RAG pipeline, we must evaluate both components separately and together. \n\n\ud83d\udd38 **What is RAGAs?**\n\nA framework that helps you evaluate your Retrieval Augmented Generation (RAG)\npipelines. One of the core concepts of RAGAs is Metric-Driven-Development\n(MDD) which is a product development approach that relies on data to make\nwell-informed decisions.\n\n\ud83d\udd38 **What metrics do RAGAs expose?**\n\n\ud83d\udd3d For \ud835\udde5\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 Stage :\n\n\u21b3 \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\udde3\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb Evaluates the precision of the context used to generate an\nanswer, ensuring relevant information is selected from the context \n\u21b3 \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\ude06 Measures how relevant the selected context is to the\nquestion. \u21b3 \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf9 Measures if all the relevant information required\nto answer the question was retrieved. \n\u21b3 \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\uddd8\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\ude01\ud835\uddf6\ud835\uddf2\ud835\ude00 \ud835\udde5\ud835\uddf2\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf9 Evaluates the recall of entities within the context,\nensuring that no important entities are overlooked.\n\n\ud83d\udd3d For \ud835\uddda\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb Stage :\n\n\u21b3 \ud835\uddd9\ud835\uddee\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\uddfb\ud835\uddf2\ud835\ude00\ud835\ude00 Measures how accurately the generated answer reflects the\nsource content, ensuring the generated content is truthful and reliable. \n\u21b3 \ud835\uddd4\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\udde5\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2 It is validating that the response directly addresses the\nuser\u2019s query. \n\u21b3 \ud835\uddd4\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\udde6\ud835\uddf2\ud835\uddfa\ud835\uddee\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddf0 \ud835\udde6\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\uddf9\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\ude06 Shows that the generated content is semantically\naligned with expected responses. \n\u21b3 \ud835\uddd4\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\uddd6\ud835\uddfc\ud835\uddff\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfb\ud835\uddf2\ud835\ude00\ud835\ude00 Focuses on fact-checking, assessing the factual accuracy\nof the generated answer. \n \n\ud83d\udd38 **How to evaluate using RAGAs?**\n\n1\\. Prepare your \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34,\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33\ud835\ude34,\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude34 and \ud835\ude28\ud835\ude33\ud835\ude30\ud835\ude36\ud835\ude2f\ud835\ude25_\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude35\ud835\ude29\ud835\ude34 \n2\\. Compose a Dataset object \n3\\. Select metrics \n4\\. Evaluate \n5\\. Monitor scores or log the entire evaluation chain to a platform like\nCometML.\n\nFor a full end-to-end workflow of RAGAs evaluation in practice, I've described\nit in this LLM-Twin Course Article \ud83d\udc47:\n\nHow to Evaluate RAGs Medium Article\n\n* * *\n\n### Why are LLMs so Memory-hungry?\n\nLLMs require lots of GPU memory, but let's see why that's the case. \ud83d\udc47\n\n\ud83d\udd38 What is an LLM parameter?\n\nLLMs, like Mistral 7B or LLama3-8B, have billions of parameters. \ud835\uddd8\ud835\uddee\ud835\uddf0\ud835\uddf5\n\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddf6\ud835\ude00 \ud835\uddee \ud835\ude04\ud835\uddf2\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01 stored and accessed during computation.\n\n\ud83d\udd38 How much GPU VRAM is required? There are three popular precision formats\nthat LLMs are trained in:\n\n\u2192 FP32 - 32bits floating point \n\u2192 FP16/BFP16 - 16 bits floating point\n\nMost use mixed precision, e.g., matmul in BFP16 and accumulations in FP32.\n\nFor this example, we'll use half-precision BFP16.\n\nHere's a deeper dive on this topic: \n\ud83d\udd17 Google BFloat16 \n\ud83d\udd17 LLMs Precision Benchmark\n\n\ud83d\udd39 Let's calculate the VRAM required:\n\n\\\\(\\begin{align*} \\text{VRAM} &= \\text{Size}(\\text{params}) +\n\\text{Size}(\\text{activations}) \\\\\\ \\text{Size}(\\text{params}) &=\n\\text{Params} \\times \\text{Precision}(\\text{bytes}) \\end{align*}\\\\)\n\nAs 1byte=8bits, we've got: \n\u2192 FP32 = 32 bits = 4 bytes \n\u2192 FP16/BFP16 = 16bits = 2 bytes\n\nNow, for a 7B model, we would require: \n\u2192 VRAM = 7 * 10^9 (billion) * 2 bytes = 14 * 10^9 bytes\n\nKnowing that 1GB = 10 ^ 9 bytes we have \ud835\udfed\ud835\udff0\ud835\uddda\ud835\uddd5 as the required VRAM to load a \ud835\udff3\ud835\uddd5\n\ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 in half BF16 precision.\n\n\ud835\udde7\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf6\ud835\ude00 \ud835\uddfd\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\ude06 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\ude00. \n \nEver encountered the \ud835\uddd6\ud835\udde8\ud835\uddd7\ud835\uddd4 \ud835\udde2\ud835\udde2\ud835\udde0 Error e.g \"\ud835\ude1b\ud835\ude33\ud835\ude2a\ud835\ude26\ud835\ude25 \ud835\ude35\ud835\ude30 \ud835\ude22\ud835\ude2d\ud835\ude2d\ud835\ude30\ud835\ude24\ud835\ude22\ud835\ude35\ud835\ude26 +56\ud835\ude14\ud835\ude09 ...\" when\ninferencing? here's the most plausible cause for that:\n\n\u2b55 No GPU VRAM left for the activations. Let's figure out the activation size\nrequired by using \ud835\udddf\ud835\udddf\ud835\uddee\ud835\uddfa\ud835\uddee\ud835\udfee-\ud835\udff3\ud835\uddd5 as an example.\n\n\ud83d\udd38 Activations are a combination of the following model parameters: \n\\- Context Length (N) \n\\- Hidden Size (H) \n\\- Precision (P)\n\nAfter a quick look at the LLama2-7b model configuration, we get these values: \n\\- Context Length (N) = 4096 tokens \n\\- Hidden Size (H) = 4096 dims \n\\- Precision (P) = BF16 = 2bytes \n\ud83d\udd17 \ud835\udddf\ud835\udddf\ud835\uddee\ud835\uddfa\ud835\uddee\ud835\udfee-\ud835\udff3\ud835\uddef \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\udde3\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\ude00: shorturl.at/CWOJ9\n\nConsult this interactive LLM-VRAM calculator to check on the different memory\nsegments reserved when inferencing/training LLMs.\n\n\ud83d\udfe2 Inference/Training VRAM Calculator \n \n\ud83d\udfe1 For training, things stay a little different, as more factors come into\nplay, as memory is allocated for: \n\u21b3 Full Activations considering N(Heads) and N( Layers) \n\u21b3 Optimizer States which differ based on the optimizer type \n\u21b3 Gradients\n\nHere's a tutorial on PEFT, QLoRA fine-tuning in action \ud83d\udc47:\n\nLLM Fine Tuning Medium Article\n\nOther Resources: \n\ud83d\udcd4 Model Anatomy: shorturl.at/nJeu0 \n\ud83d\udcd4 VRAM for Serving: shorturl.at/9UPBE \n\ud83d\udcd4 LLM VRAM Explorer: shorturl.at/yAcTU\n\n* * *\n\n### One key LLMOps concept - Chain Monitoring\n\nIn traditional ML systems, it is easier to backtrack to a problem compared to\nGenerative AI ones based on LLMs. When working with LLMs, their generative\nnature can lead to complex and sometimes unpredictable behavior.\n\n\ud83d\udd39 \ud835\uddd4 \ud835\ude00\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01?\n\n\"Log prompts or entire chains with representative metadata when\ntesting/evaluating your LLM.\" \ud835\ude16\ud835\ude2f\ud835\ude26 \ud835\ude31\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude10 \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude10'\ud835\ude37\ud835\ude26 \ud835\ude23\ud835\ude26\ud835\ude26\ud835\ude2f \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude27\ud835\ude30\ud835\ude33\n\ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34 \ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude2c \ud835\ude2a\ud835\ude34 \ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\udde0\ud835\udddf - \ud835\udddf\ud835\udddf\ud835\udde0.\n\n**\ud83d\udd38** \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\ude04 \ud835\uddf0\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude01 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\ude00 \ud835\uddef\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9**:**\n\n\u2192 \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde6\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde7\ud835\uddee\ud835\ude00\ud835\uddf8\ud835\ude00\n\nHere you might have a query that represents the larger text, the LLMs response\nwhich is the summary, and you could calculate the ROUGE score inline between\nquery & response and add it to the metadata field. Then you can compose a JSON\nwith query, response, and rouge_score and log it to comet.\n\n\u2192 \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde4&\ud835\uddd4 \ud835\udde7\ud835\uddee\ud835\ude00\ud835\uddf8\ud835\ude00 Here, you could log the Q&A pairs separately, or even add an\nevaluation step using a larger model to evaluate the response. Each pair would\nbe composed of Q, A, GT, and True/False to mark the evaluation.\n\n\u21b3 \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\uddda\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde7\ud835\uddee\ud835\ude00\ud835\uddf8\ud835\ude00 You could log the query and response, and append in the\nmetadata a few qualitative metrics (e.g. relevance, cohesiveness).\n\n\u21b3\ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde5\ud835\uddd4\ud835\uddda If you have complex chains within your RAG application, you could log\nprompt structures (sys_prompt, query), and LLM responses and track the chain\nexecution step by step.\n\n\u21b3 \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde1\ud835\uddd8\ud835\udde5 You could define the entity fields and log the query, response,\nentities_list, and extracted_entities in the same prompt payload.\n\n\u21b3\ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddf2\ud835\uddff\ud835\ude00 CometML LLM also allows you to log images associated\nwith a prompt or a chain. If you\u2019re working with GPT4-Vision for example, you\ncould log the query and the generated image in the same payload.\n\nAlso, besides the actual prompt payload, you could inspect the processing time\nper each step of a chain.\n\nFor example, a 3-step chain in an RAG application might query the Vector DB,\ncompose the prompt, and pass it to the LLM, and when logging the chain to\nCometML, you could see the processing time/chain step.\n\n\ud83d\udd39 \ud835\udde7\ud835\uddfc \ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\uddf6\ud835\ude01 \ud835\ude02\ud835\uddfd, \ud835\ude06\ud835\uddfc\ud835\ude02'\ud835\uddf9\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1:\n\n\\- CometML pip package \n\\- CometML API key - Workspace name and Project Name\n\nI've used this approach when evaluating a fine-tuned LLM on a custom\ninstruction dataset. For a detailed walkthrough \ud83d\udc47\n\nEvaluating LLMs Medium Article\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n10\n\nShare this post\n\n#### 2 Key LLMOps Concepts\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/2-key-llmops-concepts?r=1ttoeh" }, { "id": "87f34471-9a5b-4641-8272-15b6a18a9be7", "content": { "Title": "The LLM-Twin Free Course on Production-Ready RAG applications.", "Subtitle": "Learn how to build a full end-to-end LLM & RAG production-ready system, follow and code along each component by yourself.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### The LLM-Twin Free Course on Production-Ready RAG applications.\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# The LLM-Twin Free Course on Production-Ready RAG applications.\n\n### Learn how to build a full end-to-end LLM & RAG production-ready system,\nfollow and code along each component by yourself.\n\nAlex Razvant\n\nJun 20, 2024\n\n13\n\nShare this post\n\n#### The LLM-Twin Free Course on Production-Ready RAG applications.\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n\u2192 the **last lesson** of the LLM Twin free course\n\n**What is your LLM Twin?** It is an AI character that writes like yourself by\nincorporating your style, personality, and voice into an LLM.\n\n**Decoding ML Newsletter** is a reader-supported publication. If you enjoy our\nwork, please consider becoming a paid subscriber.\n\nSubscribe\n\nImage by DALL-E\n\n### **Why is this course different?**\n\n_By finishing the \u201c**LLM Twin: Building Your Production-Ready AI\nReplica\u201d**_****_free course, you will learn how to design, train, and deploy a\nproduction-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps\ngood practices_.\n\n_**Why should you care? \ud83e\udef5**_\n\n _**\u2192 No more isolated scripts or Notebooks!** Learn production ML by building\nand deploying an end-to-end production-grade LLM system._\n\n> _More**details** on what you will **learn** within the **LLM Twin**\n> **course** , **here** \ud83d\udc48_\n\n# **The LLM-Twin Free Course**\n\nThis course teaches you how to design, build, and deploy a production-ready\nLLM-RAG system. It covers all the components, system design, data ingestion,\nstreaming pipeline, fine-tuning pipeline, inference pipeline alongside\nproduction monitoring, and more.\n\n## **What is the course about?**\n\nWe\u2019re building a production-ready RAG system, able to write content based on\nyour unique style, by scrapping previous posts/articles and code snippets\nwritten by you to construct a fresh and continuously updated knowledge base,\ngenerate a dataset to fine-tune a capable and efficient open-source LLM, and\nthen interconnect all components for a full end-to-end deployment while\nintegrating evaluation and post-deployment monitoring.\n\nThis course follows best MLOps & LLMOps practices, focusing on the 3-pipeline-\ndesign pattern for building ML-centered applications.\n\n## **Lesson 1: Presenting the Architecture**\n\nPresenting and describing each component, the tooling used, and the intended\nworkflow of implementation. The first lesson will prepare the ground by\noffering a wide overview of each component and consideration.\n\n**We recommend you start here.**\n\n\ud83d\udd17 **Lesson 1:** An End-to-End Framework for Production-Ready LLM Systems by\nBuilding Your LLM Twin\n\nLLM twin system architecture [Image by the Author]\n\n## **Lesson 2: Data Pipelines**\n\nIn this lesson, we\u2019ll start by explaining what a data pipeline is, and the key\nconcepts of data processing and streaming, and then dive into the data\nscrapping and processing logic.\n\n\ud83d\udd17 **Lesson 2:** The Importance of Data Pipelines in the Era of Generative AI\n\nLesson 2: The Data Collection Pipeline [Image by author]\n\n## **Lesson 3: Change Data Capture and Data Processing**\n\nIn this lesson, we\u2019re showcasing the CDC(Change Data Capture) integration\nwithin the LLM-Twin data pipeline. We\u2019re showing how to set up MongoDB, the\nCDC approach for event-driven processing, RabbitMQ for message queuing, and\nefficient low-latency database querying using the MongoDB Oplog.\n\n\ud83d\udd17 **Lesson 3:** CDC Enabling Event-Driven Architectures\n\nLesson 3: Event-Driven Processing using RabbitMQ, CDC, and MongoDB (Image by\nAuthor)\n\n## **Lesson 4: Efficient Data Streaming Pipelines**\n\nIn this lesson, we\u2019ll focus on the feature pipeline. Here, we\u2019re showcasing\nhow we ingest data that we\u2019ve gathered in the previous lesson, and how we\u2019ve\nbuilt a stream-processing workflow with **Bytewax **that fetches raw samples,\nstructures them using Pydantic Models, cleans, chunks, encodes, and stores\nthem in our **Qdrant** Vector Database.\n\n\ud83d\udd17 **Lesson 4:** SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014\nin Real-Time!\n\nLesson 4: Efficient Data Streaming Pipelines using Bytewax and Qdrant Vector\nDB. (Image by Author)\n\n## **Lesson 5: Advanced RAG Optimization Techniques**\n\nIn this lesson, we\u2019ll showcase a few advanced techniques to increase the\nsimilarity and accuracy of the embedded data samples from our **Qdrant**\nVector Database. The contents of this lesson could make a significant\ndifference between a naive RAG application and a production-ready one.\n\n\ud83d\udd17 **Lesson 5:** The 4 Advanced RAG Algorithms You Must Know to Implement\n\nLesson 5: Advanced RAG Optimization Techniques. (Image by Author)\n\n## **Lesson 6: Dataset preparation for LLM fine-tuning**\n\nIn this lesson, we\u2019ll discuss the core concepts to consider when creating\ntask-specific custom datasets to fine-tune LLMs. We\u2019ll use our cleaned data\nfrom our Vector Database, and engineer specific Prompt Templates alongside\nusing GPT3.5-Turbo API to generate our custom dataset and version it on\n**Comet ML**.\n\n\ud83d\udd17 **Lesson 6:** The Role of Feature Stores in Fine-Tuning LLMs\n\nLesson 6: Generate custom datasets using Knowledge Distillation.\n\n## **Lesson 7: Fine-tuning LLMs on custom datasets**\n\nWe\u2019ll show how to implement a fine-tuning workflow for a Mistral7B-Instruct\nmodel while using the custom dataset we\u2019ve versioned previously. We\u2019ll present\nin-depth the key concepts including LoRA Adapters, PEFT, Quantisation, and how\nto deploy on Qwak.\n\n\ud83d\udd17 **Lesson 7:**How to fine-tune LLMs on custom datasets at Scale using Qwak\nand CometML\n\nLesson 7: Fine-tuning LLMs on custom datasets using Qwak and CometML. (Image\nby Author)\n\n## **Lesson 8: Evaluating the fine-tuned LLM**\n\nIn this lesson, we\u2019re discussing one core concept of ML - **Evaluation**. \nWe\u2019ll present the evaluation workflow we\u2019ll showcase the full process of\nassessing the model\u2019s performance using the GPT3.5-Turbo model and custom-\nengineered evaluation templates.\n\n\ud83d\udd17 **Lesson 8:**Best Practices When Evaluating Fine-Tuned LLMs\n\nLesson 8: Evaluating the quality of our custom fine-tuned LLM. (Image by\nAuthor)\n\n## **Lesson 9: Deploying the Inference Pipeline Stack**\n\nIn this lesson, we\u2019ll showcase how to design and implement the LLM & RAG\ninference pipeline based on a set of detached Python microservices. We\u2019ll\nsplit the ML and business logic into two components, describe each one in\npart, and show how to wrap up and deploy the inference pipeline on **Qwak** as\na scalable and reproducible system.\n\n\ud83d\udd17 **Lesson 9:**Architect scalable and cost-effective LLM & RAG inference\npipelines\n\nLesson 9: Architecturing LLM & RAG inference pipeline. (Image by Author)\n\n## **Lesson 10: RAG Pipeline Evaluation**\n\nIn this lesson, we\u2019re covering RAG evaluation \u2014 which is one of great\nimportance. If no proper evaluation metrics are monitored or techniques are\nused, the RAG systems might underperform and hallucinate badly.\n\nHere, we\u2019ll describe the workflow of evaluating RAG pipelines using the\npowerful RAGAs framework, compose the expected RAGAs evaluation format, and\ncapture eval scores which will be included in full LLM execution chains and\nlogged on **Comet ML LLM**.\n\n\ud83d\udd17 **Lesson 10:**Evaluating RAG Systems using the RAGAs Framework\n\nLesson 10: Evaluating the RAG pipeline. (Image by Author)\n\n### Next Steps\n\n#### Step 1\n\n**Check out** the **full versions** of all **Lessons 1-11** on our **Medium\npublication** , under the LLM-Twin Course group tag. _It\u2019s still FREE:_\n\nThe LLM-Twin Course\n\n#### Step 2\n\n\u2192 **Check out theLLM Twin GitHub repository and try it yourself \ud83e\udef5**\n\n _Nothing compares with getting your hands dirty and building it yourself!_\n\nLLM Twin Course - GitHub\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n13\n\nShare this post\n\n#### The LLM-Twin Free Course on Production-Ready RAG applications.\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-llm-twin-free-course-on-production?r=1ttoeh" }, { "id": "d3cb26a9-45fe-42e0-9a79-7a2f358fc875", "content": { "Title": "A blueprint for designing production LLM systems: From Notebooks to production ", "Subtitle": "How to get a GitHub Copilot subscription for FREE (to 5x writing code). Learn to build production ML systems by building an LLM application.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### A blueprint for designing production LLM systems: From Notebooks to\nproduction\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# A blueprint for designing production LLM systems: From Notebooks to\nproduction\n\n### How to get a GitHub Copilot subscription for FREE (to 5x writing code).\nLearn to build production ML systems by building an LLM application.\n\nPaul Iusztin\n\nJun 15, 2024\n\n13\n\nShare this post\n\n#### A blueprint for designing production LLM systems: From Notebooks to\nproduction\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * How to get a GitHub Copilot subscription for FREE (to 5x writing code)\n\n * A blueprint for designing production LLM systems: From Notebooks to production\n\n * Learn to build production ML systems by building an LLM application\n\n* * *\n\n### How to get a GitHub Copilot subscription for FREE (to 5x writing code)\n\n\ud835\udddb\ud835\uddfc\ud835\ude04 to get a \ud835\uddda\ud835\uddf6\ud835\ude01\ud835\udddb\ud835\ude02\ud835\uddef \ud835\uddd6\ud835\uddfc\ud835\uddfd\ud835\uddf6\ud835\uddf9\ud835\uddfc\ud835\ude01 \ud835\ude00\ud835\ude02\ud835\uddef\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb for \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 (to 5x writing code) \u2193 \n \nThere are other alternatives, but GitHub Copilot is still the leading solution\ndue to 2 factors: performance & convenience. \n \nIf you can get it for free, there are 0 reasons not to use it (sneaky move\nMicrosoft) \u2193 \n \n\ud835\udde6\ud835\uddfc \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb? \n \nThere is no secret. \n \nAs stated in their docs: \"Verified students, teachers, and maintainers of\npopular open source projects on GitHub are eligible to use Copilot Individual\nfor free. \" \n \n\ud83d\udd17 Docs \n \nTo become a student or teacher when you are not is not a solution. \n \nBut... \n \nTo become a maintainer of a popular open-source project is!\n\n\ud835\udde6\ud835\uddfc \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddee \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddef\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee \"\ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddf3 \ud835\uddee \ud835\uddfd\ud835\uddfc\ud835\uddfd\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\uddff \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb-\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2\n\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01\"? \n \nI don't know the exact formula, but here are some examples. \n \nI am eligible for it because I am the owner of a GitHub repository with ~2.2k\nstars & 350 forks: \ud83d\udd17 Hands-on LLMs Course \n \nAfter digging into some Reddit threads, a dude said that for a repo with ~520\nstars & 299 forks, you got the free subscription. \n \nThe idea is that you don't have to be a maintainer of Pandas or PyTorch to\nbecome eligible. \n \n. \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf0\ud835\uddf9\ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddfc... \n \n\u2192 start contributing to open-source or creating your cool project, which will\ncomplete the job! \n \n. \n \n\ud835\ude10\ud835\ude27 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude23\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude26\ud835\ude33 \ud835\ude2c\ud835\ude2f\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude29\ud835\ude26 \"\ud835\ude34\ud835\ude26\ud835\ude24\ud835\ude33\ud835\ude26\ud835\ude35 \ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude36\ud835\ude2d\ud835\ude22/\ud835\ude24\ud835\ude33\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude33\ud835\ude2a\ud835\ude22,\" \ud835\ude31\ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude34\ud835\ude26 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude37\ud835\ude26 \ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f \ud835\ude35\ud835\ude29\ud835\ude26\n\ud835\ude24\ud835\ude30\ud835\ude2e\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude34 \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude30\ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude33\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude2c\ud835\ude2f\ud835\ude30\ud835\ude38. \n \nAlso, let me know if you know that when contributing to open-source, you must\ncontribute by \"how much\" until you become eligible.\n\n* * *\n\n### A blueprint for designing production LLM systems: From Notebooks to\nproduction\n\nI am \ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\ude01\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf0\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01... \ud835\udddd\ud835\uddfc\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4, but here is \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 your \ud835\udddf\ud835\udddf\ud835\udde0\n\ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb for \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 posts or articles \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\ude03\ud835\uddfc\ud835\uddf6\ud835\uddf0\ud835\uddf2 \u2193 \n \n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb? \n \nIt's an AI character who writes like you, using your writing style and\npersonality. \n \n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddfb\ud835\uddfc\ud835\ude01 \ud835\uddf1\ud835\uddf6\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\ude01\ud835\uddda\ud835\udde3\ud835\udde7? \ud835\uddec\ud835\uddfc\ud835\ude02 \ud835\uddfa\ud835\uddee\ud835\ude06 \ud835\uddee\ud835\ude00\ud835\uddf8... \n \nWhen generating content using an LLM, the results tend to: \n \n\\- be very generic and unarticulated, \n\\- contain misinformation (due to hallucination), \n\\- require tedious prompting to achieve the desired result. \n \n\ud835\udde7\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\ude06, \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01, \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddee \ud835\ude00\ud835\uddfd\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01: \n \n\u2192 is fine-tuned on your digital content to replicate your persona \n \n\u2192 has access to a vector DB (with relevant data) to avoid hallucinating and\nwrite only about concrete facts\n\n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 \ud835\uddff\ud835\uddf2\ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb: \n \n1\\. A data collection pipeline will gather your digital data from Medium,\nSubstack, LinkedIn and GitHub. It will be normalized and saved to a Mongo DB. \n \n2\\. Using CDC, you listen to any changes made to the Mongo DB and add them as\nevents to a RabbitMQ queue. \n \n3\\. A Bytewax streaming ingestion pipeline will listen to the queue to clean,\nchunk, and embed the data in real time. \n \n4\\. The cleaned and embedded data is loaded to a Qdrant vector DB. \n \n5\\. On the training pipeline side, you use a vector DB retrieval client to\nbuild your training dataset, which consists of the cleaned data (augmented\nusing RAG). \n \n6\\. You fine-tune an open-source Mistral LLM using QLoRA and push all the\nexperiment artifacts to a Comet experiment tracker. \n \n7\\. Based on the best experiment, you push the LLM candidate to Comet's model\nregistry. You carefully evaluate the LLM candidate using Comet's prompt\nmonitoring dashboard. If the evaluation passes, you tag it as accepted. \n \n8\\. On the inference pipeline side, you deploy the new LLM model by pulling it\nfrom the model registry, loading it, and quantizing it. \n \n9\\. The inference pipeline is wrapped by a REST API, which allows users to\nmake ChatGPT-like requests.\n\n* * *\n\n### Learn to build production ML systems by building an LLM application\n\nTaking in mind the _blueprint for designing production LLM systems presented\nabove_ , we want to let you know that:\n\n_\u2192 We are close to wrapping our LLM twin course lessons and code._\n\nTo give more context for newcomers, in the past weeks we started \ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 an\n\ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 by teaching you how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 an \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb:\n\ud835\ude20\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f-\ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude3a \ud835\ude08\ud835\ude10 \ud835\ude19\ud835\ude26\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude24\ud835\ude22\n\nSo\u2026\n\nIf you are looking for an \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\n\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00, consider checking the course's **first** FREE **lesson**. \n \n\ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude38\ud835\ude22\ud835\ude2d\ud835\ude2c \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude35\ud835\ude29\ud835\ude33\ud835\ude30\ud835\ude36\ud835\ude28\ud835\ude29 \ud835\ude22 \ud835\ude27\ud835\ude36\ud835\ude2d\ud835\ude2d-\ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2c \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude24\ud835\ude26\ud835\ude34\ud835\ude34: \n \n\u2192 from data gathering... \n \n...until deploying and monitoring your LLM twin using LLMOps \u2190 \n \n. \n \nWith that in mind... \n \nThe \ud835\udfed\ud835\ude00\ud835\ude01 \ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb will walk you through: \n \n\\- the issues of generating content using ChatGPT (or other similar solutions) \n\\- the 3-pipeline design \n\\- the system design and architecture of the LLM twin \n \n. \n \nWithin the \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\ude00\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb, we will present all the \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddee\ud835\uddf9\n\ud835\uddf1\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 on \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1: \n \n\\- a data collection pipeline \n\\- a real-time feature pipeline using a streaming engine \n\\- hook the data and feature pipelines using the CDC pattern \n\\- a continuous fine-tuning pipeline \n\\- an inference pipeline deployed as a REST API \n \n \nA \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\uddff \ud835\uddf3\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\ude00 will be on \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddf4\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 & \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf4\ud835\uddfc\ud835\uddfc\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf2\ud835\ude00: \n \n\\- prompt versioning \n\\- model registries \n\\- experiment tracker \n\\- prompt monitoring \n\\- CI/CD \n\\- IaC \n\\- Docker \n \n. \n \n\ud835\ude52\ud835\ude56\ud835\ude63\ud835\ude69 \ud835\ude69\ud835\ude64 \ud835\ude59\ud835\ude5e\ud835\ude5c \ud835\ude5e\ud835\ude63\ud835\ude69\ud835\ude64 \ud835\ude69\ud835\ude5d\ud835\ude5a 1\ud835\ude68\ud835\ude69 \ud835\ude61\ud835\ude5a\ud835\ude68\ud835\ude68\ud835\ude64\ud835\ude63? \n \n\ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01. It's FREE, and no registration is required \n \n\u2193\u2193\u2193 \n \n\ud83d\udd17 \ud835\ude13\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f 1 - \ud835\ude08\ud835\ude2f \ud835\ude0c\ud835\ude2f\ud835\ude25-\ud835\ude35\ud835\ude30-\ud835\ude0c\ud835\ude2f\ud835\ude25 \ud835\ude0d\ud835\ude33\ud835\ude22\ud835\ude2e\ud835\ude26\ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f-\ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude3a \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1a\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e\ud835\ude34 \ud835\ude23\ud835\ude3a\n\ud835\ude09\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude25\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude20\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1b\ud835\ude38\ud835\ude2a\ud835\ude2f\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n13\n\nShare this post\n\n#### A blueprint for designing production LLM systems: From Notebooks to\nproduction\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/a-blueprint-for-designing-production?r=1ttoeh" }, { "id": "9d858911-52d4-4240-8d6e-91f6b426baa0", "content": { "Title": "The difference between development and continuous training ML environments", "Subtitle": "Looking to become a PRO in LangChain? How to write a streaming retrieval system for RAG on social media data.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### The difference between development and continuous training ML\nenvironments\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# The difference between development and continuous training ML environments\n\n### Looking to become a PRO in LangChain? How to write a streaming retrieval\nsystem for RAG on social media data.\n\nPaul Iusztin\n\nJun 08, 2024\n\n7\n\nShare this post\n\n#### The difference between development and continuous training ML\nenvironments\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * Looking to become a PRO in LangChain?\n\n * The difference between development and continuous training ML environments\n\n * How to write a streaming retrieval system for RAG on social media data\n\n* * *\n\n _**First** , I want to thank everyone who supported our Hands-on LLMs course\nrepo_ \ud83d\ude4f\ud83c\udffb\n\nThe \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 FREE \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 passed 2.1k+ \u2b50\ufe0f on GitHub - the place to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\nthe \ud835\uddf3\ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf9\ud835\ude00 of \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 & \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00 \n \n\ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude28\ud835\ude30-\ud835\ude35\ud835\ude30 \ud835\ude29\ud835\ude36\ud835\ude23 \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude27\ud835\ude36\ud835\ude2f\ud835\ude25\ud835\ude22\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude22\ud835\ude2d\ud835\ude34 \ud835\ude30\ud835\ude27 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f-\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude3a\n\ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 & \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude16\ud835\ude31\ud835\ude34 \n \nIt will walk you through an \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00... \n \n...from data preparation to deployment & monitoring: \n \n\\- the 3-pipeline design \n\\- building your custom financial dataset using GPT-4 \n\\- a streaming pipeline to ingest financial news in real-time \n\\- fine-tuning an LLM using QLoRA \n\\- building a custom RAG pipeline \n\\- deploying the streaming pipeline to AWS \n\\- deploying the training & inference pipelines to Beam \n\\- using MLOps components: model registries, experiment trackers, prompt\nmonitoring \n \n\n\ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01 \n \n\u2193\u2193\u2193 \n \n\ud83d\udd17 \ud835\ude0f\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude34-\ud835\ude30\ud835\ude2f \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 \ud835\ude0a\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 - \ud835\ude13\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude0b\ud835\ude26\ud835\ude31\ud835\ude2d\ud835\ude30\ud835\ude3a \ud835\ude22 \ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude2d-\ud835\ude1b\ud835\ude2a\ud835\ude2e\ud835\ude26 \ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d\n\ud835\ude08\ud835\ude25\ud835\ude37\ud835\ude2a\ud835\ude34\ud835\ude30\ud835\ude33\n\n* * *\n\n### Looking to become a PRO in LangChain?\n\nThen \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddfc\ud835\ude02\ud835\ude01 this \ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8 on \ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb: from \ud835\uddef\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddfb\ud835\uddf2\ud835\uddff to \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1 \u2193 \n \n\u2192 It's called: \ud835\ude0e\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude37\ud835\ude26 \ud835\ude08\ud835\ude10 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude13\ud835\ude22\ud835\ude2f\ud835\ude28\ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude2a\ud835\ude2f: \ud835\ude09\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude25 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude22\ud835\ude31\ud835\ude31\ud835\ude34 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f,\n\ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude35\ud835\ude0e\ud835\ude17\ud835\ude1b, \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude30\ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 by Ben Auffarth , published by Packt \n \n\ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude22 \ud835\ude34\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35 \ud835\ude23\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude2c\ud835\ude25\ud835\ude30\ud835\ude38\ud835\ude2f: \n \n\\- It begins with some theoretical chapters on LLMs & LangChain \n \n\\- It explores the critical components of LangChain: chains, agents, memory,\ntools \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2\ud835\uddfb, \ud835\uddfa\ud835\ude06 \ud835\uddf3\ud835\uddee\ud835\ude03\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude01... \n \n\ud835\udddc\ud835\ude01 \ud835\uddf7\ud835\ude02\ud835\uddfa\ud835\uddfd\ud835\ude00 \ud835\uddf1\ud835\uddf6\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddfc \ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 - \ud835\uddea\ud835\udddc\ud835\udde7\ud835\udddb \ud835\udde3\ud835\uddec\ud835\udde7\ud835\udddb\ud835\udde2\ud835\udde1 \ud835\uddd6\ud835\udde2\ud835\uddd7\ud835\uddd8 \u2193 \n \n\\- takes off with beginner-friendly examples of using LangChain with agents,\nHuggingFace, GCP/VertexAI, Azure, Anthropic, etc. \n \n\\- shows an end-to-end example of building a customer services application\nwith LangChain & VertexAI \n \n\\- how to mitigate hallucinations using the \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude0a\ud835\ude29\ud835\ude26\ud835\ude24\ud835\ude2c\ud835\ude26\ud835\ude33\ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude2a\ud835\ude2f class \n \n\\- how to implement map-reduce pipelines \n \n\\- how to monitor token usage & costs \n \n\\- how to extract information from documents such as PDFs \n \n\\- building a Streamlit interface \n \n\\- how reasoning works in agent \n \n\\- building a chatbot like ChatGPT from SCRATCH \n \n. \n \nI haven't finished it yet, but I love it so far \u2014I plan to finish it soon. \n \n. \n \n\ud835\uddea\ud835\uddf5\ud835\uddfc \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff? \n \nIf you are \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\ude02\ud835\ude01 in the LLM world, this is a great book to \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1 \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\n\ud835\uddf2\ud835\uddfb\ud835\uddf1. \n \nEven if you are \ud835\uddf2\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1, I think it is \ud835\uddf2\ud835\ude05\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddf9\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2\ud835\uddf3\ud835\ude02\ud835\uddf9 to \ud835\ude00\ud835\uddf8\ud835\uddf6\ud835\uddfa \ud835\uddf6\ud835\ude01 to\nrefresh the fundamentals, learn new details, and see how everything is\nimplemented in LangChain.\n\nGenerative AI with LangChain [By Ben Auffarth]\n\n\ud835\udddc\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\ude06\ud835\uddfc\ud835\ude02? \ud83e\udef5 \n \n\ud83d\udd17 \ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01: Generative AI with LangChain [By Ben Auffarth]\n\n* * *\n\n### The difference between development and continuous training ML environments\n\nThey might do the same thing, but their design is entirely different \u2193 \n \n\ud835\udde0\ud835\udddf \ud835\uddd7\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9\ud835\uddfc\ud835\uddfd\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \n \nAt this point, your main goal is to ingest the raw and preprocessed data\nthrough versioned artifacts (or a feature store), analyze it & generate as\nmany experiments as possible to find the best: \n\\- model \n\\- hyperparameters \n\\- augmentations \n \nBased on your business requirements, you must maximize some specific metrics,\nfind the best latency-accuracy trade-offs, etc. \n \nYou will use an experiment tracker to compare all these experiments. \n \nAfter you settle on the best one, the output of your ML development\nenvironment will be: \n\\- a new version of the code \n\\- a new version of the configuration artifact \n \nHere is where the research happens. Thus, you need flexibility. \n \nThat is why we decouple it from the rest of the ML systems through artifacts\n(data, config, & code artifacts).\n\nThe difference between ML development & continuous training environments\n\n\ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \n \nHere is where you want to take the data, code, and config artifacts and: \n \n\\- train the model on all the required data \n\\- output a staging versioned model artifact \n\\- test the staging model artifact \n\\- if the test passes, label it as the new production model artifact \n\\- deploy it to the inference services \n \nA common strategy is to build a CI/CD pipeline that (e.g., using GitHub\nActions): \n \n\\- builds a docker image from the code artifact (e.g., triggered manually or\nwhen a new artifact version is created) \n\\- start the training pipeline inside the docker container that pulls the\nfeature and config artifacts and outputs the staging model artifact \n\\- manually look over the training report -> If everything went fine, manually\ntrigger the testing pipeline \n\\- manually look over the testing report -> if everything worked fine (e.g.,\nthe model is better than the previous one), manually trigger the CD pipeline\nthat deploys the new model to your inference services \n \nNote how the model registry quickly helps you to decouple all the components. \n \nAlso, because training and testing metrics are not always black and white, it\nis challenging to automate the CI/CD pipeline 100%. \n \nThus, you need a human in the loop when deploying ML models. \n \nTo conclude... \n \nThe ML development environment is where you do your research to find better\nmodels. \n \nThe continuous training environment is used to train & test the production\nmodel at scale.\n\n* * *\n\n### How to write a streaming retrieval system for RAG on social media data\n\n\ud835\uddd5\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 are the \ud835\uddfd\ud835\uddee\ud835\ude00\ud835\ude01. Here is how to \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\nfor \ud835\udde5\ud835\uddd4\ud835\uddda on \ud835\ude00\ud835\uddfc\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfa\ud835\uddf2\ud835\uddf1\ud835\uddf6\ud835\uddee \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \u2193 \n \n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5? \n \nIn environments where data evolves quickly (e.g., social media platforms), the\nsystem's response time is critical for your application's user experience. \n \nThat is why TikTok is so addicting. Its recommender system adapts in real-time\nbased on your interaction with the app. \n \nHow would it be if the recommendations were updated daily or hourly? \n \nWell, it would work, but you would probably get bored of the app much faster. \n \nThe same applies to RAG for highly intensive data sources... \n \n\u2192 where you must sync your source and vector DB in real time for up-to-date\nretrievals. \n \n\ud835\ude13\ud835\ude26\ud835\ude35'\ud835\ude34 \ud835\ude34\ud835\ude26\ud835\ude26 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude2a\ud835\ude35 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34. \n \n\u2193\u2193\u2193 \n \nI wrote an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 on how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9-\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa for \ud835\udde5\ud835\uddd4\ud835\uddda on\n\ud835\udddf\ud835\uddf6\ud835\uddfb\ud835\uddf8\ud835\uddf2\ud835\uddf1\ud835\udddc\ud835\uddfb \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee in collaboration with Superlinked . \n \nThe \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa is based on \ud835\udfee \ud835\uddf1\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf1 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00: \n\\- the streaming ingestion pipeline \n\\- the retrieval client \n \nThe \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 runs 24/7 to keep the vector DB synced with\nthe current raw LinkedIn posts data source. \n \nThe \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\uddf0\ud835\uddf9\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 is used in RAG applications to query the vector DB. \n \n\u2192 These 2 components are completely decoupled and communicate with each other\nthrough the vector DB. \n \n#\ud835\udfed. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \n\u2192 Implemented in Bytewax \\- a streaming engine built in Rust (speed&\nreliability) that exposes a Python interface \n \n\ud835\ude14\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude27\ud835\ude2d\ud835\ude30\ud835\ude38: \n \n\\- uses CDC to add changes from the source DB to a queue \n\\- listens to the queue for new events \n\\- cleans, chunks, and embeds the LI posts \n\\- loads them to a Qdrant vector DB \n \nand... everything in real-time!\n\nAdvanced RAG architecture [source from Superlinked Vectorhub]\n\n#\ud835\udfee. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\uddf0\ud835\uddf9\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \n \n\u2192 A standard Python module. \n \nThe goal is to retrieve similar posts using various query types, such as\nposts, questions, and sentences. \n \n\ud835\ude14\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude27\ud835\ude2d\ud835\ude30\ud835\ude38: \n \n\\- preprocess user queries (the same way as they were ingested) \n\\- search the Qdrant vector DB for the most similar results \n\\- use rerank to improve the retrieval system's accuracy \n\\- visualize the results on a 2D plot using UMAP \n \n. \n \nYou don't believe me? \ud83e\udef5 \n \n\ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\uddf9 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 & \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\uddfc\ud835\uddfb \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf \u2193 \n \n\ud83d\udd17 \ud835\ude08 \ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude2d-\ud835\ude35\ud835\ude2a\ud835\ude2e\ud835\ude26 \ud835\ude19\ud835\ude26\ud835\ude35\ud835\ude33\ud835\ude2a\ud835\ude26\ud835\ude37\ud835\ude22\ud835\ude2d \ud835\ude1a\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude19\ud835\ude08\ud835\ude0e \ud835\ude30\ud835\ude2f \ud835\ude1a\ud835\ude30\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d \ud835\ude14\ud835\ude26\ud835\ude25\ud835\ude2a\ud835\ude22 \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n7\n\nShare this post\n\n#### The difference between development and continuous training ML\nenvironments\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-difference-between-development?r=1ttoeh" }, { "id": "20beb560-6063-4158-b7b5-c2083b299ec5", "content": { "Title": "Architect LLM & RAG inference pipelines - by Paul Iusztin", "Subtitle": "Design, build, deploy and monitor LLM and RAG inference pipelines using LLMOps best practices. Integrate it with a model registry and vector DB.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Architect scalable and cost-effective LLM & RAG inference pipelines\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Architect scalable and cost-effective LLM & RAG inference pipelines\n\n### Design, build and deploy RAG inference pipeline using LLMOps best\npractices.\n\nPaul Iusztin\n\nJun 06, 2024\n\n13\n\nShare this post\n\n#### Architect scalable and cost-effective LLM & RAG inference pipelines\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n\u2192 the **9th** out of **11 lessons** of the **LLM Twin free course**\n\n### **Why is this course different?**\n\n_By finishing the \u201c**LLM Twin: Building Your Production-Ready AI\nReplica\u201d**_****_free course, you will learn how to design, train, and deploy a\nproduction-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps\ngood practices_.\n\n_**Why should you care? \ud83e\udef5**_\n\n _**\u2192 No more isolated scripts or Notebooks!** Learn production ML by building\nand deploying an end-to-end production-grade LLM system._\n\n> _More**details** on what you will **learn** within the **LLM Twin**\n> **course** , **here** \ud83d\udc48_\n\n### Latest Lessons of the LLM Twin Course\n\n**Lesson 6:** The Role of Feature Stores in Fine-Tuning LLMs\n\n\u2192 Custom Dataset Generation, Artifact Versioning, GPT3.5-Turbo Distillation,\nQdrant\n\n**Lesson 7:** How to fine-tune LLMs on custom datasets at Scale using Qwak and\nCometML\n\n\u2192QLoRA, PEFT, Fine-tuning Mistral-7b-Instruct on custom dataset, Qwak, Comet\nML\n\n**Lesson 8:** Best practices when evaluating fine-tuned LLM models\n\n\u2192 LLM Evaluation techniques: Does and don\u2019ts, Quantitive and manual LLM\nevaluation techniques\n\n* * *\n\n## **Lesson 9: Architect scalable and cost-effective LLM & RAG inference\npipelines**\n\nIn **Lesson 9,** we will focus on implementing and deploying the inference\npipeline of the LLM twin system.\n\n**First** , we will design and implement a scalable LLM & RAG inference\npipeline based on microservices, separating the ML and business logic into two\nlayers.\n\n**Secondly** , we will use Comet ML to integrate a prompt monitoring service\nto capture all input prompts and LLM answers for further debugging and\nanalysis.\n\n**Ultimately** , we will deploy the inference pipeline to Qwak and make the\nLLM twin service available worldwide.\n\n#### **\u2192 Context from previous lessons. What you must know.**\n\nThis lesson is part of a more extensive series in which we learn to build an\nend-to-end LLM system using LLMOps best practices.\n\n_If you haven\u2019t read the whole series, for this one to make sense, you have to\nknow that we have a:_\n\n * Qdrant vector DB populated with digital data (posts, articles, and code snippets)\n\n * vector DB retrieval module to do advanced RAG\n\n * fine-tuned open-source LLM available in a model registry from Comet ML\n\n> _\u2192 In this lesson, we will focus on gluing everything together into a\n> scalable inference pipeline and deploying it to the cloud._\n\n* * *\n\n### **Table of Contents**\n\n 1. The architecture of the inference pipeline\n\n 2. The training vs. the inference pipeline\n\n 3. The RAG business module\n\n 4. The LLM microservice\n\n 5. Prompt monitoring\n\n 6. Deploying and running the inference pipeline\n\n 7. Conclusion\n\n* * *\n\n## 1\\. The architecture of the inference pipeline\n\nOur inference pipeline contains the following core elements:\n\n * a fine-tuned LLM\n\n * a RAG module\n\n * a monitoring service\n\nLet\u2019s see how to hook these into a scalable and modular system.\n\n### **The interface of the inference pipeline**\n\nAs we follow the feature/training/inference (FTI) pipeline architecture, the\ncommunication between the 3 core components is clear.\n\nOur LLM inference pipeline needs 2 things:\n\n * a fine-tuned LLM: pulled from the model registry\n\n * features for RAG: pulled from a vector DB (which we modeled as a logical feature store)\n\nThis perfectly aligns with the FTI architecture.\n\n> _\u2192 If you are unfamiliar with the FTI pipeline architecture, we recommend\n> you reviewLesson 1\u2019s section on the 3-pipeline architecture._\n\n### **Monolithic vs. microservice inference pipelines**\n\nUsually, the inference steps can be split into 2 big layers:\n\n * t**he LLM service:** where the actual inference is being done\n\n * **the business service:** domain-specific logic\n\nWe can design our inference pipeline in 2 ways.\n\n#### **Option 1: Monolithic LLM & business service**\n\nIn a monolithic scenario, we implement everything into a single service.\n\n_Pros:_\n\n * easy to implement\n\n * easy to maintain\n\n _Cons:_\n\n * harder to scale horizontally based on the specific requirements of each component\n\n * harder to split the work between multiple teams\n\n * not being able to use different tech stacks for the two services\n\nMonolithic vs. microservice inference pipelines\n\n#### **Option 2: Different LLM & business microservices**\n\nThe LLM and business services are implemented as two different components that\ncommunicate with each other through the network, using protocols such as REST\nor gRPC.\n\n_Pros:_\n\n * each component can scale horizontally individually\n\n * each component can use the best tech stack at hand\n\n _Cons:_\n\n * harder to deploy\n\n * harder to maintain\n\nLet\u2019s focus on the \u201ceach component can scale individually\u201d part, as this is\nthe most significant benefit of the pattern. Usually, LLM and business\nservices require different types of computing. For example, an LLM service\ndepends heavily on GPUs, while the business layer can do the job only with a\nCPU.\n\n### **Microservice architecture of the LLM twin inference pipeline**\n\nLet\u2019s understand how we applied the microservice pattern to our concrete LLM\ntwin inference pipeline.\n\nAs explained in the sections above, we have the following components:\n\n 1. A business microservice\n\n 2. An LLM microservice\n\n 3. A prompt monitoring microservice\n\n**The business microservice** is implemented as a Python module that:\n\n * contains the advanced RAG logic, which calls the vector DB and GPT-4 API for advanced RAG operations;\n\n * calls the LLM microservice through a REST API using the prompt computed utilizing the user\u2019s query and retrieved context\n\n * sends the prompt and the answer generated by the LLM to the prompt monitoring microservice.\n\nAs you can see, the business microservice is light. It glues all the domain\nsteps together and delegates the computation to other services.\n\nThe end goal of the business layer is to act as an interface for the end\nclient. In our case, as we will ship the business layer as a Python module,\nthe client will be a Streamlit application.\n\nHowever, you can quickly wrap the Python module with FastAPI and expose it as\na REST API to make it accessible from the cloud.\n\nMicroservice architecture of the LLM twin inference pipeline\n\n**The LLM microservice** is deployed on Qwak. This component is wholly niched\non hosting and calling the LLM. It runs on powerful GPU-enabled machines.\n\nHow does the LLM microservice work?\n\n * It loads the fine-tuned LLM twin model from Comet\u2019s model registry [2].\n\n * It exposes a REST API that takes in prompts and outputs the generated answer.\n\n * When the REST API endpoint is called, it tokenizes the prompt, passes it to the LLM, decodes the generated tokens to a string and returns the answer.\n\nThat\u2019s it!\n\n**The prompt monitoring microservice** is based on Comet ML\u2019s LLM dashboard.\nHere, we log all the prompts and generated answers into a centralized\ndashboard that allows us to evaluate, debug, and analyze the accuracy of the\nLLM.\n\n## **2\\. The training vs. the inference pipeline**\n\nAlong with the obvious reason that the training pipeline takes care of\ntraining while the inference pipeline takes care of inference (Duh!), there\nare some critical differences you have to understand.\n\n### **The input of the pipeline & How the data is accessed**\n\nDo you remember our logical feature store based on the Qdrant vector DB and\nComet ML artifacts? If not, consider checking out Lesson 6 for a refresher.\n\nThe core idea is that **during training** , the data is accessed from an\noffline data storage in batch mode, optimized for throughput and data lineage.\n\nOur LLM twin architecture uses Comet ML artifacts to access, version, and\ntrack all our data.\n\nThe data is accessed in batches and fed to the training loop.\n\n**During inference** , you need an online database optimized for low latency.\nAs we directly query the Qdrant vector DB for RAG, that fits like a glove.\n\nDuring inference, you don\u2019t care about data versioning and lineage. You just\nwant to access your features quickly for a good user experience.\n\nThe data comes directly from the user and is sent to the inference logic.\n\nThe training vs. the inference pipeline\n\n### **The output of the pipeline**\n\nThe **training pipeline\u2019s** final output is the trained weights stored in\nComet\u2019s model registry.\n\nThe **inference pipeline\u2019s** final output is the predictions served directly\nto the user.\n\n### **The infrastructure**\n\nThe training pipeline requires more powerful machines with as many GPUs as\npossible.\n\n_Why?_ During training, you batch your data and have to hold in memory all the\ngradients required for the optimization steps. Because of the optimization\nalgorithm, the training is more compute-hungry than the inference.\n\nThus, more computing and VRAM result in bigger batches, which means less\ntraining time and more experiments.\n\nIf you run a batch pipeline, you will still pass batches to the model but\ndon\u2019t perform any optimization steps.\n\nIf you run a real-time pipeline, as we do in the LLM twin architecture, you\npass a single sample to the model or do some dynamic batching to optimize your\ninference step.\n\n### **Are there any overlaps?**\n\nYes! This is where the training-serving skew comes in.\n\nTo avoid the training-serving skew, you must carefully apply the same\npreprocessing and postprocessing steps during training and inference.\n\n## **3\\. The RAG business module**\n\nWe will define the RAG business module under the _LLMTwin_ class. The LLM twin\nlogic is directly correlated with our business logic.\n\nWe don\u2019t have to introduce the word \u201cbusiness\u201d in the naming convention of the\nclasses.\n\nLet\u2019s dig into the _generate()_ method of the _LLMTwin_ class, where we:\n\n * call the RAG module;\n\n * create the prompt using the prompt template, query and context;\n\n * call the LLM microservice;\n\n * log the prompt, prompt template, and answer to Comet ML\u2019s prompt monitoring service.\n\nInference pipeline business module: generate() method \u2192 GitHub \u2190\n\nLet\u2019s look at how our LLM microservice is implemented using Qwak.\n\n## **4\\. The LLM microservice**\n\nAs the LLM microservice is deployed on Qwak, we must first inherit from the\n_QwakModel_ class and implement some specific functions.\n\n * _initialize_model()_ : where we load the fine-tuned model from the model registry at serving time\n\n * _schema():_ where we define the input and output schema\n\n * _predict()_ : where we implement the actual inference logic\n\n**Note:** The _build()_ function contains all the training logic, such as\nloading the dataset, training the LLM, and pushing it to a Comet experiment.\nTo see the full implementation, consider checking out Lesson 7, where we\ndetailed the training pipeline.\n\nLLM microservice \u2192 GitHub \u2190\n\nLet\u2019s zoom into the implementation and the life cycle of the Qwak model.\n\nThe _schema()_ method is used to define how the input and output of the\n_predict()_ method look like. This will automatically validate the structure\nand type of the _predict()_ method. For example, the LLM microservice will\nthrow an error if the variable instruction is a JSON instead of a string.\n\nThe other Qwak-specific methods are called in the following order:\n\n 1. ___init__()_ \u2192 when deploying the model\n\n 2. _initialize_model()_ \u2192 when deploying the model\n\n 3. _predict()_ \u2192 on every request to the LLM microservice\n\n**> >>** Note that these methods are called only during serving time (and not\nduring training).\n\nQwak exposes your model as a RESTful API, where the _predict()_ method is\ncalled on each request.\n\nInside the prediction method, we perform the following steps:\n\n * map the input text to token IDs using the LLM-specific tokenizer\n\n * move the token IDs to the provided device (GPU or CPU)\n\n * pass the token IDs to the LLM and generate the answer\n\n * extract only the generated tokens from the _generated_ids_ variable by slicing it using the shape of the _input_ids_\n\n * decode the _generated_ids_ back to text\n\n * return the generated text\n\nThe final step is to look at Comet\u2019s prompt monitoring service. \u2193\n\n## **5\\. Prompt monitoring**\n\nComet makes prompt monitoring straightforward. There is just one API call\nwhere you connect to your project and workspace and send the following to a\nsingle function:\n\n * the prompt and LLM output\n\n * the prompt template and variables that created the final output\n\n * your custom metadata specific to your use case \u2014 here, you add information about the model, prompt token count, token generation costs, latency, etc.\n\n \n \n class PromptMonitoringManager:\n @classmethod\n def log(\n cls, prompt: str, output: str,\n prompt_template: str | None = None,\n prompt_template_variables: dict | None = None,\n metadata: dict | None = None,\n ) -> None:\n metadata = {\n \"model\": settings.MODEL_TYPE,\n **metadata,\n } or {\"model\": settings.MODEL_TYPE}\n \n comet_llm.log_prompt(\n workspace=settings.COMET_WORKSPACE,\n project=f\"{settings.COMET_PROJECT}-monitoring\",\n api_key=settings.COMET_API_KEY,\n prompt=prompt, prompt_template=prompt_template,\n prompt_template_variables=prompt_template_variables,\n output=output, metadata=metadata,\n )\n\nThis is how Comet ML\u2019s prompt monitoring dashboard looks. Here, you can scroll\nthrough all the prompts that were ever sent to the LLM. \u2193\n\nYou can click on any prompt and see everything we logged programmatically\nusing the _PromptMonitoringManager_ class.\n\nScreenshot from Comet ML\u2019s dashboard\n\nBesides what we logged, adding various tags and the inference duration can be\nvaluable.\n\n## **6\\. Deploying and running the inference pipeline**\n\nWe can deploy the LLM microservice using the following Qwak command:\n\n \n \n qwak models deploy realtime \\\n --model-id \"llm_twin\" \\\n --instance \"gpu.a10.2xl\" \\ \n --timeout 50000 \\ \n --replicas 2 \\\n --server-workers 2\n\nWe deployed two replicas of the LLM twin. Each replica has access to a machine\nwith x1 A10 GPU. Also, each replica has two workers running on it.\n\n\ud83d\udd17 More on Qwak instance types \u2190\n\nTwo replicas and two workers result in 4 microservices that run in parallel\nand can serve our users.\n\nYou can scale the deployment to more replicas if you need to serve more\nclients. Qwak provides autoscaling mechanisms triggered by listening to the\nconsumption of GPU, CPU or RAM.\n\nTo conclude, you build the Qwak model once, and based on it, you can make\nmultiple deployments with various strategies.\n\n* * *\n\n## **Conclusion**\n\n _Congratulations! You are close to the end of the LLM twin series._\n\nIn **Lesson 9** of the LLM twin course, you learned to **build** a scalable\ninference pipeline for serving LLMs and RAG systems.\n\n**First** , you learned how to architect an inference pipeline by\nunderstanding the difference between monolithic and microservice\narchitectures. We also highlighted the difference in designing the training\nand inference pipelines.\n\n**Secondly** , we walked you through implementing the RAG business module and\nLLM twin microservice. Also, we showed you how to log all the prompts,\nanswers, and metadata for Comet\u2019s prompt monitoring service.\n\n**Ultimately** , we showed you how to deploy and run the LLM twin inference\npipeline on the Qwak AI platform.\n\nIn **Lesson 10** , we will show you how to evaluate the whole system by\nbuilding an advanced RAG evaluation pipeline that analyzes the accuracy of the\nLLMs \u2019 answers relative to the query and context.\n\nSee you there! \ud83e\udd17\n\n> _\ud83d\udd17**Check out** the code on GitHub [1] and support us with a \u2b50\ufe0f_\n\n* * *\n\n### Next Steps\n\n#### Step 1\n\nThis is just the **short version** of **Lesson 9** on **architecting scalable\nand cost-effective LLM & RAG inference pipelines.**\n\n\u2192 For\u2026\n\n * The full implementation.\n\n * Full deep dive into the code.\n\n * More on the RAG, LLM and monitoring services.\n\n**Check out** the **full version** of **Lesson 9** on our **Medium\npublication**. It\u2019s still FREE:\n\nLesson 9 on Medium\n\n#### Step 2\n\n\u2192 **Consider checking out theLLM Twin GitHub repository and try it yourself\n\ud83e\udef5**\n\n _Nothing compares with getting your hands dirty and doing it yourself!_\n\nLLM Twin Course - GitHub\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n13\n\nShare this post\n\n#### Architect scalable and cost-effective LLM & RAG inference pipelines\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/architect-scalable-and-cost-effective?r=1ttoeh" }, { "id": "95d64d1d-83f2-47e9-8eda-9a687b98e6eb", "content": { "Title": "7 tips to reduce your VRAM when training LLMs ", "Subtitle": "3 techniques you must know to evaluate your LLMs. Introduction to deploying private LLMs with AWS SageMaker.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### 7 tips to reduce your VRAM when training LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# 7 tips to reduce your VRAM when training LLMs\n\n### 3 techniques you must know to evaluate your LLMs. Introduction to\ndeploying private LLMs with AWS SageMaker.\n\nPaul Iusztin\n\nMay 18, 2024\n\n4\n\nShare this post\n\n#### 7 tips to reduce your VRAM when training LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * 3 techniques you must know to evaluate your LLMs\n\n * 7 tips you must know to reduce your VRAM consumption of your LLMs during training\n\n * Introduction to deploying private LLMs with AWS SageMaker\n\n* * *\n\nOn the 3rd of May, I \ud835\uddf5\ud835\uddfc\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddf1 a \ud835\uddf3\ud835\uddff\ud835\uddf2\ud835\uddf2 \ud835\ude00\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb on Maven for \ud835\udff5\ud835\udff0 \ud835\uddfd\ud835\uddf2\ud835\uddfc\ud835\uddfd\ud835\uddf9\ud835\uddf2 on how to\n\ud835\uddd4\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb. If you missed it, here is \ud835\uddf5\ud835\uddfc\ud835\ude04 you can \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\uddf6\ud835\ude01 for\n\ud835\uddf3\ud835\uddff\ud835\uddf2\ud835\uddf2 \u2193 \n \n. \n \n\ud835\ude12\ud835\ude26\ud835\ude3a \ud835\ude35\ud835\ude22\ud835\ude2c\ud835\ude26\ud835\ude22\ud835\ude38\ud835\ude22\ud835\ude3a\ud835\ude34 \ud835\ude38\ud835\ude26\ud835\ude33\ud835\ude26: \n \n\u2192 Why I started building my LLM Twin \n \n\u2192 The 3 pipeline design / The FTI pipeline architecture \n \n\u2192 System design of the LLM Twin Architecture \n \n\u2192 Break down the RAG system of the LLM Twin Architecture \n \n\u2192 Live Demo \n \n. \n \nIf you want the recording, you can watch it for free here:\nhttps://bit.ly/3PZGV0S \n \n\ud835\ude08\ud835\ude2d\ud835\ude34\ud835\ude30, \ud835\ude29\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude30\ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude33 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude27\ud835\ude36\ud835\ude2d \ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude2c\ud835\ude34: \n \n\\- \ud835\ude34\ud835\ude2d\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude34: \ud83d\udd17 https://lnkd.in/d_MdqGwS \n \n\\- \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1b\ud835\ude38\ud835\ude2a\ud835\ude2f \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude0e\ud835\ude2a\ud835\ude35\ud835\ude0f\ud835\ude36\ud835\ude23: \ud83d\udd17 https://lnkd.in/dzat6PB6 \n \n\\- \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1b\ud835\ude38\ud835\ude2a\ud835\ude2f \ud835\ude0d\ud835\ude19\ud835\ude0c\ud835\ude0c \ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f\ud835\ude34: \ud83d\udd17 https://lnkd.in/dX__4mhX\n\n* * *\n\n### 3 techniques you must know to evaluate your LLMs\n\nHere are 3 techniques you must know to evaluate your LLMs quickly. \n \nManually testing the output of your LLMs is a tedious and painful process \u2192\nyou need to automate it. \n \nIn generative AI, most of the time, you cannot leverage standard metrics. \n \nThus, the real question is, how do you evaluate the outputs of an LLM? \n \n#\ud835\udfed. \ud835\udde6\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\ude00 - \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf8\ud835\uddfb\ud835\uddfc\ud835\ude04 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude04\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\ude01 \n \nEven if you use an LLM to generate text, you can ask it to generate a response\nin a structured format (e.g., JSON) that can be parsed. \n \nYou know exactly what you want (e.g., a list of products extracted from the\nuser's question). \n \nThus, you can easily compare the generated and ideal answers using classic\napproaches. \n \nFor example, when extracting the list of products from the user's input, you\ncan do the following: \n\\- check if the LLM outputs a valid JSON structure \n\\- use a classic method to compare the generated and real answers \n \n#\ud835\udfee. \ud835\udde1\ud835\uddfc \"\ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01\" \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff (\ud835\uddf2.\ud835\uddf4., \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude00, \ud835\uddf2\ud835\ude01\ud835\uddf0.) \n \nWhen generating sentences, the LLM can use different styles, words, etc. Thus,\ntraditional metrics (e.g., BLUE score) are too rigid to be useful. \n \nYou can leverage another LLM to test the output of our initial LLM. The trick\nis in what questions to ask. \n \nHere, we have another 2 sub scenarios: \n \n\u21b3 \ud835\udfee.\ud835\udfed \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb'\ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc (\ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb'\ud835\ude01\n\ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5) \n \nYou don't have access to an expert to write an ideal answer for a given\nquestion to compare it to. \n \nBased on the initial prompt and generated answer, you can compile a set of\nquestions and pass them to an LLM. Usually, these are Y/N questions that you\ncan easily quantify and check the validity of the generated answer. \n \nThis is known as \"Rubric Evaluation\" \n \nFor example: \n\"\"\" \n\\- Is there any disagreement between the response and the context? (Y or N) \n\\- Count how many questions the user asked. (output a number) \n... \n\"\"\" \n \nThis strategy is intuitive, as you can ask the LLM any question you are\ninterested in as long it can output a quantifiable answer (Y/N or a number). \n \n\u21b3 \ud835\udfee.\ud835\udfee. \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddfc (\ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2\n\ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5) \n \nWhen you have access to an answer manually created by a group of experts,\nthings are easier. \n \nYou will use an LLM to compare the generated and ideal answers based on\nsemantics, not structure. \n \nFor example: \n\"\"\" \n(A) The submitted answer is a subset of the expert answer and entirely\nconsistent. \n... \n(E) The answers differ, but these differences don't matter. \n\"\"\"\n\n* * *\n\n### 7 tips you must know to reduce your VRAM consumption of your LLMs during\ntraining\n\nHere are \ud835\udff3 \ud835\ude01\ud835\uddf6\ud835\uddfd\ud835\ude00 you must know to \ud835\uddff\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf2 your \ud835\udde9\ud835\udde5\ud835\uddd4\ud835\udde0 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb of your \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00\nduring \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 so you can \ud835\uddf3\ud835\uddf6\ud835\ude01 it on \ud835\ude05\ud835\udfed \ud835\uddda\ud835\udde3\ud835\udde8. \n \n\ud835\udfed\\. \ud835\udde0\ud835\uddf6\ud835\ude05\ud835\uddf2\ud835\uddf1-\ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb: During training you use both FP32 and FP16 in the\nfollowing way: \"FP32 weights\" -> \"FP16 weights\" -> \"FP16 gradients\" -> \"FP32\ngradients\" -> \"Update weights\" -> \"FP32 weights\" (and repeat). As you can see,\nthe forward & backward passes are done in FP16, and only the optimization step\nis done in FP32, which reduces both the VRAM and runtime. \n \n\ud835\udfee\\. \ud835\udddf\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff-\ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb: All your computations are done in FP16 instead of FP32.\nBut the key is using bfloat16 (\"Brain Floating Point\"), a numerical\nrepresentation Google developed for deep learning. It allows you to represent\nvery large and small numbers, avoiding overflowing or underflowing scenarios. \n \n\ud835\udfef\\. \ud835\udde5\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\uddf6\ud835\ude07\ud835\uddf2: This one is straightforward. Fewer samples per\ntraining iteration result in smaller VRAM requirements. The downside of this\nmethod is that you can't go too low with your batch size without impacting\nyour model's performance. \n \n\ud835\udff0\\. \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb: It is a simple & powerful trick to increase your\nbatch size virtually. You compute the gradients for \"micro\" batches (forward +\nbackward passes). Once the accumulated gradients reach the given \"virtual\"\ntarget, the model weights are updated with the accumulated gradients. For\nexample, you have a batch size of 4 and a micro-batch size of 1. Then, the\nforward & backward passes will be done using only x1 sample, and the\noptimization step will be done using the aggregated gradient of the 4 samples. \n \n\ud835\udff1\\. \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\uddee \ud835\ude00\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddff: Adam is the most popular optimizer. It is one\nof the most stable optimizers, but the downside is that it has 2 additional\nparameters (a mean & variance) for every model parameter. If you use a\nstateless optimizer, such as SGD, you can reduce the number of parameters by\n2/3, which is significant for LLMs. \n \n\ud835\udff2\\. \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 (\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\ude03\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb) \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8\ud835\uddfd\ud835\uddfc\ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4: It drops specific activations\nduring the forward pass and recomputes them during the backward pass. Thus, it\neliminates the need to hold all activations simultaneously in VRAM. This\ntechnique reduces VRAM consumption but makes the training slower. \n \n\ud835\udff3\\. \ud835\uddd6\ud835\udde3\ud835\udde8 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddf3\ud835\uddf3\ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4: The parameters that do not fit on your GPU's\nVRAM are loaded on the CPU. Intuitively, you can see it as a model parallelism\nbetween your GPU & CPU.\n\nImage by DALL-E\n\nMost of these methods are orthogonal, so you can combine them and drastically\nreduce your VRAM requirements during training.\n\n* * *\n\n### Introduction to deploying private LLMs with AWS SageMaker\n\nEver wondered \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 in <\ud835\udfef\ud835\udfec \ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb-\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00, such as \ud835\udddf\ud835\uddf9\ud835\uddee\ud835\uddfa\ud835\uddee\ud835\udfee,\non \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\udde6\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\udde0\ud835\uddee\ud835\uddf8\ud835\uddf2\ud835\uddff? Then wonder no more \u2193\n\n#### Step 1: Deploy the LLM to AWS SageMaker\n\nThe sweet thing about SageMaker is that it accelerates the development\nprocess, enabling a more efficient and rapid transition to the production\nstage. \n \n\nVesa Alexandru\n\nsmashed with his first article on DML about showing step-by-step how to deploy\nan LLM from HuggingFace to AWS SageMaker using good practices, such as: \n \n\\- designing a config class for the deployment of the LLM \n\\- set up AWS and deploy the LLM to SageMaker \n\\- implement an inference class to call the deployed LLM in real-time through\na web endpoint \n\\- define a prompt template function to ensure reproducibility & consistency \n \n...and, ultimately, how to play yourself with your freshly deployed LLM.\n\n_Here is the full article explaining how to deploy the LLM to AWS SageMaker_ \u2193\n\n#### DML: Introduction to Deploying Private LLMs with AWS SageMaker: Focus on\nLlama2-7b-chat\n\nVesa Alexandru\n\n\u00b7\n\nJan 18\n\nRead full story\n\n#### Step 2: Call the SageMaker inference endpoint\n\nYou've just deployed your Mistral LLM to SageMaker. \n \n\ud835\ude15\ud835\ude30\ud835\ude38 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35? \n \nUnfortunately, you are not done. \n \nThat was just the beginning of the journey. \n \n\u2192 Now, you have to write a Python client that calls the LLM. \n \n\ud835\udddf\ud835\uddf2\ud835\ude01'\ud835\ude00 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddee \ud835\uddf1\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\ude06 \ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf8 \ud835\uddee\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2. \n \n\u2193\u2193\u2193 \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed: Define a Settings object using \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24. \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee: Create an inference interface that inherits from \ud835\ude08\ud835\ude09\ud835\ude0a \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef: Implement an \ud835\ude08\ud835\ude1e\ud835\ude1a \ud835\ude1a\ud835\ude22\ud835\ude28\ud835\ude26\ud835\ude14\ud835\ude22\ud835\ude2c\ud835\ude26\ud835\ude33 version of the inference interface by\nspecifying how to construct the HTTP payload and call the SageMaker endpoint.\nWe want to keep this class independent from the summarization prompt! \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff0: Create the summarization prompt. \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff1: Encapsulate the summarization prompt and Python SageMaker client into\na \ud835\ude1a\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26\ud835\ude1a\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35\ud835\ude0b\ud835\ude30\ud835\ude24\ud835\ude36\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 task. \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff2: Wrap the \ud835\ude1a\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26\ud835\ude1a\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35\ud835\ude0b\ud835\ude30\ud835\ude24\ud835\ude36\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 task with a FastAPI endpoint. \n \n...and bam! \n \nYou have an LLM for summarizing any document. \n \n. \n \n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\uddfa\ud835\uddf2 \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddef\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\uddef\ud835\uddfc\ud835\ude03\ud835\uddf2: \n \n\\- by using an inference interface, you can quickly swap the LLM\nimplementation \n \n\\- by decoupling the prompt construction logic from the inference class, you\ncan reuse the inference client with any prompt \n \n\\- by wrapping everything with a \ud835\ude1a\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26\ud835\ude1a\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35\ud835\ude0b\ud835\ude30\ud835\ude24\ud835\ude36\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 task you can quickly\ndefine & configure multiple types of tasks and leverage polymorphism to run\nthem \n \n_Here is the full article explaining how to design the inference module_ \u2193\n\n#### Steal my code to solve real-world problems\n\nVesa Alexandru\n\n\u00b7\n\nFeb 29\n\nRead full story\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n4\n\nShare this post\n\n#### 7 tips to reduce your VRAM when training LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/7-tips-to-reduce-your-vram-when-training?r=1ttoeh" }, { "id": "d0c592eb-82bc-46c4-9632-388f9dd144ce", "content": { "Title": "Using this Python package, you can x10 your text preprocessing pipelines", "Subtitle": "End-to-end framework for production-ready LLMs. Top 6 ML platform features you must know and use in your ML system.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Using this Python package, you can x10 your text preprocessing pipelines\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Using this Python package, you can x10 your text preprocessing pipelines\n\n### End-to-end framework for production-ready LLMs. Top 6 ML platform features\nyou must know and use in your ML system.\n\nPaul Iusztin\n\nMay 11, 2024\n\n9\n\nShare this post\n\n#### Using this Python package, you can x10 your text preprocessing pipelines\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * Top 6 ML platform features you must know and use in your ML system.\n\n * Using this Python package, you can x10 your text preprocessing pipelines\n\n * End-to-end framework for production-ready LLMs\n\n* * *\n\n### Top 6 ML platform features you must know and use in your ML system\n\nHere they are \u2193 \n \n#\ud835\udfed. \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n \nIn your ML development phase, you generate lots of experiments. \n \nTracking and comparing the metrics between them is crucial in finding the\noptimal model. \n \n#\ud835\udfee. \ud835\udde0\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\udde6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 \n \nIts primary purpose is reproducibility. \n \nTo know how a model was generated, you need to know: \n\\- the version of the code \n\\- the version of the packages \n\\- hyperparameters/config \n\\- total compute \n\\- version of the dataset \n... and more \n \n#\ud835\udfef. \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 \n \nMost of the time, along with the metrics, you must log a set of visualizations\nfor your experiment. \n \nSuch as: \n\\- images \n\\- videos \n\\- prompts \n\\- t-SNE graphs \n\\- 3D point clouds \n... and more \n \n#\ud835\udff0. \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddfc\ud835\uddff\ud835\ude01\ud835\ude00 \n \nYou don't work in a vacuum. \n \nYou have to present your work to other colleges or clients. \n \nA report lets you take the metadata and visualizations from your experiment... \n \n...and create, deliver and share a targeted presentation for your clients or\npeers. \n \n#\ud835\udff1. \ud835\uddd4\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf3\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\ude00 \n \nThe most powerful feature out of them all. \n \nAn artifact is a versioned object that is an input or output for your task. \n \nEverything can be an artifact, but the most common cases are: \n\\- data \n\\- model \n\\- code \n \nWrapping your assets around an artifact ensures reproducibility. \n \nFor example, you wrap your features into an artifact (e.g., features:3.1.2),\nwhich you can consume into your ML development step. \n \nThe ML development step will generate config (e.g., config:1.2.4) and code\n(e.g., code:1.0.2) artifacts used in the continuous training pipeline. \n \nDoing so lets you quickly respond to questions such as \"What I used to\ngenerate the model?\" and \"What Version?\" \n \n#\ud835\udff2. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\udde5\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06 \n \nThe model registry is the ultimate way to make your model accessible to your\nproduction ecosystem. \n \nFor example, in your continuous training pipeline, after the model is trained,\nyou load the weights as an artifact into the model registry (e.g.,\nmodel:1.2.4). \n \nYou label this model as \"staging\" under a new version and prepare it for\ntesting. If the tests pass, mark it as \"production\" under a new version and\nprepare it for deployment (e.g., model:2.1.5).\n\nAll of these features are used in a mature ML system. What is your favorite\none?\n\n* * *\n\n### Using this Python package, you can x10 your text preprocessing pipelines\n\nAny text preprocessing pipeline has to clean, partition, extract, or chunk\ntext data to feed it into your LLMs. \n \n\ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 offers a \ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf5 and \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb \ud835\uddd4\ud835\udde3\ud835\udddc that allows you to quickly: \n \n\\- \ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f your data into smaller segments from various data sources (e.g.,\nHTML, CSV, PDFs, even images, etc.) \n\\- \ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 the text of anomalies (e.g., wrong ASCII characters), any\nirrelevant information (e.g., white spaces, bullets, etc.), and filling\nmissing values \n\\- \ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28 information from pieces of text (e.g., datetimes, addresses, IP\naddresses, etc.) \n\\- \ud835\ude24\ud835\ude29\ud835\ude36\ud835\ude2f\ud835\ude2c\ud835\ude2a\ud835\ude2f\ud835\ude28 your text segments into pieces of text that can be inserted into\nyour embedding model \n\\- \ud835\ude26\ud835\ude2e\ud835\ude23\ud835\ude26\ud835\ude25\ud835\ude25\ud835\ude2a\ud835\ude2f\ud835\ude28 data (e.g., wrapper over OpenAIEmbeddingEncoder,\nHuggingFaceEmbeddingEncoders, etc.) \n\\- \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude28\ud835\ude26 your data to be fed into various tools (e.g., Label Studio, Label\nBox, etc.) \n \n\ud835\uddd4\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff: \n \n\\- feeding your data into your LLMs \n\\- embedding the data and ingesting it into a vector DB \n\\- doing RAG \n\\- labeling \n\\- recommender systems \n \n... basically for any LLM or multimodal applications \n \n. \n \nImplementing all these steps from scratch will take a lot of time. \n \nI know some Python packages already do this, but the functionality is\nscattered across multiple packages.\n\n\ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 packages everything together under a nice, clean API.\n\n* * *\n\n### End-to-end framework for production-ready LLMs\n\nWant to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 in a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\ude04\ud835\uddee\ud835\ude06? For \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8? Then \ud835\ude06\ud835\uddfc\ud835\ude02\n\ud835\ude00\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf9\ud835\uddf1 \ud835\ude01\ud835\uddee\ud835\uddf8\ud835\uddf2 our \ud835\udde1\ud835\uddd8\ud835\uddea \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on how to \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 an \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 for\n\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 \u2193 \n \n\ud83e\udde0 Decoding ML and I are \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 a \ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 how to\n\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01 and \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9-\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa by \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 an \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb: \n \n\u2192 from start to finish - from \n\u2192 from data collection to deployment \n\u2192 production-ready \n\u2192 from NO MLOps to experiment trackers, model registries, prompt monitoring,\nand versioning\n\nThe course is called: \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb: \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee \n \n...and here is what you will learn to build \n \n\u2193\u2193\u2193 \n \n\ud83d\udc0d 4 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34: \n \n\u2192 \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \n\\- Crawl your digital data from various social media platforms. \n\\- Clean, normalize and load the data to a NoSQL DB through a series of ETL\npipelines. \n\\- Send database changes to a queue using the CDC pattern. \n \n\u2601 Deployed on AWS.\n\n \n \n\u2192 \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \n\\- Consume messages from a queue through a Bytewax streaming pipeline. \n\\- Every message will be cleaned, chunked, embedded and loaded into a Qdrant\nvector DB in real-time. \n \n\u2601 Deployed on AWS. \n \n \n\u2192 \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \n\\- Create a custom dataset based on your digital data. \n\\- Fine-tune an LLM using QLoRA. \n\\- Use Comet ML's experiment tracker to monitor the experiments. \n\\- Evaluate and save the best model to Comet's model registry. \n \n\u2601 Deployed on Qwak. \n \n \n\u2192 \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \n\\- Load and quantize the fine-tuned LLM from Comet's model registry. \n\\- Deploy it as a REST API \n\\- Enhance the prompts using RAG \n\\- Generate content using your LLM twin \n\\- Monitor the LLM using Comet's prompt monitoring dashboard \n \n\u2601 Deployed on Qwak. \n \n. \n \n\ud835\ude08\ud835\ude2d\ud835\ude30\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 4 \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34, \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 3 \ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34 \ud835\ude35\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude34: \n \n\\- Comet as your ML Platform \n\\- Qdrant as your vector DB \n\\- Qwak as your ML infrastructure \n \n. \n \nTo stay updated on \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb: \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee\ncourse... \n \n\ud835\ude3e\ud835\ude5d\ud835\ude5a\ud835\ude58\ud835\ude60 \ud835\ude5e\ud835\ude69 \ud835\ude64\ud835\ude6a\ud835\ude69 \ud835\ude42\ud835\ude5e\ud835\ude69\ud835\ude43\ud835\ude6a\ud835\ude57 \ud835\ude56\ud835\ude63\ud835\ude59 \ud835\ude68\ud835\ude6a\ud835\ude65\ud835\ude65\ud835\ude64\ud835\ude67\ud835\ude69 \ud835\ude6a\ud835\ude68 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude56 \u2b50\ufe0f \n \n\u2193\u2193\u2193 \n \n\ud83d\udd17 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb: \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n9\n\nShare this post\n\n#### Using this Python package, you can x10 your text preprocessing pipelines\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/using-this-python-package-you-can?r=1ttoeh" }, { "id": "46f9a4cc-cf3b-43c6-9026-6c9cddf8674a", "content": { "Title": "4 Advanced RAG Algorithms You Must Know - by Paul Iusztin", "Subtitle": "Implement 4 advanced RAG retrieval techniques to optimize your vector DB searches. Integrate the RAG retrieval module into a production LLM system.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### The 4 Advanced RAG Algorithms You Must Know to Implement\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# The 4 Advanced RAG Algorithms You Must Know to Implement\n\n### Implement from scratch 4 advanced RAG methods to optimize your retrieval\nand post-retrieval algorithm\n\nPaul Iusztin\n\nMay 09, 2024\n\n17\n\nShare this post\n\n#### The 4 Advanced RAG Algorithms You Must Know to Implement\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n1\n\nShare\n\n _\u2192 the 5th out of 11 lessons of the LLM Twin free course_\n\n### **Why is this course different?**\n\n_By finishing the \u201c**LLM Twin: Building Your Production-Ready AI\nReplica\u201d**_****_free course, you will learn how to design, train, and deploy a\nproduction-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps\ngood practices_.\n\n_**Why should you care? \ud83e\udef5**_\n\n _**\u2192 No more isolated scripts or Notebooks!** Learn production ML by building\nand deploying an end-to-end production-grade LLM system._\n\n> More **details** on what you will **learn** within the **LLM Twin**\n> **course** , **here** \ud83d\udc48\n\n* * *\n\n### Latest Lessons of the LLM Twin Course\n\n**Lesson 2** : The importance of Data Pipeline in the era of Generative AI\n\n\u2192 Data crawling, ETL pipelines, ODM, NoSQL Database\n\n**Lesson 3:** CDC: Enabling Event-Driven Architectures\n\n\u2192 Change Data Capture (CDC), MongoDB Watcher, RabbitMQ queue\n\n**Lesson 4:** Python Streaming Pipelines for Fine-tuning LLMs and RAG - in\nReal-Time!\n\n\u2192 Feature pipeline, Bytewax streaming engine, Pydantic models, The dispatcher\nlayer\n\n* * *\n\n### Lesson 5: **The 4 Advanced RAG Algorithms You Must Know to Implement**\n\nIn **Lesson 5** , we will focus on building an advanced retrieval module used\nfor RAG.\n\nWe will show you how to implement 4 **retrieval** and **post-retrieval\nadvanced optimization techniques** to **improve** the **accuracy** of your\n**RAG retrieval step**.\n\nIn this lesson, we will focus only on the retrieval part of the RAG system.\n\nIn **Lesson 4** , we showed you how to clean, chunk, embed, and load social\nmedia data to a Qdrant vector DB (the ingestion part of RAG).\n\nIn future lessons, we will integrate this retrieval module into the inference\npipeline for a full-fledged RAG system.\n\nRetrieval Python Module Architecture\n\n* * *\n\n### 1\\. Overview of advanced RAG optimization techniques\n\nA production RAG system is split into **3 main components** :\n\n * **ingestion:** clean, chunk, embed, and load your data to a vector DB\n\n * **retrieval:** query your vector DB for context\n\n * **generation:** attach the retrieved context to your prompt and pass it to an LLM\n\nThe **ingestion component** sits in the _feature pipeline_ , while the\n**retrieval** and **generation** **components** are implemented inside the\n_inference pipeline_.\n\nYou can **also** **use** the **retrieval** and **generation** **components**\nin your _training pipeline_ to fine-tune your LLM further on domain-specific\nprompts.\n\nYou can apply advanced techniques to optimize your RAG system for ingestion,\nretrieval and generation.\n\n_That being said, there are 3 main types of advanced RAG techniques:_\n\n * **Pre-retrieval optimization**[ingestion]: tweak how you create the chunks\n\n * **Retrieval optimization**[retrieval]:**** improve the queries to your vector DB\n\n * **Post-retrieval optimization**[retrieval]**:** process the retrieved chunks to filter out the noise\n\n> The **generation step** can be **improved** through fine-tuning or prompt\n> engineering, which will be explained in future lessons.\n\nThe **pre-retrieval optimization techniques** are explained in Lesson 4.\n\nIn this lesson, we will show you some **popular** **retrieval** and **post-\nretrieval** **optimization techniques**.\n\n* * *\n\n### 2\\. Advanced RAG techniques applied to the LLM twin\n\n#### **Retrieval optimization**\n\n _We will combine 3 techniques:_\n\n * Query Expansion\n\n * Self Query\n\n * Filtered vector search\n\n#### **Post-retrieval optimization**\n\nWe will **use** the **rerank** pattern **using** **GPT-4** and **prompt\nengineering** instead of Cohere or an open-source re-ranker cross-encoder [4].\n\nI don\u2019t want to spend too much time on the theoretical aspects. There are\nplenty of articles on that.\n\n_So, we will**jump** straight to **implementing** and **integrating** these\ntechniques in our LLM twin system._\n\nBut first, let\u2019s clarify why we picked Qdrant as our vector DB \u2193\n\n#### 2.1. Why Qdrant?\n\nThere are many vector DBs out there, too many\u2026\n\nBut since we discovered Qdrant, we loved it.\n\n**Why?**\n\n * It is built in Rust.\n\n * Apache-2.0 license \u2014 open-source \ud83d\udd25\n\n * It has a great and intuitive Python SDK.\n\n * It has a freemium self-hosted version to build PoCs for free.\n\n * It supports unlimited document sizes, and vector dims of up to 645536.\n\n * It is production-ready. Companies such as Disney, Mozilla, and Microsoft already use it.\n\n * It is one of the most popular vector DBs out there.\n\n_**To** **put that in perspective,**_ Pinecone, one of its biggest\ncompetitors, supports only documents with up to 40k tokens and vectors with up\nto 20k dimensions\u2026. and a proprietary license.\n\nI could go on and on\u2026\n\n\u2026but if you are **curious to find out more** , _check out Qdrant _\u2190\n\n* * *\n\n### 3\\. Retrieval optimization (1): Query expansion\n\nQuery expansion is quite intuitive.\n\nYou use an LLM to generate multiple queries based on your initial query.\n\nThese queries should contain multiple perspectives of the initial query.\n\nThus, when embedded, they hit different areas of your embedding space that are\nstill relevant to our initial question.\n\nYou can do query expansion with a detailed zero-shot prompt.\n\nQuery expansion template \u2192 GitHub Code \u2190\n\n### 4\\. Retrieval optimization (2): Self query\n\nWhat if you could extract the tags within the query and use them along the\nembedded query?\n\nThat is what self-query is all about!\n\nYou use an LLM to extract various metadata fields that are critical for your\nbusiness use case (e.g., tags, author ID, number of comments, likes, shares,\netc.)\n\nIn our custom solution, we are extracting just the author ID. Thus, a zero-\nshot prompt engineering technique will do the job.\n\n_Self-queries work hand-in-hand with vector filter searches, which we will\nexplain in the next section._\n\nTo define the _**SelfQueryTemplate**_ , we have to:\n\n * Subclass the base abstract class\n\n * Define the self-query prompt\n\n * Create the LangChain PromptTemplate wrapper\n\n \n \n class **SelfQueryTemplate**(BasePromptTemplate):\n prompt: str = \"\"\"\n You are an AI language model assistant. \n Your task is to extract information from a user question.\n The required information that needs to be extracted is the user id. \n Your response should consists of only the extracted id (e.g. 1345256), nothing else.\n User question: {question}\n \"\"\"\n \n def create_template(self) -> PromptTemplate:\n return PromptTemplate(\n template=self.prompt, input_variables=[\"question\"], verbose=True\n )\n\n### 5\\. Retrieval optimization (3): Hybrid & filtered vector search\n\nCombine the vector search technique with one (or more) complementary search\nstrategy, which works great for finding exact words.\n\nIt is not defined which algorithms are combined, but the most standard\nstrategy for hybrid search is to combine the traditional keyword-based search\nand modern vector search.\n\n_How are these combined?_\n\n_The**first method** is to merge the similarity scores of the 2 techniques as\nfollows:_\n\n \n \n hybrid_score = (1 - alpha) * sparse_score + alpha * dense_score\n\nWhere **alpha** takes a value between [0, 1], with:\n\n * **alpha = 1** : Vector Search\n\n * **alpha = 0** : Keyword search\n\nAlso, the similarity scores are defined as follows:\n\n * **sparse_score:** is the result of the _keyword search_ that, behind the scenes, uses a BM25 algorithm [7] that sits on top of TF-IDF.\n\n * **dense_score:** is the result of the _vector search_ that most commonly uses a similarity metric such as cosine distance\n\n _The**second method** uses the vector search technique as usual and applies a\nfilter based on your keywords on top of the metadata of retrieved results._\n\n> \u2192 This is also known as**filtered vector search**.\n\nIn this use case, the **similar score** is **not changed based** on the\n**provided** **keywords**.\n\nIt is just a fancy word for a simple filter applied to the metadata of your\nvectors.\n\nBut it is **essential** to **understand** the **difference** **between** the\n**first** and **second** **methods** :\n\n * the**first method** combines the similarity score between the keywords and vectors using the alpha parameter;\n\n * the **second method** is a simple filter on top of your vector search.\n\n#### How does this fit into our architecture?\n\nRemember that during the self-query step, we extracted the **author_id** as an\nexact field that we have to match.\n\nThus, we will search for the **author_id** using the keyword search algorithm\nand attach it to the 5 queries generated by the query expansion step.\n\n_As we want the**most relevant chunks** from a **given author,** it makes the\nmost sense to use a **filter** **using** the **author_id** as follows\n(**filtered vector search**)_ \u2193\n\n \n \n self._qdrant_client.search(\n collection_name=\"vector_posts\",\n query_filter=models.Filter(\n must=[\n models.FieldCondition(\n key=\"author_id\",\n match=models.MatchValue(\n value=metadata_filter_value,\n ),\n )\n ]\n ),\n query_vector=self._embedder.encode(generated_query).tolist(),\n limit=k,\n\nNote that we can easily extend this with multiple keywords (e.g., tags),\nmaking the combination of self-query and hybrid search a powerful retrieval\nduo.\n\nThe only **question** you have to **ask yourself** is whether we want to\n**use** a simple **vector search filter** or the more complex **hybrid\nsearch** strategy.\n\n### 6\\. Implement the advanced retrieval Python class\n\n _Now that you\u2019ve understood the**advanced retrieval optimization techniques**\nwe're using, let\u2019s **combine** them into a **Python retrieval class**._\n\nQuery expansion chains wrapper \u2192 GitHub \u2190\n\nNow the final step is to call Qdrant for each query generated by the query\nexpansion step \u2193\n\nVectorRetriever: main search function \u2192 GitHub \u2190\n\n _Note that we have**3 types of data** : posts, articles, and code\nrepositories._\n\nThus, we have to make a query for each collection and combine the results in\nthe end.\n\nWe gathered data from each collection individually and kept the best-retrieved\nresults using rerank.\n\nWhich is the final step of the article.\n\n### 7\\. Post-retrieval optimization: Rerank using GPT-4\n\nWe made a **different search** in the Qdrant vector DB for **N prompts**\n**generated** by the **query expansion step**.\n\n**Each** **search** returns **K results**.\n\nThus, we **end up with** **N x K chunks**.\n\nIn our particular case, **N = 5** & **K = 3.** Thus, we end up with 15 chunks.\n\nPost-retrieval optimization: rerank\n\nWe will use **rerank** to order all the **N x K** chunks based on their\nrelevance relative to the initial question, where the first one will be the\nmost relevant and the last chunk the least.\n\nUltimately, we will pick the TOP K most relevant chunks.\n\nRerank works really well when combined with query expansion.\n\n_A natural flow when using rerank is as follows:_\n\n \n \n Search for >K chunks >>> Reorder using rerank >>> Take top K\n\nThus, when combined with query expansion, we gather potential useful context\nfrom multiple points in space rather than just looking for more than K samples\nin a single location.\n\n _Now the flow looks like:_\n\n \n \n Search for N x K chunks >>> Reoder using rerank >>> Take top K\n\nA typical solution for reranking is to use open-source Bi-Encoders from\nsentence transformers [4].\n\nThese solutions take both the question and context as input and return a score\nfrom 0 to 1.\n\nIn this article, we want to take a different approach and use GPT-4 + prompt\nengineering as our reranker.\n\nIf you want to see how to apply rerank using open-source algorithms, check out\nthis hands-on article from Decoding ML:\n\n#### A Real-time Retrieval System for RAG on Social Media Data\n\nPaul Iusztin\n\n\u00b7\n\nMar 7\n\nRead full story\n\nNow let\u2019s see our implementation using GPT-4 & prompt engineering.\n\nSimilar to what we did for the expansion and self-query chains, we define a\ntemplate and a chain builder \u2193\n\n \n \n class RerankingTemplate(BasePromptTemplate):\n prompt: str = \"\"\"\n You are an AI language model assistant. \n Your task is to rerank passages related to a query\n based on their relevance. The most relevant passages \n should be put at the beginning. \n You should only pick at max {k} passages.\n The following are passages related to this query: {question}.\n Passages: {passages}\n \"\"\"\n \n def create_template(self) -> PromptTemplate:\n return PromptTemplate(\n template=self.prompt, \n input_variables=[\"question\", \"passages\"])\n\n\u2026and that\u2019s it!\n\n* * *\n\n### Conclusion\n\n _Congratulations!_\n\nIn **Lesson 5** , you learned to **build** an **advanced RAG retrieval\nmodule** optimized for searching posts, articles, and code repositories from a\nQdrant vector DB.\n\n**First** , you learned about where the RAG pipeline can be optimized:\n\n * pre-retrieval\n\n * retrieval\n\n * post-retrieval\n\n**After** you learn how to build from scratch (without using LangChain\u2019s\nutilities) the following advanced RAG retrieval & post-retrieval optimization\ntechniques:\n\n * query expansion\n\n * self query\n\n * hybrid search\n\n * rerank\n\n**Ultimately** , you understood where the retrieval component sits in an RAG\nproduction LLM system, where the code is shared between multiple microservices\nand doesn\u2019t sit in a single Notebook.\n\n_**Next week** , in **Lesson 6** , we will move to the training pipeline and\nshow you how to automatically transform the data crawled from LinkedIn,\nSubstack, Medium, and GitHub into an instruction dataset using GPT-4 to fine-\ntune your LLM Twin._\n\nSee you there! \ud83e\udd17\n\n* * *\n\n### Next Steps\n\n#### Step 1\n\nThis is just the **short version** of **Lesson 5** on the **advanced RAG\nretrieval module**.\n\n\u2192 For\u2026\n\n * The full implementation.\n\n * Discussion on our custom implementation vs. LangChain.\n\n * More on the problems these 4 advanced RAG techniques solve.\n\n * How to use the retrieval module.\n\n**Check out** the **full version** of **Lesson 5** on our **Medium\npublication**. It\u2019s still FREE:\n\nLesson 5 - FREE Medium Article\n\n#### Step 2\n\n\u2192 **Check out theLLM Twin GitHub repository and try it yourself \ud83e\udef5**\n\n _Nothing compares with getting your hands dirty and building it yourself!_\n\nLLM Twin Course - GitHub\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n17\n\nShare this post\n\n#### The 4 Advanced RAG Algorithms You Must Know to Implement\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n1\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| Meng LiAI Disruption May 17Great, thanks for sharing!Expand full\ncommentReplyShare \n---|--- \n \nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-4-advanced-rag-algorithms-you?r=1ttoeh" }, { "id": "037e6362-8be7-4860-992f-1f075921a669", "content": { "Title": "Problems deploying your ML models? Here is your solution!", "Subtitle": "PyTorch + CUDA ultimate guide. Synthetic data generation. Serverless infrastructure.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Problems deploying your ML models? Here is your solution!\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Problems deploying your ML models? Here is your solution!\n\n### PyTorch + CUDA ultimate guide. Synthetic data generation. Serverless\ninfrastructure.\n\nPaul Iusztin\n\nApr 27, 2024\n\n10\n\nShare this post\n\n#### Problems deploying your ML models? Here is your solution!\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * The ultimate guide on installing PyTorch with CUDA support in all possible ways\n\n * Generate a synthetic domain-specific Q&A dataset in <30 minutes\n\n * The power of serverless in the world of ML\n\n* * *\n\nExciting news \ud83d\udd25 I was invited by Maven to speak in their Lighting Lesson\nseries about how to \ud835\uddd4\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb.\n\nRegister here (it\u2019s free) \u2190\n\nThis 30-min session is for ML & MLOps engineers who want to learn:\n\n\ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde6\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb\n\n\u2192 Using the 3-pipeline architecture & MLOps good practices\n\n\ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\u2192 data crawling, ETLs, CDC, AWS\n\n\ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\u2192 streaming engine in Python, data ingestion for fine-tuning & RAG, vector DBs\n\n\ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\u2192 create a custom dataset, fine-tuning, model registries, experiment trackers,\nLLM evaluation\n\n\ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\u2192 real-time deployment, REST API, RAG, LLM monitoring\n\n\u2193\u2193\u2193\n\n> Join LIVE on \ud835\ude0d\ud835\ude33\ud835\ude2a, \ud835\ude14\ud835\ude22\ud835\ude3a 3!\n>\n> Register here (it\u2019s free) \u2190\n\n* * *\n\n### The ultimate guide on installing PyTorch with CUDA support in all possible\nways\n\nEver wanted to quit ML while wrestling with \ud835\uddd6\ud835\udde8\ud835\uddd7\ud835\uddd4 \ud835\uddf2\ud835\uddff\ud835\uddff\ud835\uddfc\ud835\uddff\ud835\ude00? I know I did. \u2192\nDiscover \ud835\uddf5\ud835\uddfc\ud835\ude04 to install \ud835\uddd6\ud835\udde8\ud835\uddd7\ud835\uddd4 & \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\uddfd\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf9\ud835\ude06 in all possible ways. \n \nHere is the story of most ML people: \n \n1\\. You just got excited about a new model that came out. \n \n2\\. You want to try it out. \n \n3\\. You install everything. \n \n4\\. You run the model. \n \n5\\. Bam... CUDA error. \n \n6\\. You fix the error. \n \n7\\. Bam... Another CUDA error \n \n7\\. You fix the error. \n \n8\\. ...Yet another CUDA error. \n \nYou get the idea. \n \n\u2192 Now it is 3:00 am, and you finally solved all your CUDA errors and ran your\nmodel. \n \nNow, it's time to do your actual work. \n \nDo you relate? \n \nIf so... \n \nI started a Medium article where I documented good practices and step-by-step\ninstructions on how to install CUDA & PyTorch with: \n \n\\- Pip \n\\- Conda (or Mamba) \n\\- Poetry \n\\- Docker\n\nDocker entry point - bash template\n\n> **Check it out** \u2193 \n> \n> \ud83d\udd17 _**The ultimate guide on installing PyTorch with CUDA support in all\n> possible ways**_\n\n\ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2: Feel free to comment with any improvements on how to install CUDA +\nPyTorch. Let's make the ultimate tutorial on installing these 2 beasts \ud83d\udd25\n\n* * *\n\n### Generate a synthetic domain-specific Q&A dataset in <30 minutes\n\nHow do you \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 a \ud835\ude00\ud835\ude06\ud835\uddfb\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude01\ud835\uddf6\ud835\uddf0 \ud835\uddf1\ud835\uddfc\ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb-\ud835\ude00\ud835\uddfd\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\uddf3\ud835\uddf6\ud835\uddf0 \ud835\udde4&\ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 in <\ud835\udfef\ud835\udfec \ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\ude00 to\n\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 your \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb-\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0? \n \nThis method is also known as \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb. Here are its 3 \ud835\ude2e\ud835\ude22\ud835\ude2a\ud835\ude2f\n\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31\ud835\ude34 \u2193 \n \n\ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26, \ud835\ude2d\ud835\ude26\ud835\ude35'\ud835\ude34 \ud835\ude28\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude22 \ud835\ude18&\ud835\ude08 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude35 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude35\ud835\ude30 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26 \ud835\ude22\n\ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d \ud835\ude22\ud835\ude25\ud835\ude37\ud835\ude2a\ud835\ude34\ud835\ude30\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14. \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed: \ud835\udde0\ud835\uddee\ud835\uddfb\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\ude04 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \n \nGenerate a few input samples (~3) that have the following structure: \n\\- \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude33_\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35: describe the type of investor (e.g., \"I am a 28-year-old\nmarketing professional\") \n\\- \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f: describe the user's intention (e.g., \"Is Bitcoin a good\ninvestment option?\") \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee: \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf5\ud835\uddf2\ud835\uddf9\ud835\uddfd \ud835\uddfc\ud835\uddf3 \ud835\uddee \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \n \nUse a powerful LLM as a teacher (e.g., GPT4, Falcon 180B, etc.) to generate up\nto +N similar input examples. \n \nWe generated 100 input examples in our use case, but you can generate more. \n \nYou will use the manually filled input examples to do few-shot prompting. \n \nThis will guide the LLM to give you domain-specific samples. \n \n\ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34: \n\"\"\" \n... \nGenerate 100 more examples with the following pattern: \n \n# USER CONTEXT 1 \n... \n \n# QUESTION 1 \n... \n \n# USER CONTEXT 2 \n... \n\"\"\" \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef: \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddfc\ud835\ude02\ud835\ude01\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \n \nNow, you will have the same powerful LLM as a teacher, but this time, it will\nanswer all your N input examples. \n \nBut first, to introduce more variance, we will use RAG to enrich the input\nexamples with news context. \n \nAfterward, we will use the teacher LLM to answer all N input examples. \n \n...and bam! You generated a domain-specific Q&A dataset with almost 0 manual\nwork. \n \n. \n \nNow, you will use this data to train a smaller LLM (e.g., Falcon 7B) on a\nniched task, such as financial advising. \n \nThis technique is known as finetuning with distillation because you use a\npowerful LLM as the teacher (e.g., GPT4, Falcon 180B) to generate the data,\nwhich will be used to fine-tune a smaller LLM (e.g., Falcon 7B), which acts as\nthe student.\n\nGenerate a Q&A dataset in <30 minutes\n\n \n\u2712\ufe0f \ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26: To ensure that the generated data is of high quality, you can hire a\ndomain expert to check & refine it.\n\n* * *\n\n### The power of serverless in the world of ML\n\n\ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06\ud835\uddf6\ud835\uddfb\ud835\uddf4 & \ud835\uddfa\ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf4 ML models is \ud835\uddf5\ud835\uddee\ud835\uddff\ud835\uddf1, especially when running your models on\nGPUs. \n \nBut \ud835\ude00\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 makes things \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06. \n \nUsing Beam as your serverless provider, deploying & managing ML models can be\nas easy as \u2193 \n \n\ud835\uddd7\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 & \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf2\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\ude00 \n \nIn a few lines of code, you define the application that contains: \n \n\\- the requirements of your infrastructure, such as the CPU, RAM, and GPU \n\\- the dependencies of your application \n\\- the volumes from where you can load your data and store your artifacts \n \n\ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf7\ud835\uddfc\ud835\uddef\ud835\ude00 \n \nUsing the Beam application, you can quickly decorate your Python functions to: \n \n\\- run them once on the given serverless application \n\\- put your task/job in a queue to be processed or even schedule it using a\nCRON-based syntax \n\\- even deploy it as a RESTful API endpoint \n \n. \n \nAs you can see in the image below, you can have one central function for\ntraining or inference, and with minimal effort, you can switch from all these\ndeployment methods. \n \nAlso, you don't have to bother at all with managing the infrastructure on\nwhich your jobs run. You specify what you need, and Beam takes care of the\nrest. \n \nBy doing so, you can directly start to focus on your application and stop\ncarrying about the infrastructure. \n \nThis is the power of serverless!\n\nBeam example\n\n> \u21b3\ud83d\udd17 \ud835\ude0a\ud835\ude29\ud835\ude26\ud835\ude24\ud835\ude2c \ud835\ude30\ud835\ude36\ud835\ude35 \ud835\ude09\ud835\ude26\ud835\ude22\ud835\ude2e \ud835\ude35\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude2e\ud835\ude30\ud835\ude33\ud835\ude26\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n10\n\nShare this post\n\n#### Problems deploying your ML models? Here is your solution!\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/problems-deploying-your-ml-models?r=1ttoeh" }, { "id": "c91e76e3-774c-43e7-91db-01c0c6bff57a", "content": { "Title": "Streaming Pipelines for LLMs and RAG - by Paul Iusztin", "Subtitle": "SOTA streaming pipeline in Python to clean, chunk, embed and load data to a vector DB (feature store) in real time: for fine-tuning LLMs and RAG (on AWS).", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG - in Real-\nTime!\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG - in Real-Time!\n\n### Use a Python streaming engine to populate a feature store from 4+ data\nsources\n\nPaul Iusztin\n\nApr 25, 2024\n\n11\n\nShare this post\n\n#### SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG - in Real-\nTime!\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n\u2192 the 4th out of 11 lessons of the LLM Twin free course\n\n**What is your LLM Twin?** It is an AI character that writes like yourself by\nincorporating your style, personality, and voice into an LLM.\n\nImage by DALL-E\n\n### **Why is this course different?**\n\n_By finishing the \u201c**LLM Twin: Building Your Production-Ready AI\nReplica\u201d**_****_free course, you will learn how to design, train, and deploy a\nproduction-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps\ngood practices_.\n\n_**Why should you care? \ud83e\udef5**_\n\n _**\u2192 No more isolated scripts or Notebooks!** Learn production ML by building\nand deploying an end-to-end production-grade LLM system._\n\n> More **details** on what you will **learn** within the **LLM Twin**\n> **course** , **here** \ud83d\udc48\n\n* * *\n\n### Latest Lessons of the LLM Twin Course\n\n**Lesson 1:**` `An End-to-End Framework for Production-Ready LLM Systems by\nBuilding Your LLM Twin\n\n\u2192 LLM Twin Concept, 3-Pipeline Architecture, System Design for LLM Twin\n\n**Lesson 2** : The importance of Data Pipeline in the era of Generative AI\n\n\u2192 Data crawling, ETL pipelines, ODM, NoSQL Database\n\n**Lesson 3:** CDC: Enabling Event-Driven Architectures\n\n\u2192 Change Data Capture (CDC), MongoDB Watcher, RabbitMQ queue\n\n* * *\n\n## Lesson 4: **Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-\nTime!**\n\nIn the **4th lesson** , we will focus on the **feature pipeline.**\n\nThe **feature pipeline** is the **first** **pipeline** presented in the **3\npipeline architecture** : feature, training and inference pipelines.\n\nA **feature pipeline** takes raw data as input, processes it into features,\nand stores it in a feature store, from which the training & inference\npipelines will use it.\n\nThe component is completely isolated from the training and inference code. All\nthe communication is done through the feature store.\n\nBy the **end of this** **article** , you will **learn** to **design** and\n**build** a **production-ready feature pipeline** that:\n\n * uses Bytewax as a stream engine to process data in real-time;\n\n * ingests data from a RabbitMQ queue;\n\n * uses SWE practices to process multiple data types: posts, articles, code;\n\n * cleans, chunks, and embeds data for LLM fine-tuning and RAG;\n\n * loads the features to a Qdrant vector DB.\n\n> Note that we will only cover the **vector DB retrieval client** and\n> **advanced retrieval techniques** in the **5th lesson**!\n\n_Excited? Let\u2019s get started!_\n\n* * *\n\n### Table of Contents:\n\n 1. Why are we doing this?\n\n 2. System design of the feature pipeline\n\n 3. The Bytewax streaming flow\n\n 4. Pydantic data models\n\n 5. Load data to Qdrant (our feature store)\n\n 6. The dispatcher layer\n\n> \ud83d\udd17 **Check out** the code on GitHub [1] and support us with a \u2b50\ufe0f\n\n* * *\n\n### 1\\. Why are we doing this?\n\n#### A quick reminder from previous lessons\n\nTo give you some context, in Lesson 2, we crawl data from LinkedIn, Medium,\nand GitHub, normalize it, and load it to MongoDB.\n\nIn Lesson 3, we are using CDC to listen to changes to the MongoDB database and\nemit events in a RabbitMQ queue based on any CRUD operation done on MongoDB.\n\n#### The problem we are solving\n\nIn our LLM Twin use case, the **feature pipeline** constantly syncs the\nMongoDB warehouse with the Qdrant vector DB (our feature store) while\nprocessing the raw data into features.\n\n#### Why we are solving it\n\nThe **feature store** will be the **central point of access** for all the\nfeatures used within the training and inference pipelines.\n\n\u2192 The **training pipeline** will use the feature store to create **fine-\ntunin** g datasets for your **LLM** **twin**.\n\n\u2192 The **inference pipeline** will use the feature store for **RAG**.\n\n### 2\\. System design of the feature pipeline: our solution\n\n _Our**solution** is based on **CDC** , a **queue,** a **streaming engine,**\nand a **vector DB:**_\n\n\u2192 CDC adds any change made to the Mongo DB to the queue (read more in Lesson\n3).\n\n\u2192 the RabbitMQ queue stores all the events until they are processed.\n\n\u2192 The Bytewax streaming engine cleans, chunks, and embeds the data.\n\n\u2192 A streaming engine works naturally with a queue-based system.\n\n\u2192 The data is uploaded to a Qdrant vector DB on the fly\n\n#### **Why is this powerful?**\n\nHere are 4 core reasons:\n\n 1. The **data** is **processed** in **real-time**.\n\n 2. **Out-of-the-box recovery system:** If the streaming pipeline fails to process a message will be added back to the queue \n\n 3. **Lightweight:** No need for any diffs between databases or batching too many records\n\n 4. **No I/O bottlenecks** on the source database\n\n\u2192 **It solves all our problems!**\n\nStreaming ingestion pipeline architecture and integration with the rest of the\ncomponents\n\n#### How do we process multiple data types?\n\nHow do you **process multiple types** **of** **data** in a **single streaming\npipeline** **without** **writing** **spaghetti code**?\n\nYes, that is for you, data scientists! **Joking\u2026** am I**?**\n\nWe have **3 data types** : posts, articles, and code.\n\n**Each data type** (and its state) will be **modeled** using **Pydantic**\n**models**.\n\nTo **process** them, we will write a **dispatcher layer** , which will use a\n**creational** **factory** **pattern **to **instantiate** a **handler**\nimplemented for that **specific data type** (post, article, code) and\n**operation** (cleaning, chunking, embedding).\n\nThe **handler** follows the **strategy behavioral pattern.**\n\n#### Streaming over batch\n\nNowadays, using tools such as Bytewax makes implementing streaming pipelines a\nlot more frictionless than using their JVM alternatives.\n\nThe key aspect of choosing a streaming vs. a batch design is real-time\nsynchronization between your source and destination DBs.\n\nIn our particular case, we will process social media data, which changes fast\nand irregularly.\n\nAlso, for our digital twin, it is important to do RAG on up-to-date data. We\ndon\u2019t want to have any delay between what happens in the real world and what\nyour LLM twin sees.\n\nThat being said, choosing a streaming architecture seemed natural in our use\ncase.\n\n* * *\n\n### 3\\. The Bytewax streaming flow\n\nThe **Bytewax flow** is the **central point** of the **streaming pipeline**.\nIt defines all the required steps, following the next simplified pattern:\n_\u201cinput - > processing -> output\u201d._\n\nAs I come from the AI world, I like to see it as the **\u201cgraph\u201d** of the\n**streaming pipeline** , where you use the _input()_ , _map()_ , and\n_output()_ Bytewax functions to define your graph, which in the **Bytewax\nworld** is **called** a _**\u201cflow\u201d**_.\n\nAs you can see in the code snippet below, we ingest posts, articles or code\nmessages from a RabbitMQ queue. After we clean, chunk and embed them.\nUltimately, we load the cleaned and embedded data to a Qdrant vector DB, which\nin our LLM twin use case will represent the feature store of our system.\n\nTo structure and validate the data, between each Bytewax step, we map and pass\na different Pydantic model based on its current state: raw, cleaned, chunked,\nor embedded.\n\nBytewax flow \u2192 GitHub Code \u2190\n\nWe have a single streaming pipeline that processes everything.\n\nAs we ingest multiple data types (posts, articles, or code snapshots), we have\nto process them differently.\n\nTo do this the right way, we implemented a dispatcher layer that knows how to\napply data-specific operations based on the type of message.\n\nMore on this in the next sections \u2193\n\n#### Why Bytewax?\n\n_Bytewax is an open-source streaming processing framework that:_ \n\\- is built in **Rust** \u2699\ufe0f for **performance** \n\\- has **Python** \ud83d\udc0d bindings for leveraging its powerful ML ecosystem\n\n\u2026 so, for all the Python fanatics out there, no more JVM headaches for you.\n\nJokes aside, here is why Bytewax is so powerful \u2193\n\n\\- Bytewax local setup is plug-and-play \n\\- can quickly be integrated into any Python project (you can go wild \u2014 even\nuse it in Notebooks) \n\\- can easily be integrated with other Python packages (NumPy, PyTorch,\nHuggingFace, OpenCV, SkLearn, you name it) \n\\- out-of-the-box connectors for Kafka and local files, or you can quickly\nimplement your own\n\nWe used Bytewax to build the streaming pipeline for the LLM Twin course and\nloved it.\n\n> To **learn more** about **Bytewax** , check out their **Substack** , where\n> you have the chance to **dive deeper** into **streaming engines**. In\n> Python. For FREE:\n>\n> \u2192 Bytewax Newsletter \u2190\n\n* * *\n\n### 4\\. Pydantic data models\n\nLet\u2019s take a look at what our Pydantic models look like.\n\nWe defined a hierarchy of Pydantic models for:\n\n * all our data types: posts, articles, or code\n\n * all our states: raw, cleaned, chunked, and embedded\n\nThis is how the set of classes for the posts will look like \u2193\n\nPydantic posts model structure \u2192 GitHub Code \u2190\n\nWe **repeated** the s**ame process** for the **articles** and **code** model\n**hierarchy**.\n\n### 5\\. Load data to Qdrant (our feature store)\n\nThe first step is to implement our custom Bytewax _DynamicSink_ class \u2193\n\nQdrant DynamicSink \u2192 GitHub Code \u2190\n\nNext, for every type of operation we need (output cleaned or embedded data ),\nwe have to subclass the _StatelessSinkPartition_ Bytewax class (they also\nprovide a stateful option \u2192 more in their docs)\n\nAn instance of the class will run on every partition defined within the\nBytewax deployment.\n\nIn the course, we are using a single partition per worker. But, by adding more\npartitions (and workers), you can quickly scale your Bytewax pipeline\nhorizontally.\n\n**Remember** **why** we upload the **data** to Qdrant in **two stages** , as\nthe **Qdrant vector DB** will act as our **feature store** :\n\n 1. The _cleaned data_ will be used for _LLM fine-tuning_(used by the training pipeline)\n\n 2. The _chunked & embedded_ data will be used for _RAG (used by the inference pipeline)_\n\nQdrant worker partitions \u2192 GitHub Code \u2190\n\nNote that we used**Qdrant\u2019s** **Batch** method to upload all the available\npoints simultaneously. By doing so, we **reduce** the **latency** on the\n**network I/O** side: more on that here\n\n### 6\\. The dispatcher layer\n\nNow that we have the Bytewax flow and all our data models.\n\n**How do we map a raw data model to a cleaned data model?**\n\n> All our domain logic is modeled by a set of _Handler()_ classes:\n>\n> \u2192 CleaningDataHandler\n>\n> \u2192 ChunkingDataHandler\n>\n> \u2192 EmbeddingDataHandler\n\n**Now, to build our dispatcher, we need 2 last components:**\n\n * **a factory class:** instantiates the right handler based on the type of the event\n\n * **a dispatcher class:** the glue code that calls the factory class and handler\n\n**Here is what the cleaning dispatcher and factory look like** \u2193\n\nThe dispatcher and factory classes \u2192 GitHub Code \u2190\n\nNote that we will have a different **Handler()** for every (data_type, state)\npair \u2014 resulting in 3 x 3 = 9 different handlers.\n\nFor Example, we will have 3 handlers based on their data type for the cleaned\npost state: PostCleaningHandler, ArticleCleaningHandler, and\nRepositoryCleaningHandler.\n\n**By repeating the same logic, we will end up with the following set of\ndispatchers:**\n\n * _RawDispatcher_ (no factory class required as the data is not processed)\n\n * _CleaningDispatcher_ (with a _ChunkingHandlerFactory_ class)\n\n * _ChunkingDispatcher_ (with a _ChunkingHandlerFactory_ class)\n\n * _EmbeddingDispatcher_ (with an _EmbeddingHandlerFactory_ class)\n\n* * *\n\n### To Summarize\n\nIn **Lesson 4** of the LLM Twin course, we learned how to:\n\n * Design a streaming pipeline in Python using Bytewax\n\n * Load data to a Qdrant vector DB\n\n * Use Pydantic models to add types and validation to the data points\n\n * Implement a dispatcher layer to process multiple data types in a modular way\n\n _\u2192 In**Lesson 5, which will be held in two weeks,** we will focus on the\nvector DB retrieval client and advanced retrieval techniques._\n\n* * *\n\n### Next Steps\n\nTo **dig** **into** the **details** of the **streaming pipeline** and **how**\nto:\n\n * **implement** **cleaning** , **chunking** , and **embedding** **strategies** for digital data\n\n * **design** the **AWS infrastructure** for the streaming pipeline\n\n * understand how to **run the component**\n\n**Check out** the **full-fledged version** of the **article** on our **Medium\npublication**.\n\n\u2193\u2193\u2193\n\nLesson 4 - FREE Medium Article\n\n* * *\n\n#### Images\n\nIf not otherwise stated, all images are created by the author.\n\n11\n\nShare this post\n\n#### SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG - in Real-\nTime!\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/sota-python-streaming-pipelines-for?r=1ttoeh" }, { "id": "53bc94d1-8cfd-4e65-b55c-9b3582f6ed64", "content": { "Title": "Ready for production ML? Here are the 4 pillars to build production ML systems", "Subtitle": "ML Platforms & MLOps Components. RAG:RAG: What problems does it solve, and how is it integrated into LLM-powered applications", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Ready for production ML? Here are the 4 pillars to build production ML\nsystems\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Ready for production ML? Here are the 4 pillars to build production ML\nsystems\n\n### ML Platforms & MLOps Components. RAG:RAG: What problems does it solve, and\nhow is it integrated into LLM-powered applications\n\nPaul Iusztin\n\nApr 13, 2024\n\n8\n\nShare this post\n\n#### Ready for production ML? Here are the 4 pillars to build production ML\nsystems\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\n _Decoding ML Notes_\n\n### **This week\u2019s topics:**\n\n * Using an ML Platform is critical to integrating MLOps into your project\n\n * The 4 pillars to build production ML systems\n\n * RAG: What problems does it solve, and how is it integrated into LLM-powered applications?\n\n* * *\n\n### Using an ML Platform is critical to integrating MLOps into your project\n\nHere are 6 ML platform features you must know & use \u2193 \n \n...and let's use Comet ML as a concrete example. \n \n#\ud835\udfed. \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n \nIn your ML development phase, you generate lots of experiments. \n \nTracking and comparing the metrics between them is crucial in finding the\noptimal model & hyperparameters. \n \n#\ud835\udfee. \ud835\udde0\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\udde6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 \n \nIts primary purpose is reproducibility. \n \nTo know how a model from a specific experiment was generated, you must know: \n\\- the version of the code \n\\- version of the dataset \n\\- hyperparameters/config \n\\- total compute \n... and more \n \n#\ud835\udfef. \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 \n \nMost of the time, along with the scalar metrics, you must log visual results,\nsuch as: \n\\- images \n\\- videos \n\\- prompts \n\\- t-SNE graphs \n\\- 3D point clouds \n... and more \n \n#4. \ud835\udc00\ud835\udc2b\ud835\udc2d\ud835\udc22\ud835\udc1f\ud835\udc1a\ud835\udc1c\ud835\udc2d\ud835\udc2c \n \nThe most powerful feature out of them all. \n \nAn artifact is a versioned object that acts as an input or output for your\njob. \n \nEverything can be an artifact (data, model, code), but the most common case is\nfor your data. \n \nWrapping your assets around an artifact ensures reproducibility and\nshareability. \n \nFor example, you wrap your features into an artifact (e.g., features:3.1.2),\nwhich you can consume and share across multiple ML environments (development\nor continuous training). \n \nUsing an artifact to wrap your data allows you to quickly respond to questions\nsuch as \"What data have I used to generate the model?\" and \"What Version?\" \n \n#5. \ud835\udc0c\ud835\udc28\ud835\udc1d\ud835\udc1e\ud835\udc25 \ud835\udc11\ud835\udc1e\ud835\udc20\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc2b\ud835\udc32 \n \nThe model registry is the ultimate way to version your models and make them\naccessible to all your services. \n \nFor example, your continuous training pipeline will log the weights as an\nartifact into the model registry after it trains the model. \n \nYou label this model as \"v:1.1.5:staging\" and prepare it for testing. If the\ntests pass, mark it as \"v:1.1.0:production\" and trigger the CI/CD pipeline to\ndeploy it to production. \n \n#6. \ud835\udc16\ud835\udc1e\ud835\udc1b\ud835\udc21\ud835\udc28\ud835\udc28\ud835\udc24\ud835\udc2c \n \nWebhooks lets you integrate the Comet model registry with your CI/CD pipeline. \n \nFor example, when the model status changes from \"Staging\" to \"Production,\" a\nPOST request triggers a GitHub Actions workflow to deploy your new model.\n\nImage by the Author\n\n\u21b3\ud83d\udd17 Check out **Comet** to learn more\n\n* * *\n\n### The 4 pillars to build production ML systems\n\nBefore building a production-ready system, it is critical to consider a set of\nquestions that will later determine the nature of your ML system architecture. \n \n\ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26 4 \ud835\ude31\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude33\ud835\ude34 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude22\ud835\ude2d\ud835\ude38\ud835\ude22\ud835\ude3a\ud835\ude34 \ud835\ude29\ud835\ude22\ud835\ude37\ud835\ude26 \ud835\ude35\ud835\ude30 \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude34\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude33 \ud835\ude23\ud835\ude26\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude26 \ud835\ude25\ud835\ude26\ud835\ude34\ud835\ude2a\ud835\ude28\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude22\ud835\ude2f\ud835\ude3a\n\ud835\ude34\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e \u2193 \n \n\u2794 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \n \n\\- What data types do you have? (e.g., tabular data, images, text, etc.) \n\\- What does the data look like? (e.g., for text data, is it in a single\nlanguage or multiple?) \n\\- How do you collect the data? \n\\- At what frequency do you have to collect the data? \n\\- How do you collect labels for the data? (crucial for how you plan to\nevaluate and monitor the model in production) \n \n\u2794 \ud835\udde7\ud835\uddf5\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5\ud835\uddfd\ud835\ude02\ud835\ude01 \n \n\\- What are the throughput requirements? You must know at least the\nthroughput's minimum, average, and maximum statistics. \n\\- How many requests the system must handle simultaneously? (1, 10, 1k, 1\nmillion, etc.) \n \n\u2794 \ud835\udddf\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06 \n \n\\- What are the latency requirements? (1 millisecond, 10 milliseconds, 1\nsecond, etc.) \n\\- Throughput vs. latency trade-off \n\\- Accuracy vs. speed trade-off \n \n\u2794 \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \n \n\\- Batch vs. real-time architecture (closely related to the throughput vs.\nlatency trade-off) \n\\- How should the system scale? (e.g., based on CPU workload, # of requests,\nqueue size, data size, etc.) \n\\- Cost requirements \n \n. \n \nDo you see how we shifted the focus from model performance towards how it is\nintegrated into a more extensive system? \n \nWhen building production-ready ML, the model's accuracy is no longer the holy\ngrail but a bullet point in a grander scheme. \n \n. \n \n\ud835\udde7\ud835\uddfc \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude07\ud835\uddf2, the 4 pillars to keep in mind before designing an ML\narchitecture are: \n\\- Data \n\\- Throughput \n\\- Latency \n\\- Infrastructure\n\nImage by the Author\n\n* * *\n\n### RAG: What problems does it solve, and how is it integrated into LLM-\npowered applications?\n\nLet's find out \u2193 \n \nRAG is a popular strategy when building LLMs to add external data to your\nprompt. \n \n=== \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa === \n \nWorking with LLMs has 3 main issues: \n \n1\\. The world moves fast \n \nLLMs learn an internal knowledge base. However, the issue is that its\nknowledge is limited to its training dataset. \n \nThe world moves fast. New data flows on the internet every second. Thus, the\nmodel's knowledge base can quickly become obsolete. \n \nOne solution is to fine-tune the model every minute or day... \n \nIf you have some billions to spend around, go for it. \n \n2\\. Hallucinations \n \nAn LLM is full of testosterone and likes to be blindly confident. \n \nEven if the answer looks 100% legit, you can never fully trust it. \n \n3\\. Lack of reference links \n \nIt is hard to trust the response of the LLM if we can't see the source of its\ndecisions. \n \nEspecially for important decisions (e.g., health, financials) \n \n=== \ud835\udde6\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb === \n \n\u2192 Surprize! It is RAG. \n \n1\\. Avoid fine-tuning \n \nUsing RAG, you use the LLM as a reasoning engine and the external knowledge\nbase as the main memory (e.g., vector DB). \n \nThe memory is volatile, so you can quickly introduce or remove data. \n \n2\\. Avoid hallucinations \n \nBy forcing the LLM to answer solely based on the given context, the LLM will\nprovide an answer as follows: \n \n\\- use the external data to respond to the user's question if it contains the\nnecessary insights \n\\- \"I don't know\" if not \n \n3\\. Add reference links \n \nUsing RAG, you can easily track the source of the data and highlight it to the\nuser. \n \n=== \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc\ud835\uddf2\ud835\ude00 \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8? === \n \nLet's say we want to use RAG to build a financial assistant. \n \n\ud835\ude1e\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude25\ud835\ude30 \ud835\ude38\ud835\ude26 \ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude25? \n \n\\- a data source with historical and real-time financial news (e.g. Alpaca) \n\\- a stream processing engine (eg. Bytewax) \n\\- an encoder-only model for embedding the docs (e.g., pick one from\n`sentence-transformers`) \n\\- a vector DB (e.g., Qdrant) \n \n\ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude25\ud835\ude30\ud835\ude26\ud835\ude34 \ud835\ude2a\ud835\ude35 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c? \n \n\u21b3 On the feature pipeline side: \n \n1\\. using Bytewax, you ingest the financial news and clean them \n2\\. you chunk the news documents and embed them \n3\\. you insert the embedding of the docs along with their metadata (e.g., the\ninitial text, source_url, etc.) to Qdrant \n \n\u21b3 On the inference pipeline side: \n \n4\\. the user question is embedded (using the same embedding model) \n5\\. using this embedding, you extract the top K most similar news documents\nfrom Qdrant \n6\\. along with the user question, you inject the necessary metadata from the\nextracted top K documents into the prompt template (e.g., the text of\ndocuments & its source_url) \n7\\. you pass the whole prompt to the LLM for the final answer\n\nImage by the Author\n\n8\n\nShare this post\n\n#### Ready for production ML? Here are the 4 pillars to build production ML\nsystems\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| Dr. Jody-Ann S. JonesThe Data Sensei Apr 13Liked by Paul IusztinExcellent\narticle Paul! Thank you so much for sharing \ud83d\ude4fExpand full commentReplyShare \n---|--- \n \n1 reply by Paul Iusztin\n\n1 more comment...\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/ready-for-production-ml-here-are?r=1ttoeh" }, { "id": "20a85606-a880-4894-bfb7-6b0cad8b3f1f", "content": { "Title": "My monthly recommendations for leveling up in ML", "Subtitle": "In Vector DBs, RAG, MLOps, and LLMs", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### My monthly recommendations for leveling up in ML\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# My monthly recommendations for leveling up in ML\n\n### In Vector DBs, RAG, MLOps, and LLMs\n\nPaul Iusztin\n\nApr 06, 2024\n\n12\n\nShare this post\n\n#### My monthly recommendations for leveling up in ML\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\n**Today is about learning.**\n\nHere is a list of learning resources I used and filtered in the past months.\n\nIt is one of the most helpful content on Vector DBs, RAG, MLOps and LLMs out\nthere.\n\n* * *\n\n### **This week\u2019s topics:**\n\n * Pick the right vector DB for your exact use case\n\n * 4 video lectures on hands-on LLMs\n\n * 7 steps you have to achieve 100% MLOps maturity\n\n * Advanced RAG\n\n* * *\n\n### Pick the right vector DB for your exact use case\n\nThis is the \ud835\uddfc\ud835\uddfb\ud835\uddf9\ud835\ude06 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 to \ud835\uddfd\ud835\uddf6\ud835\uddf0\ud835\uddf8 the \ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 for your exact\n\ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddf0\ud835\uddee\ud835\ude00\ud835\uddf2. \n \nSince ChatGPT made AI cool, besides the millions of ChatGPT posts you got\ntired of and blocked, you realized that a new type of tool started to hit the\nscene: Vector DBs. \n \nAs vector DBs play a crucial role in most LLM applications, they popped out\neverywhere. \n \nOn this day, there are 37 vector DB solutions that are constantly changing and\nadding features. \n \n\ud835\ude15\ud835\ude30\ud835\ude38, \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude29**\ud835\ude2d \ud835\ude34\ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude2d\ud835\ude25 \ud835\ude10 \ud835\ude31\ud835\ude2a\ud835\ude24\ud835\ude2c \ud835\ude30\ud835\ude2f\ud835\ude26?\n\nSS from Superlinked\n\n\ud835\ude43\ud835\ude5a\ud835\ude67\ud835\ude5a \ud835\ude5e\ud835\ude68 \ud835\ude6c\ud835\ude5d\ud835\ude5a\ud835\ude67\ud835\ude5a \ud835\ude69\ud835\ude5d\ud835\ude5a \"\ud835\ude51\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude64\ud835\ude67 \ud835\ude3f\ud835\ude3d \ud835\ude3e\ud835\ude64\ud835\ude62\ud835\ude65\ud835\ude56\ud835\ude67\ud835\ude5e\ud835\ude68\ud835\ude64\ud835\ude63\" \ud835\ude60\ud835\ude5e\ud835\ude58\ud835\ude60\ud835\ude68 \ud835\ude5e\ud835\ude63. \n \nIt is an effort managed by Superlinked, where they carefully compared all\nthese 37 vector DBs across 29 features, such as: \n \n\\- License \n\\- GitHub \u2b50 \n\\- support for text, image or struct models \n\\- RAG, RecSys, LangChain or LllamaIndex APIs \n\\- pricing \n\\- sharding \n\\- document size \n\\- vector dims \n \n...and more! \n \nI won't list all 29 features. \n \nYou have to check it out to see them for yourself \u2193\n\nVector DB Comparison\n\n\ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2: To keep the table updated or add more features, you can contribute to it\nyourself.\n\n* * *\n\n### 4 video lectures on hands-on LLMs\n\nWant to build your first \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01 but don't know where to start? \n \nHere are \ud835\udff0 \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\ude00, made by\n\nPau Labarta Bajo\n\nfrom\n\nReal-World Machine Learning\n\n, to put you on the right track \u2193 \n \n#1. \ud835\udc05\ud835\udc22\ud835\udc27\ud835\udc1e-\ud835\udc2d\ud835\udc2e\ud835\udc27\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e \ud835\udc1f\ud835\udc28\ud835\udc2b \ud835\udc28\ud835\udc29\ud835\udc1e\ud835\udc27-\ud835\udc2c\ud835\udc28\ud835\udc2e\ud835\udc2b\ud835\udc1c\ud835\udc1e \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc2c \n \nYou will learn: \n\\- What is model fine-tuning? \n\\- Why is it useful? \n\\- When to use it? \n\\- Why to fine-tune an LLM using QLoRA \n\\- How to architect a fine-tuning pipeline in a real-world project\n\n#2. \ud835\udc07\ud835\udc1a\ud835\udc27\ud835\udc1d\ud835\udc2c-\ud835\udc28\ud835\udc27 \ud835\udc1f\ud835\udc22\ud835\udc27\ud835\udc1e-\ud835\udc2d\ud835\udc2e\ud835\udc27\ud835\udc22\ud835\udc27\ud835\udc20 \n \nLet's apply what we learned in lesson 1 to build our first fine-tuning\npipeline.\n\n#3. \ud835\udc01\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d & \ud835\udc1d\ud835\udc1e\ud835\udc29\ud835\udc25\ud835\udc28\ud835\udc32 \ud835\udc1a \ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc25-\ud835\udc2d\ud835\udc22\ud835\udc26\ud835\udc1e \ud835\udc2c\ud835\udc2d\ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc26\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e \n \nYou will learn: \n\\- How to transform HTML docs into vector embeddings. \n\\- How to process data in real-time \n\\- How to store & retrieve embeddings from a vector DB \n\\- How to deploy it to AWS.\n\n#4. \ud835\udc08\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e \n \nFinally, you will learn how to use LangChain to glue together your fine-tuned\nLLM and your financial news stored as embeddings in a vector DB to serve\npredictions behind a RESTful API.\n\n* * *\n\n### 7 steps you have to achieve 100% MLOps maturity\n\nOne of the most \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf3\ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf1\ud835\ude00 in the \ud835\udde0\ud835\udddf \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 is \"\ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00\", a new &\ninterdisciplinary process that isn't fully defined yet. \n \nThe good news is that there is a strong movement in \ud835\uddf1\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 a \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff \ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\nin \ud835\ude00\ud835\uddf0\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 the \ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9 of \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddfa\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\ude06 within your \ud835\uddfc\ud835\uddff\ud835\uddf4\ud835\uddee\ud835\uddfb\ud835\uddf6\ud835\ude07\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb or \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01. \n \n\u21b3 Here are \ud835\udff3 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 you have to \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 to \ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddf2 \ud835\udfed\ud835\udfec\ud835\udfec% \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddfa\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\ude06 \u2193 \n \nNo one other than\n\nMaria Vechtomova\n\nfrom\n\nMarvelousMLOps\n\nhas proposed it. \n \n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude06 \ud835\uddee\ud835\uddff\ud835\uddf2 \u2193 \n \n=== \ud835\ude14\ud835\ude36\ud835\ude34\ud835\ude35 \ud835\ude29\ud835\ude22\ud835\ude37\ud835\ude26\ud835\ude34 === \n \n\ud835\udfed\\. \ud835\uddd7\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb: project, ML model, and technical documentation \n \n\ud835\udfee\\. \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf2\ud835\uddee\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddff\ud835\uddf2\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06: Infrastructure traceability and\nreproducibility (versioned IaC under CI/CD) and ML code traceability and\nreproducibility (versioned code, data, and models along with metadata &\nlineage attached to the data & model) \n \n\ud835\udfef\\. \ud835\uddd6\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\uddfe\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06: infrastructure code & ML model code quality requirements\n(tests ran on PRs under the CI pipeline, PR reviews, formatting checks) \n \n\ud835\udff0\\. \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 & \ud835\ude00\ud835\ude02\ud835\uddfd\ud835\uddfd\ud835\uddfc\ud835\uddff\ud835\ude01: infrastructure, application, model performance,\nbusiness KPIs, data drift and outliers monitoring \n \n=== \ud835\ude09\ud835\ude26\ud835\ude3a\ud835\ude30\ud835\ude2f\ud835\ude25 \ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude2a\ud835\ude24 \ud835\ude14\ud835\ude13\ud835\ude16\ud835\ude31\ud835\ude34 === \n \n\ud835\udff1\\. \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00 & \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2: all the features are shared\n& versioned from a central feature store \n \n\ud835\udff2\\. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf9\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06: a human can understand the reasoning of the model\nand not treat it as a black box \n \n\ud835\udff3\\. \ud835\uddd4/\ud835\uddd5 \ud835\ude01\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 & \ud835\uddf3\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\uddef\ud835\uddee\ud835\uddf0\ud835\uddf8 \ud835\uddf9\ud835\uddfc\ud835\uddfc\ud835\uddfd: inputs & outputs of the model are stored\nautomatically and A/B testing is performed regularly \n \n. \n \n\u21b3 Check out the entire questionnaire on the\n\nMarvelousMLOps\n\nblog: \ud83d\udd17 MLOps maturity assessment\n\n**MLOps Maturity Assessment by Marvelous MLOps**\n\nWhat level of MLOps maturity is your organization at? For now, you will rarely\nsee 100%.\n\n* * *\n\n### Advanced RAG\n\nRAG systems are far from perfect \u2192 This free course teaches you how to improve\nyour RAG system. \n \nI recently finished the \ud835\uddd4\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1 \ud835\udde5\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddd4\ud835\udddc \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddd6\ud835\uddf5\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddee free course from\nDeepLearning.AI\n\nSS from the Advanced Retrieval for AI with Chroma course\n\nIf you are into RAG, I find it among the most valuable learning sources. \n \nThe course already assumes you know what RAG is. \n \nIts primary focus is to show you all the current issues of RAG and why it is\nfar from perfect. \n \nAfterward, it shows you the latest SoTA techniques to improve your RAG system,\nsuch as: \n\\- query expansion \n\\- cross-encoder re-ranking \n\\- embedding adaptors \n \nI am not affiliated with DeepLearning.AI (I wouldn't mind though). \n \nThis is a great course you should take if you are into RAG systems. \n \nThe good news is that it is free and takes only 1 hour. \n \nCheck it out \u2193\n\nAdvanced Retrieval for AI with Chroma\n\n12\n\nShare this post\n\n#### My monthly recommendations for leveling up in ML\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/my-ml-monthly-learning-resource-recommendations?r=1ttoeh" }, { "id": "ab66f3dc-2957-4ab9-9ed7-ece653d3f725", "content": { "Title": "End-to-End Framework for Production-Ready LLMs", "Subtitle": "FREE course on designing, training, deploying, and monitoring a production-ready LLM system powered by LLMs, vector DBs & LLMOps by building your LLM twin.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### An End-to-End Framework for Production-Ready LLM Systems by Building Your\nLLM Twin\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# An End-to-End Framework for Production-Ready LLM Systems by Building Your\nLLM Twin\n\n### From data gathering to productionizing LLMs using LLMOps good practices.\n\nPaul Iusztin\n\nMar 28, 2024\n\n35\n\nShare this post\n\n#### An End-to-End Framework for Production-Ready LLM Systems by Building Your\nLLM Twin\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _\u2192 the 1st out of 11 lessons of**the LLM Twin** free course_\n\n**What is your LLM Twin?** It is an AI character that writes like yourself by\nincorporating your style, personality and voice into an LLM.\n\nImage by DALL-E\n\n### **Why is this course different?**\n\n_By finishing the \u201c**LLM Twin: Building Your Production-Ready AI\nReplica\u201d**_****_free course, you will learn how to design, train, and deploy a\nproduction-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps\ngood practices_.\n\n_**Why should you care? \ud83e\udef5**_\n\n _**\u2192 No more isolated scripts or Notebooks!** Learn production ML by building\nand deploying an end-to-end production-grade LLM system._\n\n> More **details** on what you will **learn** within the **LLM Twin**\n> **course** , **here** \ud83d\udc48\n\nAre you ready to build your AI replica? \ud83e\udee2\n\n**Let\u2019s start** with **Lesson 1** \u2193\u2193\u2193\n\n* * *\n\n### **Lesson 1: End-to-end framework for production-ready LLM systems**\n\nIn the **first lesson** , we will present**** the **project** you will\n**build** **during** **the** **course** : _your production-ready LLM Twin/AI\nreplica._\n\n**Afterward** , we will **dig into** the **LLM project system design**.\n\nWe will **present** all our **architectural decisions** regarding the design\nof the _data collection pipeline_ for social media data and how we applied\n_the 3-pipeline architecture_ to our LLM microservices.\n\nIn the **following lessons** , we will **examine** **each component\u2019s code**\nand learn **how** to **implement** and **deploy** **it** to AWS and Qwak.\n\nLLM twin system architecture [Image by the Author] \u2192 What you will learn to\nbuild during this course.\n\n### **Table of Contents**\n\n 1. What are you going to build? The LLM twin concept\n\n 2. LLM twin system design\n\n* * *\n\n### **1\\. What are you going to build? The LLM twin concept**\n\nThe **outcome** of this **course** is to learn to **build** your **own AI\nreplica**. We will use an LLM to do that, hence the name of the course: _**LLM\nTwin: Building Your Production-Ready AI Replica.**_\n\n**But what is an LLM twin?**\n\nShortly, your LLM twin will be an AI character who writes like you, using your\nwriting style and personality.\n\nIt will not be you. It will be your writing copycat.\n\nMore concretely, you will build an AI replica that writes social media posts\nor technical articles (like this one) using your own voice.\n\n**Why not directly use ChatGPT? You may ask\u2026**\n\nWhen trying to generate an article or post using an LLM, the results tend to\nbe:\n\n * very generic and unarticulated,\n\n * contain misinformation (due to hallucination),\n\n * require tedious prompting to achieve the desired result.\n\n_**But here is what we are going to do to fix that** _\u2193\u2193\u2193\n\n**First** , we will fine-tune an LLM on your digital data gathered from\nLinkedIn, Medium, Substack and GitHub.\n\nBy doing so, the LLM will align with your writing style and online\npersonality. It will teach the LLM to talk like the online version of\nyourself.\n\nOur use case will focus on an LLM twin who writes social media posts or\narticles that reflect and articulate your voice.\n\n**Secondly** , we will give the LLM access to a vector DB to access external\ninformation to avoid hallucinating.\n\n**Ultimately** , in addition to accessing the vector DB for information, you\ncan provide external links that will act as the building block of the\ngeneration process.\n\nExcited? Let\u2019s get started \ud83d\udd25\n\n* * *\n\n### **2\\. LLM Twin System design**\n\nLet\u2019s understand how to **apply the 3-pipeline architecture** to **our LLM\nsystem**.\n\nThe **architecture** of the **LLM twin** is split into **4 Python\nmicroservices** :\n\n 1. The data collection pipeline\n\n 2. The feature pipeline\n\n 3. The training pipeline\n\n 4. The inference pipeline\n\nLLM twin system architecture [Image by the Author]\n\n_Now,**let\u2019s zoom in** on **each component** to understand how they work\nindividually and interact with each other. \u2193\u2193\u2193_\n\n### **2.1. The data collection pipeline**\n\nIts scope is to **crawl data** for **a given user** from:\n\n * Medium (articles)\n\n * Substack (articles)\n\n * LinkedIn (posts)\n\n * GitHub (code)\n\nAs every platform is unique, we implemented a different Extract Transform Load\n(ETL) pipeline for each website.\n\nHowever, the **baseline steps** are the **same** for **each platform**.\n\n_Thus, for each ETL pipeline, we can abstract away the following baseline\nsteps:_\n\n * log in using your credentials\n\n * use _selenium_ to crawl your profile\n\n * use _BeatifulSoup_ to parse the HTML\n\n * clean & normalize the extracted HTML\n\n * save the normalized (but still raw) data to Mongo DB\n\n> **Important note:** We are crawling only our data, as most platforms do not\n> allow us to access other people\u2019s data due to privacy issues. But this is\n> perfect for us, as to build our LLM twin, we need only our own digital data.\n\n**Why Mongo DB?**\n\nWe wanted a NoSQL database that quickly allows us to store unstructured data\n(aka text).\n\n**How will the data pipeline communicate with the feature pipeline?**\n\nWe will use the **Change Data Capture (CDC) pattern** to inform the feature\npipeline of any change on our Mongo DB.\n\nTo **explain** the **CDC** briefly, a watcher listens 24/7 for any CRUD\noperation that happens to the Mongo DB.\n\nThe watcher will issue an event informing us what has been modified. We will\nadd that event to a RabbitMQ queue.\n\nThe feature pipeline will constantly listen to the queue, process the\nmessages, and add them to the Qdrant vector DB.\n\nFor example, when we write a new document to the Mongo DB, the watcher creates\na new event. The event is added to the RabbitMQ queue; ultimately, the feature\npipeline consumes and processes it.\n\n**Where will the data pipeline be deployed?**\n\nThe data collection pipeline and RabbitMQ service will be deployed to AWS. We\nwill also use the freemium serverless version of Mongo DB.\n\n### **2.2. The feature pipeline**\n\nThe feature pipeline is **implemented usingBytewax** (a Rust streaming engine\nwith a Python interface). Thus, in **our** specific **use case** , we will\nalso **refer to it** as a **streaming ingestion pipeline**.\n\nIt is an **entirely different service** than the data collection pipeline.\n\n**How does it communicate with the data pipeline?**\n\nAs explained above, the **feature pipeline communicates** with the **data**\n**pipeline** through a RabbitMQ **queue**.\n\nCurrently, the streaming pipeline doesn\u2019t care how the data is generated or\nwhere it comes from.\n\nIt knows it has to listen to a given queue, consume messages from there and\nprocess them.\n\nBy doing so, we **decouple** **the two components** entirely.\n\n**What is the scope of the feature pipeline?**\n\nIt represents the **ingestion component** of the **RAG system**.\n\nIt will **take** the **raw data** passed through the queue and:\n\n * clean the data;\n\n * chunk it;\n\n * embed it using the embedding models from Superlinked;\n\n * load it to the Qdrant vector DB.\n\n**What data will be stored?**\n\nThe **training pipeline** will have **access** **only** to the **feature\nstore** , which, in our case, is represented by the Qdrant vector DB.\n\n_With this in mind, we will**store** in Qdrant **2 snapshots of our data:**_\n\n1\\. The **cleaned data** (without using vectors as indexes \u2014 store them in a\nNoSQL fashion).\n\n2\\. The **cleaned, chunked, and embedded data** (leveraging the vector indexes\nof Qdrant)\n\nThe **training pipeline** needs **access** to the **data** in**both formats**\nas we want to fine-tune the LLM on standard and augmented prompts.\n\n**Why implement a streaming pipeline instead of a batch pipeline?**\n\nThere are **2 main reasons.**\n\nThe first one is that, coupled with the **CDC pattern** , it is the most\n**efficient** way to **sync two DBs** between each other.\n\nUsing CDC + a streaming pipeline, you process only the changes to the source\nDB without any overhead.\n\nThe second reason is that by doing so, your **source** and **vector DB** will\n**always be in sync**. Thus, you will always have access to the latest data\nwhen doing RAG.\n\n**Why Bytewax?**\n\n**Bytewax** is a streaming engine built in Rust that exposes a Python\ninterface. We use Bytewax because it combines Rust\u2019s impressive speed and\nreliability with the ease of use and ecosystem of Python. It is incredibly\nlight, powerful, and easy for a Python developer.\n\n**Where will the feature pipeline be deployed?**\n\nThe feature pipeline will be deployed to AWS. We will also use the freemium\nserverless version of Qdrant.\n\n### **2.3. The training pipeline**\n\n**How do we have access to the training features?**\n\nAs section 2.2 highlights, all the **training data** will be **accessed** from\nthe **feature store**. In our case, the feature store is the **Qdrant vector\nDB** that contains:\n\n * the cleaned digital data from which we will create prompts & answers;\n\n * we will use the chunked & embedded data for RAG to augment the cleaned data.\n\n_We will implement a different vector DB retrieval client for each of our main\ntypes of data (posts, articles, code)._\n\n**What will the training pipeline do?**\n\nThe training pipeline contains a **data-to-prompt layer** that will preprocess\nthe data retrieved from the vector DB into prompts.\n\nIt will also contain an **LLM fine-tuning module** that inputs a HuggingFace\ndataset and uses QLoRA to fine-tune a given LLM (e.g., Mistral).\n\nAll the experiments will be logged into Comet ML\u2019s **experiment tracker**.\n\nWe will use a bigger LLM (e.g., GPT4) to **evaluate** the results of our fine-\ntuned LLM. These results will be logged into Comet\u2019s experiment tracker.\n\n**Where will the production candidate LLM be stored?**\n\nWe will compare multiple experiments, pick the best one, and issue an LLM\nproduction candidate for the model registry.\n\nAfter, we will inspect the LLM production candidate manually using Comet\u2019s\nprompt monitoring dashboard.\n\n**Where will the training pipeline be deployed?**\n\nThe training pipeline will be deployed to Qwak.\n\nQwak is a serverless solution for training and deploying ML models. It makes\nscaling your operation easy while you can focus on building.\n\nAlso, we will use the freemium version of Comet ML for the following:\n\n * experiment tracker;\n\n * model registry;\n\n * prompt monitoring.\n\n### **2.4. The inference pipeline**\n\nThe inference pipeline is the **final component** of the **LLM system**. It is\nthe one the **clients** will **interact with**.\n\nIt will be **wrapped** under a **REST API**. The clients can call it through\nHTTP requests, similar to your experience with ChatGPT or similar tools.\n\n**How do we access the features?**\n\nWe will grab the features solely from the feature store. We will use the same\nQdrant vector DB retrieval clients as in the training pipeline to use the\nfeatures we need for RAG.\n\n**How do we access the fine-tuned LLM?**\n\nThe fine-tuned LLM will always be downloaded from the model registry based on\nits tag (e.g., accepted) and version (e.g., v1.0.2, latest, etc.).\n\n**What are the components of the inference pipeline?**\n\nThe first one is the **retrieval client** used to access the vector DB to do\nRAG.\n\nAfter we have a **query to prompt the layer,** that will map the prompt and\nretrieved documents from Qdrant into a prompt.\n\nAfter the LLM generates its answer, we will log it to Comet\u2019s **prompt\nmonitoring dashboard** and return it to the clients.\n\nFor example, the client will request the inference pipeline to:\n\n\u201cWrite a 1000-word LinkedIn post about LLMs,\u201d and the inference pipeline will\ngo through all the steps above to return the generated post.\n\n**Where will the inference pipeline be deployed?**\n\nThe inference pipeline will be deployed to Qwak.\n\nAs for the training pipeline, we will use a serverless freemium version of\nComet for its prompt monitoring dashboard.\n\n* * *\n\n### **Conclusion**\n\nThis is the 1st article of the****_**LLM Twin: Building Your Production-Ready\nAI Replica**_**** free**** course.\n\nIn this lesson, we presented what **you will build** during the course.\n\nUltimately, we went through the **system design** of the course and presented\nthe **architecture** of **each microservice** and how they **interact with\neach other** :\n\n 1. The data collection pipeline\n\n 2. The feature pipeline\n\n 3. The training pipeline\n\n 4. The inference pipeline\n\nIn **Lesson 2** , we will dive deeper into the **data collection pipeline** ,\nlearn how to implement crawlers for various social media platforms, clean the\ngathered data, store it in a Mongo DB, and finally, show you how to deploy it\nto AWS.\n\n> _\ud83d\udd17**Check out** the code on GitHub [1] and support us with a \u2b50\ufe0f_\n\n* * *\n\n#### This is how we can further help you \ud83e\udef5\n\nIn the **Decoding ML newsletter** , we want to keep things **short & sweet**.\n\nTo **dive deeper** into all the **concepts** presented in this article\u2026\n\n**Check out** the **full-fledged version** of the **article** on our **Medium\npublication**.\n\n**It\u2019s FREE** \u2193\u2193\u2193\n\n> \ud83d\udd17 Detailed Lesson 1 [on Medium]\n\n35\n\nShare this post\n\n#### An End-to-End Framework for Production-Ready LLM Systems by Building Your\nLLM Twin\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/an-end-to-end-framework-for-production?r=1ttoeh" }, { "id": "c4ad61cb-4875-41f6-a9d9-f0da74303586", "content": { "Title": "Upskill your LLM knowledge base with these tools.", "Subtitle": "Speed-up your LLM inference and dissect the Attention Mechanism with step-by-step animation.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Upskill your LLM knowledge base with these tools.\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Upskill your LLM knowledge base with these tools.\n\n### Speed-up your LLM inference and dissect the Attention Mechanism with step-\nby-step animation.\n\nAlex Razvant\n\nMar 23, 2024\n\n10\n\nShare this post\n\n#### Upskill your LLM knowledge base with these tools.\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\nThe **LLM-Twin Course** development has taken off! \ud83d\ude80\n\nJoin aboard and learn how to design, build, and implement an end-to-end LLM\nreplica, by following along in a step-by-step hands-on manner with the\ndevelopment of data pipelines, ingestion, LLM fine-tuning, serving,\nmonitoring, and more.\n\nDecoding ML Newsletter is a reader-supported publication. To receive new posts\nand support my work, consider becoming a free or paid subscriber.\n\nSubscribe\n\nThe first 2/11 lessons are out, make sure to check them out here:\n\n * Lesson 1: **An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin**\n\n * Lesson 2: **The Importance of Data Pipelines in the Era of Generative AI**\n\n* * *\n\n* * *\n\n### **This week\u2019s topics:**\n\n * **Fast inference on LLMs**\n\n * **Visualize attention mechanism**\n\n * **A commonly misunderstood CUDA issue!**\n\n* * *\n\n### Fast inference LLMs\n\nFor the last few years, LLMs have been a hot topic - new models, RAGs, new\npapers, the rise of OpenSource models, etc. \nThe attention mechanism is easy to understand, but \u201chungry\u201d to compute - thus\nmultiple methods aim to fill the performance gap in model-serving.\n\nHere are the top 4 LLM inference solutions:\n\n 1. \ud835\ude03\ud835\udddf\ud835\udddf\ud835\udde0 \nA fast and easy-to-use library for LLM inference and serving.\n\n\ud835\ude46\ud835\ude5a\ud835\ude6e \ud835\ude56\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude68 \ud835\ude56\ud835\ude67\ud835\ude5a:\n\n * \u279d is open-source \n\n * \u279d state-of-the-art serving throughput \n\n * \u279d fast model execution with optimized CUDA kernels/graph. \n\n * \u279d efficient memory management using PagedAttention \n\n * \u279d support for AMD GPUs (ROCm) \u279d deploy support with NVIDIA Triton, KServe, Docker\n\n\ud83d\udd17 \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25: shorturl.at/nAFPW\n\n 2. \ud835\udde7\ud835\uddf2\ud835\uddfb\ud835\ude00\ud835\uddfc\ud835\uddff\ud835\udde5\ud835\udde7-\ud835\udddf\ud835\udddf\ud835\udde0 \nA library that accelerates and optimizes inference performance of the latest\nLLMs.\n\n\ud835\ude46\ud835\ude5a\ud835\ude6e \ud835\ude56\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude68 \ud835\ude56\ud835\ude67\ud835\ude5a:\n\n * \u279d is open-source \n\n * \u279d built on a strong TensorRT foundation \n\n * \u279d leverages custom-optimized CUDA kernels for transformers \u279d enhances customization \n\n * \u279d supports various optimization (quant, tensor parallelism) \n\n * \u279d takes advantage of the NVIDIA Toolkit (perf-analyzer, Triton)\n\n\ud83d\udd17 \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25: shorturl.at/dluMX\n\n 3. \ud835\udde2\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\uddfa\ud835\uddee \nA tool that allows you to run open-source language models locally.\n\n\ud835\uddde\ud835\uddf2\ud835\ude06 \ud835\uddee\ud835\ude00\ud835\uddfd\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude00 \ud835\uddee\ud835\uddff\ud835\uddf2:\n\n * \u279d multi-modal model support \n\n * \u279d optimizes setup and configuration details, including GPU usage \n\n * \u279d bundles weights, configuration, and data into a single Modelfile package\n\n\ud83d\udd17 \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25: shorturl.at/dGZ46\n\n 4. \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\udde5\ud835\udde7\ud835\uddeb\n\nA solution from NVIDIA that allows users to build their own personalized\nchatbot experience.\n\n\ud835\ude46\ud835\ude5a\ud835\ude6e \ud835\ude56\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude68 \ud835\ude56\ud835\ude67\ud835\ude5a:\n\n * \u279d emphasizes no-code, ChatGPT-like interface \n\n * \u279d one can connect custom documents, videos, notes, and PDFs \u279d easy to set up RAG (Retrieval Augmented Generation) \n\n * \u279d support for the latest LLMs \n\n * \u279d leverages TensorRT-LLM and RTX acceleration \n\n * \u279d downloadable installer (35GB), out-of-the-box Mistral & LLaMA 7b versions\n\n\ud83d\udd17 \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25: shorturl.at/ekuK6\n\n* * *\n\n### Visualize attention mechanism\n\n\ud835\udddf\ud835\udddf\ud835\udde0 models are complex - the key to understanding the process is the \ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\n\ud835\uddfa\ud835\uddf2\ud835\uddf0\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf6\ud835\ude00\ud835\uddfa.\n\nHere are \ud835\udfef \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9\ud835\ude00 to help you interactively visualize attention:\n\n 1. \ud835\uddd4\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\udde9\ud835\uddf6\ud835\ude07 : shorturl.at/DSY58\n\n 1. \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28\ud835\ude36\ud835\ude33\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude2f\ud835\ude36\ud835\ude2e \ud835\ude29\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude34.\n\n 2. \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28\ud835\ude36\ud835\ude33\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude2f\ud835\ude36\ud835\ude2e \ud835\ude2d\ud835\ude22\ud835\ude3a\ud835\ude26\ud835\ude33\ud835\ude34.\n\n 3. \ud835\ude29\ud835\ude22\ud835\ude34 \ud835\ude1d\ud835\ude2a\ud835\ude1b, \ud835\ude09\ud835\ude0c\ud835\ude19\ud835\ude1b, \ud835\ude0e\ud835\ude17\ud835\ude1b2 \ud835\ude2a\ud835\ude2f\ud835\ude24\ud835\ude2d\ud835\ude36\ud835\ude25\ud835\ude26\ud835\ude25.\n\n 4. \ud835\udfee\ud835\uddd7 visualization + \ud835\udfef\ud835\uddd7 \ud835\ude3b\ud835\ude30\ud835\ude30\ud835\ude2e-\ud835\ude2a\ud835\ude2f\ud835\ude34 \ud835\ude30\ud835\ude2f \ud835\ude34\ud835\ude26\ud835\ude2d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude26\ud835\ude25 \ud835\ude2d\ud835\ude22\ud835\ude3a\ud835\ude26\ud835\ude33\ud835\ude34.\n\n 2. \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\udde0\ud835\udde0: shorturl.at/lqJQY\n\n * \ud835\ude24\ud835\ude36\ud835\ude34\ud835\ude35\ud835\ude30\ud835\ude2e \ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34.\n\n * \ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude26\ud835\ude2f\ud835\ude34\ud835\ude2a\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude2a\ud835\ude2f \ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude31\ud835\ude29-\ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude27\ud835\ude22\ud835\ude34\ud835\ude29\ud835\ude2a\ud835\ude30\ud835\ude2f.\n\n * \ud835\ude29\ud835\ude22\ud835\ude34 \ud835\ude0e\ud835\ude17\ud835\ude1b2-\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude30, \ud835\ude13\ud835\ude30\ud835\ude19\ud835\ude08 \ud835\ude1b\ud835\ude26\ud835\ude24\ud835\ude29\ud835\ude2f\ud835\ude2a\ud835\ude32\ud835\ude36\ud835\ude26 \ud835\ude2a\ud835\ude2f\ud835\ude24\ud835\ude2d\ud835\ude36\ud835\ude25\ud835\ude26\ud835\ude25.\n\n * 3D\n\n 3. \ud835\uddd5\ud835\uddd5\ud835\ude06\ud835\uddd6\ud835\uddff\ud835\uddfc\ud835\uddf3\ud835\ude01: shorturl.at/ivCR1\n\n * \ud835\ude2a\ud835\ude2f\ud835\ude34\ud835\ude31\ud835\ude26\ud835\ude24\ud835\ude35 \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31-\ud835\ude23\ud835\ude3a-\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31 1 \ud835\ude35\ud835\ude30\ud835\ude2c\ud835\ude26\ud835\ude2f \ud835\ude31\ud835\ude33\ud835\ude26\ud835\ude25\ud835\ude2a\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f.\n\n * \ud835\ude29\ud835\ude22\ud835\ude34 \ud835\ude0e\ud835\ude17\ud835\ude1b2-\ud835\ude34\ud835\ude2e\ud835\ude22\ud835\ude2d\ud835\ude2d, \ud835\ude0e\ud835\ude17\ud835\ude1b3, \ud835\ude0e\ud835\ude17\ud835\ude1b-\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude30, \ud835\ude0e\ud835\ude17\ud835\ude1b2-\ud835\ude1f\ud835\ude13 \ud835\ude2a\ud835\ude2f\ud835\ude24\ud835\ude2d\ud835\ude36\ud835\ude25\ud835\ude26\ud835\ude25.\n\n * straight-forward\n\n* * *\n\n### A commonly misunderstood CUDA issue!\n\nThe problem was that \ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf6\ud835\uddee-\ud835\ude00\ud835\uddfa\ud835\uddf6 was showing a \ud835\uddf1\ud835\uddf6\ud835\uddf3\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddda\ud835\udde3\ud835\udde8 \ud835\uddf1\ud835\uddf2\ud835\ude03\ud835\uddf6\ud835\uddf0\ud835\uddf2 \ud835\uddfc\ud835\uddff\ud835\uddf1\ud835\uddf2\ud835\uddff\ncompared to docker or Python. Thus, errors regarding the disjoint memory\nregions appeared.\n\n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2'\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf8:\n\n * \ud835\udde6\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa \ud835\udddf\ud835\uddee\ud835\ude06\ud835\uddf2\ud835\uddff\n\n * \ud835\ude63\ud835\ude6b\ud835\ude5e\ud835\ude59\ud835\ude5e\ud835\ude56-\ud835\ude68\ud835\ude62\ud835\ude5e works at the system level and orders GPU \ud835\ude67\ud835\ude5a\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude5e\ud835\ude63\ud835\ude5c \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude69\ud835\ude64\ud835\ude65-\ud835\ude59\ud835\ude64\ud835\ude6c\ud835\ude63 \ud835\ude64\ud835\ude67\ud835\ude59\ud835\ude5a\ud835\ude67 \ud835\ude64\ud835\ude5b \ud835\ude5d\ud835\ude64\ud835\ude6c \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude65\ud835\ude5d\ud835\ude6e\ud835\ude68\ud835\ude5e\ud835\ude58\ud835\ude56\ud835\ude61 \ud835\ude6b\ud835\ude5e\ud835\ude59\ud835\ude5a\ud835\ude64 \ud835\ude58\ud835\ude56\ud835\ude67\ud835\ude59 \ud835\ude5e\ud835\ude68 \ud835\ude5e\ud835\ude63\ud835\ude68\ud835\ude5a\ud835\ude67\ud835\ude69\ud835\ude5a\ud835\ude59 \ud835\ude5e\ud835\ude63\ud835\ude69\ud835\ude64 \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude4b\ud835\ude3e\ud835\ude44_\ud835\ude40\ud835\ude53\ud835\ude4b\ud835\ude4d\ud835\ude40\ud835\ude4e\ud835\ude4e \ud835\ude68\ud835\ude61\ud835\ude64\ud835\ude69\ud835\ude68 \ud835\ude64\ud835\ude63 \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude62\ud835\ude64\ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude67\ud835\ude57\ud835\ude64\ud835\ude56\ud835\ude67\ud835\ude59.\n\n * \ud835\udde6\ud835\uddfc\ud835\uddf3\ud835\ude01\ud835\ude04\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\udddf\ud835\uddee\ud835\ude06\ud835\uddf2\ud835\uddff\n\n * At this layer, python/docker or any other program, by default is seeing the \ud835\ude42\ud835\ude4b\ud835\ude50\ud835\ude68 \ud835\ude5e\ud835\ude63 \ud835\ude69\ud835\ude5d\ud835\ude5a \"\ud835\ude41\ud835\ude3c\ud835\ude4e\ud835\ude4f\ud835\ude40\ud835\ude4e\ud835\ude4f_\ud835\ude41\ud835\ude44\ud835\ude4d\ud835\ude4e\ud835\ude4f\" \ud835\ude64\ud835\ude67\ud835\ude59\ud835\ude5a\ud835\ude67, meaning it will take the \ud835\ude42\ud835\ude4b\ud835\ude50 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude5d\ud835\ude5e\ud835\ude5c\ud835\ude5d\ud835\ude5a\ud835\ude68\ud835\ude69 \ud835\ude3e\ud835\ude3e (\ud835\ude58\ud835\ude6a\ud835\ude59\ud835\ude56 \ud835\ude58\ud835\ude56\ud835\ude65\ud835\ude56\ud835\ude57\ud835\ude5e\ud835\ude61\ud835\ude5e\ud835\ude69\ud835\ude6e) \ud835\ude64\ud835\ude63 \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude5b\ud835\ude5e\ud835\ude67\ud835\ude68\ud835\ude69 \ud835\ude5e\ud835\ude63\ud835\ude59\ud835\ude5a\ud835\ude6d.\n\nThe solution here is to condition the applications at the Software Layer to\nrespect the System Layer ordering by setting the env variable:\n\n \n \n \ud835\ude3e\ud835\ude50\ud835\ude3f\ud835\ude3c_\ud835\ude3f\ud835\ude40\ud835\ude51\ud835\ude44\ud835\ude3e\ud835\ude40\ud835\ude4e_\ud835\ude4a\ud835\ude4d\ud835\ude3f\ud835\ude40\ud835\ude4d = \"\ud835\ude4b\ud835\ude3e\ud835\ude44_\ud835\ude3d\ud835\ude50\ud835\ude4e_\ud835\ude44\ud835\ude3f\"\n\nDecoding ML Newsletter is a reader-supported publication. To receive new posts\nand support my work, consider becoming a free or paid subscriber.\n\nSubscribe\n\n10\n\nShare this post\n\n#### Upskill your LLM knowledge base with these tools.\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/upskill-your-llm-knowledge-base-with?r=1ttoeh" }, { "id": "4d1d7d1c-ebd2-445e-a8d7-bdfc1c90cfc6", "content": { "Title": "An end-to-end framework for production-ready LLM systems", "Subtitle": "Learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Learn an end-to-end framework for production-ready LLM systems by\nbuilding your LLM twin\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Learn an end-to-end framework for production-ready LLM systems by building\nyour LLM twin\n\n### Why you should take our new production-ready LLMs course\n\nPaul Iusztin\n\nMar 16, 2024\n\n18\n\nShare this post\n\n#### Learn an end-to-end framework for production-ready LLM systems by\nbuilding your LLM twin\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\nWant to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb an \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 for \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 by\n\ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 your \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb?\n\nThen you are in luck.\n\n\u2193\u2193\u2193\n\nThe Decoding ML team and I will \ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf2 (in a few days) a \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 called\nthe \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb: \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee.\n\n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb? It is an AI character that learns to write like somebody\nby incorporating its style and personality into an LLM.\n\n> **Within** the**course,** you**** will**learn how** to**:**\n>\n> * architect\n>\n> * train\n>\n> * deploy\n>\n>\n\n>\n> ...a \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb of yourself powered by LLMs, vector DBs, and\n> LLMOps good practices, such as:\n>\n> * experiment trackers\n>\n> * model registries\n>\n> * prompt monitoring\n>\n> * versioning\n>\n> * deploying LLMs\n>\n>\n\n>\n> ...and more!\n\nIt is an \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 where you will \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9-\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa:\n\n\u2192 from start to finish\n\n\u2192 from data collection to deployment\n\n\u2192 production-ready\n\n\u2192 from NO MLOps to experiment trackers, model registries, prompt monitoring,\nand versioning\n\nImage by DALL-E\n\n* * *\n\n### Who is this for?\n\n**Audience:** MLE, DE, DS, or SWE who want to learn to engineer production-\nready LLM systems using LLMOps good principles.\n\n**Level:** intermediate\n\n**Prerequisites:** basic knowledge of Python, ML, and the cloud\n\n### **How will you learn?**\n\nThe course contains **11 hands-on written lessons** and the **open-source\ncode** you can access on GitHub (WIP).\n\nYou can read everything at your own pace.\n\n### Costs?\n\nThe **articles** and **code** are **completely free**. They will always remain\nfree.\n\nThis time, the Medium articles won't be under any paid wall. I want to make\nthem entirely available to everyone.\n\n### **Meet your teachers!**\n\nThe course is created under the Decoding ML umbrella by:\n\n * Paul Iusztin | Senior ML & MLOps Engineer\n\n * Alex Vesa | Senior AI Engineer\n\n * Alex Razvant | Senior ML & MLOps Engineer\n\n* * *\n\n## What will you learn to build?\n\nLM twin system architecture [Image by the Author]\n\n\ud83d\udc0d \ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26 \ud835\ude30\ud835\ude27 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude34\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude30 4 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34:\n\n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\\- Crawl your digital data from various social media platforms.\n\n\\- Clean, normalize and load the data to a NoSQL DB through a series of ETL\npipelines.\n\n\\- Send database changes to a queue using the CDC pattern.\n\n\u2601 Deployed on AWS.\n\n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\\- Consume messages from a queue through a Bytewax streaming pipeline.\n\n\\- Every message will be cleaned, chunked, embedded (using Superlinked), and\nloaded into a Qdrant vector DB in real-time.\n\n\u2601 Deployed on AWS.\n\n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\\- Create a custom dataset based on your digital data.\n\n\\- Fine-tune an LLM using QLoRA.\n\n\\- Use Comet ML's experiment tracker to monitor the experiments.\n\n\\- Evaluate and save the best model to Comet's model registry.\n\n\u2601 Deployed on Qwak.\n\n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\n\\- Load and quantize the fine-tuned LLM from Comet's model registry.\n\n\\- Deploy it as a REST API.\n\n\\- Enhance the prompts using RAG.\n\n\\- Generate content using your LLM twin.\n\n\\- Monitor the LLM using Comet's prompt monitoring dashboard .\n\n\u2601 Deployed on Qwak.\n\n.\n\n\ud835\ude08\ud835\ude2d\ud835\ude30\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 4 \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34, \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 3 \ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34 \ud835\ude35\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude34:\n\n\\- Comet ML as your ML Platform\n\n\\- Qdrant as your vector DB\n\n\\- Qwak as your ML infrastructure\n\n* * *\n\nSoon, we will release the first lesson from the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb: \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff\n\ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb-\ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee\n\nTo stay updated...\n\n\ud835\ude3e\ud835\ude5d\ud835\ude5a\ud835\ude58\ud835\ude60 \ud835\ude5e\ud835\ude69 \ud835\ude64\ud835\ude6a\ud835\ude69 \ud835\ude42\ud835\ude5e\ud835\ude69\ud835\ude43\ud835\ude6a\ud835\ude57 \ud835\ude56\ud835\ude63\ud835\ude59 \ud835\ude68\ud835\ude6a\ud835\ude65\ud835\ude65\ud835\ude64\ud835\ude67\ud835\ude69 \ud835\ude6a\ud835\ude68 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude56 \u2b50\ufe0f\n\n\u2193\u2193\u2193\n\n\ud83d\udd17 _**LLM Twin: Building Your Production-Ready AI Replica** Course GitHub\nRepository_\n\n18\n\nShare this post\n\n#### Learn an end-to-end framework for production-ready LLM systems by\nbuilding your LLM twin\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/want-to-learn-an-end-to-end-framework?r=1ttoeh" }, { "id": "1dbefe69-acbf-4b86-8b52-0670b28dbab4", "content": { "Title": "Fix your messy ML configs in your Python projects", "Subtitle": "2024 MLOps learning roadmap. Python syntax sugar that will help you write cleaner code.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### Fix your messy ML configs in your Python projects\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# Fix your messy ML configs in your Python projects\n\n### 2024 MLOps learning roadmap. Python syntax sugar that will help you write\ncleaner code.\n\nPaul Iusztin\n\nMar 09, 2024\n\n13\n\nShare this post\n\n#### Fix your messy ML configs in your Python projects\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Decoding ML Notes_\n\nThis week our main focus will be a classic.\n\n> We will discuss Python.\n>\n> More concretely how to write cleaner code and applications in Python. \ud83d\udd25\n\nIs that even possible? \ud83d\udc80\n\n* * *\n\n### **This week\u2019s topics:**\n\n * My favorite way to implement a configuration layer in Python\n\n * Some Python syntax sugar that will help you write cleaner code\n\n * 2024 MLOps learning roadmap\n\n* * *\n\nSince creating content, I learned one crucial thing: \"\ud835\ude0c\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude3a\ud835\ude23\ud835\ude30\ud835\ude25\ud835\ude3a \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25\n\ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude25\ud835\ude2a\ud835\ude27\ud835\ude27\ud835\ude26\ud835\ude33\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude2d\ud835\ude3a.\"\n\n> Do you prefer to read content on Medium?\n\nThen, you are in luck.\n\nDecoding ML is also on Medium.\n\n**Substack vs. Medium?**\n\nOn Medium, we plan to post more extended and detailed content, while on\nSubstack, we will write on the same topics but in a shorter and more\nconcentrated manner.\n\nIf you want more code and less talking\u2026\n\n _Check out our Medium publication_ \ud83d\udc40\n\n\u2193\u2193\u2193\n\n\u2794 \ud83d\udd17 Decoding ML Medium publication\n\n\ud83d\udd17 Decoding ML Medium publication\n\n* * *\n\n### My favorite way to implement a configuration layer in Python\n\nThis is my favorite way to \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 a \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf3\ud835\uddf6\ud835\uddf4\ud835\ude02\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb/\ud835\ude00\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\ude00 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa in \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb\nfor all my apps \u2193\n\nThe core is based on \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24, a data validation library for Python.\n\nMore precisely, on their \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude1a\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude34 class.\n\n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\ude06\ud835\uddf1\ud835\uddee\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddf0 \ud835\uddd5\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\udde6\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\ude00 \ud835\uddf0\ud835\uddf9\ud835\uddee\ud835\ude00\ud835\ude00?\n\n\\- you can quickly load values from .\ud835\ude26\ud835\ude2f\ud835\ude37 files (or even \ud835\ude11\ud835\ude1a\ud835\ude16\ud835\ude15 or \ud835\ude20\ud835\ude08\ud835\ude14\ud835\ude13)\n\n\\- add default values for the configuration of your application\n\n\\- the MOST IMPORTANT one \u2192 It validates the type of the loaded variables.\nThus, you will always be ensured you use the correct variables to configure\nyour system.\n\n\ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddf6\ud835\ude01?\n\nIt is pretty straightforward.\n\nYou subclass the \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude1a\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude34 class and define all your settings at the class\nlevel.\n\nIt is similar to a Python \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2d\ud835\ude22\ud835\ude34\ud835\ude34 but with an extra layer of data validation\nand factory methods.\n\nIf you assign a value to the variable, it makes it optional.\n\nIf you leave it empty, providing it in your .\ud835\ude5a\ud835\ude63\ud835\ude6b file is mandatory.\n\n\ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddf4\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddf6\ud835\ude01 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde0\ud835\udddf \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2?\n\nYou often have a training configuration file (or inference) into a JSON or\nYAML file (I prefer YAML files as they are easier to read).\n\nYou shouldn't pollute your \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24 settings class with all the\nhyperparameters related to the module (as they are a lot, A LOT).\n\nAlso, to isolate the application & ML settings, the easiest way is to add the\n\ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28_\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28_\ud835\ude31\ud835\ude22\ud835\ude35\ud835\ude29 in your settings and use a \ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28 class to load\nit independently.\n\nDoing so lets you leverage your favorite way (probably the one you already\nhave in your ML code) of loading a config file for the ML configuration: plain\nYAML or JSON files, hydra, or other fancier methods.\n\nAnother plus is that you can't hardcode the path anywhere on your system. That\nis a nightmare when you start using git with multiple people.\n\npydantic BaseSettings example [Image by the Author]\n\nWhat do you say? Would you start using the \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24 \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude1a\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude34 class in your\nML applications?\n\n* * *\n\n### Some Python syntax sugar that will help you write cleaner code\n\nHere is some \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb \ud835\ude00\ud835\ude06\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude05 \ud835\ude00\ud835\ude02\ud835\uddf4\ud835\uddee\ud835\uddff that will help you \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb\ud835\uddf2\ud835\uddff \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \u2193\n\nI am talking about the \ud835\ude38\ud835\ude22\ud835\ude2d\ud835\ude33\ud835\ude36\ud835\ude34 \ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude30\ud835\ude33 denoted by the `:=` symbol.\n\nIt was introduced in Python 3.8, but I rarely see it used.\n\nThus, as a \"clean code\" freak, I wanted to dedicate a post to it.\n\n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf1\ud835\uddfc\ud835\uddf2\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude04\ud835\uddee\ud835\uddf9\ud835\uddff\ud835\ude02\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddf1\ud835\uddfc?\n\nIt's an assignment expression that allows you to assign and return a value in\nthe same expression.\n\n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\ude00\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf9\ud835\uddf1 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddf6\ud835\ude01?\n\n\ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude34\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude34\ud835\ude34: It reduces the number of lines needed for variable assignment and\nchecking, making code more concise.\n\n\ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude22\ud835\ude23\ud835\ude2a\ud835\ude2d\ud835\ude2a\ud835\ude35\ud835\ude3a: It can enhance readability by keeping related logic close,\nalthough this depends on the context and the reader's familiarity with exotic\nPython syntax.\n\n\ud835\ude43\ud835\ude5a\ud835\ude67\ud835\ude5a \ud835\ude56\ud835\ude67\ud835\ude5a \ud835\ude68\ud835\ude64\ud835\ude62\ud835\ude5a \ud835\ude5a\ud835\ude6d\ud835\ude56\ud835\ude62\ud835\ude65\ud835\ude61\ud835\ude5a\ud835\ude68\n\n\u2193\u2193\u2193\n\n1\\. Using the walrus operator, you can directly assign the result of the \ud835\ude2d\ud835\ude26\ud835\ude2f()\nfunction inside an if statement.\n\n2\\. Avoid calling the same function twice in a while loop. The benefit is less\ncode and makes everything more readable.\n\n3\\. Another use case arises in list comprehensions where a value computed in a\nfiltering condition is also needed in the expression body. Before the \ud835\ude38\ud835\ude22\ud835\ude2d\ud835\ude33\ud835\ude36\ud835\ude34\n\ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude30\ud835\ude33, if you had to apply a function to an item from a list and filter it\nbased on some criteria, you had to refactor it to a standard for loop.\n\n.\n\nWhen writing clean code, the detail matters.\n\nThe details make the difference between a codebase that can be read like a\nbook or one with 10 WTFs / seconds.\n\nThe walrus operator examples [Image by the Author]\n\nWhat do you think? Does the walrus operator make the Python code more readable\nand concise?\n\n* * *\n\n### 2024 MLOps learning roadmap\n\n\ud835\uddea\ud835\uddee\ud835\uddfb\ud835\ude01 to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 but got stuck at the 100th tool you think you must know?\nHere is the \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddff\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddfa\ud835\uddee\ud835\uddfd \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\udfee\ud835\udfec\ud835\udfee\ud835\udff0 \u2193 \n \n\ud835\ude14\ud835\ude13\ud835\ude16\ud835\ude31\ud835\ude34 \ud835\ude37\ud835\ude34. \ud835\ude14\ud835\ude13 \ud835\ude26\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33 \n \nIn theory, MLEs focus on deploying models to production while MLOps engineers\nbuild the platform used by MLEs. \n \nI think this is heavily dependent on the scale of the company. As the company\ngets smaller, these 2 roles start to overlap more. \n \nThis roadmap will teach you how to build such a platform, from programming\nskills to MLOps components and infrastructure as code. \n \n. \n \nHere is the MLOps roadmap for 2024 suggested by\n\nMaria Vechtomova\n\nfrom\n\nMarvelousMLOps\n\n: \n \n\ud835\udfed\\. \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf4\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n\\- Python & IDEs \n\\- Bash basics & command line editors \n \n\ud835\udfee\\. \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\ude07\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddde\ud835\ude02\ud835\uddef\ud835\uddf2\ud835\uddff\ud835\uddfb\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\ude00 \n\\- Docker \n\\- Kubernetes \n \n\ud835\udfef\\. \ud835\udde0\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf9\ud835\ude00 \n \n...until now we laid down the fundamentals. Now let's get into MLOps \ud83d\udd25 \n \n\ud835\udff0\\. \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddfd\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \n\\- reproducible, \n\\- testable, and \n\\- evolvable ML-powered software \n \n\ud835\udff1\\. \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00 \n\\- Version control & CI/CD pipelines \n\\- Orchestration \n\\- Experiment tracking and model registries \n\\- Data lineage and feature stores \n\\- Model training & serving \n\\- Monitoring & observability \n \n\ud835\udff2\\. \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \n\\- Terraform\n\n2024 MLOps Learning Roadmap [Image by the Author]\n\nAs a self-learner, I wish I had access to this step-by-step plan when I\nstarted learning MLOps. \n \nRemember, you should pick up and tailor this roadmap at the level you are\ncurrently at. \n \nFind more details about the roadmap in\n\nMaria Vechtomova\n\narticle \u2193 \n \n\u2794 \ud83d\udd17 MLOps roadmap 2024\n\n13\n\nShare this post\n\n#### Fix your messy ML configs in your Python projects\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/my-favorite-way-to-implement-a-configuration?r=1ttoeh" }, { "id": "ba6ba94f-b2d0-4ad8-9dbc-638f5eb1a081", "content": { "Title": "A Real-time Retrieval System for RAG on Social Media Data", "Subtitle": "Use a Bytewax streaming engine to build a real-time ingestion pipeline to populate a Qdrant vector DB. Implement a RAG retrieval client using rerank.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### A Real-time Retrieval System for RAG on Social Media Data\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# A Real-time Retrieval System for RAG on Social Media Data\n\n### Use a streaming engine to populate a vector DB in real time. Use rerank &\nUMAP to improve the accuracy of your retrieved documents.\n\nPaul Iusztin\n\nMar 07, 2024\n\n31\n\nShare this post\n\n#### A Real-time Retrieval System for RAG on Social Media Data\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n4\n\nShare\n\n> We are putting in a lot of time to create high-quality content. Thus, we\n> want to make it as convenient as possible for you to read our content.\n>\n> That is why we will experiment with the **posting time** and **move** it to\n> **Thursday** at **3:00 PM CET**.\n\nIn this article, you will learn how to build a real-time retrieval system for\nsocial media data. In our example, we will use only my LinkedIn posts, but our\nimplementation can easily be extended to other platforms supporting written\ncontent, such as X, Instagram, or Medium.\n\n**In this article, you will learn how to:**\n\n * build a streaming pipeline that ingests LinkedIn posts into a vector DB in real-time\n\n * clean, chunk, and embed LinkedIn posts\n\n * build a retrieval client to query LinkedIn posts\n\n * use a rerank pattern to improve retrieval accuracy\n\n * visualize content retrieved for a given query in a 2D plot using UMAP\n\nOur implementation focuses on just the retrieval part of an RAG system. But\nyou can quickly hook the retrieved LinkedIn posts to an LLM for post analysis\nor personalized content generation.\n\n* * *\n\n## Table of Contents:\n\n 1. System Design\n\n 2. Data\n\n 3. Streaming ingestion pipeline\n\n 4. Retrieval client\n\n 5. Conclusion\n\n* * *\n\n### 1\\. System Design\n\nThe architecture of the retrieval system [Image by the Author - in\ncollaboration with VectorHub].\n\nThe retrieval system is based on 2 detached components:\n\n 1. the streaming ingestion pipeline\n\n 2. the retrieval client\n\nThe **streaming ingestion pipeline** runs 24/7 to keep the vector DB synced up\nwith current raw LinkedIn posts data source, while the **retrieval client** is\nused in RAG applications to query the vector DB. These 2 components\n**communicate with each other only through the vector DB**.\n\n#### **1.1. The streaming ingestion pipeline**\n\nThe streaming ingestion pipeline implements the Change Data Capture (CDC)\npattern between a data source containing the raw LinkedIn posts and the vector\nDB used for retrieval.\n\nIn a real-world scenario, the streaming pipeline listens to a queue populated\nby all the changes made to the source database. But because we are focusing\nprimarily on the retrieval system, we simulate the data within the queue with\na couple of JSON files.\n\nThe streaming pipeline is built in Python using Bytewax, and cleans, chunks,\nand embeds the LinkedIn posts before loading them into a Qdrant vector DB.\n\n**Why do we need a stream engine?**\n\nBecause LinkedIn posts (or any other social media data) evolve frequently,\nyour vector DB can quickly get out of sync. To handle this, you can build a\nbatch pipeline that runs every minute. But to really minimize data lag, to\n**make sure your vector DB stays current with new social media posts** , you\nneed to use a streaming pipeline that **immediately** takes every new item the\nmoment it's posted, preprocesses it, and loads it into the vector DB.\n\n**Why Bytewax?**\n\nBytewax is a streaming engine built in Rust that exposes a Python interface.\nWe use Bytewax because it combines the impressive speed and reliability of\nRust with the ease of use and ecosystem of Python.\n\n#### 1.2. The retrieval client\n\nOur retrieval client is a standard Python module that preprocesses user\nqueries and searches the vector DB for most similar results. Qdrant vector DB\nlets us decouple the retrieval client from the streaming ingestion pipeline.\n\nUsing a semantic-based retrieval system lets us query our LinkedIn post\ncollection very flexibly. For example, we can retrieve similar posts using a\nvariety of query types - e.g., posts, questions, sentences.\n\nAlso, to improve the retrieval system's accuracy, we use a rerank pattern.\n\nLastly, to better understand and explain the retrieval process for particular\nqueries, we visualize our results on a 2D plot using UMAP.\n\n### 2\\. Data\n\nWe will ingest 215 LinkedIn posts from my Linked profile - Paul Iusztin.\nThough we simulate the post ingestion step using JSON files, the posts\nthemselves are authentic.\n\nBefore diving into the code, let's take a look at an example LinkedIn post to\nfamiliarize ourselves with the challenges it will introduce \u2193\n\n \n \n [\n {\n \"text\": \"\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 do you need to \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 an open-source \ud835\udddf\ud835\udddf\ud835\udde0 to create your own \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddfc\ud835\uddff?\\nThis is the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf8\ud835\uddf6\ud835\ude01 you must know \u2193\\n\ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01\\nThe key component of any successful ML project is the data.\\nYou need a 100 - 1000 sample Q&A (questions & answers) dataset with financial scenarios.\\nThe best approach is to hire a bunch of experts to create it manually.\\nBut, for a PoC, that might get expensive & slow.\\nThe good news is that a method called \\\"\ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude25\ud835\ude2a\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\\\" exists.\\n \n ...\n Along with ease of deployment, you can easily add your training code to your CI/CD to add the final piece of the MLOps puzzle, called CT (continuous training).\\n\u21b3 Beam: \ud83d\udd17\\nhttps://lnkd.in/dedCaMDh\\n.\\n\u21b3 To see all these components in action, check out my FREE \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 & give it a \u2b50: \ud83d\udd17\\nhttps://lnkd.in/dZgqtf8f\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\",\n \"image\": \"https://media.licdn.com/dms/image/D4D10AQHWQzZcToQQ1Q/image-shrink_800/0/1698388219549?e=1705082400&v=beta&t=9mrDC_NooJgD7u7Qk0PmrTGGaZtuwDIFKh3bEqeBsm0\"\n }\n ]\n\nThe following features of the above post are not compatible with embedding\nmodels. We'll need to find some way of handling them in our preprocessing\nstep:\n\n * emojis\n\n * bold, italic text\n\n * other non-ASCII characters\n\n * URLs\n\n * content that exceeds the context window limit of the embedding model\n\nEmojis and bolded and italic text are represented by Unicode characters that\nare not available in the vocabulary of the embedding model. Thus, these items\ncannot be tokenized and passed to the model; we have to remove them or\nnormalize them to something that can be parsed by the tokenizer. The same\nholds true for all other non-ASCII characters.\n\nURLs take up space in the context window without providing much semantic\nvalue. Still, knowing that there's a URL in the sentence may add context. For\nthis reason, we replace all URLs with a [URL] token. This lets us ingest\nwhatever value the URL's presence conveys without it taking up valuable space.\n\n### 3\\. Streaming ingestion pipeline\n\nLet's dive into the streaming pipeline, starting from the top and working our\nway to the bottom \u2193\n\n#### 3.1. The Bytewax flow\n\n**The Bytewax flow** transparently conveys all the steps of the streaming\npipeline.\n\nThe first step is ingesting every LinkedIn post from our JSON files. In the\nnext steps, every map operation has a single responsibility:\n\n * validate the ingested data using a _RawPost pydantic model_\n\n * clean the posts\n\n * chunk the posts; because chunking will output a list of ChunkedPost objects, we use a flat_map operation to flatten them out\n\n * embed the posts\n\n * load the posts to a Qdrant vector DB\n\n \n \n def build_flow():\n embedding_model = EmbeddingModelSingleton()\n \n flow = Dataflow(\"flow\")\n \n stream = op.input(\"input\", flow, JSONSource([\"data/paul.json\"]))\n stream = op.map(\"raw_post\", stream, RawPost.from_source)\n stream = op.map(\"cleaned_post\", stream, CleanedPost.from_raw_post)\n stream = op.flat_map(\n \"chunked_post\",\n stream,\n lambda cleaned_post: ChunkedPost.from_cleaned_post(\n cleaned_post, embedding_model=embedding_model\n ),\n )\n stream = op.map(\n \"embedded_chunked_post\",\n stream,\n lambda chunked_post: EmbeddedChunkedPost.from_chunked_post(\n chunked_post, embedding_model=embedding_model\n ),\n )\n op.inspect(\"inspect\", stream, print)\n op.output(\n \"output\", stream, QdrantVectorOutput(vector_size=model.embedding_size)\n )\n \n return flow\n\n#### 3.2. The processing steps\n\nEvery processing step is incorporated into a _pydantic model_. This way, we\ncan easily validate the data at each step and reuse the code in the retrieval\nmodule.\n\nWe isolate every step of an ingestion pipeline into its own class:\n\n * cleaning\n\n * chunking\n\n * embedding \n\nDoing so, we follow the separation of concerns good SWE practice. Thus, every\nclass has its own responsibility.\n\nNow the code is easy to read and understand. Also, it\u2019s future-proof, as it\u2019s\nextremely easy to change or extend either of the 3 steps: cleaning, chunking\nand embedding.\n\nHere is the interface of the _pydantic models_ :\n\n \n \n class RawPost(BaseModel):\n post_id: str\n text: str\n image: Optional[str]\n \n @classmethod\n def from_source(cls, k_v: Tuple[str, dict]) -> \"RawPost\":\n ... # Mapping a dictionary to a RawPost validated pydantic model.\n \n return cls(...)\n \n class CleanedPost(BaseModel):\n post_id: str\n raw_text: str\n text: str\n image: Optional[str]\n \n @classmethod\n def from_raw_post(cls, raw_post: RawPost) -> \"CleanedPost\":\n ... # Cleaning the raw post\n \n return cls(...)\n \n \n class ChunkedPost(BaseModel):\n post_id: str\n chunk_id: str\n full_raw_text: str\n text: str\n image: Optional[str]\n \n @classmethod\n def from_cleaned_post(\n cls, cleaned_post: CleanedPost, embedding_model: EmbeddingModelSingleton\n ) -> list[\"ChunkedPost\"]:\n chunks = ... # Compute chunks\n \n return [cls(...) for chunk in chunks]\n \n \n class EmbeddedChunkedPost(BaseModel):\n post_id: str\n chunk_id: str\n full_raw_text: str\n text: str\n text_embedding: list\n image: Optional[str] = None\n score: Optional[float] = None\n rerank_score: Optional[float] = None\n \n @classmethod\n def from_chunked_post(\n cls, chunked_post: ChunkedPost, embedding_model: EmbeddingModelSingleton\n ) -> \"EmbeddedChunkedPost\":\n ... # Compute embedding.\n \n return cls(...)\n \n\nNow, the data at each step is validated and has a clear structure.\n\n**Note:** Providing different types when instantiating a _pydantic_ model will\nthrow a validation error. For example, if the _post_id_ is defined as a\n_string_ , and we try to instantiate an _EmbeddedChunkedPost_ with a _None_\nor _int_ _post_id_ , it will throw an error.\n\n> Check out the full implementation on our \ud83d\udd17 GitHub Articles Hub repository.\n\n#### 3.3. Load to Qdrant\n\nTo load the LinkedIn posts to Qdrant, you have to override Bytewax's\n_StatelessSinkPartition_ class (which acts as an **output** in a Bytewax\nflow):\n\n \n \n class QdrantVectorSink(StatelessSinkPartition):\n def __init__(\n self,\n client: QdrantClient,\n collection_name: str\n ):\n self._client = client\n self._collection_name = collection_name\n \n def write_batch(self, chunks: list[EmbeddedChunkedPost]):\n ... # Map chunks to ids, embeddings, and metadata.\n \n self._client.upsert(\n collection_name=self._collection_name,\n points=Batch(\n ids=ids,\n vectors=embeddings,\n payloads=metadata,\n ),\n )\n\nWithin this class, you must overwrite the _write_batch()_ method, where we\nwill serialize every _EmbeddedChunkedPost_ to a format expected by Qdrant and\nload it to the vector DB.\n\n### 4\\. Retrieval client\n\nHere, we focus on preprocessing a user's query, searching the vector DB, and\npostprocessing the retrieved posts for maximum results.\n\nTo design the retrieval step, we implement a _QdrantVectorDBRetriever_ class\nto expose all the necessary features for our retrieval client.\n\n \n \n class QdrantVectorDBRetriever:\n def __init__(\n self,\n embedding_model: EmbeddingModelSingleton,\n vector_db_client: QdrantClient,\n cross_encoder_model: CrossEncoderModelSingleton\n vector_db_collection: str\n ):\n self._embedding_model = embedding_model\n self._vector_db_client = vector_db_client\n self._cross_encoder_model = cross_encoder_model\n self._vector_db_collection = vector_db_collection\n \n def search(\n self, query: str, limit: int = 3, return_all: bool = False\n ) -> Union[list[EmbeddedChunkedPost], dict[str, list]]:\n ... # Search the Qdrant vector DB based on the given query.\n \n def embed_query(self, query: str) -> list[list[float]]:\n ... # Embed the given query.\n \n def rerank(self, query: str, posts: list[EmbeddedChunkedPost]) -> list[EmbeddedChunkedPost]:\n ... # Rerank the posts relative to the given query.\n \n def render_as_html(self, post: EmbeddedChunkedPost) -> None:\n ... # Map the embedded post to HTML to display it.\n\n#### 4.1. Embed query\n\nWe must embed the query in precisely the same way we ingested our posts into\nthe vector DB. Because the streaming pipeline is written in Python (thanks to\nBytewax), and every preprocessing operation is modular, we can quickly\nreplicate all the steps necessary to embed the query.\n\n \n \n class QdrantVectorDBRetriever:\n \n ...\n \n def embed_query(self, query: str) -> list[list[float]]:\n cleaned_query = CleanedPost.clean(query)\n chunks = ChunkedPost.chunk(cleaned_query, self._embedding_model)\n embdedded_queries = [\n self._embedding_model(chunk, to_list=True) for chunk in chunks\n ]\n \n return embdedded_queries\n\n> Check out the full implementation on our \ud83d\udd17 GitHub repository.\n\n#### 4.2. Plain retrieval\n\nLet\u2019s try to retrieve a set of posts without using the rerank algorithm.\n\n \n \n vector_db_retriever = QdrantVectorDBRetriever(\n embedding_model=EmbeddingModelSingleton(),\n vector_db_client=build_qdrant_client()\n )\n \n query = \"Posts about Qdrant\"\n retrieved_results = vector_db_retriever.search(query=query)\n for post in retrieved_results[\"posts\"]:\n vector_db_retriever.render_as_html(post)\n\nHere are the **top 2 retrieved results** sorted using the cosine similarity\nscore \u2193\n\n**Result 1:**\n\nResult 1 for the \"Posts about Qdrant\" query (without using reranking) [Image\nby the Author - in collaboration with VectorHub]\n\n**Result 2:**\n\nResult 2 for the \"Posts about Qdrant\" query (without using reranking) [Image\nby the Author - in collaboration with VectorHub]\n\nYou can see from the results above, that starting from the second post the\nresults are irrelevant. Even though it has a cosine similarly score of ~0.69\nthe posts doesn\u2019t contain any information about Qdrant or vector DBs.\n\n**Note:** We looked over the top 5 retrieved results. Nothing after the first\npost was relevant. We haven\u2019t added them here as the article is already too\nlong.\n\n#### 4.3. Visualize retrieval\n\nTo visualize our retrieval, we implement a dedicated class that uses the UMAP\ndimensionality reduction algorithm. We have picked UMAP as it preserves the\ngeometric properties between points (e.g., the distance) in higher dimensions\nwhen they are projected onto lower dimensions better than its peers (e.g.,\nPCA, t-SNE).\n\nThe _RetrievalVisualizer_ computes the projected embeddings for the entire\nvector space once. Afterwards, it uses the render() method to project only the\ngiven query and retrieved posts, and plot them to a 2D graph.\n\n \n \n class RetrievalVisualizer:\n def __init__(self, posts: list[EmbeddedChunkedPost]):\n self._posts = posts\n \n self._umap_transform = self._fit_model(self._posts)\n self._projected_post_embeddings = self.project_posts(self._posts)\n \n def _fit_model(self, posts: list[EmbeddedChunkedPost]) -> umap.UMAP:\n umap_transform = ... # Fit a UMAP model on the given posts.\n \n return umap_transform\n \n def project_posts(self, posts: list[EmbeddedChunkedPost]) -> np.ndarray:\n embeddings = np.array([post.text_embedding for post in posts])\n \n return self._project(embeddings=embeddings)\n \n def _project(self, embeddings: np.ndarray) -> np.ndarray:\n ... # Project the embeddings to 2D using UMAP.\n \n return umap_embeddings\n \n def render(\n self,\n embedded_queries: list[list[float]],\n retrieved_posts: list[EmbeddedChunkedPost],\n ) -> None:\n ... # Render the given queries & retrieved posts using matplotlib.\n\nLet's take a look at the result to see how the _\" Posts about Qdrant\"_ query\nlooks \u2193\n\nVisualization of the \u201cPosts about Qdrant\u201d query using UMAP (without reranking)\n[Image by the Author - in collaboration with VectorHub].\n\nOur results are not great. You can see how far the retrieved posts are from\nour query in the vector space.\n\nCan we improve the quality of our retrieval system using the **rerank**\nalgorithm?\n\n#### 4.4. Rerank\n\nWe use the _reranking_ algorithm to refine our retrieval for the initial\nquery. Our initial retrieval step - because it used cosine similarity (or\nsimilar distance metrics) to compute the distance between a query and post\nembeddings - may have missed more complex (but essential) relationships\nbetween the query and the documents in the vector space. Reranking leverages\nthe power of transformer models that are capable of understanding more nuanced\nsemantic relationships.\n\nWe use a **cross-encoder** model to implement the reranking step, so we can\nscore the query relative to all retrieved posts individually. These scores\ntake into consideration more complex relationships than cosine similarity can.\nUnder the hood is a BERT classifier that outputs a number between 0 and 1\naccording to how similar the 2 given sentences are. The BERT classifier\noutputs 0 if they are entirely different and 1 if they are a perfect match.\n\nBi-Encoder vs. Cross-Encoder [Image by the Author - in collaboration with\nVectorHub]\n\nBut, you might ask, \"_Why not use the**cross-encoder** model from the start if\nit is that much better?\"_\n\nThe answer, in a word, is speed. Using a cross-encoder model to search your\nwhole collection is much slower than using cosine similarity. To optimize your\nretrieval, therefore, your reranking process should involve 2 steps:\n\n 1. an initial rough retrieval step using cosine similarity, which retrieves the top N items as potential candidates\n\n 2. filtering the rough search using the rerank strategy, which retrieves the top K items as your final results\n\nThe implementation is relatively straightforward. For each retrieved post, we\ncreate a pair consisting of the (cleaned) query and the text of the post. We\ndo this for all retrieved posts, resulting in a list of pairs.\n\nNext, we call a _cross-encoder/ms-marco-MiniLM-L-6-v2_ model (from sentence-\ntransformers) to give the retrieved posts their rerank score. We then sort the\nposts in descending order based on their rerank score.\n\n> Check out the rerank algorithm implementation on our \ud83d\udd17 GitHub repository.\n\n#### 4.5. Visualize retrieval with rerank\n\nNow that we've added the rerank pattern to our retrieval system, let's see if\nit improves the results of our _\" Posts about Qdrant\"_ query \u2193\n\n**Result 1**\n\nResult 1 for the \"Posts about Qdrant\" query (using reranking) [Image by the\nAuthor - in collaboration with VectorHub]\n\n**Result 2:**\n\nResult 2 for the \"Posts about Qdrant\" query (using reranking) [Image by the\nAuthor - in collaboration with VectorHub]\n\nThe improvement is remarkable! All our results are about Qdrant and vector\nDBs.\n\n**Note:** We looked over the top 5 retrieved results. The top 4 out of 5 posts\nare relevant to our query, which is incredible.\n\nNow, let's look at the UMAP visualization:\n\nVisualization of the \u201cPosts about Qdrant\u201d query using UMAP (with reranking)\n[Image by the Author - in collaboration with VectorHub].\n\nWhile the returned posts aren't very close to the query, they are **a lot\ncloser to the query compared to when we weren't reranking the retrieved\nposts**.\n\n* * *\n\n### 5\\. Conclusion\n\nIn this article, we learned how to adapt a RAG retrieval pattern to improve\nLinkedIn post retrieval. To keep our database up to date with rapidly changing\nsocial media data, we implemented a real-time streaming pipeline that uses CDC\nto sync the raw LinkedIn posts data source with a vector DB. You also saw how\nto use Bytewax to write - using only Python - a streaming pipeline that\ncleans, chunks, and embeds LinkedIn posts.\n\nFinally, you learned how to implement a standard retrieval client for RAG and\nsaw how to improve it using the rerank pattern. As retrieval is complex to\nevaluate, you saw how to visualize the retrieval for a given query by\nrendering all the posts, the query, and the retrieved posts in a 2D space\nusing UMAP.\n\n> This **article** is a **summary** __ of **my contribution** from\n> **VectorHub**. Check out the full article here to **dig** **into** the\n> **details,** the**code** and **more experiments**.\n\n31\n\nShare this post\n\n#### A Real-time Retrieval System for RAG on Social Media Data\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n4\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| OlaMar 8Liked by Paul IusztinNice read, full of insights.Expand full\ncommentReplyShare \n---|--- \n \n1 reply by Paul Iusztin\n\n| VenkataMar 23Liked by Paul IusztinExcellent article. Thanks a lot for\nposting this.Expand full commentReplyShare \n---|--- \n \n1 reply by Paul Iusztin\n\n2 more comments...\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/a-real-time-retrieval-system-for?r=1ttoeh" }, { "id": "cb6e689e-e718-42c8-80b1-44db7d568c3b", "content": { "Title": "4 key decoding strategies for LLMs that you must know", "Subtitle": "The only 6 prompt engineering techniques you need to know. One thing that I do that sets me apart from the crowd.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### 4 key decoding strategies for LLMs that you must know\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# 4 key decoding strategies for LLMs that you must know\n\n### The only 6 prompt engineering techniques you need to know. One thing that\nI do that sets me apart from the crowd.\n\nPaul Iusztin\n\nFeb 15, 2024\n\n9\n\nShare this post\n\n#### 4 key decoding strategies for LLMs that you must know\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nHello everyone,\n\nI hope you enjoyed what Alex R. & Alex V. have prepared for you in their\nprevious articles.\n\nI promised that the 3 of us would dig deeper into more exciting topics about\nproduction-ready LLM and CV models.\n\n_\u2192 But this is just the beginning. Stay tuned for more production ML_ \ud83d\udd25\n\n* * *\n\n### **This week\u2019s topics:**\n\n * 4 key decoding strategies for LLMs that you must know\n\n * The only 6 prompt engineering techniques you need to know\n\n * One thing that I do that sets me apart from the crowd\n\n* * *\n\n> Want to build your first \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01 but don't know where to start?\n\nIf you want to **learn** in a **structured** **way** to **build** hands-on\n**LLM systems** using good **LLMOps** principles\u2026\n\nWe want to **announce** that we just **released** **8 Medium lessons** for the\n**Hands-on LLMs** **course** that will put you on the right track \u2193\n\nWithin the **8 Medium lessons** , you will go step-by-step through the\n**theory** , **system** **design** , and **code** to learn how to build a:\n\n * **real-time streaming pipeline** (deployed on AWS) that uses Bytewax as the stream engine to listen to financial news, cleans & embeds the documents, and loads them to a vector DB\n\n * **fine-tuning pipeline** (deployed as a serverless continuous training) that fine-tunes an LLM on financial data using QLoRA, monitors the experiments using an experiment tracker and saves the best model to a model registry\n\n * **inference pipeline** built in LangChain (deployed as a serverless RESTful API) that loads the fine-tuned LLM from the model registry and answers financial questions using RAG (leveraging the vector DB populated with financial news)\n\nWe will also show you how to **integrate** various **serverless tools** , such\nas: \n \n\u2022 Comet ML as your ML Platform; \n\u2022 Qdrant as your vector DB; \n\u2022 Beam as your infrastructure.\n\nThe architecture of the system you will learn to build during the **Hands-on\nLLMs** course [Image by the Author].\n\n**Who is this for?**\n\nThe series targets MLE, DE, DS, or SWE who want to learn to engineer LLM\nsystems using LLMOps good principles.\n\n**How will you learn?**\n\nThe series contains 4 hands-on video lessons and the open-source code you can\naccess on GitHub.\n\n**Curious?** \u2193\n\nCheck out the 8 Medium lessons of the Hands-on LLMs course and start building\nyour own LLMs system:\n\n\ud83d\udd17 The Hands-on LLMs Medium Series\n\n* * *\n\n### 4 key decoding strategies for LLMs that you must know\n\nYou see, LLMs don't just spit out text. \n \nThey calculate \"logits\", which are mapped to probabilities for every possible\ntoken in their vocabulary. \n \nIt uses previous token IDs to predict the next most likely token (the auto-\nregressive nature of decoder models). \n \nThe real magic happens in the decoding strategy you pick \u2193 \n \n\\- Greedy Search \n\\- Beam Search \n\\- Top-K Sampling \n\\- Nucleus Sampling \n \n. \n \n\ud835\uddda\ud835\uddff\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\ude06 \ud835\udde6\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 \n \nIt only holds onto the most likely token at each stage. It's fast and\nefficient, but it is short-sighted. \n \n\ud835\uddd5\ud835\uddf2\ud835\uddee\ud835\uddfa \ud835\udde6\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 \n \nThis time, you are not looking at just the token with the highest probability.\nBut you are considering the N most likely tokens. \n \nThis will create a tree-like structure, where each node will have N children. \n \nThe procedure repeats until you hit a maximum length or an end-of-sequence\ntoken. \n \nUltimately, you pick the leaf with the biggest score and recursively pick its\nparent until you hit the root node. \n \nFor example, in the graph below, we have \"\ud835\ude23\ud835\ude26\ud835\ude22\ud835\ude2e\ud835\ude34 = 2\" and \"\ud835\ude2d\ud835\ude26\ud835\ude2f\ud835\ude28\ud835\ude35\ud835\ude29 = 3\". \n \n\ud835\udde7\ud835\uddfc\ud835\uddfd-\ud835\uddde \ud835\udde6\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n \nThis technique extends the Beam search strategy and adds a dash of randomness\nto the generation process. \n \nInstead of just picking the most likely tokens, it's selecting a token\nrandomly from the top k most likely choices. \n \nThus, the tokens with the highest probability will appear more often, but\nother tokens will be generated occasionally to add some randomness\n(\"creativity\"). \n \n\ud835\udde1\ud835\ude02\ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\ude02\ud835\ude00 \ud835\udde6\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n \nIn this case, you're not just picking the top k most probable tokens here.\nYou're picking a cutoff value _p_ and forming a \"nucleus\" of tokens. \n \nIn other words, rather than selecting the top k most probable tokens, nucleus\nsampling chooses a cutoff value p such that the sum of the probabilities of\nthe selected tokens exceeds p. \n \nThus, at every step, you will have a various number of possible tokens\nincluded in the \"nucleus\" from which you sample. This introduces even more\ndiversity and creativity into your output. \n \n. \n \n\ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2: For \ud835\ude35\ud835\ude30\ud835\ude31-\ud835\ude2c and \ud835\ude2f\ud835\ude36\ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude36\ud835\ude34 \ud835\ude34\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude28, you can also use the \"\ud835\ude35\ud835\ude26\ud835\ude2e\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26\"\nhyperparameter to tweak the output probabilities. It is a parameter that\nranges from 0 to 1. A low temperature (e.g., 0.1) will decrease the entropy\n(randomness), making the generation more stable.\n\n4 key decoding strategies for LLMs that you must know [Image by the Author].\n\nTo summarize... \n \nThere are 2 main decoding strategies for LLMs: \n\\- greedy search \n\\- beam search \n \nTo add more variability and creativity to beam search, you can use: \n\\- top-k sampling \n\\- nucleus sampling\n\n* * *\n\n### The only 6 prompt engineering techniques you need to know\n\nThe whole field of prompt engineering can be reduced to these 6 techniques I\nuse almost daily when using ChatGPT (or other LLMs). \n \nHere they are \u2193 \n \n#1. \ud835\udc05\ud835\udc1e\ud835\udc30 \ud835\udc2c\ud835\udc21\ud835\udc28\ud835\udc2d \ud835\udc29\ud835\udc2b\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc2d\ud835\udc22\ud835\udc27\ud835\udc20 \n \nAdd in your prompt 2 or 3 high-quality demonstrations, each consisting of both\ninput and desired output, on the target task. \n \nThe LLM will better understand your intention and what kind of answers you\nexpect based on concrete examples. \n \n#2. \ud835\udc12\ud835\udc1e\ud835\udc25\ud835\udc1f-\ud835\udc1c\ud835\udc28\ud835\udc27\ud835\udc2c\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc32 \ud835\udc2c\ud835\udc1a\ud835\udc26\ud835\udc29\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc20 \n \nSample multiple outputs with \"temperature > 0\" and select the best one out of\nthese candidates. \n \nHow to pick the best candidate? \n \nIt will vary from task to task, but here are 2 primary scenarios \u2193 \n \n1\\. Some tasks are easy to validate, such as programming questions. In this\ncase, you can write unit tests to verify the correctness of the generated\ncode. \n \n2\\. For more complicated tasks, you can manually inspect them or use another\nLLM (or another specialized model) to rank them. \n \n#3. \ud835\udc02\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27-\ud835\udc28\ud835\udc1f-\ud835\udc13\ud835\udc21\ud835\udc28\ud835\udc2e\ud835\udc20\ud835\udc21\ud835\udc2d (\ud835\udc02\ud835\udc28\ud835\udc13) \n \nYou want to force the LLM to explain its thought process, which eventually\nleads to the final answer, step by step. \n \nThis will help the LLM to reason complex tasks better. \n \nYou want to use CoT for complicated reasoning tasks + large models (e.g., with\nmore than 50B parameters). Simple tasks only benefit slightly from CoT\nprompting. \n \nHere are a few methods to achieve CoT: \n\\- provide a list of bullet points with all the steps you expect the LLM to\ntake \n\\- use \"Few shot prompt\" to teach the LLM to think in steps \n \n... or my favorite: use sentences such as \"Let's think step by step.\" \n \n#4. \ud835\udc00\ud835\udc2e\ud835\udc20\ud835\udc26\ud835\udc1e\ud835\udc27\ud835\udc2d\ud835\udc1e\ud835\udc1d \ud835\udc0f\ud835\udc2b\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc2d\ud835\udc2c \n \nThe LLM's internal knowledge is limited to the data it was trained on. Also,\noften, it forgets specific details of older training datasets. \n \nThe most common use case is Retrieval-Augmented Generation (RAG). \n \nThat is why using the LLM as a reasoning engine is beneficial to parse and\nextract information from a reliable source of information given as context in\nthe prompt. \n \n\ud835\ude1e\ud835\ude29\ud835\ude3a? \n\\- avoid retraining the model on new data \n\\- avoid hallucinating \n\\- access to references on the source \n \n#5. \ud835\udc00 \ud835\udc2c\ud835\udc22\ud835\udc27\ud835\udc20\ud835\udc25\ud835\udc1e \ud835\udc2b\ud835\udc1e\ud835\udc2c\ud835\udc29\ud835\udc28\ud835\udc27\ud835\udc2c\ud835\udc22\ud835\udc1b\ud835\udc22\ud835\udc25\ud835\udc22\ud835\udc2d\ud835\udc32 \ud835\udc29\ud835\udc1e\ud835\udc2b \ud835\udc29\ud835\udc2b\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc2d \n \nQuite self-explanatory. It is similar to the DRY principle in SWE. \n \nHaving only x1 task/prompt is good practice to avoid confusing the LLM. \n \nIf you have more complex tasks, split them into granular ones and merge the\nresults later in a different prompt. \n \n#6. \ud835\udc01\ud835\udc1e \ud835\udc1a\ud835\udc2c \ud835\udc1e\ud835\udc31\ud835\udc29\ud835\udc25\ud835\udc22\ud835\udc1c\ud835\udc22\ud835\udc2d \ud835\udc1a\ud835\udc2c \ud835\udc29\ud835\udc28\ud835\udc2c\ud835\udc2c\ud835\udc22\ud835\udc1b\ud835\udc25\ud835\udc1e \n \nThe LLM cannot read your mind. To maximize the probability of getting\nprecisely what you want, you can imagine the LLM as a 7-year-old to whom you\nmust explain everything step-by-step to be sure he understood. \n \n\ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26: The level of detail in the prompt is inversely proportional to the size\n& complexity of the model.\n\n[Image generated by DALL-E]\n\nThe truth is that prompt engineering is quite intuitive, and we don't have to\noverthink it too much. \n \nWhat would you add to this list?\n\n* * *\n\n### One thing that I do that sets me apart from the crowd\n\nHere is one thing that I do that sets me apart from the crowd: \n \n\"\ud835\ude10 \ud835\ude22\ud835\ude2e \ud835\ude30\ud835\ude2c\ud835\ude22\ud835\ude3a \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude23\ud835\ude26\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude25\ud835\ude36\ud835\ude2e\ud835\ude31 \ud835\ude30\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude22\ud835\ude34\ud835\ude2c\ud835\ude34 \ud835\ude2e\ud835\ude22\ud835\ude2f\ud835\ude3a \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34.\" \n \n\ud835\udc07\ud835\udc26\ud835\udc26... \ud835\udc16\ud835\udc21\ud835\udc32? \n \nThe reality is that even the brightest minds cannot understand everything from\nthe first shot. \n \nIt is not necessarily that you cannot understand the concepts. \n \nThere are other factors, such as: \n\\- you are tired \n\\- you haven't paid enough attention \n\\- the concept wasn't explained at your level \n\\- the presenter wasn't clear enough, etc. \n \nAlso, the truth is that many of us don't understand everything from the first\nshot when presented with a new concept. \n \nBut because of our ego, we are afraid to come out and ask something because we\nare worried that we will sound stupid. \n \nThe jokes are on you. \n \nMost people will be grateful you broke the ice and asked to explain the\nconcept again. \n \n\ud835\udc16\ud835\udc21\ud835\udc32? \n \nIt will help the team to learn the new concepts better. \n \nIt will start a discussion to dig deeper into the subject. \n \nIt will piss off or annoy the people you don't like. \n \nIt will help other people ask questions next time. \n \nIt will open up new perspectives on the problem.\n\nTo conclude... \n \nIgnore your ego and what people think of you. Own your curiosity and ask\nquestions when you feel like it. \n \nIt is ok not to know everything. \n \nIt is better to be stupid for 5 minutes than your entire life.\n\n* * *\n\nCongrats on learning something new today!\n\n**Don\u2019t hesitate to share your thoughts - we would love to hear them.**\n\n_**\u2192** Remember, when ML looks **encoded - we\u2019ll help you decode it.**_\n\nSee you next Thursday at 9:00 am CET.\n\nHave a fantastic weekend!\n\n9\n\nShare this post\n\n#### 4 key decoding strategies for LLMs that you must know\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/4-key-decoding-strategies-for-llms?r=1ttoeh" }, { "id": "50a5a621-5799-4214-990d-3387ecc704e1", "content": { "Title": "DML: New year, the new & improved Decoding ML - What to expect?", "Subtitle": "How we plan to grow, provide more qualitative & hands-on content, and real-world ML projects to expand your professional skills", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: New year, the new & improved Decoding ML - What to expect?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: New year, the new & improved Decoding ML - What to expect?\n\n### How we plan to grow, provide more qualitative & hands-on content, and\nreal-world ML projects to expand your professional skills\n\nPaul Iusztin\n\n,\n\nAlex Razvant\n\n, and\n\nVesa Alexandru\n\nJan 11, 2024\n\n10\n\nShare this post\n\n#### DML: New year, the new & improved Decoding ML - What to expect?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\nThis newsletter will differ from the others as I want to share my plans for\nthe Decoding ML newsletter with you.\n\n> From now on, it will cost $1000/month. **Joking.** It will still be free.\n> It\u2019s not about the money but about growth, better quality & added value.\n\nTo be 100% transparent with you, I started this newsletter as an experiment,\nbut when I saw people who actually read it, the perfectionist in me screamed\nthat I should improve it and move to the next step.\n\nThis is the next step. And I\u2019m taking you with me.\n\nThe big news is that I will go all in, pouring more time and resources into\ngrowing the Decoding ML newsletter. My main goals are to:\n\n * push better-quality content every week\n\n * bring more real-world projects to increase your hands-on skills\n\n * increases the number of articles with code examples to make it practical so you can benefit from it even more at your job \n\n> As the world constantly changes, especially AI, MLE & MLOps, you cannot\n> stagnate. Decoding ML\u2019s growth is about providing you with all the MLE &\n> MLOps necessary resources to grow with it and smash it at your projects and\n> job.\n\n* * *\n \n \n _So.. How do I plan to grow the Decoding ML newsletter?_\n\n## Well, there are 3 main steps \u2193\n\n## #1. Rebranding\n\nFrom now on, my face will no longer be the \u201clogo\u201d of Decoding ML.\n\nThis will be the new logo of Decoding ML \u2193\n\nSo you don\u2019t have to see my annoying face every Thursday morning in your email\n\ud83e\udd23\n\n* * *\n\n## #2. Bringing in talent\n\nAs I wanted to push more content of higher quality, I had to bring in more\ntalented people to write beside me.\n\nI was lucky enough to know Alex Razvant and Alex Vesa, who are 2 fantastic MLE\n& MLOps engineers with 10 years of hands-on experience in the AI industry.\n\nFrom now on, they will start contributing to the Decoding ML newsletter and\nteam along with me.\n\n> Maybe you know this famous saying: \u201c**If you want to go fast, go alone; if\n> you want to go far, go together**.\u201d \u2026and I want Decoding ML to go far.\n\nOur primary goal is to help you level up in MLE & MLOps by offering hands-on\nexamples that you can use at your job.\n\nI plan to improve the quality of the articles by including more code and\nconcrete examples besides the system design talks we have discussed so far.\n\n\u2026and here enters the scene \u201cThe Alex\u2019s\u201d\n\nI have worked with them, and I know they are talented experts with fantastic\nhands-on MLE & MLOps skills and insights to share with you.\n\nStarting from now on, Decoding ML will no longer be a one-person brand but a\nbrand by itself, hosted by the new Decoding ML team:\n\n * myself\n\n * Alex Vesa\n\n * Alex Razvant\n\n### #2.1. Now, let the team introduce itself \u2193\n\n#### _**Alex Vesa**_\n\n _Main niche: \u201cDeep Learning/Computer Vision | ML System Infrastructure | Startups | Business\u201d_\n\n\u21b3 \ud83d\udd17 LinkedIn \n\nHello everyone,\n\n \nI\u2019m very grateful for this opportunity. I consider creativity and inspiration\nto flourish when there's a merger of minds from various individuals.\n\nMy professional journey began in 2015, initially focusing on software\nengineering with a keen interest in Python and AI technologies. I quickly\nprogressed, taking on challenging roles and AI projects. My experience in\nvarious startups as a CTO focused on leading teams in developing innovative\nsoftware solutions. I worked in multiple sectors, notably healthcare and\nautomotive, where I've implemented AI-driven systems to enhance operational\nefficiency.\n\nMy technical skills are broad, encompassing Python, Django, and AWS. I'm\ndedicated to leveraging my AI and software development expertise to drive\norganizational success in this dynamic field.\n\nI value knowledge-sharing among our community, and my objective is to bring\nsolid expertise in practical, real-world AI/ML systems to help you in your\nday-to-day work and enhance your creativity and vision in product development.\n\nUltimately, I want to share with you the endless capabilities you can possess\nto evolve.\n\n#### _Alex Razvant_\n\n _Main niche: \u201cML/CV Systems in Production | MLOps_ /_Edge ML Deployments\u201d_\n\n\u21b3 \ud83d\udd17 LinkedIn\n\nHey everyone,\n\nI\u2019m really happy about this merger, as you\u2019ll get 3X more quality content in a\nconcise, valuable, and actionable manner directly to your inbox!\n\nHere are a few words about who I am:\n\nI started my journey as a SWE in 2015, diving into full-stack web development. \nAfter a few internships, hackathons, and a few failed projects, the ML field\ncaught my eye, and I haven\u2019t looked back ever since.\n\nMy journey includes over **15+** successful freelance projects, earning a\n**Top-Rated** ML Engineer badge on **UpWork** , collaborating with **BMW** on\nAI for self-driving cars, authoring a paper for IEEE RAL 2020, and developing\nscalable Computer Vision systems to analyze 1000+ hours of CCTV footage.\n\nI aim to bring solid expertise via **code tutorials, diagrams, and system\ndesigns** to help you overcome challenges in building and deploying ML & CV\nsystems in cloud or edge environments, following the best practices I\u2019ve\nlearned in SWE, ML, and MLOps.\n\n> _Follow them & check them out on LinkedIn to see their incredible experience\n> in AI._\n\n### #2.2. Will we start approaching different topics?\n\n_TL/DR: No!_\n\nI was meticulous in bringing in more people with the same vision.\n\nThus, Decoding ML will approach the same niche as it has done: _\u201cproduction-\nready MLE & MLOps topics.\u201d_\n\nSo\u2026 you don\u2019t have to unsubscribe. We will keep talking about the same topics\nyou chose to follow in our newsletter: _\u201chands-on MLE & MLOps topics\u201d_\n\nHowever, the advantage of having more people with different backgrounds on the\nteam is that we all come with different perspectives and domain knowledge.\n\nFor example:\n\n * Alex Razvant worked a lot with Computer Vision, Deep Learning, and MLOps technologies in the world of retail\n\n * Alex Vesa has a lot of experience with Deep Learning and infrastructure projects in the medical field\n\n * I am passioned about generative AI, MLOps, and SWE\n\n\u2026combining our knowledge will result in exciting production-ready MLE & MLOps\narticles that will significantly benefit you.\n\n* * *\n\n## #3. Expanding to new distribution channels\n\nEvery person consumes content differently.\n\nSo, we'd like to give you the best fit to enjoy our content.\n\nWe already started a Decoding ML Medium publication, where we will start this\nmonth to push a deep dive into the code of the Hands-on LLMs Course.\n\n\u2026and slowly, we will expand to video format content on:\n\n * Youtube\n\n * Instagram\n\n * TikTok\n\nAlso, we started planning a set of eBooks about MLE, MLOps and LLMOps and a\nnew course about LLMs and LLMOps.\n\n* * *\n\n### So\u2026 What happens next?\n\nI hope you are excited about the news. For sure, I am \ud83d\udd25\n\n> _Next Thursday at 9:00 a.m. CET_ , **Alex Vesa** will make his **grand\n> opening** by writing a step-by-step article on **how** you can **deploy an\n> LLaMA2-7b LLM** using **Amazon SageMaker** and **HuggingFace**.\n\nTo conclude, you don\u2019t have to do anything on your side.\n\n_Decoding ML follows its natural course by bringing in more people and\nexpanding to other platforms to give you more value for your time and a more\npersonalized way to enjoy our content._\n\nSee you next Thursday!\n\nHave a fantastic weekend! \u270c\ud83c\udffb\n\nPaul\n\n10\n\nShare this post\n\n#### DML: New year, the new & improved Decoding ML - What to expect?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| Ahmed BesbesThe Tech Buffet Jan 11Liked by Paul IusztinGreat things coming\nahead Paul! Looking forward to it!Expand full commentReplyShare \n---|--- \n \n1 reply by Paul Iusztin\n\n1 more comment...\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-new-year-the-new-and-improved?r=1ttoeh" }, { "id": "e85a60a3-6667-45fe-81fd-9384322b7cea", "content": { "Title": "DML: 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer", "Subtitle": "How to successfully present MLOps ideas to upper management. How I generated PyDocs for 100 Python functions in <1 hour", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: 8 types of MLOps tools that must be in your toolbelt to be a\nsuccessful MLOps engineer\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: 8 types of MLOps tools that must be in your toolbelt to be a successful\nMLOps engineer\n\n### How to successfully present MLOps ideas to upper management. How I\ngenerated PyDocs for 100 Python functions in <1 hour\n\nPaul Iusztin\n\nJan 04, 2024\n\n18\n\nShare this post\n\n#### DML: 8 types of MLOps tools that must be in your toolbelt to be a\nsuccessful MLOps engineer\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\nThe last Hands-on LLM series finished last week. In case you are curious, here\nare the top 3 out of 9 lessons of the series:\n\n 1. Lesson 6: What do you need to fine-tune an open-source LLM to create your financial advisor?\n\n 2. Lesson 7: How do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?\n\n 3. Lesson 4: How to implement a streaming pipeline to populate a vector DB for real-time RAG?\n\n* * *\n\n#### **This week\u2019s topics:**\n\n 1. 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer\n\n 2. How to successfully present MLOps ideas to upper management\n\n 3. How I generated PyDocs for 100 Python functions in <1 hour\n\n* * *\n\n\u2192 Before diving into the topics, I have one important thing to share with you.\n\n> We finally finished the code & video lessons for the**Hands-on LLMs** course\n> \ud83d\udd25\n\nBy finishing the **Hands-On LLMs** free course, you will learn how to use the\n3-pipeline architecture & LLMOps good practices to design, build, and deploy a\nreal-time financial advisor powered by LLMs & vector DBs. \n \nWe will primarily focus on the engineering & MLOps aspects. \n \nThus, by the end of this series, you will know how to build & deploy a real ML\nsystem, not some isolated code in Notebooks. \n \n\ud835\udc0c\ud835\udc28\ud835\udc2b\ud835\udc1e \ud835\udc29\ud835\udc2b\ud835\udc1e\ud835\udc1c\ud835\udc22\ud835\udc2c\ud835\udc1e\ud835\udc25\ud835\udc32, \ud835\udc2d\ud835\udc21\ud835\udc1e\ud835\udc2c\ud835\udc1e \ud835\udc1a\ud835\udc2b\ud835\udc1e \ud835\udc2d\ud835\udc21\ud835\udc1e 3 \ud835\udc1c\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc28\ud835\udc27\ud835\udc1e\ud835\udc27\ud835\udc2d\ud835\udc2c \ud835\udc32\ud835\udc28\ud835\udc2e \ud835\udc30\ud835\udc22\ud835\udc25\ud835\udc25 \ud835\udc25\ud835\udc1e\ud835\udc1a\ud835\udc2b\ud835\udc27 \ud835\udc2d\ud835\udc28 \ud835\udc1b\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d: \n \n1\\. a \ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc25-\ud835\udc2d\ud835\udc22\ud835\udc26\ud835\udc1e \ud835\udc2c\ud835\udc2d\ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc26\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e (deployed on AWS) that listens to financial\nnews, cleans & embeds the documents, and loads them to a vector DB \n \n2\\. a \ud835\udc1f\ud835\udc22\ud835\udc27\ud835\udc1e-\ud835\udc2d\ud835\udc2e\ud835\udc27\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e (deployed as a serverless continuous training) that\nfine-tunes an LLM on financial data using QLoRA, monitors the experiments\nusing an experiment tracker and saves the best model to a model registry \n \n3\\. an \ud835\udc22\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e built in LangChain (deployed as a serverless RESTful\nAPI) that loads the fine-tuned LLM from the model registry and answers\nfinancial questions using RAG (leveraging the vector DB populated with\nfinancial news in real-time) \n \nWe will also show you how to integrate various serverless tools, such as: \n \n\u2022 Comet ML as your ML Platform; \n\u2022 Qdrant as your vector DB; \n\u2022 Beam as your infrastructure. \n \n\ud835\udc16\ud835\udc21\ud835\udc28 \ud835\udc22\ud835\udc2c \ud835\udc2d\ud835\udc21\ud835\udc22\ud835\udc2c \ud835\udc1f\ud835\udc28\ud835\udc2b? \n \nThe series targets MLE, DE, DS, or SWE who want to learn to engineer LLM\nsystems using LLMOps good principles. \n \n\ud835\udc07\ud835\udc28\ud835\udc30 \ud835\udc30\ud835\udc22\ud835\udc25\ud835\udc25 \ud835\udc32\ud835\udc28\ud835\udc2e \ud835\udc25\ud835\udc1e\ud835\udc1a\ud835\udc2b\ud835\udc27? \n \nThe series contains 4 hands-on video lessons and the open-source code you can\naccess on GitHub. \n \n\ud835\udc02\ud835\udc2e\ud835\udc2b\ud835\udc22\ud835\udc28\ud835\udc2e\ud835\udc2c? \n \n\u21b3 \ud83d\udd17 Check it out and support us with a \u2b50\n\nThe architecture of a financial bot powered by LLMs, vector DBs and MLOps\n[Image by the Authors]\n\n* * *\n\n### #1. 8 types of MLOps tools that must be in your toolbelt to be a\nsuccessful MLOps engineer\n\nThese are the \ud835\udff4 \ud835\ude01\ud835\ude06\ud835\uddfd\ud835\uddf2\ud835\ude00 of \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9\ud835\ude00 that must be in your toolbelt to be a\n\ud835\ude00\ud835\ude02\ud835\uddf0\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff \u2193 \n \nIf you are into MLOps, you are aware of the 1000+ tools in the space and think\nyou have to know. \n \nThe reality is that all of these tools can be boiled down to 8 main\ncategories. \n \nIf you learn the fundamentals and master one tool from each category, you will\nbe fine. \n \n.\n\nBa\u015fak Tu\u011f\u00e7e Eskili\n\nand\n\nMaria Vechtomova\n\nfrom\n\nMarvelousMLOps\n\nwrote an excellent summary highlighting these 8 categories: \n \n1\\. \ud835\ude51\ud835\ude5a\ud835\ude67\ud835\ude68\ud835\ude5e\ud835\ude64\ud835\ude63 \ud835\ude58\ud835\ude64\ud835\ude63\ud835\ude69\ud835\ude67\ud835\ude64\ud835\ude61: crucial for the traceability and reproducibility of an ML\nmodel deployment or run. Without a version control system, it is difficult to\nfind out what exact code version was responsible for specific runs or errors\nyou might have in production. (\ud83d\udd27 GitHub, GitLab, etc.) \n \n2\\. \ud835\ude3e\ud835\ude44/\ud835\ude3e\ud835\ude3f: automated tests are triggered upon pull request creation &\ndeployment to production should only occur through the CD pipeline (\ud83d\udd27 GitHub\nActions, GitLab CI/CD, Jenkins, etc.) \n \n3\\. \ud835\ude52\ud835\ude64\ud835\ude67\ud835\ude60\ud835\ude5b\ud835\ude61\ud835\ude64\ud835\ude6c \ud835\ude64\ud835\ude67\ud835\ude58\ud835\ude5d\ud835\ude5a\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude56\ud835\ude69\ud835\ude5e\ud835\ude64\ud835\ude63: manage complex dependencies between different\ntasks, such as data preprocessing, feature engineering, ML model training (\ud83d\udd27\nAirflow, ZenML, AWS Step Functions, etc.) \n \n4\\. \ud835\ude48\ud835\ude64\ud835\ude59\ud835\ude5a\ud835\ude61 \ud835\ude67\ud835\ude5a\ud835\ude5c\ud835\ude5e\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude6e: store, version, and share trained ML model artifacts,\ntogether with additional metadata (\ud83d\udd27 Comet ML, W&B, MLFlow, etc.) \n \n5\\. \ud835\ude3f\ud835\ude64\ud835\ude58\ud835\ude60\ud835\ude5a\ud835\ude67 \ud835\ude67\ud835\ude5a\ud835\ude5c\ud835\ude5e\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude6e: store, version, and share Docker images. Basically, all\nyour code will be wrapped up in Docker images and shared through this registry\n(\ud83d\udd27 Docker Hub, ECR, etc.) \n \n6 & 7\\. \ud835\ude48\ud835\ude64\ud835\ude59\ud835\ude5a\ud835\ude61 \ud835\ude69\ud835\ude67\ud835\ude56\ud835\ude5e\ud835\ude63\ud835\ude5e\ud835\ude63\ud835\ude5c & \ud835\ude68\ud835\ude5a\ud835\ude67\ud835\ude6b\ud835\ude5e\ud835\ude63\ud835\ude5c \ud835\ude5e\ud835\ude63\ud835\ude5b\ud835\ude67\ud835\ude56\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude6a\ud835\ude58\ud835\ude69\ud835\ude6a\ud835\ude67\ud835\ude5a: if on-premise, you will\nlikely have to go with Kubernetes. There are multiple choices if you are on a\ncloud provider: Azure ML on Azure, Sagemaker on AWS, and Vertex AI on GCP. \n \n8\\. \ud835\ude48\ud835\ude64\ud835\ude63\ud835\ude5e\ud835\ude69\ud835\ude64\ud835\ude67\ud835\ude5e\ud835\ude63\ud835\ude5c: Monitoring in ML systems goes beyond what is needed for\nmonitoring regular software applications. The distinction lies in that the\nmodel predictions can fail even if all typical health metrics appear in good\ncondition. (\ud83d\udd27 SageMaker, NannyML, Arize, etc.) \n \nThe secret sauce in MLOps is knowing how to glue all these pieces together\nwhile keeping things simple. \n\n[Image from Marvelous MLOps]\n\n\u21b3\ud83d\udd17 To read more about these components, check out the article on\n\nMarvelousMLOps\n\n.\n\n* * *\n\n### #2. How to successfully present MLOps ideas to upper management\n\nHave you ever presented your MLOps ideas to upper management just to get\nghosted? \n \nIn that case... \n \n\nRapha\u00ebl Hoogvliets\n\n,\n\nBa\u015fak Tu\u011f\u00e7e Eskili\n\n, and\n\nMaria Vechtomova\n\nfrom\n\nMarvelousMLOps\n\npresented a great step-by-step strategy for pitching your MLOps ideas to your\nupper management and getting attention and resources to implement them. \n \nHere are the 6 steps you have to know \u2193 \n \n1\\. \ud835\udc02\ud835\udc28\ud835\udc25\ud835\udc25\ud835\udc1e\ud835\udc1c\ud835\udc2d \ud835\udc1a\ud835\udc25\ud835\udc25 \ud835\udc2d\ud835\udc21\ud835\udc1e \ud835\udc29\ud835\udc1a\ud835\udc22\ud835\udc27 \ud835\udc29\ud835\udc28\ud835\udc22\ud835\udc27\ud835\udc2d\ud835\udc2c \nTalk to data scientists, product owners, and stakeholders in your organization\nto gather issues such as: \n\\- time to deployment \n\\- poor quality deployment \n\\- non-existing monitoring \n\\- lack of collaboration \n\\- external parties \n \n2\\. \ud835\udc04\ud835\udc1d\ud835\udc2e\ud835\udc1c\ud835\udc1a\ud835\udc2d\ud835\udc1e \ud835\udc29\ud835\udc1e\ud835\udc28\ud835\udc29\ud835\udc25\ud835\udc1e \nOrganize workshops, meetings, etc., to present what MLOps is and how it can\nhelp. \n \nI think it's critical to present it to your target audience. For example, an\nengineer looks at the problem differently than the business stakeholders. \n \n3\\. \ud835\udc0f\ud835\udc2b\ud835\udc1e\ud835\udc2c\ud835\udc1e\ud835\udc27\ud835\udc2d \ud835\udc1b\ud835\udc1e\ud835\udc1f\ud835\udc28\ud835\udc2b\ud835\udc1e \ud835\udc1a\ud835\udc27\ud835\udc1d \ud835\udc1a\ud835\udc1f\ud835\udc2d\ud835\udc1e\ud835\udc2b \ud835\udc2c\ud835\udc1c\ud835\udc1e\ud835\udc27\ud835\udc1a\ud835\udc2b\ud835\udc22\ud835\udc28\ud835\udc2c \nShow how MLOps can solve the company's challenges and deliver tangible\nbenefits to the organization, such as: \n\\- less cost \n\\- fast deployment \n\\- better collaboration \n\\- less risk \n \n4\\. \ud835\udc0f\ud835\udc2b\ud835\udc28\ud835\udc2f\ud835\udc1e \ud835\udc22\ud835\udc2d \nUse concrete examples to support your ideas, such as: \n\\- how a competitor or an organization in the same or related field benefited\nfrom introducing MLOps \n\\- build a PoC within your organization \n \n5\\. \ud835\udc12\ud835\udc1e\ud835\udc2d \ud835\udc2e\ud835\udc29 \ud835\udc32\ud835\udc28\ud835\udc2e\ud835\udc2b \ud835\udc2d\ud835\udc1e\ud835\udc1a\ud835\udc26 \nChoose 2-3 experienced individuals (not juniors) to set up the foundations in\nyour team/organization. \n \nWith an emphasis on starting with experienced engineers and only later\nbringing more juniors to the party. \n \n6\\. \ud835\udc0a\ud835\udc1e\ud835\udc1e\ud835\udc29 \ud835\udc28\ud835\udc27 \ud835\udc24\ud835\udc1e\ud835\udc1e\ud835\udc29\ud835\udc22\ud835\udc27' \ud835\udc28\ud835\udc27 \nOnce you successfully apply MLOps to one use case, you can bring in more\nresponsibility by growing your team and taking on more projects. \n \n. \n \nAll of these are great tips for integrating MLOps in your organization. \n \nI love their \"Present before and after scenarios\" approach. \n \nYou can extrapolate this strategy for any other new processes (not only\nMLOps). \n \n. \n \n\u21b3\ud83d\udd17 To learn the details, check out the full article on\n\nMarvelousMLOps\n\n.\n\n* * *\n\n### #3. How I generated PyDocs for 100 Python functions in <1 hour\n\nThe most boring programming part is to write PyDocs, so I usually write clean\ncode and let it speak for itself. \n \nBut, for open-source projects where you have to generate robust documentation,\nPyDocs are a must. \n \nThe good news is that now you can automate this process using Copilot. \n \nYou can see in the video below an example of how easy it is. \n \nI tested it on more complex functions/classes, and it works well. I chose this\nexample because it fits nicely on one screen. \n \nOnce I tested Copilot's experience, I will never go back. \n \nIt is true that, in some cases, you have to make some minor adjustments. But\nthat is still 10000% more efficient than writing it from scratch. \n\nIf you want more examples, check out our **Hands-on LLMs** course, where all\nthe PyDocs are generated 99% using Copilot in <1 hour.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.). \n\n18\n\nShare this post\n\n#### DML: 8 types of MLOps tools that must be in your toolbelt to be a\nsuccessful MLOps engineer\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-8-types-of-mlops-tools-that-must?r=1ttoeh" }, { "id": "8ff6064c-9c09-494f-a42d-a60b0e80387c", "content": { "Title": "DML: This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps", "Subtitle": "Lesson 9 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: This is what you need to build an inference pipeline for a financial\nassistant powered by LLMs, vector DBs and LLMOps\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: This is what you need to build an inference pipeline for a financial\nassistant powered by LLMs, vector DBs and LLMOps\n\n### Lesson 9 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nDec 28, 2023\n\n15\n\nShare this post\n\n#### DML: This is what you need to build an inference pipeline for a financial\nassistant powered by LLMs, vector DBs and LLMOps\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 9 | The Hands-on LLMs Series**\n\n> This is the **last lesson** within the **Hands-on LLMs** series... _But\n> certainly not the last MLE & MLOps series. We are cooking some exciting\n> stuff._ But I hope you had fun and learned much during this series.\n\nNow, let's see how to glue everything we have done so far under the inference\npipeline. Enjoy! \ud83e\uddc1\n\n#### **Table of Contents:**\n\n 1. Inference pipeline video lesson\n\n 2. What do you need to build an inference pipeline for a financial assistant powered by LLMs and vector DBs?\n\n 3. How can you build & deploy an inference pipeline for a real-time financial advisor while considering good LLMOps practices?\n\n#### Previous Lessons:\n\n * Lesson 6: What do you need to fine-tune an open-source LLM to create your financial advisor?\n\n * Lesson 7: How do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?\n\n * Lesson 8: 7-steps on how to fine-tune an open-source LLM to create your real-time financial advisor\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. Inference pipeline video lesson\n\nWe \ud835\udc2b\ud835\udc1e\ud835\udc25\ud835\udc1e\ud835\udc1a\ud835\udc2c\ud835\udc1e\ud835\udc1d the \ud835\udc1f\ud835\udc22\ud835\udc27\ud835\udc1a\ud835\udc25 video \ud835\udc25\ud835\udc1e\ud835\udc2c\ud835\udc2c\ud835\udc28\ud835\udc27 of the \ud835\udc07\ud835\udc1a\ud835\udc27\ud835\udc1d\ud835\udc2c-\ud835\udc28\ud835\udc27 \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc2c FREE course that will\nteach you how to \ud835\udc1b\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d & \ud835\udc1d\ud835\udc1e\ud835\udc29\ud835\udc25\ud835\udc28\ud835\udc32 an \ud835\udc22\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e for a financial advisor\nusing \ud835\udc0b\ud835\udc1a\ud835\udc27\ud835\udc20\ud835\udc02\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27, \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc0e\ud835\udc29\ud835\udc2c, and \ud835\udc2f\ud835\udc1e\ud835\udc1c\ud835\udc2d\ud835\udc28\ud835\udc2b \ud835\udc03\ud835\udc01\ud835\udc2c. \n \n\ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude2c\ud835\ude26\ud835\ude3a \ud835\ude35\ud835\ude30\ud835\ude31\ud835\ude2a\ud835\ude24\ud835\ude34 \ud835\ude24\ud835\ude30\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude26\ud835\ude25 \ud835\ude2a\ud835\ude2f \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude37\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f made by Pau Labarta \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude10\n\u2193 \n \n1\\. Overview of the architecture of the inference pipeline and how to apply\nLLMOps good practices \n \n2\\. How to build from scratch a RAG agent using LangChain:\nContextExtractorChain + FinancialBotQAChain \n \n3\\. How to attach a callback class to log input prompts and LLM answers to\nComet LLMOps \n \n4\\. Setting up and running the code locally \n \n5\\. Deploying the inference pipeline to Beam as a RESTful API \n \n. \n \n\ud835\ude0a\ud835\ude36\ud835\ude33\ud835\ude2a\ud835\ude30\ud835\ude36\ud835\ude34?\n\nCheck out the video lesson\n\nPau Labarta Bajo\n\nand I did \u2193\n\n* * *\n\n### #2. What do you need to build an inference pipeline for a financial\nassistant powered by LLMs and vector DBs?\n\nHere are its \ud835\udff3 \ud835\uddf8\ud835\uddf2\ud835\ude06 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00 \u2193 \n \n1\\. \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 \ud835\uddfd\ud835\uddfc\ud835\uddfd\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\ude04\ud835\ude00: This is the output of the feature\npipeline. More concretely, a Qdrant vector DB populated with chunks of\nfinancial news from Alpaca. During the inference pipeline, we will use it to\nquery valuable chunks of information and do RAG. \n \n2\\. \ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf9\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\ude02\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9: To embed the user question and query the vector\nDB, you need the same embedding model used in the feature pipeline, more\nconcretely `\ud835\ude22\ud835\ude2d\ud835\ude2d-\ud835\ude14\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude13\ud835\ude14-\ud835\ude136-\ud835\ude372` from `\ud835\ude34\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude2f\ud835\ude24\ud835\ude26-\ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude26\ud835\ude33\ud835\ude34`. Using the same\nencoder-only model is crucial, as the query vector and vector DB index vectors\nhave to be in the same space. \n \n3\\. \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2\ud835\uddf1 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb-\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0: The output of the training pipeline will be a\nfine-tuned Falcon 7B on financial tasks. \n \n4\\. \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddff\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06: The fine-tuned model will be shared between the training &\ninference pipeline through Comet\u2019s model registry. By doing so, you decouple\nentirely the 2 components, and the model can easily be shared under specific\nenvironments (e.g., staging, prod) and versions (e.g., v1.0.1). \n \n5\\. \ud835\uddee \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddee\ud835\uddfd\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00: You need LangChain, as your LLM\nframework, to glue all the steps together, such as querying the vector DB,\nstoring the history of the conversation, creating the prompt, and calling the\nLLM. LangChain provides out-of-the-box solutions to chain all these steps\ntogether quickly. \n \n6\\. \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddee\ud835\uddfd\ud835\uddfd \ud835\uddee\ud835\ude00 \ud835\uddee \ud835\udde5\ud835\uddd8\ud835\udde6\ud835\udde7\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\uddd4\ud835\udde3\ud835\udddc: One of the final steps is to deploy\nyour awesome LLM financial assistant under a RESTful API. You can quickly do\nthis using Beam as your serverless infrastructure provider. Beam specializes\nin DL. Thus, it offers quick ways to load your LLM application on GPU machines\nand expose it under a RESTful API. \n \n7\\. \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4: The last step is to add eyes on top of your system. You\ncan do this using Comet\u2019s LLMOps features that allow you to track & monitor\nall the prompts & responses of the system.\n\n> \u21b3\ud83d\udd17 Check out how these components are working together in our Hands-on LLMs\n> free course.\n\n* * *\n\n### #3. How can you build & deploy an inference pipeline for a real-time\nfinancial advisor while considering good LLMOps practices?\n\n\ud835\udc07\ud835\udc28\ud835\udc30 can you \ud835\udc1b\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d & \ud835\udc1d\ud835\udc1e\ud835\udc29\ud835\udc25\ud835\udc28\ud835\udc32 an \ud835\udc22\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e for a real-time financial\nadvisor with \ud835\udc0b\ud835\udc1a\ud835\udc27\ud835\udc20\ud835\udc02\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27 powered by \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc2c & \ud835\udc2f\ud835\udc1e\ud835\udc1c\ud835\udc2d\ud835\udc28\ud835\udc2b \ud835\udc03\ud835\udc01\ud835\udc2c while considering \ud835\udc20\ud835\udc28\ud835\udc28\ud835\udc1d\n\ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc0e\ud835\udc29\ud835\udc2c \ud835\udc29\ud835\udc2b\ud835\udc1a\ud835\udc1c\ud835\udc2d\ud835\udc22\ud835\udc1c\ud835\udc1e\ud835\udc2c?\n\n.\n\nAs a quick reminder from previous posts, here is what we already have: \n\\- a Qdrant vector DB populated with financial news (the output of the feature\npipeline) \n\\- fine-tuned Falcon-7B LoRA weights stored in Comet\u2019s model registry (the\noutput of the training pipeline)\n\nThe Qdrant vectorDB is accessed through a Python client.\n\nA specific version of the Falcon-7B LoRA weights is downloaded from Comet\u2019s\nmodel registry and loaded in memory using QLoRA.\n\nThe goal of the inference pipeline is to use LangChain to glue the 2\ncomponents into a single `**FinancialAssistant** ` entity.\n\n.\n\nThe `**FinancialAssistant** ` entity is deployed in a request-response fashion\nunder a RESTful API. We used Beam to deploy it quickly under a serverless web\nendpoint.\n\nTo deploy any model using Beam as a RESTful API is as easy as writing the\nfollowing Python decorator:\n\n \n \n @financial_bot. rest_api(keep_warm_seconds=300, loader=load_bot)def run(**inputs):\n ....\n\n \n\ud835\udc0d\ud835\udc28\ud835\udc30 \ud835\udc25\ud835\udc1e\ud835\udc2d\u2019\ud835\udc2c \ud835\udc2e\ud835\udc27\ud835\udc1d\ud835\udc1e\ud835\udc2b\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc1d \ud835\udc2d\ud835\udc21\ud835\udc1e \ud835\udc1f\ud835\udc25\ud835\udc28\ud835\udc30 \ud835\udc28\ud835\udc1f \ud835\udc2d\ud835\udc21\ud835\udc1e `\ud835\udc05\ud835\udc22\ud835\udc27\ud835\udc1a\ud835\udc27\ud835\udc1c\ud835\udc22\ud835\udc1a\ud835\udc25\ud835\udc00\ud835\udc2c\ud835\udc2c\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc2d` \ud835\udc1c\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27\u2193\n\n1\\. Clean the user\u2019s input prompt and use a pre-trained \u201c**all-MiniLM-L6-v2**\n\u201d encoder-only model to embed it (the same LM used to populate the vector DB).\n\n2\\. Using the embedded user input, query the Qdrant vector DB and extract the\ntop 3 most similar financial news based on the cosine similarly distance\n\n\u2192 These 2 steps were necessary to do RAG. If you don\u2019t know how RAG works,\ncheck out Lesson 3.\n\n3\\. Build the final prompt using a \u201c**PromptTemplate** \u201d class (the same one\nused for training) that formats the following components: \n\\- a system prompt \n\\- the user\u2019s input prompt \n\\- the financial news context \n\\- the chat history\n\n4\\. Now that our prompt contains all the necessary data, we pass it to the\nfine-tuned Falcon-7B LLM for the final answer.\n\nThe input prompt and LLM answer will be logged and monitored by Comet LLMOps.\n\n5\\. You can get the answer in one shot or use the `TextIteratorStreamer` class\n(from HuggingFace) to stream it token-by-token.\n\n6\\. Store the user\u2019s input prompt and LLM answer in the chat history.\n\n7\\. Pass the final answer to the client.\n\n**Note:** You can use the `**TextIteratorStreamer** ` class & wrap the\n`**FinancialAssistant** ` under a WebSocket (instead of the RESTful API) to\nstream the answer of the bot token by token.\n\nSimilar to what you see in the interface of ChatGPT.\n\nHow | Inference pipeline: Build & deploy an inference pipeline using LangChain powered by LLMs & vector DBs [Image by the Author].\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nWith this, we concluded the **Hands-On LLMs** series. I hope you enjoyed it \ud83d\udd25\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n15\n\nShare this post\n\n#### DML: This is what you need to build an inference pipeline for a financial\nassistant powered by LLMs, vector DBs and LLMOps\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-this-is-what-you-need-to-build?r=1ttoeh" }, { "id": "ceacd8d8-91dc-42a7-ad33-97964bf91387", "content": { "Title": "DML: 7-steps on how to fine-tune an open-source LLM to create your real-time financial advisor", "Subtitle": "Lesson 8 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: 7-steps on how to fine-tune an open-source LLM to create your real-\ntime financial advisor\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: 7-steps on how to fine-tune an open-source LLM to create your real-time\nfinancial advisor\n\n### Lesson 8 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nDec 21, 2023\n\n6\n\nShare this post\n\n#### DML: 7-steps on how to fine-tune an open-source LLM to create your real-\ntime financial advisor\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 8 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. What is Beam? How does serverless make deploying ML models easy?\n\n 2. 7 tips you must know to reduce your VRAM consumption of your LLMs during training\n\n 3. 7-steps on how to fine-tune an open-source LLM to create your real-time financial advisor\n\n#### Previous Lessons:\n\n * Lesson 5: Why & when do you need to fine-tune open-source LLMs? What about fine-tuning vs. prompt engineering?\n\n * Lesson 6: What do you need to fine-tune an open-source LLM to create your financial advisor?\n\n * Lesson 7: How do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. What is Beam? How does serverless make deploying ML models easy?\n\n\ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06\ud835\uddf6\ud835\uddfb\ud835\uddf4 & \ud835\uddfa\ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf4 ML models is \ud835\uddf5\ud835\uddee\ud835\uddff\ud835\uddf1, especially when running your models on\nGPUs. \n \nBut \ud835\ude00\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 makes things \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06. \n \nUsing Beam as your serverless provider, deploying & managing ML models can be\nas easy as \u2193 \n \n\ud835\uddd7\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 & \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf2\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\ude00 \n \nIn a few lines of code, you define the application that contains: \n \n\\- the requirements of your infrastructure, such as the CPU, RAM, and GPU \n\\- the dependencies of your application \n\\- the volumes from where you can load your data and store your artifacts \n \n\ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf7\ud835\uddfc\ud835\uddef\ud835\ude00 \n \nUsing the Beam application, you can quickly decore your Python functions to: \n \n\\- run them once on the given serverless application \n\\- put your task/job in a queue to be processed or even schedule it using a\nCRON-based syntax \n\\- even deploy it as a RESTful API endpoint\n\nHow do you use Beam as your serverless provider? [Image by the Author]\n\nAs you can see in the image below, you can have one central function for\ntraining or inference, and with minimal effort, you can switch from all these\ndeployment methods. \n \nAlso, you don't have to bother at all with managing the infrastructure on\nwhich your jobs run. You specify what you need, and Beam takes care of the\nrest. \n \nBy doing so, you can directly start to focus on your application and stop\ncarrying about the infrastructure. \n \nThis is the power of serverless! \n \n\u21b3\ud83d\udd17 Check out Beam to learn more\n\n* * *\n\n### #2. 7 tips you must know to reduce your VRAM consumption of your LLMs\nduring training\n\nHere are \ud835\udff3 \ud835\ude01\ud835\uddf6\ud835\uddfd\ud835\ude00 you must know to \ud835\uddff\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf2 your \ud835\udde9\ud835\udde5\ud835\uddd4\ud835\udde0 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb of your \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00\nduring \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 so you can \ud835\uddf3\ud835\uddf6\ud835\ude01 it on \ud835\ude05\ud835\udfed \ud835\uddda\ud835\udde3\ud835\udde8. \n \nWhen training LLMs, one of the pain points is to have enough VRAM on your\nsystem. \n \nThe good news is that the gods of DL are with us, and there are methods to\nlower your VRAM consumption without a significant impact on your performance \u2193 \n \n\ud835\udfed\\. \ud835\udde0\ud835\uddf6\ud835\ude05\ud835\uddf2\ud835\uddf1-\ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb: During training you use both FP32 and FP16 in the\nfollowing way: \"FP32 weights\" -> \"FP16 weights\" -> \"FP16 gradients\" -> \"FP32\ngradients\" -> \"Update weights\" -> \"FP32 weights\" (and repeat). As you can see,\nthe forward & backward passes are done in FP16, and only the optimization step\nis done in FP32, which reduces both the VRAM and runtime. \n \n\ud835\udfee\\. \ud835\udddf\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff-\ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb: All your computations are done in FP16 instead of FP32.\nBut the key is using bfloat16 (\"Brain Floating Point\"), a numerical\nrepresentation Google developed for deep learning. It allows you to represent\nvery large and small numbers, avoiding overflowing or underflowing scenarios. \n \n\ud835\udfef\\. \ud835\udde5\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\uddf6\ud835\ude07\ud835\uddf2: This one is straightforward. Fewer samples per\ntraining iteration result in smaller VRAM requirements. The downside of this\nmethod is that you can't go too low with your batch size without impacting\nyour model's performance. \n \n\ud835\udff0\\. \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb: It is a simple & powerful trick to increase your\nbatch size virtually. You compute the gradients for \"micro\" batches (forward +\nbackward passes). Once the accumulated gradients reach the given \"virtual\"\ntarget, the model weights are updated with the accumulated gradients. For\nexample, you have a batch size of 4 and a micro-batch size of 1. Then, the\nforward & backward passes will be done using only x1 sample, and the\noptimization step will be done using the aggregated gradient of the 4 samples. \n \n\ud835\udff1\\. \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\uddee \ud835\ude00\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddff: Adam is the most popular optimizer. It is one\nof the most stable optimizers, but the downside is that it has 2 additional\nparameters (a mean & variance) for every model parameter. If you use a\nstateless optimizer, such as SGD, you can reduce the number of parameters by\n2/3, which is significant for LLMs. \n \n\ud835\udff2\\. \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 (\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\ude03\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb) \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8\ud835\uddfd\ud835\uddfc\ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4: It drops specific activations\nduring the forward pass and recomputes them during the backward pass. Thus, it\neliminates the need to hold all activations simultaneously in VRAM. This\ntechnique reduces VRAM consumption but makes the training slower. \n \n\ud835\udff3\\. \ud835\uddd6\ud835\udde3\ud835\udde8 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddf3\ud835\uddf3\ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4: As the name suggests, the parameters that do not\nfit on your GPU's VRAM are loaded on the CPU. Intuitively, you can see it as a\nmodel parallelism between your GPU & CPU.\n\nA happy dude going for a walk with his GPU [Image by DALL-E]\n\nMost of these methods are orthogonal, so you can combine them and drastically\nreduce your VRAM requirements during training.\n\n* * *\n\n### #3. 7-steps on how to fine-tune an open-source LLM to create your real-\ntime financial advisor\n\nIn the past weeks, we covered \ud835\ude04\ud835\uddf5\ud835\ude06 you have to fine-tune an LLM and \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01\nresources & tools you need: \n\\- Q&A dataset \n\\- pre-trained LLM (Falcon 7B) & QLoRA \n\\- MLOps: experiment tracker, model registry, prompt monitoring (Comet ML) \n\\- compute platform (Beam) \n \n. \n \nNow, let's see how you can hook all of these pieces together into a single\nfine-tuning module \u2193 \n \n\ud835\udfed\\. \ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde4&\ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 \n \nOur Q&A samples have the following structure keys: \"about_me,\" \"user_context,\"\n\"question,\" and \"answer.\" \n \nFor task-specific fine-tuning, you need only 100-1000 samples. Thus, you can\ndirectly load the whole JSON in memory. \n \nAfter you map every sample to a list of Python \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2d\ud835\ude22\ud835\ude34\ud835\ude34\ud835\ude26\ud835\ude34 to validate the\nstructure & type of the ingested instances. \n \n\ud835\udfee\\. \ud835\udde3\ud835\uddff\ud835\uddf2\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde4&\ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddfc \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00 \n \nThe first step is to use \ud835\ude36\ud835\ude2f\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26\ud835\ude25 to clean every sample by removing\nredundant characters. \n \nAfter, as every sample consists of multiple fields, you must map it to a\nsingle piece of text, also known as the prompt. \n \nTo do so, you define a \ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35\ud835\ude1b\ud835\ude26\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude26 class to manage all your prompts. You\nwill use it to map all the sample keys to a prompt using a Python f-string. \n \nThe last step is to map the list of Python \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2d\ud835\ude22\ud835\ude34\ud835\ude34\ud835\ude26\ud835\ude34 to a HuggingFace\ndataset and map every sample to a prompt, as discussed above. \n \n\ud835\udfef\\. \ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde4\ud835\udddf\ud835\uddfc\ud835\udde5\ud835\uddd4 \n \nLoad a pretrained Falcon 7B LLM by passing a \ud835\ude23\ud835\ude2a\ud835\ude35\ud835\ude34\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude23\ud835\ude3a\ud835\ude35\ud835\ude26\ud835\ude34 quantization\nconfiguration that loads all the weights on 4 bits. \n \nAfter using LoRA, you freeze the weights of the original Falcon LLM and attach\nto it a set of trainable adapters. \n \n\ud835\udff0\\. \ud835\uddd9\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n \nThe \ud835\ude35\ud835\ude33\ud835\ude2d Python package makes this step extremely simple. \n \nYou pass to the \ud835\ude1a\ud835\ude0d\ud835\ude1b\ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude33 class the training arguments, the dataset and the\nmodel and call the \ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f() method. \n \nOne crucial aspect is configuring an experiment tracker, such as Comet ML, to\nlog the loss and other vital metrics & artifacts. \n \n\ud835\udff1\\. \ud835\udde3\ud835\ude02\ud835\ude00\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddef\ud835\uddf2\ud835\ude00\ud835\ude01 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\ude01\ud835\uddfc \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddff\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06 \n \nOne of the final steps is to attach a callback to the \ud835\ude1a\ud835\ude0d\ud835\ude1b\ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude33 class that\nruns when the training ends to push the model with the lowest loss to the\nmodel registry as the new production candidate. \n \n\ud835\udff2\\. \ud835\uddd8\ud835\ude03\ud835\uddee\ud835\uddf9\ud835\ude02\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf0\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\uddf6\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2 \n \nEvaluating generative AI models can be pretty tricky. \n \nYou can run the LLM on the test set and log the prompts & answers to Comet\nML's monitoring system to check them manually. \n \nIf the provided answers are valid, using the model registry dashboard, you\nwill manually release it to replace the old LLM. \n \n\ud835\udff3\\. \ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude01\ud835\uddfc \ud835\uddd5\ud835\uddf2\ud835\uddee\ud835\uddfa \n \nIt is as easy as wrapping the training & inference functions (or classes) with\na Python \"@\ud835\ude22\ud835\ude31\ud835\ude31.\ud835\ude33\ud835\ude36\ud835\ude2f()\" decorator.\n\nA step-by-step guide on fine-tuning an LLM to create a real-time financial\nadvisor [Image by the Author].\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 9** ,**** the last lesson of the **Hands-\nOn LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n6\n\nShare this post\n\n#### DML: 7-steps on how to fine-tune an open-source LLM to create your real-\ntime financial advisor\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-7-steps-on-how-to-fine-tune-an?r=1ttoeh" }, { "id": "dffed5e0-c824-40db-9388-a26fa09f7b49", "content": { "Title": "DML: How do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?", "Subtitle": "Lesson 7 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: How do you generate a Q&A dataset in <30 minutes to fine-tune your\nLLMs?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: How do you generate a Q&A dataset in <30 minutes to fine-tune your\nLLMs?\n\n### Lesson 7 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nDec 14, 2023\n\n5\n\nShare this post\n\n#### DML: How do you generate a Q&A dataset in <30 minutes to fine-tune your\nLLMs?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 7 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. Real-time feature pipeline video lesson\n\n 2. How do you generate a synthetic domain-specific Q&A dataset in <30 minutes to fine-tune your open-source LLM?\n\n 3. My personal list of filtered resources about LLMs & vector DBs\n\n#### Previous Lessons:\n\n * Lesson 4: How to implement a streaming pipeline to populate a vector DB for real-time RAG?\n\n * Lesson 5: Why & when do you need to fine-tune open-source LLMs? What about fine-tuning vs. prompt engineering?\n\n * Lesson 6: What do you need to fine-tune an open-source LLM to create your financial advisor?\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. Real-time feature pipeline video lesson\n\nI know we are currently talking about the training pipeline and Q&A dataset\ngeneration, but sometimes, mixing the information to remember and make new\nconnections is healthy.\n\n\u2026or maybe that is only an excuse to share the video lesson about the feature\npipeline that wasn\u2019t ready when I started this series.\n\nIt will teach you how to \ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\ude04\ud835\ude00 in \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9-\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 from Alpaca, \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb\n& \ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1 the \ud835\uddf1\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00, and \ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1 them in a \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5.\n\n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf6\ud835\uddf2\ud835\ude04 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddfc \u2193 \n \n1\\. Step-by-step instructions on how to set up the streaming pipeline code & a\nQdrant vector DB serverless cluster \n2\\. Why we used Bytewax to build the streaming pipeline \n3\\. How we used Bytewax to ingest financial news in real-time leveraging a\nWebSocket, clean the documents, chunk them, embed them and ingest them in the\nQdrant vector DB \n4\\. How we adapted the Bytewax streaming pipeline to also work in batch mode\nto populate the vector DB with historical data \n5\\. How to run the code \n6\\. How to deploy the code to AWS\n\nHere it is \u2193 Enjoy \ud83d\udc40\n\n* * *\n\n## #2. How do you generate a synthetic domain-specific Q&A dataset in <30\nminutes to fine-tune your open-source LLM?\n\nThis method is also known as \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb. Here are its 3 \ud835\ude2e\ud835\ude22\ud835\ude2a\ud835\ude2f\n\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31\ud835\ude34 \u2193 \n \n\ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26, \ud835\ude2d\ud835\ude26\ud835\ude35'\ud835\ude34 \ud835\ude28\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude22 \ud835\ude18&\ud835\ude08 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude35 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude35\ud835\ude30 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26 \ud835\ude22\n\ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d \ud835\ude22\ud835\ude25\ud835\ude37\ud835\ude2a\ud835\ude34\ud835\ude30\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14. \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed: \ud835\udde0\ud835\uddee\ud835\uddfb\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\ude04 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \n \nGenerate a few input samples (~3) that have the following structure: \n\\- \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude33_\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35: describe the type of investor (e.g., \"I am a 28-year-old\nmarketing professional\") \n\\- \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f: describe the user's intention (e.g., \"Is Bitcoin a good\ninvestment option?\") \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee: \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf5\ud835\uddf2\ud835\uddf9\ud835\uddfd \ud835\uddfc\ud835\uddf3 \ud835\uddee \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \n \nUse a powerful LLM as a teacher (e.g., GPT4, Falcon 180B, etc.) to generate up\nto +N similar input examples. \n \nWe generated 100 input examples in our use case, but you can generate more. \n \nYou will use the manually filled input examples to do few-shot prompting. \n \nThis will guide the LLM to give you domain-specific samples. \n \n\ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34: \n\"\"\" \n... \nGenerate 100 more examples with the following pattern: \n \n# USER CONTEXT 1 \n... \n \n# QUESTION 1 \n... \n \n# USER CONTEXT 2 \n... \n\"\"\" \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef: \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddfc\ud835\ude02\ud835\ude01\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \n \nNow, you will have the same powerful LLM as a teacher, but this time, it will\nanswer all your N input examples. \n \nBut first, to introduce more variance, we will use RAG to enrich the input\nexamples with news context. \n \nAfterward, we will use the teacher LLM to answer all N input examples. \n \n...and bam! You generated a domain-specific Q&A dataset with almost 0 manual\nwork. \n \n. \n \nNow, you will use this data to train a smaller LLM (e.g., Falcon 7B) on a\nniched task, such as financial advising. \n \nThis technique is known as finetuning with distillation because you use a\npowerful LLM as the teacher (e.g., GPT4, Falcon 180B) to generate the data,\nwhich will be used to fine-tune a smaller LLM (e.g., Falcon 7B), which acts as\nthe student. \n \n\u2712\ufe0f \ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26: To ensure that the generated data is of high quality, you can hire a\ndomain expert to check & refine it.\n\nHow do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?\n[Image by the Author].\n\n\u21b3 To learn more about this technique, check out \u201cHow to generate a Q&A dataset\nin less than 30 minutes\u201d Pau Labarta's article from\n\nReal-World Machine Learning\n\n.\n\n* * *\n\n### #3. My personal list of filtered resources about LLMs & vector DBs\n\nThe internet is full of \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\ude00 about \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 & \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5\ud835\ude00. But \ud835\uddfa\ud835\uddfc\ud835\ude00\ud835\ude01\n\ud835\uddfc\ud835\uddf3 \ud835\uddf6\ud835\ude01 is \ud835\ude01\ud835\uddff\ud835\uddee\ud835\ude00\ud835\uddf5. \n \nAfter \ud835\udff2 \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf5\ud835\ude00 of \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 & \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5\ud835\ude00, here is a \ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\ude01 \ud835\uddfc\ud835\uddf3 \ud835\uddf3\ud835\uddf6\ud835\uddf9\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddf1\n\ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\ude00 that I \ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\uddfc\ud835\uddfb\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2 \u2193 \n \n\ud835\ude09\ud835\ude2d\ud835\ude30\ud835\ude28\ud835\ude34: \n \n\\- philschmid \n\\- Chip Huyen \n\\- eugeneyan \n\\- LLM Learning Lab \n\\- Lil'Log \n\\- VectorHub by SuperLinked \n\\- Qdrant Blog \n \n\ud835\ude08\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude34: \n \n\\- Patterns for Building LLM-based Systems & Products \n\\- RLHF: Reinforcement Learning from Human Feedback \n\\- Illustrating Reinforcement Learning from Human Feedback (RLHF) \n\\- Understanding Encoder And Decoder LLMs \n\\- Building LLM applications for production \n\\- Prompt Engineering \n\\- Transformers \n\\- Bidirectional Encoder Representations from Transformers (BERT) \n\\- Multimodality and Large Multimodal Models (LMMs) by Chip Huyen \n \n\ud835\ude1d\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude30\ud835\ude34: \n \n\\- Word Embedding and Word2Vec, Clearly Explained!!! \n\\- Let's build GPT: from scratch, in code, spelled out \n\\- Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!! \n\\- Large Language Models with Semantic Search \n\\- Decoder-Only Transformers, ChatGPTs specific Transformer, Clearly\nExplained!!! \n \n\ud835\ude0a\ud835\ude30\ud835\ude25\ud835\ude26 \ud835\ude19\ud835\ude26\ud835\ude31\ud835\ude30\ud835\ude34\ud835\ude2a\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude2a\ud835\ude26\ud835\ude34: \n \n\\- OpenAI Cookbook \n\\- generative-ai-for-beginners \n \n\ud835\ude0a\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26\ud835\ude34: \n \n\\- LangChain for LLM Application Development \n\\- Building Systems with the ChatGPT API \n\\- ChatGPT Prompt Engineering for Developers \n \n. \n \n...and hopefully, my \ud83d\udd17 Hands-on LLMs course will soon appear along them.\n\nImage by DALL-E\n\nLet me know what you think of this list and have fun learning \ud83d\udd25\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 8** of the **Hands-On LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n5\n\nShare this post\n\n#### DML: How do you generate a Q&A dataset in <30 minutes to fine-tune your\nLLMs?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-do-you-generate-a-q-and-a?r=1ttoeh" }, { "id": "15c3831b-67fd-4279-970a-a720aafefa67", "content": { "Title": "DML: What do you need to fine-tune an open-source LLM to create your financial advisor?", "Subtitle": "Lesson 6 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: What do you need to fine-tune an open-source LLM to create your\nfinancial advisor?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: What do you need to fine-tune an open-source LLM to create your\nfinancial advisor?\n\n### Lesson 6 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nDec 07, 2023\n\n4\n\nShare this post\n\n#### DML: What do you need to fine-tune an open-source LLM to create your\nfinancial advisor?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 6 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. The difference between encoders, decoders, and encoder-decoder LLMs.\n\n 2. You must know these 3 main stages of training an LLM to train your own LLM on your proprietary data.\n\n 3. What do you need to fine-tune an open-source LLM to create your own financial advisor?\n\n#### Previous Lessons:\n\n * Lesson 3: Why & what do you need a streaming pipeline when implementing RAG in your LLM applications?\n\n * Lesson 4: How to implement a streaming pipeline to populate a vector DB for real-time RAG?\n\n * Lesson 5: Why & when do you need to fine-tune open-source LLMs? What about fine-tuning vs. prompt engineering?\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. The difference between encoders, decoders, and encoder-decoder LLMs\n\nLet's see when to use each architecture \u2193 \n \nAs embeddings are everywhere, both encoders and decoders use self-attention\nlayers to encode word tokens into embeddings. \n \nThe devil is in the details. Let's clarify it \u2193 \n \n\ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\udde2\ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddf9 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddf2\ud835\uddff \n \nIt is an encoder-decoder setup. The encoder processes the input text and hands\noff its understanding as embeddings to the decoder, which will generate the\nfinal output. \n \nThe key difference between an encoder & decoder is in how it processes its\ninputs & outputs. \n \n=== \ud835\uddd8\ud835\uddfb\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00 === \n \nThe role of an encoder is to extract relevant information from the whole input\nand encode it into an embedding (e.g., BERT, RoBERTa). \n \nWithin the \"Multi-head attention\" of the transformer, all the tokens are\nallowed to speak to each other. \n \nA token at position t can talk to all other previous tokens [0, t-1] and\nfuture tokens [t+1, T]. This means that the attention mask is computed along\nthe whole vector. \n \nThus, because the encoder processes the whole input, it is helpful for\nclassification tasks (e.g., sentiment analysis) and creates embeddings for\nclustering, recommender systems, vector DB indexes, etc. \n \n=== \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00 === \n \nOn the flip side, if you want to generate text, use decoder-only models (e.g.,\nGPT family). \n \nOnly the current and previous tokens (not the whole input) are used to predict\nthe next token. \n \nWithin the \"Masked Multi-head attention,\" the future positions are masked to\nmaintain the autoregressive property of the decoding process. \n \nFor example, within the \"Masked Multi-head attention,\" instead of all the\ntokens talking to each other, a token at position t will have access only to\nprevious tokens at positions t-1, t-2, t-3, ..., 0. \n \n=== \ud835\uddd8\ud835\uddfb\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff-\ud835\uddf1\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff === \n \nThis technique is used when you have to understand the entire input sequence\n(encoder) and the previously generated sequence (decoder -> autoregressive). \n \nTypical use cases are text translation & summarization (the original\ntransformer was built for text translation), where the output heavily relies\non the input. \n \nWhy? Because the decoding step always has to be conditioned by the encoded\ninformation. Also known as cross-attention, the decoder queries the encoded\ninformation for information to guide the decoding process. \n \nFor example, when translating English to Spanish, every Spanish token\npredicted is conditioned by the previously predicted Spanish tokens & the\nentire English sentence.\n\nEncoder vs. Decoder vs. Encoder-Decoder LLMs [Image by the Author].\n\nTo conclude... \n \n\\- a decoder takes as input previous tokens and predicts the next one (in an\nautoregressive way) \n\\- by dropping the \"Masked\" logic from the \"Masked Multi-head attention,\" you\nprocess the whole input, transforming the decoder into an encoder \n\\- if you hook the encoder to the decoder through a cross-attention layer, you\nhave an encoder-decoder architecture\n\n* * *\n\n### #2. You must know these 3 main stages of training an LLM to train your own\nLLM on your proprietary data\n\nYou must know these \ud835\udfef \ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\ude00 of \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0 to train your own \ud835\udddf\ud835\udddf\ud835\udde0 on\nyour \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfd\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude06 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee. \n \n# \ud835\udde6\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\udfed: \ud835\udde3\ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \n \nYou start with a bear foot randomly initialized LLM. \n \nThis stage aims to teach the model to spit out tokens. More concretely, based\non previous tokens, the model learns to predict the next token with the\nhighest probability. \n \nFor example, your input to the model is \"The best programming language is\n___\", and it will answer, \"The best programming language is Rust.\" \n \nIntuitively, at this stage, the LLM learns to speak. \n \n\ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22: >1 trillion token (~= 15 million books). The data quality doesn't have\nto be great. Hence, you can scrape data from the internet. \n \n# \ud835\udde6\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\udfee: \ud835\udde6\ud835\ude02\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddf2\ud835\uddf1 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 (\ud835\udde6\ud835\uddd9\ud835\udde7) \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf1\ud835\uddf6\ud835\uddee\ud835\uddf9\ud835\uddfc\ud835\uddf4\ud835\ude02\ud835\uddf2 \n \nYou start with the pretrained model from stage 1. \n \nThis stage aims to teach the model to respond to the user's questions. \n \nFor example, without this step, when prompting: \"What is the best programming\nlanguage?\", it has a high probability of creating a series of questions such\nas: \"What is MLOps? What is MLE? etc.\" \n \nAs the model mimics the training data, you must fine-tune it on Q&A (questions\n& answers) data to align the model to respond to questions instead of\npredicting the following tokens. \n \nAfter the fine-tuning step, when prompted, \"What is the best programming\nlanguage?\", it will respond, \"Rust\". \n \n\ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22: 10K - 100K Q&A example \n \n\ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26: After aligning the model to respond to questions, you can further\nsingle-task fine-tune the model, on Q&A data, on a specific use case to\nspecialize the LLM. \n \n# \ud835\udde6\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\udfef: \ud835\udde5\ud835\uddf2\ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddff\ud835\uddfc\ud835\uddfa \ud835\uddf5\ud835\ude02\ud835\uddfa\ud835\uddee\ud835\uddfb \ud835\uddf3\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\uddef\ud835\uddee\ud835\uddf0\ud835\uddf8 (\ud835\udde5\ud835\udddf\ud835\udddb\ud835\uddd9) \n \nDemonstration data tells the model what kind of responses to give but doesn't\ntell the model how good or bad a response is. \n \nThe goal is to align your model with user feedback (what users liked or didn't\nlike) to increase the probability of generating answers that users find\nhelpful. \n \n\ud835\ude19\ud835\ude13\ud835\ude0f\ud835\ude0d \ud835\ude2a\ud835\ude34 \ud835\ude34\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f 2: \n \n1\\. Using the LLM from stage 2, train a reward model to act as a scoring\nfunction using (prompt, winning_response, losing_response) samples (=\ncomparison data). The model will learn to maximize the difference between\nthese 2. After training, this model outputs rewards for (prompt, response)\ntuples. \n \n\ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22: 100K - 1M comparisons \n \n2\\. Use an RL algorithm (e.g., PPO) to fine-tune the LLM from stage 2. Here,\nyou will use the reward model trained above to give a score for every:\n(prompt, response). The RL algorithm will align the LLM to generate prompts\nwith higher rewards, increasing the probability of generating responses that\nusers liked. \n \n\ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22: 10K - 100K prompts\n\nThe 3 main stages of training an LLM that you must know [Image by the Author].\n\n**Note:** Post inspired by Chip Huyen's \ud83d\udd17 RLHF: Reinforcement Learning from\nHuman Feedback\" article.\n\n* * *\n\n### #3. What do you need to fine-tune an open-source LLM to create your own\nfinancial advisor?\n\nThis is the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf8\ud835\uddf6\ud835\ude01 you must know \u2193 \n \n\ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 \n \nThe key component of any successful ML project is the data. \n \nYou need a 100 - 1000 sample Q&A (questions & answers) dataset with financial\nscenarios. \n \nThe best approach is to hire a bunch of experts to create it manually. \n \nBut, for a PoC, that might get expensive & slow. \n \nThe good news is that a method called \"\ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude25\ud835\ude2a\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\" exists. \n \nIn a nutshell, this is how it works: \"Use a big & powerful LLM (e.g., GPT4) to\ngenerate your fine-tuning data. After, use this data to fine-tune a smaller\nmodel (e.g., Falcon 7B).\" \n \nFor specializing smaller LLMs on specific use cases (e.g., financial\nadvisors), this is an excellent method to kick off your project. \n \n\ud835\udde3\ud835\uddff\ud835\uddf2-\ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf1 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb-\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \n \nYou never want to start training your LLM from scratch (or rarely). \n \nWhy? Because you need trillions of tokens & millions of $$$ in compute power. \n \nYou want to fine-tune your LLM on your specific task. \n \nThe good news is that you can find a plethora of open-source LLMs on\nHuggingFace (e.g., Falcon, LLaMa, etc.) \n \n\ud835\udde3\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddf2\ud835\uddf3\ud835\uddf3\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n \nAs LLMs are big... duh... \n \n... they don't fit on a single GPU. \n \nAs you want only to fine-tune the LLM, the community invented clever\ntechniques that quantize the LLM (to fit on a single GPU) and fine-tune only a\nset of smaller adapters. \n \nOne popular approach is QLoRA, which can be implemented using HF's `\ud835\ude31\ud835\ude26\ud835\ude27\ud835\ude35`\nPython package. \n \n\ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \n \nAs you want your project to get to production, you have to integrate the\nfollowing MLOps components: \n \n\\- experiment tracker to monitor & compare your experiments \n\\- model registry to version & share your models between the FTI pipelines \n\\- prompts monitoring to debug & track complex chains \n \n\u21b3\ud83d\udd17 All of them are available on ML platforms, such as Comet ML \n \n\ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 \ud835\uddfd\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa \n \nThe most common approach is to train your LLM on your on-prem Nivida GPUs\ncluster or rent them on cloud providers such as AWS, Paperspace, etc. \n \nBut what if I told you that there is an easier way? \n \nThere is! It is called serverless. \n \nFor example, Beam is a GPU serverless provider that makes deploying your\ntraining pipeline as easy as decorating your Python function with\n`@\ud835\ude22\ud835\ude31\ud835\ude31.\ud835\ude33\ud835\ude36\ud835\ude2f()`. \n \nAlong with ease of deployment, you can easily add your training code to your\nCI/CD to add the final piece of the MLOps puzzle, called CT (continuous\ntraining). \n \n\u21b3\ud83d\udd17 Beam\n\nWhat | Training Pipeline [Image by the Author].\n\n> \u21b3\ud83d\udd17 To see all these components in action, check out our FREE \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00\n> \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 & give it a \u2b50\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 7** of the **Hands-On LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n4\n\nShare this post\n\n#### DML: What do you need to fine-tune an open-source LLM to create your\nfinancial advisor?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-what-do-you-need-to-fine-tune?r=1ttoeh" }, { "id": "174d6f07-42f4-4190-9150-bb4ad35f8413", "content": { "Title": "DML: Why & when do you need to fine-tune open-source LLMs? What about fine-tuning vs. prompt engineering?", "Subtitle": "Lesson 5 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: Why & when do you need to fine-tune open-source LLMs? What about\nfine-tuning vs. prompt engineering?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: Why & when do you need to fine-tune open-source LLMs? What about fine-\ntuning vs. prompt engineering?\n\n### Lesson 5 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nNov 30, 2023\n\n6\n\nShare this post\n\n#### DML: Why & when do you need to fine-tune open-source LLMs? What about\nfine-tuning vs. prompt engineering?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 5 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. Using this Python package, you can x10 your text preprocessing pipeline development.\n\n 2. Why & when do you need to fine-tune open-source LLMs? What about fine-tuning vs. prompt engineering?\n\n 3. Fine-tuning video lessons\n\n#### Previous Lessons:\n\n * Lesson 2: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps\n\n * Lesson 3: Why & what do you need a streaming pipeline when implementing RAG in your LLM applications?\n\n * Lesson 4: How to implement a streaming pipeline to populate a vector DB for real-time RAG?\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. Using this Python package, you can x10 your text preprocessing\npipeline development\n\nAny text preprocessing pipeline has to clean, partition, extract, or chunk\ntext data to feed it into your LLMs. \n \n\ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 offers a \ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf5 and \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb \ud835\uddd4\ud835\udde3\ud835\udddc that allows you to quickly: \n \n\\- \ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f your data into smaller segments from various data sources (e.g.,\nHTML, CSV, PDFs, even images, etc.) \n\\- \ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 the text of anomalies (e.g., wrong ASCII characters), any\nirrelevant information (e.g., white spaces, bullets, etc.), and filling\nmissing values \n\\- \ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28 information from pieces of text (e.g., datetimes, addresses, IP\naddresses, etc.) \n\\- \ud835\ude24\ud835\ude29\ud835\ude36\ud835\ude2f\ud835\ude2c\ud835\ude2a\ud835\ude2f\ud835\ude28 your text segments into pieces of text that can be inserted into\nyour embedding model \n\\- \ud835\ude26\ud835\ude2e\ud835\ude23\ud835\ude26\ud835\ude25\ud835\ude25\ud835\ude2a\ud835\ude2f\ud835\ude28 data (e.g., wrapper over OpenAIEmbeddingEncoder,\nHuggingFaceEmbeddingEncoders, etc.) \n\\- \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude28\ud835\ude26 your data to be fed into various tools (e.g., Label Studio, Label\nBox, etc.)\n\nUnstructured [Image by the Author].\n\n\ud835\uddd4\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff: \n \n\\- feeding your data into your LLMs \n\\- embedding the data and ingesting it into a vector DB \n\\- doing RAG \n\\- labeling \n\\- recommender systems \n \n... basically for any LLM or multimodal applications \n \n. \n \nImplementing all these steps from scratch will take a lot of time. \n \nI know some Python packages already do this, but the functionality is\nscattered across multiple packages. \n \n\ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 packages everything together under a nice, clean API. \n \n\u21b3 Check it out.\n\n* * *\n\n### #2. Why & when do you need to fine-tune open-source LLMs? What about fine-\ntuning vs. prompt engineering?\n\nFine-tuning is the process of taking a pre-trained model and further refining\nit on a specific task. \n \n\ud835\uddd9\ud835\uddf6\ud835\uddff\ud835\ude00\ud835\ude01, \ud835\uddf9\ud835\uddf2\ud835\ude01'\ud835\ude00 \ud835\uddf0\ud835\uddf9\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddf3\ud835\ude06 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee\ud835\uddfb \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb-\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf2\ud835\ude05\ud835\uddf6\ud835\ude00t \u2193 \n \n\\- \ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude36\ud835\ude26\ud835\ude25 \ud835\ude31\ud835\ude33\ud835\ude26-\ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28: utilize domain-specific data to apply the same pre-\ntraining process (next token prediction) on the pre-trained (base) model \n\\- \ud835\ude10\ud835\ude2f\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28: the pre-trained (base) model is fine-tuned on a\nQ&A dataset to learn to answer questions \n\\- \ud835\ude1a\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude2d\ud835\ude26-\ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude2c \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28: the pre-trained model is refined for a specific\ntask, such as toxicity detection, coding, medicine advice, etc. \n\\- \ud835\ude19\ud835\ude13\ud835\ude0f\ud835\ude0d: It requires collecting human preferences (e.g., pairwise\ncomparisons), which are then used to train a reward model. The reward model is\nused to fine-tune the LLM via RL techniques such as PPO. \n \nCommon approaches are to take a pre-trained LLM (next-word prediction) and\napply instruction & single-task fine-tuning. \n \n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0? \n \nYou do instruction fine-tuning to make the LLM learn to answer your questions. \n \nThe exciting part is when you want to fine-tune your LLM on a single task. \n \nHere is why \u2193 \n \n\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude26: it will improve your LLM performance on given use cases (e.g.,\ncoding, extracting text, etc.). Mainly, the LLM will specialize in a given\ntask (a specialist will always beat a generalist in its domain) \n \n\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude33\ud835\ude30\ud835\ude2d: you can refine how your model should behave on specific inputs and\noutputs, resulting in a more robust product \n \n\ud835\ude2e\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude2d\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f: you can create an army of smaller models, where each is\nspecialized on a particular task, increasing the overall system's performance.\nUsually, when you fine-tune one task, it reduces the performance of the other\ntasks (known as the \nalignment tax). Thus, having an expert system of multiple smaller models can\nimprove the overall performance. \n \n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddee\ud835\uddef\ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude03\ud835\ude00 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4? \n \n\ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22: use prompting when you don't have data available (~2 examples are\nenough). Fine-tuning needs at least >=100 examples to work. \n \n\ud835\ude24\ud835\ude30\ud835\ude34\ud835\ude35: prompting forces you to write long & detailed prompts to achieve your\nlevel of performance. You pay per token (API or compute-wise). Thus, when a\nprompt gets bigger, your costs increase. But, when fine-tuning an LLM, you\nincorporate all that knowledge inside the model. Hence, you can use smaller\nprompts with similar performance.\n\nFine-tuning LLMs [Image by the Author].\n\nWhen you start a project, a good strategy is to write a wrapper over an API\n(e.g., OpenAI's GPT-4, Anyscale, etc.) that defines a desired interface that\ncan easily be swapped with your open-source implementation in future\niterations.\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course to see this in action.\n\n* * *\n\n### #3. Fine-tuning video lessons \n\nAs you might know,\n\nPau Labarta Bajo\n\nfrom\n\nReal-World Machine Learning\n\nand I are also working on a free Hands-on LLMs course that contains the open-\nsource code + a set of video lessons.\n\nHere are the 2 video lessons about fine-tuning \u2193\n\n#### 01 Hands-on LLMS | Theoretical Part\n\nHere is a \ud835\ude34\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude3a of the 1\ud835\ude34\ud835\ude35 \ud835\ude37\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f \u2193\n\n\ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\uddf9\ud835\uddee\ud835\uddff\ud835\uddf4\ud835\uddf2 \ud835\uddf9\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\ude02\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00? \n \n1\\. \ud835\ude17\ud835\ude26\ud835\ude33\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude26: Fine-tuning a large language model (LLM) can improve\nperformance, especially for specialized tasks. \n \n2\\. \ud835\ude0c\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude30\ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude34: Fine-tuned models are smaller and thus cheaper to run. This is\ncrucial, given that LLMs can have billions of parameters. \n \n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddee \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2? \n \n1\\. \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude35: You need a dataset of input-output examples. This dataset can be\ncreated manually or semi-automatically using existing LLMs like GPT-3.5. \n \n2\\. \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26 \ud835\ude13\ud835\ude13\ud835\ude14: Choose an open-source LLM from repositories like Hugging Face's\nModel Hub (e.g., Falcon 7B) \n \n3\\. \ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude34\ud835\ude24\ud835\ude33\ud835\ude2a\ud835\ude31\ud835\ude35: Data loader + Trainer \n \n4\\. \ud835\ude08\ud835\ude25\ud835\ude37\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude26\ud835\ude25 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude29\ud835\ude2f\ud835\ude2a\ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude2e\ud835\ude30\ud835\ude25\ud835\ude26\ud835\ude2d \ud835\ude30\ud835\ude2f \ud835\ude24\ud835\ude29\ud835\ude26\ud835\ude22\ud835\ude31 \ud835\ude29\ud835\ude22\ud835\ude33\ud835\ude25\ud835\ude38\ud835\ude22\ud835\ude33\ud835\ude26:\nQLoRA \n \n5\\. \ud835\ude14\ud835\ude13\ud835\ude16\ud835\ude31\ud835\ude34: Experiment Tracker + Model Registry \n \n6\\. \ud835\ude10\ud835\ude2f\ud835\ude27\ud835\ude33\ud835\ude22\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26: Comet \\+ Beam\n\n#### 02 Hands-on LLMS | Diving into the code\n\n\ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddee \ud835\ude00\ud835\uddf5\ud835\uddfc\ud835\uddff\ud835\ude01 \ud835\ude04\ud835\uddee\ud835\uddf9\ud835\uddf8\ud835\ude01\ud835\uddf5\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb \u2193 \n \n1\\. How to set up the code and environment using Poetry \n2\\. How to configure Comet & Beam \n3\\. How to start the training pipeline locally (if you have a CUDA-enabled\nGPU) or on Beam (for running your training pipeline on a serverless\ninfrastructure -> doesn't matter what hardware you have). \n4\\. An overview of the code \n5\\. Clarifying why we integrated Poetry, a model registry and linting within\nthe training pipeline. \n \n\u2757This video is critical for everyone who wants to replicate the training\npipeline of our course on their system. The previous lesson focused on the\ntheoretical parts of the training pipeline.\n\n> \u21b3\ud83d\udd17 To find out the code & all the videos, check out the **Hands-on LLMs**\n> GitHub repository.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 6** of the **Hands-On LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n6\n\nShare this post\n\n#### DML: Why & when do you need to fine-tune open-source LLMs? What about\nfine-tuning vs. prompt engineering?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-why-and-when-do-you-need-to-fine?r=1ttoeh" }, { "id": "b6d86294-1bcc-4226-8218-3a63cab813a2", "content": { "Title": "DML: How to implement a streaming pipeline to populate a vector DB for real-time RAG?", "Subtitle": "Lesson 4 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: How to implement a streaming pipeline to populate a vector DB for\nreal-time RAG?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: How to implement a streaming pipeline to populate a vector DB for real-\ntime RAG?\n\n### Lesson 4 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nNov 23, 2023\n\n3\n\nShare this post\n\n#### DML: How to implement a streaming pipeline to populate a vector DB for\nreal-time RAG?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 4 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. What is Bytewax?\n\n 2. Why have vector DBs become so popular? Why are they so crucial for most ML applications?\n\n 3. How to implement a streaming pipeline to populate a vector DB for real-time RAG?\n\n#### Previous Lessons:\n\n * Lesson 1: How to design an LLM system for a financial assistant using the 3-pipeline design\n\n * Lesson 2: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps\n\n * Lesson 3: Why & what do you need a streaming pipeline when implementing RAG in your LLM applications?\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. What is Bytewax?\n\nAre you afraid of writing \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00? Or do you think they are hard\nto implement? \n \nI did until I discovered Bytewax \ud83d\udc1d. Let me show you \u2193 \n \nBytewax \ud83d\udc1d is an \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb-\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 that: \n\\- is built in Rust \u2699\ufe0f for performance \n\\- has Python \ud83d\udc0d binding for ease of use \n \n... so for all the Python fanatics out there, no more JVM headaches for you. \n \nJokes aside, here is why Bytewax \ud83d\udc1d is so powerful \u2193 \n \n\\- Bytewax local setup is plug-and-play \n\\- can quickly be integrated into any Python project (you can go wild -- even\nuse it in Notebooks) \n\\- can easily be integrated with other Python packages (NumPy, PyTorch,\nHuggingFace, OpenCV, SkLearn, you name it) \n\\- out-of-the-box connectors for Kafka, local files, or you can quickly\nimplement your own \n\\- CLI tool to easily deploy it to K8s, AWS, or GCP. \n \n\ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26 (\ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude22\ud835\ude35 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude2a\ud835\ude2e\ud835\ude22\ud835\ude28\ud835\ude26 \ud835\ude23\ud835\ude26\ud835\ude2d\ud835\ude30\ud835\ude38): \n1\\. We defined a streaming app in a few lines of code. \n2\\. We run the streaming app with one command. \n \n. \n \nThe thing is that I worked in Kafka Streams (in Kotlin) for one year. \n \nI loved & understood the power of building streaming applications. The only\nthing that stood in my way was, well... Java. \n \nI don't have something with Java; it is a powerful language. However, building\nan ML application in Java + Python takes much time due to a more significant\nresistance to integrating the two. \n \n...and that's where Bytewax \ud83d\udc1d kicks in. \n \nWe used Bytewax \ud83d\udc1d for building the streaming pipeline for the \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00\ncourse and loved it.\n\nWhat is Bytewax? [Iamge by the Author].\n\n* * *\n\n### #2. Why have vector DBs become so popular? Why are they so crucial for\nmost ML applications?\n\nIn the world of ML, everything can be represented as an embedding. \n \nA vector DB is an intelligent way to use your data embeddings as an index and\nperform fast and scalable searches between unstructured data points. \n \nSimply put, a vector DB allows you to find matches between anything and\nanything (e.g., use an image as a query to find similar pieces of text, video,\nother images, etc.). \n \n. \n \n\ud835\ude10\ud835\ude2f \ud835\ude22 \ud835\ude2f\ud835\ude36\ud835\ude35\ud835\ude34\ud835\ude29\ud835\ude26\ud835\ude2d\ud835\ude2d, \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34 \ud835\ude2a\ud835\ude34 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude22 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude0b\ud835\ude09 \ud835\ude2a\ud835\ude2f \ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude2d-\ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2d\ud835\ude25\n\ud835\ude34\ud835\ude24\ud835\ude26\ud835\ude2f\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude30\ud835\ude34 \u2193 \n \nUsing various DL techniques, you can project your data points (images, videos,\ntext, audio, user interactions) into the same vector space (aka the embeddings\nof the data). \n \nYou will load the embeddings along a payload (e.g., a URL to the image, date\nof creation, image description, properties, etc.) into the vector DB, where\nthe data will be indexed along the: \n\\- vector \n\\- payload \n\\- text within the payload \n \nNow that the embedding indexes your data, you can query the vector DB by\nembedding any data point. \n \nFor example, you can query the vector DB with an image of your cat and use a\nfilter to retrieve only \"black\" cats. \n \nTo do so, you must embed the image using the same model you used to embed the\ndata within your vector DB. After you query the database using a given\ndistance (e.g., cosine distance between 2 vectors) to find similar embeddings. \n \nThese similar embeddings have attached to them their payload that contains\nvaluable information such as the URL to an image, a URL to a site, an ID of a\nuser, a chapter from a book about the cat of a witch, etc. \n \n. \n \nUsing this technique, I used Qdrant to implement RAG for a financial assistant\npowered by LLMs. \n \nBut vector DBs go beyond LLMs & RAG. \n \n\ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude22 \ud835\ude2d\ud835\ude2a\ud835\ude34\ud835\ude35 \ud835\ude30\ud835\ude27 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude23\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude25 \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude0b\ud835\ude09\ud835\ude34 (e.g., Qdrant ): \n \n\\- similar image search \n\\- semantic text search (instead of plain text search) \n\\- recommender systems \n\\- RAG for chatbots \n\\- anomalies detection \n \n\u21b3\ud83d\udd17 \ud835\ude0a\ud835\ude29\ud835\ude26\ud835\ude24\ud835\ude2c \ud835\ude30\ud835\ude36\ud835\ude35 \ud835\ude18\ud835\ude25\ud835\ude33\ud835\ude22\ud835\ude2f\ud835\ude35'\ud835\ude34 \ud835\ude28\ud835\ude36\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude34 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude35\ud835\ude36\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude2a\ud835\ude22\ud835\ude2d\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude2e\ud835\ude30\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude23\ud835\ude30\ud835\ude36\ud835\ude35 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude0b\ud835\ude09\ud835\ude34.\n\nQdrant\u2019s Architecture [Image from Qdrant docs].\n\n* * *\n\n### #3. How to implement a streaming pipeline to populate a vector DB for\nreal-time RAG?\n\nThis is \ud835\uddf5\ud835\uddfc\ud835\ude04 you can \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 to populate a \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 to\ndo \ud835\udde5\ud835\uddd4\ud835\uddda for a \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\ude01 powered by \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00. \n \nIn a previous post, I covered \ud835\ude04\ud835\uddf5\ud835\ude06 you need a streaming pipeline over a batch\npipeline when implementing RAG. \n \nNow, we will focus on the \ud835\uddf5\ud835\uddfc\ud835\ude04, aka implementation details \u2193 \n \n\ud83d\udc1d All the following steps are wrapped in Bytewax functions and connected in a\nsingle streaming pipeline. \n \n\ud835\uddd8\ud835\ude05\ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude01 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\ude04\ud835\ude00 \ud835\uddf3\ud835\uddff\ud835\uddfc\ud835\uddfa \ud835\uddd4\ud835\uddf9\ud835\uddfd\ud835\uddee\ud835\uddf0\ud835\uddee \n \nYou need 2 types of inputs: \n \n1\\. A WebSocket API to listen to financial news in real-time. This will be\nused to listen 24/7 for new data and ingest it as soon as it is available. \n \n2\\. A RESTful API to ingest historical data in batch mode. When you deploy a\nfresh vector DB, you must populate it with data between a given range\n[date_start; date_end]. \n \nYou wrap the ingested HTML document and its metadata in a `pydantic`\nNewsArticle model to validate its schema. \n \nRegardless of the input type, the ingested data is the same. Thus, the\nfollowing steps are the same for both data inputs \u2193 \n \n\ud835\udde3\ud835\uddee\ud835\uddff\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddb\ud835\udde7\ud835\udde0\ud835\udddf \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01 \n \nAs the ingested financial news is in HTML, you must extract the text from\nparticular HTML tags. \n \n`unstructured` makes it as easy as calling `partition_html(document)`, which\nwill recursively return the text within all essential HTML tags. \n \nThe parsed NewsArticle model is mapped into another `pydantic` model to\nvalidate its new schema. \n \nThe elements of the news article are the headline, summary and full content. \n \n\ud835\uddd6\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \n \nNow we have a bunch of text that has to be cleaned. Again, `unstructured`\nmakes things easy. Calling a few functions we clean: \n\\- the dashes & bullets \n\\- extra whitespace & trailing punctuation \n\\- non ascii chars \n\\- invalid quotes \n \nFinally, we standardize everything to lowercase. \n \n\ud835\uddd6\ud835\uddf5\ud835\ude02\ud835\uddfb\ud835\uddf8 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \n \nAs the text can exceed the context window of the embedding model, we have to\nchunk it. \n \nYet again, `unstructured` provides a valuable function that splits the text\nbased on the tokenized text and expected input length of the embedding model. \n \nThis strategy is naive, as it doesn't consider the text's structure, such as\nchapters, paragraphs, etc. As the news is short, this is not an issue, but\nLangChain provides a `RecursiveCharacterTextSplitter` class that does that if\nrequired. \n \n\ud835\uddd8\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddf5\ud835\ude02\ud835\uddfb\ud835\uddf8\ud835\ude00 \n \nYou pass all the chunks through an encoder-only model. \n \nWe have used `all-MiniLM-L6-v2` from `sentence-transformers`, a small model\nthat can run on a CPU and outputs a 384 embedding. \n \nBut based on the size and complexity of your data, you might need more complex\nand bigger models. \n \n\ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf6\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde4\ud835\uddf1\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 \n \nFinally, you insert the embedded chunks and their metadata into the Qdrant\nvector DB. \n \nThe metadata contains the embedded text, the source_url and the publish date.\n\nHow to implement a streaming pipeline to populate a vector DB for real-time\nRAG [Image by the Author].\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course to see this in action.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 5** of the **Hands-On LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n3\n\nShare this post\n\n#### DML: How to implement a streaming pipeline to populate a vector DB for\nreal-time RAG?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-to-implement-a-streaming?r=1ttoeh" }, { "id": "b2296169-eed0-4b28-864a-08b061f5ee45", "content": { "Title": "DML: Why & what do you need a streaming pipeline when implementing RAG in your LLM applications?", "Subtitle": "Lesson 3 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: Why & what do you need a streaming pipeline when implementing RAG in\nyour LLM applications?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: Why & what do you need a streaming pipeline when implementing RAG in\nyour LLM applications?\n\n### Lesson 3 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nNov 16, 2023\n\n3\n\nShare this post\n\n#### DML: Why & what do you need a streaming pipeline when implementing RAG in\nyour LLM applications?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 3 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. RAG: What problems does it solve, and how it's integrated into LLM-powered applications?\n\n 2. Why do you need a streaming pipeline instead of a batch pipeline when implementing RAG in your LLM applications?\n\n 3. What do you need to implement a streaming pipeline for a financial assistant?\n\n#### Previous Lessons:\n\n * Lesson 1: How to design an LLM system for a financial assistant using the 3-pipeline design\n\n * Lesson 2: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. RAG: What problems does it solve, and how it's integrated into LLM-\npowered applications?\n\nLet's find out \u2193 \n \nRAG is a popular strategy when building LLMs to add external data to your\nprompt. \n \n=== \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa === \n \nWorking with LLMs has 3 main issues: \n \n1\\. The world moves fast \n \nAn LLM learns an internal knowledge base. However, the issue is that its\nknowledge is limited to its training dataset. \n \nThe world moves fast. New data flows on the internet every second. Thus, the\nmodel's knowledge base can quickly become obsolete. \n \nOne solution is to fine-tune the model every minute or day... \n \nIf you have some billions to spend around, go for it. \n \n2\\. Hallucinations \n \nAn LLM is full of testosterone and likes to be blindly confident. \n \nEven if the answer looks 100% legit, you can never fully trust it. \n \n3\\. Lack of reference links \n \nIt is hard to trust the response of the LLM if we can't see the source of its\ndecisions. \n \nEspecially for important decisions (e.g., health, financials) \n \n=== \ud835\udde6\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb === \n \n\u2192 Surprize! It is RAG. \n \n1\\. Avoid fine-tuning \n \nUsing RAG, you use the LLM as a reasoning engine and the external knowledge\nbase as the main memory (e.g., vector DB). \n \nThe memory is volatile, so you can quickly introduce or remove data. \n \n2\\. Avoid hallucinations \n \nBy forcing the LLM to answer solely based on the given context, the LLM will\nprovide an answer as follows: \n\\- use the external data to respond to the user's question if it contains the\nnecessary insights \n\\- \"I don't know\" if not \n \n3\\. Add reference links \n \nUsing RAG, you can easily track the source of the data and highlight it to the\nuser. \n \n=== \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc\ud835\uddf2\ud835\ude00 \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8? === \n \nLet's say we want to use RAG to build a financial assistant. \n \n\ud835\ude1e\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude25\ud835\ude30 \ud835\ude38\ud835\ude26 \ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude25? \n \n\\- a data source with historical and real-time financial news (e.g. Alpaca) \n\\- a stream processing engine (e.g., Bytewax) \n\\- an encoder-only model for embedding the documents (e.g., pick one from\n`sentence-transformers`) \n\\- a vector DB (e.g., Qdrant) \n \n\ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude25\ud835\ude30\ud835\ude26\ud835\ude34 \ud835\ude2a\ud835\ude35 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c? \n \n\u21b3 On the feature pipeline side: \n \n1\\. using Bytewax, you ingest the financial news and clean them \n2\\. you chunk the news documents and embed them \n3\\. you insert the embedding of the docs along with their metadata (e.g., the\ninitial text, source_url, etc.) to Qdrant \n \n\u21b3 On the inference pipeline side: \n \n4\\. the user question is embedded (using the same embedding model) \n5\\. using this embedding, you extract the top K most similar news documents\nfrom Qdrant \n6\\. along with the user question, you inject the necessary metadata from the\nextracted top K documents into the prompt template (e.g., the text of\ndocuments & its source_url) \n7\\. you pass the whole prompt to the LLM for the final answer\n\nWhat is Retrieval Augmented Generation (RAG)? [Image by the Author].\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course to see this in action.\n\n* * *\n\n### #2. Why do you need a streaming pipeline instead of a batch pipeline when\nimplementing RAG in your LLM applications?\n\nThe quality of your RAG implementation is as good as the quality & freshness\nof your data. \n \nThus, depending on your use case, you have to ask: \n\"How fresh does my data from the vector DB have to be to provide accurate\nanswers?\" \n \nBut for the best user experience, the data has to be as fresh as possible, aka\nreal-time data. \n \nFor example, when implementing a financial assistant, being aware of the\nlatest financial news is critical. A new piece of information can completely\nchange the course of your strategy. \n \nHence, when implementing RAG, one critical aspect is to have your vector DB\nsynced with all your external data sources in real-time. \n \nA batch pipeline will work if your use case accepts a particular delay (e.g.,\none hour, one day, etc.). \n \nBut with tools like Bytewax \ud83d\udc1d, building streaming applications becomes much\nmore accessible. So why not aim for the best?\n\nStreaming vs. batch pipelines when doing RAG [Image by the Author]\n\n* * *\n\n### #3. What do you need to implement a streaming pipeline for a financial\nassistant?\n\n\\- A financial news data source exposed through a web socket (e.g., Alpaca) \n \n\\- A Python streaming processing framework. For example, Bytewax \ud83d\udc1d is built in\nRust for efficiency and exposes a Python interface for ease of use - you don't\nneed the Java ecosystem to implement real-time pipelines anymore. \n \n\\- A Python package to process, clean, and chunk documents. `unstructured`\noffers a rich set of features that makes parsing HTML documents extremely\nconvenient. \n \n\\- An encoder-only language model that maps your chunked documents into\nembeddings. `setence-transformers` is well integrated with HuggingFace and has\na huge list of models of various sizes. \n \n\\- A vector DB, where to insert your embeddings and their metadata (e.g., the\nembedded text, the source_url, the creation date, etc.). For example, Qdrant\nprovides a rich set of features and a seamless experience. \n \n\\- A way to deploy your streaming pipeline. Docker + AWS will never disappoint\nyou. \n \n\\- A CI/CD pipeline for continuous tests & deployments. GitHub Actions is a\ngreat serverless option with a rich ecosystem. \n \nThis is what you need to build & deploy a streaming pipeline solely in Python\n\ud83d\udd25\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course to see this in action.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 4** of the **Hands-On LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n3\n\nShare this post\n\n#### DML: Why & what do you need a streaming pipeline when implementing RAG in\nyour LLM applications?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-why-and-what-do-you-need-a-streaming?r=1ttoeh" }, { "id": "032f3296-b891-484d-9e00-c2872bbb9bbe", "content": { "Title": "DML: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps", "Subtitle": "Lesson 2 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps\n\n### Lesson 2 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nNov 09, 2023\n\n6\n\nShare this post\n\n#### DML: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n### **Lesson 2 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. Introduction video lessons \n\n 2. What is LLMOps? MLOps vs. LLMOps\n\n 3. Unwrapping step-by-step the 3-pipeline design of a financial assistant powered by LLMs\n\n#### Previous Lessons:\n\n * Lesson 1: How to design an LLM system for a financial assistant using the 3-pipeline design\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #1. Introduction video lessons\n\nWe started releasing the first video lessons of the course.\n\nThis is a recording of me, where I presented at a webinar hosted by Gathers, a\n1.5-hour overview of the \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 course.\n\nCheck it out to get a gut feeling of the LLM system \u2193\n\nThis is the **1st official lesson** of the **Hands-on LLMs** course presented\nby no other but\n\nPau Labarta Bajo\n\nfrom the **Real-World Machine Learning** newsletter (if you wonder, the course\nis the result of our collaboration).\n\nPau is one of the best teachers I know. If you have some spare time, it is\nworth it \u2193\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course and support it with a \u2b50.\n\n* * *\n\n### #2. What is LLMOps? MLOps vs. LLMOps\n\nLLMOps here, LLMOps there, but did you take the time to see how it differs\nfrom MLOps? \n \nIf not, here is a 2-min LLMOps vs. MLOps summary \u2193 \n \n\ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00? \n \nWell, everything revolves around the idea that \"Size matters.\" \n \nLLMOps is about best practices for efficient deployment, monitoring and\nmaintenance, but this time for large language models. \n \nLLMOps is a subset of MLOps, focusing on training & deploying large models\ntrained on big data. \n \nIntuitive right? \n \n\ud835\uddd5\ud835\ude02\ud835\ude01 \ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\udff1 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfe\ud835\ude02\ud835\uddf2 \ud835\uddf3\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\uddf6\ud835\ude01 \ud835\uddee\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude01 \ud835\uddf3\ud835\uddff\ud835\uddfc\ud835\uddfa \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \u2193 \n \n\ud835\udfed\\. \ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\uddee\ud835\uddf9 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\ude00: training your models on CUDA-enabled GPUs is more\ncritical than ever, along with knowing how to run your jobs on a cluster of\nGPUs leveraging data & model parallelism using techniques such as ZeRO from\nDeepSpeed. Also, the high cost of inference makes model compression techniques\nessential for deployment. \n \n\ud835\udfee\\. \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddf2\ud835\uddff \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4: training models from scratch is a thing of the past. In\nmost use cases, you will fine-tune the model on specific tasks, leveraging\ntechniques such as LLaMA-Adapters or QLora. \n \n\ud835\udfef\\. \ud835\udddb\ud835\ude02\ud835\uddfa\ud835\uddee\ud835\uddfb \ud835\uddf3\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\uddef\ud835\uddee\ud835\uddf0\ud835\uddf8: reinforcement learning from human feedback (RLHF) showed\nmuch potential in improving the quality of generated outputs. But to do RLHF,\nyou have to introduce a feedback loop within your ML system that lets you\nevaluate the generated results based on human feedback, which are even further\nused to fine-tune your LLMs. \n \n\ud835\udff0\\. \ud835\uddda\ud835\ude02\ud835\uddee\ud835\uddff\ud835\uddf1\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddf9\ud835\ude00: to create safe systems, you must protect your systems against\nharmful or violent inputs and outputs. Also, when designing your prompt\ntemplates, you must consider hallucinations and prompt hacking. \n \n\ud835\udff1\\. \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 & \ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf9\ud835\ude06\ud835\ude07\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00: most ML platforms (e.g., Comet ML)\nintroduced specialized logging tools to debug and monitor your LLMs to help\nyou find better prompt templates and protect against hallucination and\nhacking.\n\nWhat is LLMOps? LLMOps vs. MLOps [Image by the Author]\n\nTo conclude... \n \nLLMOps isn't anything new for those familiar with MLOps and Deep Learning. \n \nFor example, training deep learning models on clusters of GPUs or fine-tuning\nthem isn't new, but now it is more important than ever to master these skills\nas models get bigger and bigger. \n \nBut it indeed introduced novel techniques to fine-tune models (e.g., QLora),\nto merge the fields of RL and DL, and a plethora of tools around prompt\nmanipulation & storing, such as: \n\\- vector DBs (e.g., Qdrant) \n\\- prompt chaining (e.g., LangChain) \n\\- prompt logging & analytics (e.g., Comet LLMOps) \n \n. \n \nBut with the new multi-modal large models trend, these tips & tricks will\nconverge towards all deep learning models (e.g., computer vision), and soon,\nwe will change the name of LLMOps to DLOps or LMOps. \n \nWhat do you think? Is the term of LLMOps going to stick around?\n\n* * *\n\n### #3. Unwrapping step-by-step the 3-pipeline design of a financial assistant\npowered by LLMs\n\nHere is a step-by-step guide on designing the architecture of a financial\nassistant powered by LLMs, vector DBs and MLOps. \n \nThe 3-pipeline design, also known as the FTI architecture, makes things simple\n\u2193 \n \n=== \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 === \n \nWe want to build a streaming pipeline that listens to real-time financial\nnews, embeds the news, and loads everything in a vector DB. The goal is to add\nup-to-date news to the user's questions using RAG to avoid retraining. \n \n1\\. We listen 24/7 to financial news from Alpaca through a WebSocket wrapped\nover a Bytewax connector \n2\\. Once any financial news is received, these are passed to the Bytewax flow\nthat: \n\\- extracts & cleans the necessary information from the news HTML document \n\\- chunks the text based on the LLM's max context window \n\\- embeds all the chunks using the \"all-MiniLM-L6-v2\" encoder-only model from\nsentence-transformers \n\\- inserts all the embeddings along their metadata to Qdrant \n3\\. The streaming pipeline is deployed to an EC2 machine that runs multiple\nBytewax processes. It can be deployed to K8s into a multi-node setup to scale\nup. \n \n=== \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 === \n \nWe want to fine-tune a pretrained LLM to specialize the model to answer\nfinancial-based questions. \n \n1\\. Manually fill ~100 financial questions. \n2\\. Use RAG to enrich the questions using the financial news from the Qdrant\nvector DB. \n3\\. Use a powerful model such as GPT-4 to answer them, or hire an expert if\nyou have more time and resources. \n4\\. Load Falcon from HuggingFace using QLoRA to fit on a single GPU. \n5\\. Preprocess the Q&A dataset into prompts. \n6\\. Fine-tune the LLM and log all the artifacts to Comet's experiment tracker\n(loss, model weights, etc.) \n7\\. For every epoch, run the LLM on your test set, log the prompts to Comet's\nprompt logging feature and compute the metrics. \n8\\. Send the best LoRA weights to the model registry as the next production\ncandidate. \n9\\. Deploy steps 4-8 to Beam to run the training on an A10G or A100 Nvidia\nGPU. \n \n=== \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 === \n \nWe want to hook the financial news stored in the Qdrant Vector DB and the\nFalcon fine-tuned model into a single entity exposed under a RESTful API. \n \nSteps 1-7 are all chained together using LangChain. \n \n1\\. Use the \"all-MiniLM-L6-v2\" encoder-only model to embed the user's\nquestion. \n2\\. Using the question embedding, query the Qdrant vector DB to find the top 3\nrelated financial news. \n3\\. Attach the text (stored as metadata along the embeddings) of the news to\nthe prompt (aka RAG). \n4\\. Download Falcon's pretrained weights from HF & LoRA's fine-tuned weights\nfrom Comet's model registry. \n5\\. Load the LLM and pass the prompt (= the user's question, financial news,\nhistory) to it. \n6\\. Store the conversation in LangChain's memory. \n7\\. Deploy steps 1-7 under a RESTful API using Beam.\n\n3-pipeline architecture [Image by the Author]\n\n> \u21b3\ud83d\udd17 Check out the **Hands-on LLMs** course to see this in action.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 3** of the **Hands-On LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n6\n\nShare this post\n\n#### DML: Unwrapping the 3-pipeline design of a financial assistant powered by LLMs | LLMOps vs. MLOps\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-unwrapping-the-3-pipeline-design?r=1ttoeh" }, { "id": "21c92489-204c-4791-b4dd-f0c2487f7e82", "content": { "Title": "DML: How to design an LLM system for a financial assistant using the 3-pipeline design", "Subtitle": "Lesson 1 | The Hands-on LLMs Series", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: How to design an LLM system for a financial assistant using the\n3-pipeline design\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: How to design an LLM system for a financial assistant using the\n3-pipeline design\n\n### Lesson 1 | The Hands-on LLMs Series\n\nPaul Iusztin\n\nNov 02, 2023\n\n5\n\nShare this post\n\n#### DML: How to design an LLM system for a financial assistant using the\n3-pipeline design\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n> As promised, starting this week, we will **begin** the **series** based on\n> the **Hands-on LLMs FREE course**.\n\nNote that this is not the course itself. It is an overview for all the busy\npeople who will focus on the key aspects.\n\nThe entire course will soon be available on \ud83d\udd17 GitHub.\n\n* * *\n\n### **Lesson 1 | The Hands-on LLMs Series**\n\n#### **Table of Contents:**\n\n 1. What is the 3-pipeline design\n\n 2. How to apply the 3-pipeline design in architecting a financial assistant powered by LLMs\n\n 3. The tech stack used to build an end-to-end LLM system for a financial assistant \n\n* * *\n\nAs the Hands-on LLMs course is still a \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 \ud835\uddf6\ud835\uddfb \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf4\ud835\uddff\ud835\uddf2\ud835\ude00\ud835\ude00, we want to \ud835\uddf8\ud835\uddf2\ud835\uddf2\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02\n\ud835\ude02\ud835\uddfd\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 on our progress \u2193 \n\n> \u21b3 Thus, we opened up the \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\uddf0\ud835\ude02\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddee\ud835\uddef under the course's GitHub\n> Repository, where we will \ud835\uddf8\ud835\uddf2\ud835\uddf2\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude02\ud835\uddfd\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 with everything is happening.\n\n \nAlso, if you have any \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\ude00, \ud835\ude00\ud835\ude02\ud835\uddf4\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 or want to \ud835\uddf0\ud835\uddf5\ud835\uddee\ud835\ude01, we\nencourage you to \ud835\uddf0\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddee \"\ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\uddf0\ud835\ude02\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb\". \n \n\u2193 We want the course to fill your real needs \u2193 \n \n\u21b3 Hence, if your suggestion fits well with our hands-on course direction, we\nwill consider implementing it.\n\nHands-on LLMs course discussions section [Image by the Author].\n\nCheck it out and leave a \u2b50 if you like what you see: \n\u21b3\ud83d\udd17 Hands-on LLMs course\n\n* * *\n\n### #1. What is the 3-pipeline design\n\nWe all know how \ud835\uddfa\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\ude06 \ud835\udde0\ud835\udddf \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 can get. That is where the \ud835\udfef-\ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\n\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddf8\ud835\uddf6\ud835\uddf0\ud835\uddf8\ud835\ude00 \ud835\uddf6\ud835\uddfb. \n \nThe 3-pipeline design is a way to bring structure & modularity to your ML\nsystem and improve your MLOps processes. \n \nThis is how \u2193 \n \n=== \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa === \n \nDespite advances in MLOps tooling, transitioning from prototype to production\nremains challenging. \n \nIn 2022, only 54% of the models get into production. Auch. \n \nSo what happens? \n \nSometimes the model is not mature enough, sometimes there are some security\nrisks, but most of the time... \n \n...the architecture of the ML system is built with research in mind, or the ML\nsystem becomes a massive monolith that is extremely hard to refactor from\noffline to online. \n \nSo, good processes and a well-defined architecture are as crucial as good\ntools and models. \n \n \n=== \ud835\udde6\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb === \n \n\ud835\ude1b\ud835\ude29\ud835\ude26 3-\ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26. \n \nFirst, let's understand what the 3-pipeline design is. \n \nIt is a mental map that helps you simplify the development process and split\nyour monolithic ML pipeline into 3 components: \n1\\. the feature pipeline \n2\\. the training pipeline \n3\\. the inference pipeline \n \n...also known as the Feature/Training/Inference (FTI) architecture. \n \n. \n \n#\ud835\udfed. The feature pipeline transforms your data into features & labels, which\nare stored and versioned in a feature store. \n \n#\ud835\udfee. The training pipeline ingests a specific version of the features & labels\nfrom the feature store and outputs the trained models, which are stored and\nversioned inside a model registry. \n \n#\ud835\udfef. The inference pipeline takes a given version of the features and trained\nmodels and outputs the predictions to a client. \n \n. \n \nThis is why the 3-pipeline design is so beautiful: \n \n\\- it is intuitive \n\\- it brings structure, as on a higher level, all ML systems can be reduced to\nthese 3 components \n\\- it defines a transparent interface between the 3 components, making it\neasier for multiple teams to collaborate \n\\- the ML system has been built with modularity in mind since the beginning \n\\- the 3 components can easily be divided between multiple teams (if\nnecessary) \n\\- every component can use the best stack of technologies available for the\njob \n\\- every component can be deployed, scaled, and monitored independently \n\\- the feature pipeline can easily be either batch, streaming or both \n \nBut the most important benefit is that... \n \n...by following this pattern, you know 100% that your ML model will move out\nof your Notebooks into production.\n\nWhat is the 3-pipeline design & Why should you adopt it in your ML systems?\n[Image by the Author].\n\nWhat do you think about the 3-pipeline architecture? Have you used it? \n \nIf you want to know more about the 3-pipeline design, I recommend this awesome\narticle from Hopsworks \u2193 \n\u21b3\ud83d\udd17 From MLOps to ML Systems with Feature/Training/Inference Pipelines\n\n* * *\n\n### #2. How to apply the 3-pipeline design in architecting a financial\nassistant powered by LLMs\n\nBuilding ML systems is hard, right? Wrong. \n \nHere is how the \ud835\udfef-\ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb can make \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 the \ud835\udde0\ud835\udddf \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa for a\n\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06 \u2193 \n \n. \n \nI already covered the concepts of the 3-pipeline design in my previous post,\nbut here is a quick recap: \n \n\"\"\" \nIt is a mental map that helps you simplify the development process and split\nyour monolithic ML pipeline into 3 components: \n1\\. the feature pipeline \n2\\. the training pipeline \n3\\. the inference pipeline \n...also known as the Feature/Training/Inference (FTI) architecture. \n\"\"\" \n \n. \n \nNow, let's see how you can use the FTI architecture to build a financial\nassistant powered by LLMs \u2193 \n \n#\ud835\udfed. \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \nThe feature pipeline is designed as a streaming pipeline that extracts real-\ntime financial news from Alpaca and: \n \n\\- cleans and chunks the news documents \n\\- embeds the chunks using an encoder-only LM \n\\- loads the embeddings + their metadata in a vector DB \n\\- deploys it to AWS \n \nIn this architecture, the vector DB acts as the feature store. \n \nThe vector DB will stay in sync with the latest news to attach real-time\ncontext to the LLM using RAG. \n \n#\ud835\udfee. \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \nThe training pipeline is split into 2 main steps: \n \n\u21b3 \ud835\udde4&\ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\ude00\ud835\uddf2\ud835\uddfa\ud835\uddf6-\ud835\uddee\ud835\ude02\ud835\ude01\ud835\uddfc\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd \n \nIt takes the vector DB (feature store) and a set of predefined questions\n(manually written) as input. \n \nAfter, you: \n \n\\- use RAG to inject the context along the predefined questions \n\\- use a large & powerful model, such as GPT-4, to generate the answers \n\\- save the generated dataset under a new version \n \n\u21b3 \ud835\uddd9\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd \n \n\\- download a pre-trained LLM from Huggingface \n\\- load the LLM using QLoRA \n\\- preprocesses the generated Q&A dataset into a format expected by the LLM \n\\- fine-tune the LLM \n\\- push the best QLoRA weights (model) to a model registry \n\\- deploy it using a serverless solution as a continuous training pipeline \n \n#\ud835\udfef. \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \n \nThe inference pipeline is the financial assistant that the clients actively\nuse. \n \nIt uses the vector DB (feature store) and QLoRA weights (model) from the model\nregistry in the following way: \n \n\\- download the pre-trained LLM from Huggingface \n\\- load the LLM using the pretrained QLoRA weights \n\\- connect the LLM and vector DB into a chain \n\\- use RAG to add relevant financial news from the vector DB \n\\- deploy it using a serverless solution under a RESTful API\n\nThe architecture of a financial assistant using the 3 pipeline design [Image\nby the Author].\n\nHere are the main benefits of using the FTI architecture: \n\\- it defines a transparent interface between the 3 modules \n\\- every component can use different technologies to implement and deploy the\npipeline \n\\- the 3 pipelines are loosely coupled through the feature store & model\nregistry \n\\- every component can be scaled independently\n\n> See this architecture in action in my \ud83d\udd17 \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 FREE course.\n\n* * *\n\n### #3. The tech stack used to build an end-to-end LLM system for a financial\nassistant\n\nThe tools are divided based on the \ud835\udfef-\ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 (aka \ud835\uddd9\ud835\udde7\ud835\udddc) \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2: \n \n=== \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 === \n \nWhat do you need to build a streaming pipeline? \n \n\u2192 streaming processing framework: Bytewax (brings the speed of Rust into our\nbeloved Python ecosystem) \n \n\u2192 parse, clean, and chunk documents: unstructured \n \n\u2192 validate document structure: pydantic \n \n\u2192 encoder-only language model: HuggingFace sentence-transformers, PyTorch \n \n\u2192 vector DB: Qdrant \n \n\u2192deploy: Docker, AWS \n \n\u2192 CI/CD: GitHub Actions \n \n \n=== \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 === \n \nWhat do you need to build a fine-tuning pipeline? \n \n\u2192 pretrained LLM: HuggingFace Hub \n \n\u2192 parameter efficient tuning method: peft (= LoRA) \n \n\u2192 quantization: bitsandbytes (= QLoRA) \n \n\u2192 training: HuggingFace transformers, PyTorch, trl \n \n\u2192 distributed training: accelerate \n \n\u2192 experiment tracking: Comet ML \n \n\u2192 model registry: Comet ML \n \n\u2192 prompt monitoring: Comet ML \n \n\u2192 continuous training serverless deployment: Beam \n \n \n=== \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 === \n \nWhat do you need to build a financial assistant? \n \n\u2192 framework for developing applications powered by language models: LangChain \n \n\u2192 model registry: Comet ML \n \n\u2192 inference: HuggingFace transformers, PyTorch, peft (to load the LoRA\nweights) \n \n\u2192 quantization: bitsandbytes \n \n\u2192 distributed inference: accelerate \n \n\u2192 encoder-only language model: HuggingFace sentence-transformers \n \n\u2192 vector DB: Qdrant \n \n\u2192 prompt monitoring: Comet ML \n \n\u2192 RESTful API serverless service: Beam \n \n. \n \nAs you can see, some tools overlap between the FTI pipelines, but not all. \n \nThis is the beauty of the 3-pipeline design, as every component represents a\ndifferent entity for which you can pick the best stack to build, deploy, and\nmonitor. \n \nYou can go wild and use Tensorflow in one of the components if you want your\ncolleges to hate you \ud83d\ude02\n\n> See the tools in action in my \ud83d\udd17 \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 FREE course.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for **Lesson 2** of the **Hands-On LLMs series** \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n5\n\nShare this post\n\n#### DML: How to design an LLM system for a financial assistant using the\n3-pipeline design\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-to-design-an-llm-system-for?r=1ttoeh" }, { "id": "007833f1-fb36-470f-adad-78143f817fee", "content": { "Title": "DML: Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time RAG in Your LLM Applications", "Subtitle": "Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time RAG\nin Your LLM Applications\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time RAG in\nYour LLM Applications\n\nPaul Iusztin\n\nOct 26, 2023\n\n4\n\nShare this post\n\n#### DML: Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time RAG\nin Your LLM Applications\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n**This week\u2019s ML & MLOps topics:**\n\n 1. Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time Rag in Your LLM Applications\n\n> **Story:** If anyone told you that ML or MLOps is easy, they were right. A\n> simple trick I learned the hard way.\n\n* * *\n\nThis week\u2019s newsletter is shorter than usual, but I have some great news \ud83d\udd25\n\n> Next week, within the Decoding ML newsletter, I will start a step-by-step\n> series based on the Hands-On LLMs course I am developing.\n>\n> By the end of this series, you will know how to design, build, and deploy a\n> financial assistant powered by LLMs.\n>\n> \u2026all of this for FREE inside the Decoding ML newsletter\n\n\u21b3\ud83d\udd17 Check out the Hands-On LLMs course GitHub page and give it a star to stay\nupdated with our progress.\n\n* * *\n\n### #1. Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time Rag\nin Your LLM Applications\n\nTo successfully use \ud835\udde5\ud835\uddd4\ud835\uddda in your \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddee\ud835\uddfd\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, your \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 must\nconstantly be updated with the latest data. \n \nHere is how you can implement a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 to keep your vector DB in\nsync with your datasets \u2193 \n \n. \n \n\ud835\udde5\ud835\uddd4\ud835\uddda is a popular strategy when building LLMs to add context to your prompt\nabout your private datasets. \n \nLeveraging your domain data using RAG provides 2 significant benefits: \n\\- you don't need to fine-tune your model as often (or at all) \n\\- avoid hallucinations \n \n. \n \nOn the \ud835\uddef\ud835\uddfc\ud835\ude01 \ud835\ude00\ud835\uddf6\ud835\uddf1\ud835\uddf2, to implement RAG, you have to: \n \n3\\. Embed the user's question using an embedding model (e.g., BERT). Use the\nembedding to query your vector DB and find the most similar vectors using a\ndistance function (e.g., cos similarity). \n4\\. Get the top N closest vectors and their metadata. \n5\\. Attach the extracted top N vectors metadata + the chat history to the\ninput prompt. \n6\\. Pass the prompt to the LLM. \n7\\. Insert the user question + assistant answer to the chat history. \n \n. \n \nBut the question is, \ud835\uddf5\ud835\uddfc\ud835\ude04 do you \ud835\uddf8\ud835\uddf2\ud835\uddf2\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 \ud835\ude02\ud835\uddfd \ud835\ude01\ud835\uddfc \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\ude00\ud835\ude01\n\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee? \n \n\u21b3 You need a real-time streaming pipeline. \n \nHow do you implement it? \n \nYou need 2 components: \n \n\u21b3 A streaming processing framework. For example, Bytewax is built in Rust for\nefficiency and exposes a Python interface for ease of use - you don't need\nJava to implement real-time pipelines anymore. \n \n\ud83d\udd17 Bytewax \n \n\u21b3 A vector DB. For example, Qdrant provides a rich set of features and a\nseamless experience. \n \n\ud83d\udd17 Qdrant \n \n. \n \nHere is an example of how to implement a streaming pipeline for financial news\n\u2193 \n \n#\ud835\udfed. Financial news data source (e.g., Alpaca): \n \nTo populate your vector DB, you need a historical API (e.g., RESTful API) to\nadd data to your vector DB in batch mode between a desired [start_date,\nend_date] range. You can tweak the number of workers to parallelize this step\nas much as possible. \n\u2192 You run this once in the beginning. \n \nYou need the data exposed under a web socket to ingest news in real time. So,\nyou'll be able to listen to the news and ingest it in your vector DB as soon\nas they are available. \n\u2192 Listens 24/7 for financial news. \n \n#\ud835\udfee. Build the streaming pipeline using Bytewax: \n \nImplement 2 input connectors for the 2 different types of APIs: RESTful API &\nweb socket. \n \nThe rest of the steps can be shared between both connectors \u2193 \n \n\\- Clean financial news documents. \n\\- Chunk the documents. \n\\- Embed the documents (e.g., using Bert). \n\\- Insert the embedded documents + their metadata to the vector DB (e.g.,\nQdrant). \n \n#\ud835\udfef-\ud835\udff3. When the users ask a financial question, you can leverage RAG with an\nup-to-date vector DB to search for the latest news in the industry.\n\nSynced Vector DBs - A Guide to Streaming Pipelines for Real-Time Rag in Your\nLLM Applications [Image by the Author]\n\n* * *\n\n### #Story. If anyone told you that ML or MLOps is easy, they were right. A\nsimple trick I learned the hard way.\n\nIf anyone told you that \ud835\udde0\ud835\udddf or \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 is \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06, they were \ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01. \n \nHere is a simple trick that I learned the hard way \u2193 \n \nIf you are in this domain, you already know that everything changes fast: \n \n\\- a new tool every month \n\\- a new model every week \n\\- a new project every day \n \nYou know what I did? I stopped caring about all these changes and switched my\nattention to the real gold. \n \nWhich is \u2192 \"\ud835\uddd9\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf9\ud835\ude00.\" \n \n. \n \nLet me explain \u2193 \n \nWhen you constantly chase the latest models (aka FOMO), you will only have a\nshallow understanding of that new information (except if you are a genius or\nalready deep into that niche). \n \nBut the joke's on you. In reality, most of what you think you need to know,\nyou don't. \n \nSo you won't use what you learned and forget most of it after 1-2 months. \n \nWhat a waste of time, right? \n \n. \n \nBut... \n \nIf you master the fundamentals of the topic, you want to learn. \n \nFor example, for deep learning, you have to know: \n \n\\- how models are built \n\\- how they are trained \n\\- groundbreaking architectures (Resnet, UNet, Transformers, etc.) \n\\- parallel training \n\\- deploying a model, etc. \n \n...when in need (e.g., you just moved on to a new project), you can easily\npick up the latest research. \n \nThus, after you have laid the foundation, it is straightforward to learn SoTA\napproaches when needed (if needed). \n \nMost importantly, what you learn will stick with you, and you will have the\nflexibility to jump from one project to another quickly. \n \n. \n \nI am also guilty. I used to FOMO into all kinds of topics until I was honest\nwith myself and admitted I am no Leonardo Da Vinci. \n \nBut here is what I did and worked well: \n \n\\- building projects \n\\- replicating the implementations of famous papers \n\\- teaching the subject I want to learn \n... and most importantly, take my time to relax and internalize the\ninformation.\n\nTo conclude: \n \n\\- learn ahead only the fundamentals \n\\- learn the latest trend only when needed\n\n[Image by the Author]\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\n\u2026and see you next week for the beginning of the Hands-On LLMs series \ud83d\udd25\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n4\n\nShare this post\n\n#### DML: Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time RAG\nin Your LLM Applications\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-synced-vector-dbs-a-guide-to?r=1ttoeh" }, { "id": "e9353901-9ba9-483c-8c59-2de649c9743a", "content": { "Title": "DML: What is the difference between your ML development and continuous training environments?", "Subtitle": "3 techniques you must know to evaluate your LLMs quickly. Experimentation vs. continuous training environments.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: What is the difference between your ML development and continuous\ntraining environments?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: What is the difference between your ML development and continuous\ntraining environments?\n\n### 3 techniques you must know to evaluate your LLMs quickly. Experimentation\nvs. continuous training environments.\n\nPaul Iusztin\n\nOct 19, 2023\n\n3\n\nShare this post\n\n#### DML: What is the difference between your ML development and continuous\ntraining environments?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n**This week\u2019s ML & MLOps topics:**\n\n 1. 3 techniques you must know to evaluate your LLMs quickly\n\n 2. What is the difference between your ML development and continuous training environments?\n\n> **Story:** Job roles tell you there is just one type of MLE, but there are\n> actually 3.\n\n* * *\n\n> But first, I want to let you know that after 1 year of making content, I\n> finally decided to share my content on **Twitter/X**.\n\nI took this decision because everybody has a different way of reading and\ninteracting with their socials. \n \n...and I want everyone to enjoy my content on their favorite platform.\n\nI even bought that stu*** blue ticker to see that I am serious about this \ud83d\ude02\n\nSo... \n\n> If **you like my content** and you are a **Twitter/X** **person** \u2193\n>\n> \u21b3\ud83d\udd17 **follow** at @\ud835\udc22\ud835\udc2e\ud835\udc2c\ud835\udc33\ud835\udc2d\ud835\udc22\ud835\udc27\ud835\udc29\ud835\udc1a\ud835\udc2e\ud835\udc25\n\n* * *\n\n### #1. 3 techniques you must know to evaluate your LLMs quickly\n\nManually testing the output of your LLMs is a tedious and painful process \u2192\nyou need to automate it. \n \nIn generative AI, most of the time, you cannot leverage standard metrics. \n \nThus, the real question is, how do you evaluate the outputs of an LLM? \n \nDepending on your problem, here is what you can do \u2193 \n \n#\ud835\udfed. \ud835\udde6\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\ude00 - \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf8\ud835\uddfb\ud835\uddfc\ud835\ude04 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude04\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\ude01 \n \nEven if you use an LLM to generate text, you can ask it to generate a response\nin a structured format (e.g., JSON) that can be parsed. \n \nYou know exactly what you want (e.g., a list of products extracted from the\nuser's question). \n \nThus, you can easily compare the generated and ideal answers using classic\napproaches. \n \nFor example, when extracting the list of products from the user's input, you\ncan do the following: \n\\- check if the LLM outputs a valid JSON structure \n\\- use a classic method to compare the generated and real answers \n \n#\ud835\udfee. \ud835\udde1\ud835\uddfc \"\ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01\" \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff (\ud835\uddf2.\ud835\uddf4., \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude00, \ud835\uddf2\ud835\ude01\ud835\uddf0.) \n \nWhen generating sentences, the LLM can use different styles, words, etc. Thus,\ntraditional metrics (e.g., BLUE score) are too rigid to be useful. \n \nYou can leverage another LLM to test the output of our initial LLM. The trick\nis in what questions to ask. \n \nWhen testing LLMs, you won't have a big testing split size as you are used to.\nA set of 10-100 tricky examples usually do the job (it won't be costly). \n \nHere, we have another 2 sub scenarios: \n \n\u21b3 \ud835\udfee.\ud835\udfed \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb'\ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc (\ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb'\ud835\ude01\n\ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5) \n \nYou don't have access to an expert to write an ideal answer for a given\nquestion to compare it to. \n \nBased on the initial prompt and generated answer, you can compile a set of\nquestions and pass them to an LLM. Usually, these are Y/N questions that you\ncan easily quantify and check the validity of the generated answer. \n \nThis is known as \"Rubric Evaluation\" \n \nFor example: \n\"\"\" \n\\- Is there any disagreement between the response and the context? (Y or N) \n\\- Count how many questions the user asked. (output a number) \n... \n\"\"\" \n \nThis strategy is intuitive, as you can ask the LLM any question you are\ninterested in as long it can output a quantifiable answer (Y/N or a number). \n \n\u21b3 \ud835\udfee.\ud835\udfee. \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddfc (\ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2\n\ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5) \n \nWhen you can access an answer manually created by a group of experts, things\nare easier. \n \nYou will use an LLM to compare the generated and ideal answers based on\nsemantics, not structure. \n \nFor example: \n\"\"\" \n(A) The submitted answer is a subset of the expert answer and entirely\nconsistent. \n... \n(E) The answers differ, but these differences don't matter. \n\"\"\"\n\n3 techniques you must know to evaluate your LLMs quickly [Image by the\nAuthor].\n\n* * *\n\n### #2. What is the difference between your ML development and continuous\ntraining environments?\n\nThey might do the same thing, but their design is entirely different \u2193 \n \n\ud835\udde0\ud835\udddf \ud835\uddd7\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9\ud835\uddfc\ud835\uddfd\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \n \nAt this point, your main goal is to ingest the raw and preprocessed data\nthrough versioned artifacts (or a feature store), analyze it & generate as\nmany experiments as possible to find the best: \n\\- model \n\\- hyperparameters \n\\- augmentations \n \nBased on your business requirements, you must maximize some specific metrics,\nfind the best latency-accuracy trade-offs, etc. \n \nYou will use an experiment tracker to compare all these experiments. \n \nAfter you settle on the best one, the output of your ML development\nenvironment will be: \n\\- a new version of the code \n\\- a new version of the configuration artifact \n \nHere is where the research happens. Thus, you need flexibility. \n \nThat is why we decouple it from the rest of the ML systems through artifacts\n(data, config, & code artifacts). \n \n\ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \n \nHere is where you want to take the data, code, and config artifacts and: \n \n\\- train the model on all the required data \n\\- output a staging versioned model artifact \n\\- test the staging model artifact \n\\- if the test passes, label it as the new production model artifact \n\\- deploy it to the inference services \n \nA common strategy is to build a CI/CD pipeline that (e.g., using GitHub\nActions): \n \n\\- builds a docker image from the code artifact (e.g., triggered manually or\nwhen a new artifact version is created) \n\\- start the training pipeline inside the docker container that pulls the\nfeature and config artifacts and outputs the staging model artifact \n\\- manually look over the training report -> If everything went fine, manually\ntrigger the testing pipeline \n\\- manually look over the testing report -> if everything worked fine (e.g.,\nthe model is better than the previous one), manually trigger the CD pipeline\nthat deploys the new model to your inference services \n \nNote how the model registry quickly helps you to decouple all the components. \n \nAlso, because training and testing metrics are not always black & white, it is\ntough to 100% automate the CI/CD pipeline. \n \nThus, you need a human in the loop when deploying ML models.\n\n. What is the difference between your ML development and continuous training\nenvironments [Image by the Author]\n\nTo conclude... \n \nThe ML development environment is where you do your research to find better\nmodels: \n\\- \ud835\ude2a\ud835\ude2f\ud835\ude31\ud835\ude36\ud835\ude35: data artifact \n\\- \ud835\ude30\ud835\ude36\ud835\ude35\ud835\ude31\ud835\ude36\ud835\ude35: code & config artifacts \n \nThe continuous training environment is used to train & test the production\nmodel at scale: \n\\- \ud835\ude2a\ud835\ude2f\ud835\ude31\ud835\ude36\ud835\ude35: data, code, config artifacts \n\\- \ud835\ude30\ud835\ude36\ud835\ude35\ud835\ude31\ud835\ude36\ud835\ude35: model artifact\n\n> This is not a fixed solution, as ML systems are still an open question.\n>\n> But if you want to see this strategy in action \u2193 \n> \n> \u21b3\ud83d\udd17 Check out my **The Full Stack 7-Steps MLOps Framework** FREE Course.\n\n* * *\n\n### Story: Job roles tell you there is just one type of MLE, but there are\nactually 3\n\nHere they are \u2193 \n \nThese are the 3 ML engineering personas I found while working with different\nteams in the industry: \n \n#\ud835\udfed. \ud835\udde5\ud835\uddf2\ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\uddf0\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff \n \nThey like to stay in touch with the latest papers, understand the architecture\nof models, optimize them, run experiments, etc. \n \nThey are great at picking the best models but not that great at writing clean\ncode and scaling the solution. \n \n#\ud835\udfee. \ud835\udde6\ud835\uddea\ud835\uddd8 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\uddf0\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff \n \nThey pretend they read papers but don't (maybe only when they have to). They\nare more concerned with writing modular code and data quality than the latest\nhot models. Usually, these are the \"data-centric\" people. \n \nThey are great at writing clean code & processing data at scale but lack deep\nmathematical skills to develop complex DL solutions. \n \n#\ud835\udfef. \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf3\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf8\ud835\ude00 \n \nThey ultimately don't care about the latest research & hot models. They are\nmore into the latest MLOps tools and building ML systems. They love to\nautomate everything and use as many tools as possible. \n \nGreat at scaling the solution and building ML pipelines, but not great at\nrunning experiments & tweaking ML models. They love to treat the ML model as a\nblack box.\n\nImage by the Author.\n\nI started as #1. , until I realized I hated it - now I am a mix of: \n \n\u2192 #\ud835\udfed. 20% \n\u2192 #\ud835\udfee. 40% \n\u2192 #\ud835\udfef. 40% \n \nBut that doesn't mean one is better - these types are complementary. \n \nA great ML team should have at least one of each persona. \n \nWhat do you think? Did I get it right?\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n3\n\nShare this post\n\n#### DML: What is the difference between your ML development and continuous\ntraining environments?\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-what-is-the-difference-between?r=1ttoeh" }, { "id": "aa199018-9dcc-4768-9e99-1b2356af2c21", "content": { "Title": "DML: 7-steps to build a production-ready financial assistant using LLMs ", "Subtitle": "How to fine-tune any LLM at scale in under 5 minutes. 7 steps to build a production-ready financial assistant using LLMs.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: 7-steps to build a production-ready financial assistant using LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: 7-steps to build a production-ready financial assistant using LLMs\n\n### How to fine-tune any LLM at scale in under 5 minutes. 7 steps to build a\nproduction-ready financial assistant using LLMs.\n\nPaul Iusztin\n\nOct 12, 2023\n\n5\n\nShare this post\n\n#### DML: 7-steps to build a production-ready financial assistant using LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n**This week\u2019s ML & MLOps topics:**\n\n 1. Writing your own ML models is history. How to fine-tune any LLM at scale in under 5 minutes.\n\n 2. 7 steps to chain your prompts to build a production-ready financial assistant using LLMs.\n\n> **Extra:** 3 key resources on how to monitor your ML models\n\n* * *\n\n### #1. Writing your own ML models is history. How to fine-tune any LLM at\nscale in under 5 minutes.\n\nWriting your own ML models is history. \n \nThe true value is in your data, how you prepare it, and your computer power. \n \nTo demonstrate my statement. Here is how you can write a Python script to\ntrain your LLM at scale in under 5 minutes \u2193 \n \n#\ud835\udfed. Load your data in JSON format and convert it into a Hugging Dataset \n \n#\ud835\udfee. Use Huggingface to load the LLM and pass it to the SFTTrainer, along with\nthe tokenizer and training & evaluation datasets. \n \n#\ud835\udfef. Wrap your training script with a serverless solution, such as Beam, which\nquickly lets you access a cluster of GPUs to train large models. \n \n\ud83d\udea8 As you can see, the secret ingredients are not the LLM but: \n\\- the amount of data \n\\- the quality of data \n\\- how you process the data \n\\- $$$ for compute power \n\\- the ability to scale the system\n\n3-steps to write a Python script to train your LLMs at scale [Image by the\nAuthor].\n\n\ud83d\udca1 My advice \n \n\u21b3 If you don't plan to become an ML researcher, shift your focus from the\nlatest models to your data and infrastructure. \n \n. \n \n\ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2: Integrating serverless services, such as Beam, makes the deployment of\nyour training pipeline fast & seamless, leaving you to focus only on the last\npiece of the puzzle: your data.\n\n \n\u21b3\ud83d\udd17 Check out Beam's docs to find out more.\n\n* * *\n\n### #2. 7 steps to chain your prompts to build a production-ready financial\nassistant using LLMs.\n\n\ud835\udff3 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 on how to \ud835\uddf0\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb your \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00 to build a production-ready \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9\n\ud835\uddee\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\ude01 using \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \u2193 \n \nWhen building LLM applications, you frequently have to divide your application\ninto multiple steps & prompts, which are known as \"chaining prompts\". \n \nHere are 7 standard steps when building a financial assistant using LLMs (or\nany other assistant) \u2193 \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed: Check if the user's question is safe using OpenAI's Moderation API \n \nIf the user's query is safe, move to \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee \u2193 \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee: Query your proprietary data (e.g., financial news) to enrich the\nprompt with fresh data & additional context. \n \nTo do so, you have to: \n\\- use an LM to embed the user's input \n\\- use the embedding to query your proprietary data stored in a vector DB \n \n\ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26: You must use the same LM model to embed: \n\\- the data that will be stored in the vector DB \n\\- the user's question used to query the vector DB \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef: Build the prompt using: \n\\- a predefined template \n\\- the user's question \n\\- extracted financial news as context \n\\- your conversation history as context \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff0: Call the LLM \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff1: Check if the assistant's answer is safe using the OpenAI's Moderation\nAPI. \n \nIf the assistant's answer is safe, move to \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff1 \u2193 \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff2: Use an LLM to check if the final answer is satisfactory. \n \nTo do so, you build a prompt using the following: \n\\- a validation predefined template \n\\- the user's initial question \n\\- the assistants answer \n \nThe LLM has to give a \"yes\" or \"no\" answer. \n \nThus, if it answers \"yes,\" we show the final answer to the user. Otherwise, we\nwill return a predefined response, such as: \n\"Sorry, we couldn't answer your question because we don't have enough\ninformation.\" \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff3: Add the user's question and assistant's answer to a history cache.\nWhich will be used to enrich the following prompts with the current\nconversation. \n \nJust to remind you, the assistant should support a conversation. Thus, it\nneeds to know what happened in the previous questions. \n \n\u2192 In practice, you usually keep only the latest N (question, answer) tuples or\na conversation summary to keep your context length under control.\n\n7 Steps to Build a Production-Ready Financial Assistant Using LLMs [Image by\nthe Author]\n\n\u21b3 If you want to see this strategy in action, check out our new FREE Hands-on\nLLMs course (work in progress) & give it a \u2b50 on GitHub to stay updated with\nits latest progress.\n\n* * *\n\n### Extra: 3 key resources on how to monitor your ML models\n\nIn the last month, I read 100+ ML monitoring articles. \n \nI trimmed them for you to 3 key resources: \n \n1\\. A series of excellent articles made by Arize AI that will make you\nunderstand what ML monitoring is all about. \n \n\u21b3\ud83d\udd17 Arize Articles \n \n2\\. The Evidently AI Blog, where you can find answers to all your questions\nregarding ML monitoring. \n \n\u21b3\ud83d\udd17 Evidently Blog \n \n3\\. The monitoring hands-on examples hosted by DataTalksClub will teach you\nhow to implement an ML monitoring system. \n \n\u21b3\ud83d\udd17 DataTalks Course \n \nAfter wasting a lot of time reading other resources... \n \nUsing these 3 resources is a solid start for learning about monitoring ML\nsystems.\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n5\n\nShare this post\n\n#### DML: 7-steps to build a production-ready financial assistant using LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-7-steps-to-build-a-production?r=1ttoeh" }, { "id": "de3f1dc2-70e9-4621-825b-56dd9a8f99be", "content": { "Title": "DML: Chain of Thought Reasoning: Write robust & explainable prompts for your LLM", "Subtitle": "Everything you need to know about chaining prompts: increase your LLMs accuracy & debug and explain your LLM.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: Chain of Thought Reasoning: Write robust & explainable prompts for\nyour LLM\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: Chain of Thought Reasoning: Write robust & explainable prompts for your\nLLM\n\n### Everything you need to know about chaining prompts: increase your LLMs\naccuracy & debug and explain your LLM.\n\nPaul Iusztin\n\nOct 05, 2023\n\n1\n\nShare this post\n\n#### DML: Chain of Thought Reasoning: Write robust & explainable prompts for\nyour LLM\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n**This week\u2019s ML & MLOps topics:**\n\n 1. Chaining Prompts to Reduce Costs, Increase Accuracy & Easily Debug Your LLMs\n\n 2. Chain of Thought Reasoning: Write robust & explainable prompts for your LLM\n\n> **Extra:** Why**** any ML system should use an ML platform as its central\n> nervous system\n\n* * *\n\nBut first, I want to share with you this quick 7-minute guide teaching you how\nstable diffusion models are trained and generate new images. \n \nDiffusion models are the cornerstone of most modern computer vision generative\nAI applications. \n \nThus, if you are into generative AI, it is essential to have an intuition of\nhow a diffusion model works. \n \nCheck out my article to quickly understand: \n\\- the general picture of how diffusion models work \n\\- how diffusion models generate new images \n\\- how they are trained \n\\- how they are controlled by a given context (e.g., text) \n \n\u21b3\ud83d\udd17 Busy? This Is Your Quick Guide to Opening the Diffusion Models Black Box\n\n* * *\n\n### #1. Chaining Prompts to Reduce Costs, Increase Accuracy & Easily Debug\nYour LLMs\n\n> Here it is \u2193\n\n\ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00 is an intuitive technique that states that you must split\nyour prompts into multiple calls. \n \n\ud835\uddea\ud835\uddf5\ud835\ude06? \ud835\udddf\ud835\uddf2\ud835\ude01'\ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude00\ud835\uddfc\ud835\uddfa\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf9\ud835\uddfc\ud835\uddf4\ud835\uddf6\ud835\uddf2\ud835\ude00. \n \nWhen cooking, you are following a recipe split into multiple steps. You want\nto move to the next step only when you know what you have done so far is\ncorrect. \n \n\u21b3 You want every prompt to be simple & focused. \n \nAnother analogy is between reading all the code in one monolith/god class and\nusing DRY to separate the logic between multiple modules. \n \n\u21b3 You want to understand & debug every prompt easily. \n \n. \n \nChaining prompts is a \ud835\uddfd\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee \ud835\ude00\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa where you\nmust take different actions depending on the current state. \n \nIn other words, you control what happens between 2 chained prompts. \n \n\ud835\ude09\ud835\ude3a\ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude34 \ud835\ude30\ud835\ude27 \ud835\ude24\ud835\ude29\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35\ud835\ude34: \n \n\\- increase in accuracy \n\\- reduce the number of tokens -> lower costs (skips steps of the workflow\nwhen not needed) \n\\- avoid context limitations \n\\- easier to include a human-in-the-loop -> easier to control, moderate, test\n& debug \n\\- use external tools/plugins (web search, API, databases, calculator, etc.) \n \n. \n \n\ud835\uddd8\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2 \n \nYou want to build a virtual assistant to respond to customer service queries. \n \nInstead of adding in one single prompt the system message, all the available\nproducts, and the user inquiry, you can split it into the following: \n1\\. Use a prompt to extract the products and categories of interest. \n2\\. Enrich the context only with the products of interest. \n3\\. Call the LLM for the final answer. \n \nYou can evolve this example by adding another prompt that classifies the\nnature of the user inquiry. Based on that, redirect it to billing, technical\nsupport, account management, or a general LLM (similar to the complex system\nof GPT-4).\n\nChaining Prompts to Reduce Costs, Increase Accuracy & Easily Debug Your LLMs\n[Image by the Author].\n\n\ud835\udde7\ud835\uddfc \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude07\ud835\uddf2: \n \nInstead of writing a giant prompt that includes multiple steps: \n \nSplit the god prompt into multiple modular prompts that let you keep track of\nthe state externally and orchestrate the program. \n \nIn other words, you want modular prompts that you can combine easily (same as\nin writing standard functions/classes) \n \n. \n \nTo \ud835\uddee\ud835\ude03\ud835\uddfc\ud835\uddf6\ud835\uddf1 \ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4, use this technique when your prompt contains >=\ninstruction. \n \nYou can leverage the DRY principle from software -> one prompt = one\ninstruction. \n \n\u21b3\ud83d\udd17 Tools to chain prompts: LangChain \n\u21b3\ud83d\udd17 Tools to monitor and debug prompts: Comet LLMOps Tools\n\n* * *\n\n### #2. Chain of Thought Reasoning: Write robust & explainable prompts for\nyour LLM\n\n\ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\udde7\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 is a \ud835\uddfd\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\uddf5\ud835\uddfb\ud835\uddf6\ud835\uddfe\ud835\ude02\ud835\uddf2 to\n\ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\ude03\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0'\ud835\ude00 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\ude02\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude06 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddf2\ud835\ude05\ud835\uddfd\ud835\uddf9\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\uddf6\ud835\ude01\ud835\ude00 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff. \n\n> Let me explain \u2193\n\n \nIt is a method to force the LLM to follow a set of predefined steps. \n \n\ud83e\udde0 \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\udde7\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4? \n \nIn complex scenarios, the LLM must thoroughly reason about a problem before\nresponding to the question. \n \nOtherwise, the LLM might rush to an incorrect conclusion. \n \nBy forcing the model to follow a set of steps, we can guide the model to\n\"think\" more methodically about the problem. \n \nAlso, it helps us explain and debug how the model reached a specific answer. \n \n. \n \n\ud83d\udca1 \ud835\udddc\ud835\uddfb\ud835\uddfb\ud835\uddf2\ud835\uddff \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddfc\ud835\uddf9\ud835\uddfc\ud835\uddf4\ud835\ude02\ud835\uddf2 \n \nThe inner monologue is all the steps needed to reach the final answer. \n \nOften, we want to hide all the reasoning steps from the end user. \n \nIn fancy words, we want to mimic an \"inner monologue\" and output only the\nfinal answer. \n \nEach reasoning step is structured into a parsable format. \n \nThus, we can quickly load it into a data structure and output only the desired\nsteps to the user. \n \n. \n \n\u21b3 \ud835\udddf\ud835\uddf2\ud835\ude01'\ud835\ude00 \ud835\uddef\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2: \n \nThe input prompt to the LLM consists of a system message + the user's\nquestion. \n \nThe secret is in defining the system message as follows: \n \n\"\"\" \nYou are a virtual assistant helping clients... \n \nFollow the next steps to answer the customer queries. \n \nStep 1: Decide if it is a question about a product ... \nStep 2: Retrieve the product ... \nStep 3: Extract user assumptions ... \nStep 4: Validate user assumptions ... \nStep 5: Answer politely ... \n \nMake sure to answer in the following format: \nStep 1: <\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_1_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33> \nStep 2: <\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_2_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33> \nStep 3: <\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_3_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33> \nStep 4: <\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_4_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33> \n \nResponse to the user: <\ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2d_\ud835\ude33\ud835\ude26\ud835\ude34\ud835\ude31\ud835\ude30\ud835\ude2f\ud835\ude34\ud835\ude26> \n\"\"\" \n \nEnforcing the LLM to follow a set of steps, we ensured it would answer the\nright questions. \n \nUltimately, we will show the user only the <\ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2d_\ud835\ude33\ud835\ude26\ud835\ude34\ud835\ude31\ud835\ude30\ud835\ude2f\ud835\ude34\ud835\ude26> subset of the\nanswer. \n \nThe other steps (aka \"inner monologue\") help: \n\\- the model to reason \n\\- the developer to debug \n \nHave you used this technique when writing prompts?\n\nChain of Thought Reasoning: Write robust & explainable prompts for your LLM\n[Image by the Author].\n\n* * *\n\n### Extra: Why**** any ML system should use an ML platform as its central\nnervous system\n\nAny ML system should use an ML platform as its central nervous system. \n \nHere is why \u2193 \n \nThe primary role of an ML Platform is to bring structure to your: \n\\- experiments \n\\- visualizations \n\\- models \n\\- datasets \n\\- documentation \n \nAlso, its role is to decouple your data preprocessing, experiment, training,\nand inference pipelines. \n \n. \n \nAn ML platform helps you automate everything mentioned above using these 6\nfeatures: \n \n1\\. experiment tracking: log & compare experiments \n2\\. metadata store: know how a model (aka experiment) was generated \n3\\. visualisations: a central hub for your visualizations \n4\\. reports: create documents out of your experiments \n5\\. artifacts: version & share your datasets \n6\\. model registry: version & share your models\n\nWhy**** any ML system should use an ML platform as its central nervous system\n[GIF by the Author].\n\nI have used many ML Platforms before, but lately, I started using Comet, and I\nlove it.\n\n\u21b3\ud83d\udd17 Comet ML \n \nWhat is your favorite ML Platform?\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n1\n\nShare this post\n\n#### DML: Chain of Thought Reasoning: Write robust & explainable prompts for\nyour LLM\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-chain-of-thought-reasoning-write?r=1ttoeh" }, { "id": "3d7e4ad6-60d2-4e20-bf42-e158930d168c", "content": { "Title": "DML: Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs", "Subtitle": "Stop Manually Creating Your ML AWS Infrastructure - use Terraform! Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs\n\n### Stop Manually Creating Your ML AWS Infrastructure - use Terraform! Build &\nServe a Production-Ready Classifier in 1 Hour Using LLMs.\n\nPaul Iusztin\n\nSep 21, 2023\n\n6\n\nShare this post\n\n#### DML: Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n**This week\u2019s ML & MLOps topics:**\n\n 1. Stop Manually Creating Your ML AWS Infrastructure. Use Terraform!\n\n 2. Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs.\n\n* * *\n\n> Before going into our subject of the day, I have some news to share with you\n> \ud83d\udc40\n\nIf you want to \ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\uddf0\ud835\uddf8\ud835\uddf9\ud835\ude06 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb in a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\ude04\ud835\uddee\ud835\ude06 how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddf2\ud835\uddfb\ud835\uddf1-\ud835\ude01\ud835\uddfc-\ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\udde0\ud835\udddf\n\ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00, emphasizing \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9-\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00?\n\nI want to let you know that \u2193\n\nI am invited on \ud835\udde6\ud835\uddf2\ud835\uddfd\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddff \ud835\udfee\ud835\udff4\ud835\ude01\ud835\uddf5 to a \ud835\ude04\ud835\uddf2\ud835\uddef\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddff to present an overview of the\n\ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 course I am creating.\n\nI will show you a \ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00-\ud835\uddfc\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2 of how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddee \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddef\ud835\uddfc\ud835\ude01 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00.\nHere is what I will cover \u2193\n\n * creating your Q&A dataset in a semi-automated way (OpenAI GPT) \n\n * fine-tuning an LLM on your new dataset using QLoRA (HuggingFace, Peft, Comet ML, Beam)\n\n * build a streaming pipeline to ingest news in real time into a vector DB (Bytewax, Qdrant, AWS)\n\n * build a financial bot based on the fine-tuned model and real-time financial news (LangChain, Comet ML, Beam) \n\n * build a simple UI to interact with the financial bot \n\n\u2757No Notebooks or fragmented examples.\n\n\u2705 I want to show you how to build a real product.\n\n\u2192 More precisely, I will focus on the engineering and system design, showing\nyou how the components described above work together.\n\n.\n\nIf this is something you want to learn, be sure to register using the link\nbelow \u2193\n\n\u21b3\ud83d\udd17 Engineering an End-to-End ML System for a Financial Assistant Using LLMs\n(September 28th).\n\nSee you there \ud83d\udc40\n\n> Now back to business \ud83d\udd25\n\n* * *\n\n### #1. Stop Manually Creating Your ML AWS Infrastructure. Use Terraform!\n\nI was uselessly spending 1000$ dollars every month on cloud machines until I\nstarted using this tool \ud83d\udc47 \n \nTerraform! \n \n. \n \n\ud835\udc05\ud835\udc22\ud835\udc2b\ud835\udc2c\ud835\udc2d, \ud835\udc25\ud835\udc1e\ud835\udc2d'\ud835\udc2c \ud835\udc2e\ud835\udc27\ud835\udc1d\ud835\udc1e\ud835\udc2b\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc1d \ud835\udc30\ud835\udc21\ud835\udc32 \ud835\udc30\ud835\udc1e \ud835\udc27\ud835\udc1e\ud835\udc1e\ud835\udc1d \ud835\udc13\ud835\udc1e\ud835\udc2b\ud835\udc2b\ud835\udc1a\ud835\udc1f\ud835\udc28\ud835\udc2b\ud835\udc26. \n \nWhen you want to deploy a software application, there are two main steps: \n1\\. Provisioning infrastructure \n2\\. Deploying applications \n \nA regular workflow would be that before deploying your applications or\nbuilding your CI/CD pipelines, you manually go and spin up your, let's say,\nAWS machines. \n \nInitially, this workflow should be just fine, but there are two scenarios when\nit could get problematic. \n \n#1. Your infrastructure gets too big and complicated. Thus, it is cumbersome\nand might yield bugs in manually replicating it. \n \n#2. In the world of AI, there are many cases when you want to spin up a GPU\nmachine to train your models, and afterward, you don't need it anymore. Thus,\nif you forget to close it, you will end up uselessly paying a lot of $$$. \n \nWith Terraform, you can solve both of these issues. \n \n. \n \nSo... \n \n\ud835\udc16\ud835\udc21\ud835\udc1a\ud835\udc2d \ud835\udc22\ud835\udc2c \ud835\udc13\ud835\udc1e\ud835\udc2b\ud835\udc2b\ud835\udc1a\ud835\udc1f\ud835\udc28\ud835\udc2b\ud835\udc26? \n \nIt sits on the provisioning infrastructure layer as a: \"infrastructure as\ncode\" tool that: \n \n\\- is declarative (you focus on the WHAT, not on the HOW) \n\\- automates and manages your infrastructure \n\\- is open source \n \nYeah... yeah... that sounds fancy. But \ud835\udc30\ud835\udc21\ud835\udc1a\ud835\udc2d \ud835\udc1c\ud835\udc1a\ud835\udc27 \ud835\udc08 \ud835\udc1d\ud835\udc28 \ud835\udc30\ud835\udc22\ud835\udc2d\ud835\udc21 \ud835\udc22\ud835\udc2d? \n \nLet's take AWS as an example, where you have to: \n\\- create a VPC \n\\- create AWS users and permissions \n\\- spin up EC2 machines \n\\- install programs (e.g., Docker) \n\\- create a K8s cluster \n \nUsing Terraform... \n \nYou can do all that just by providing a configuration file that reflects the\nstate of your infrastructure. \n \nBasically, it helps you create all the infrastructure you need\nprogrammatically. Isn't that awesome?\n\nTerraform [Image by the Author].\n\nIf you want to quickly understand Terraform enough to start using it in your\nown projects: \n \n\u21b3 check out my 7-minute read article: \ud83d\udd17 Stop Manually Creating Your AWS\nInfrastructure. Use Terraform!\n\n* * *\n\n### #2. Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs\n\n\ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude22 \ud835\ude2d\ud835\ude30\ud835\ude35 \ud835\ude2e\ud835\ude30\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude2f \ud835\ude24\ud835\ude29\ud835\ude22\ud835\ude35\ud835\ude23\ud835\ude30\ud835\ude35\ud835\ude34. \ud835\ude1b\ud835\ude29\ud835\ude26\ud835\ude34\ud835\ude26 \ud835\ude2e\ud835\ude30\ud835\ude25\ud835\ude26\ud835\ude2d\ud835\ude34 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude33\ud835\ude26\ud835\ude37\ud835\ude30\ud835\ude2d\ud835\ude36\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude2a\ud835\ude3b\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude14\ud835\ude13\n\ud835\ude34\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e\ud835\ude34 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude23\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude35. \n \n. \n \nUsing the standard approach when building an end-to-end ML application, you\nhad to: \n\\- get labeled data: 1 month \n\\- train the model: 2 months \n\\- serve de model: 3 months \n \nThese 3 steps might take ~6 months to implement. \n \nSo far, it worked great. \n \nBut here is the catch \u2193 \n \n. \n \n\ud835\ude20\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude24\ud835\ude29 \ud835\ude22\ud835\ude2d\ud835\ude2e\ud835\ude30\ud835\ude34\ud835\ude35 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude34\ud835\ude22\ud835\ude2e\ud835\ude26 \ud835\ude33\ud835\ude26\ud835\ude34\ud835\ude36\ud835\ude2d\ud835\ude35 \ud835\ude2a\ud835\ude2f \ud835\ude22 \ud835\ude27\ud835\ude26\ud835\ude38 \ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34 \ud835\ude30\ud835\ude33 \ud835\ude25\ud835\ude22\ud835\ude3a\ud835\ude34 \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude22 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35-\n\ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude22\ud835\ude31\ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude22\ud835\ude24\ud835\ude29. \n \nLet's take a classification task as an example \u2193 \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed: You write a system prompt explaining the model and what types of\ninputs and outputs it will get. \n \n\" \nYou will be provided with customer service queries. \n \nClassify each query into the following categories: \n\\- Billing \n\\- Account Management \n\\- General Inquiry \n\" \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee: You can give the model an example to make sure it understands the task\n(known as one-shot learning): \n \n\" \nUser: I want to know the price of the pro subscription plan. \nAssistant: Billing \n\" \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef: Attach the user prompt and create the input prompt, which now consists\nof the following: \n\\- system \n\\- example \n\\- user \n...prompts \n \n\ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff0: Call the LLM's API... and boom, you built a classifier in under one\nhour. \n \nCool, right? \ud83d\udd25 \n \nUsing this approach, the only time-consuming step is to tweak the prompt until\nit reaches the desired result.\n\nHow to quickly build a classifier using LLMs [GIF by the Author].\n\nTo conclude... \n \nIn today's LLMs world, to build a classifier, you have to write: \n\\- a system prompt \n\\- an example \n\\- attach the user prompt \n\\- pass the input prompt to the LLM API\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n6\n\nShare this post\n\n#### DML: Build & Serve a Production-Ready Classifier in 1 Hour Using LLMs\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-build-and-serve-a-production?r=1ttoeh" }, { "id": "49e2912f-313d-439d-8de6-522dc8379cb2", "content": { "Title": "DML: 4 key ideas you must know to train an LLM successfully", "Subtitle": "My time series forecasting Python code was a disaster until I started using this package. 4 key ideas you must know to train an LLM successfully.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: 4 key ideas you must know to train an LLM successfully\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: 4 key ideas you must know to train an LLM successfully\n\n### My time series forecasting Python code was a disaster until I started\nusing this package. 4 key ideas you must know to train an LLM successfully.\n\nPaul Iusztin\n\nSep 14, 2023\n\n3\n\nShare this post\n\n#### DML: 4 key ideas you must know to train an LLM successfully\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n**This week\u2019s ML & MLOps topics:**\n\n 1. My time series forecasting Python code was a disaster until I started using this package\n\n 2. 4 key ideas you must know to train an LLM successfully\n\n> **Extra** : My favorite ML & MLOps newsletter\n\n* * *\n\n### #1. My time series forecasting Python code was a disaster until I started\nusing this package\n\nDoes building time series models sound more complicated than modeling standard\ntabular datasets? \n \nWell... maybe it is... but that is precisely why you need to learn more about\n\ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2! \n \nWhen I first built forecasting models, I manually coded the required\npreprocessing and postprocessing steps. What a newbie I was... \n \nHow easy would my life have been if I had started from the beginning to use\n\ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2? \n \n. \n \n\ud835\udc16\ud835\udc21\ud835\udc1a\ud835\udc2d \ud835\udc22\ud835\udc2c \ud835\udc2c\ud835\udc24\ud835\udc2d\ud835\udc22\ud835\udc26\ud835\udc1e? \n \n\ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 is a Python package that adds time-series functionality over well-known\npackages such as statsmodels, fbprophet, scikit-learn, autoarima, xgboost,\netc. \n \nThus, all of a sudden, all your beloved packages will support time series\nfeatures such as: \n\\- easily swap between different models (e.g., xgboost, lightgbm, decision\ntrees, etc.) \n\\- out-of-the-box windowing transformations & aggregations \n\\- functionality for multivariate, panel, and hierarchical learning \n\\- cross-validation adapted to time-series \n\\- cool visualizations \nand more...\n\nSktime example [Image by the Author].\n\n\u21b3 If you want to see \ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 in action, check out my article: \ud83d\udd17 A Guide to\nBuilding Effective Training Pipelines for Maximum Results\n\n* * *\n\n### #2. 4 key ideas you must know to train an LLM successfully\n\nThese are 4 key ideas you must know to train an LLM successfully \n \n\ud83d\udcd6 \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4? \n \nLLMs still leverage supervised learning. \n \nA standard NLP task is to build a classifier. \nFor example, you have a sequence of tokens as inputs and, as output, a set of\nclasses (e.g., negative and positive). \n \nWhen training an LLM for text generation, you have as input a sequence of\ntokens, and its task is to predict the next token: \n\\- Input: JavaScript is all you [...] \n\\- Output: Need \n \nThis is known as an autoregressive process. \n \n\u2694\ufe0f \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf1\ud835\ude00 != \ud835\ude01\ud835\uddfc\ud835\uddf8\ud835\uddf2\ud835\uddfb\ud835\ude00 \n \nTokens are created based on the frequency of sequences of characters. \n \nFor example: \n\\- In the sentence: \"Learning new things is fun!\" every work is a different\ntoken as each is frequently used. \n\\- In the sentence: \"Prompting is a ...\" the word 'prompting' is divided into\n3 tokens: 'prom', 'pt', and 'ing' \n \nThis is important because different LLMs have different limits for the input\nnumber of tokens.\n\nHow to train an LLM cheatsheet [Image by the Author].\n\n\ud83e\udde0 \ud835\udde7\ud835\ude06\ud835\uddfd\ud835\uddf2\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \n \nThere are 3 primary types of LLMs: \n\\- base LLM \n\\- instruction tuned LLM \n\\- RLHF tuned LLM \n \n\ud835\ude1a\ud835\ude35\ud835\ude26\ud835\ude31\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude28\ud835\ude26\ud835\ude35 \ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e \ud835\ude22 \ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26 \ud835\ude35\ud835\ude30 \ud835\ude22\ud835\ude2f \ud835\ude2a\ud835\ude2f\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f-\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26\ud835\ude25 \ud835\ude13\ud835\ude13\ud835\ude14: \n \n1\\. Train the Base LLM on a lot of data (trillions of tokens) - trained for\nmonths on massive GPU clusters \n \n2\\. Fine-tune the Base LLM on a Q&A dataset (millions of tokens) - trained for\nhours or days on modest-size computational resources \n \n3\\. [Optional] Fine-tune the LLM further on human ratings reflecting the\nquality of different LLM outputs, on criteria such as if the answer is\nhelpful, honest and harmless using RLHF. This will increase the probability of\ngenerating a more highly rated output. \n \n\ud83c\udfd7\ufe0f \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\ude01\ud835\uddfc \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddfc\ud835\uddfb \ud835\uddee \ud835\udde4&\ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 \n \nThe most common approach consists of 4 steps: \n1\\. A system message that sets the general tone & behavior. \n2\\. The context that adds more information to help the model to answer\n(Optional). \n3\\. The user's question. \n4\\. The answer to the question. \n \nNote that you need to know the answer to the question during training. You can\nintuitively see it as your label.\n\n* * *\n\n### Extra: My favorite ML & MLOps newsletter\n\nDo you want to learn ML & MLOps from real-world experience? \n \nThen I suggest you join Pau Labarta Bajo's Real-World Machine Learning \nweekly newsletter, along with another 8k+ ML developers. \n \nPau Labarta Bajo inspired me to start my weekly newsletter and is a great\nteacher who makes learning seamless \u270c\n\n> \ud83d\udd17 **Real-World Machine Learning -**Every Saturday Morning\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 a.m. CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).\n\n3\n\nShare this post\n\n#### DML: 4 key ideas you must know to train an LLM successfully\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| Pau Labarta BajoReal-World Machine Learning Sep 14, 2023Liked by Paul\nIusztinThanks for the shout out Paul. I love the content you shareExpand full\ncommentReplyShare \n---|--- \n \n1 reply by Paul Iusztin\n\n1 more comment...\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-4-key-ideas-you-must-know-to?r=1ttoeh" }, { "id": "0b152bfd-0a90-4220-a1b8-77709ecb06d0", "content": { "Title": "DML: How to add real-time monitoring & metrics to your ML System", "Subtitle": "How to easily add retry policies to your Python code. How to add real-time monitoring & metrics to your ML System.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: How to add real-time monitoring & metrics to your ML System\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: How to add real-time monitoring & metrics to your ML System\n\n### How to easily add retry policies to your Python code. How to add real-time\nmonitoring & metrics to your ML System.\n\nPaul Iusztin\n\nSep 07, 2023\n\n6\n\nShare this post\n\n#### DML: How to add real-time monitoring & metrics to your ML System\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\n _This week\u2019s ML & MLOps topics:_\n\n 1. How to add real-time monitoring & metrics to your ML System\n\n 2. How to easily add retry policies to your Python code\n\n _Storytime:_ How am I writing code in 2023? \ud835\udddc \ud835\uddf1\ud835\uddfc\ud835\uddfb'\ud835\ude01.\n\n* * *\n\n> But first, I have some big news to share with you \ud83c\udf89\n\n\u2014> Want to learn how to \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2-\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0, build a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2, use a\n\ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5, build a \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddef\ud835\uddfc\ud835\ude01 and \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf4 using a serverless\nsolution?\n\nThen you will enjoy looking at this new free course that me and\n\nPau Labarta Bajo\n\n(from the RWML newsletter) are cooking.\n\n \n\u21b3 The course will teach you how to build an end-to-end LLM solution. \n \nIt is structured into 4 modules \u2193 \n \n\ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udfed: Learn how to generate a financial Q&A dataset in a semi-automated\nway using the OpenAI API. \n \n\ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udfee: Fine-tune the LLM (e.g., Falcon, Llama2, etc.) using HuggingFace &\nPeft. Also, we will show you how to integrate an experiment tracker, model\nregistry, and monitor the prompts using Comet. \n \n\ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udfef: Build a streaming pipeline using Bytewax that listens to financial\nnews through a web socket, cleans it, embeds it, and loads it to a vector\ndatabase using Qdrant. \n \n\ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udff0: Wrap the fine-tuned model and vector DB into a financial bot using\nLangChain and deploy it under a RESTful API. \n \n\u2757\ufe0f But all of this is useless if it isn't deployed. \n \n\u2192 We will use Beam to deploy everything quickly - Beam is a serverless\nsolution that lets you focus on your problem and quickly serve all your ML\ncomponents. Say bye-bye to access policies and network configuration. \n \n\ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2: This is still a work in progress, but the first 3 modules are almost\ndone.\n\nArchitecture built during the **Hands-On LLMs Course** [GIF by the Author].\n\nCurious?\n\nThen, check out the repository and give it a \u2b50 \u2193\n\n\u21b3 \ud83d\udd17 Course GitHub Repository\n\n* * *\n\n### #1. How to add real-time monitoring & metrics to your ML System\n\nYour model is exposed to performance degradation after it is deployed to\nproduction. \n \nThat is why you need to monitor it constantly. \n \nThe most common way to monitor an ML model is to compute its metrics. \n \nBut for that, you need the ground truth. \n \n\ud835\udddc\ud835\uddfb \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb, \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf0\ud835\uddee\ud835\uddfb \ud835\uddee\ud835\ude02\ud835\ude01\ud835\uddfc\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5 \ud835\uddf6\ud835\uddfb \ud835\udfef \ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb\n\ud835\ude00\ud835\uddf0\ud835\uddf2\ud835\uddfb\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddfc\ud835\ude00: \n1\\. near real-time: you can access it quite quickly \n2\\. delayed: you can access it after a considerable amount of time (e.g., one\nmonth) \n3\\. never: you have to label the data manually \n \n. \n \n\ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddf0\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude00 \ud835\udfee. \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\udfef. \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf0\ud835\uddee\ud835\uddfb \ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\uddf0\ud835\uddf8\ud835\uddf9\ud835\ude06 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddf6\ud835\uddfb\n\ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddfc\ud835\ude04\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddee\ud835\ude06: \n \n\\- store the model predictions and GT as soon as they are available (these 2\nwill be out of sync -> you can't compute the metrics right away) \n \n\\- build a DAG (e.g., using Airflow) that extracts the predictions & GT\ncomputes the metrics in batch mode and loads them into another storage (e.g.,\nGCS) \n \n\\- use an orchestration tool to run the DAG in the following scenarios: \n1\\. scheduled: if the GT is available in near real-time (e.g., hourly), then\nit makes sense to run your monitoring pipeline based on the known frequency \n2\\. triggered: if the GT is delayed and you don't know when it may come up,\nthen you can implement a webhook to trigger your monitoring pipeline \n \n\\- attach a consumer to your storage to use and display the metrics (e.g.,\ntrigger alarms and display them in a dashboard)\n\nHow to add real-time monitoring & metrics to your ML system [Image by the\nAuthor].\n\nIf you want to see how to implement a near real-time monitoring pipeline using\nAirflow and GCS, check out my article \u2193\n\n\u21b3 \ud83d\udd17 Ensuring Trustworthy ML Systems With Data Validation and Real-Time\nMonitoring\n\n* * *\n\n### #2. How to easily add retry policies to your Python code\n\nOne strategy that makes the \ud835\uddf1\ud835\uddf6\ud835\uddf3\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddef\ud835\uddf2\ud835\ude01\ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\uddfb \ud835\uddf4\ud835\uddfc\ud835\uddfc\ud835\uddf1 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddf4\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude01 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 is\nadding \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\ude06 \ud835\uddfd\ud835\uddfc\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\ude00. \n \nTo manually implement them can get tedious and complicated. \n \nRetry policies are a must when you: \n\\- make calls to an external API \n\\- read from a queue, etc. \n \n. \n \n\ud835\udde8\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde7\ud835\uddf2\ud835\uddfb\ud835\uddee\ud835\uddf0\ud835\uddf6\ud835\ude01\ud835\ude06 \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddee\ud835\uddf4\ud835\uddf2... \n \n\ud835\ude20\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude32\ud835\ude36\ud835\ude2a\ud835\ude24\ud835\ude2c\ud835\ude2d\ud835\ude3a \ud835\ude25\ud835\ude26\ud835\ude24\ud835\ude30\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude3a\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude27\ud835\ude36\ud835\ude2f\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude22\ud835\ude25\ud835\ude25 \ud835\ude24\ud835\ude36\ud835\ude34\ud835\ude35\ud835\ude30\ud835\ude2e\ud835\ude2a\ud835\ude3b\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude33\ud835\ude26\ud835\ude35\ud835\ude33\ud835\ude3a \ud835\ude31\ud835\ude30\ud835\ude2d\ud835\ude2a\ud835\ude24\ud835\ude2a\ud835\ude26\ud835\ude34,\n\ud835\ude34\ud835\ude36\ud835\ude24\ud835\ude29 \ud835\ude22\ud835\ude34: \n \n1\\. Add fixed and random wait times between multiple retries. \n \n2\\. Add a maximum number of attempts or computation time. \n \n3\\. Retry only when specific errors are thrown (or not thrown). \n \n... as you can see, you easily compose these policies between them. \n \nThe cherry on top is that you can access the statistics of the retries of a\nspecific function: \n\" \nprint(raise_my_exception.retry.statistics) \n\"\n\nExamples of the retry policies using tenacity [Image by the Author].\n\n\u21b3 \ud83d\udd17 tenacity repository\n\n* * *\n\n### _Storytime:_ How am I writing code in 2023? I don\u2019t\n\nAs an engineer, you are paid to think and solve problems. How you do that, it\ndoesn't matter. Let me explain \u2193 \n \n. \n \nThe truth is that I am lazy. \n \nThat is why I am a good engineer. \n \nWith the rise of LLMs, my laziness hit all times highs. \n \n. \n \n\ud835\udde7\ud835\uddf5\ud835\ude02\ud835\ude00, \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf6\ud835\ude00 \ud835\uddf5\ud835\uddfc\ud835\ude04 \ud835\udddc \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddfa\ud835\ude06 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude06\ud835\ude00 \u2193 \n \n\\- 50% Copilot (tab is the new CTRL-C + CTRL-V) \n\\- 30% ChatGPT/Bard \n\\- 10% Stackoverflow (call me insane, but I still use StackOverflow from time\nto time) \n\\- 10% Writing my own code \n \nThe thing is that I am more productive than ever. \n \n... and that 10% of \"writing my own code\" is the final step that connects all\nthe dots and brings real value to the table. \n \n. \n \n\ud835\udddc\ud835\uddfb \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06, \ud835\uddee\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff, \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfa\ud835\uddfc\ud835\ude00\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\ude01\ud835\uddfc: \n \n\\- ask the right questions \n\\- understand & improve the architecture of the system \n\\- debug code \n\\- understand business requirements \n\\- communicate with other teams \n \n...not to write code.\n\n[Image by the Author]\n\nWriting code as we know it most probably will disappear with the rise of AI\n(it kind of already did). \n \n. \n \nWhat do you think? How do you write code these days?\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 am CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: here, I approach in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where I will constantly aggregate all my work (courses, articles, webinars, podcasts, etc.).\n\n6\n\nShare this post\n\n#### DML: How to add real-time monitoring & metrics to your ML System\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-to-add-real-time-monitoring?r=1ttoeh" }, { "id": "a520fdac-65b4-4340-9ee2-d16a1390b838", "content": { "Title": "DML: Top 6 ML Platform Features You Must Know to Build an ML System", "Subtitle": "Why serving an ML model using a batch architecture is so powerful? Top 6 ML platform features you must know.", "Content": "#\n\nSubscribeSign in\n\nShare this post\n\n#### DML: Top 6 ML Platform Features You Must Know to Build an ML System\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n# DML: Top 6 ML Platform Features You Must Know to Build an ML System\n\n### Why serving an ML model using a batch architecture is so powerful? Top 6\nML platform features you must know.\n\nPaul Iusztin\n\nAug 31, 2023\n\n3\n\nShare this post\n\n#### DML: Top 6 ML Platform Features You Must Know to Build an ML System\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\n _Hello there, I am Paul Iusztin \ud83d\udc4b\ud83c\udffc_\n\n _Within this newsletter, I will help you decode complex topics about ML &\nMLOps one week at a time \ud83d\udd25_\n\nThis week we will cover:\n\n 1. Top 6 ML platform features you must know to build an ML system\n\n 2. Why serving an ML model using a batch architecture is so powerful?\n\n_Story:_ \u201cI never forget anything\u201d - said no one but your second brain.\n\n* * *\n\nThis week, no shameless promotion \ud83d\udc40\n\n* * *\n\n### #1. Top 6 ML platform features you must know to build an ML system\n\nHere they are \u2193 \n \n#\ud835\udfed. \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 \n \nIn your ML development phase, you generate lots of experiments. \n \nTracking and comparing the metrics between them is crucial in finding the\noptimal model. \n \n#\ud835\udfee. \ud835\udde0\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\udde6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 \n \nIts primary purpose is reproducibility. \n \nTo know how a model was generated, you need to know: \n\\- the version of the code \n\\- the version of the packages \n\\- hyperparameters/config \n\\- total compute \n\\- version of the dataset \n... and more \n \n#\ud835\udfef. \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 \n \nMost of the time, along with the metrics, you must log a set of visualizations\nfor your experiment. \n \nSuch as: \n\\- images \n\\- videos \n\\- prompts \n\\- t-SNE graphs \n\\- 3D point clouds \n... and more \n \n#\ud835\udff0. \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddfc\ud835\uddff\ud835\ude01\ud835\ude00 \n \nYou don't work in a vacuum. \n \nYou have to present your work to other colleges or clients. \n \nA report lets you take the metadata and visualizations from your experiment... \n \n...and create, deliver and share a targeted presentation for your clients or\npeers. \n \n#\ud835\udff1. \ud835\uddd4\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf3\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\ude00 \n \nThe most powerful feature out of them all. \n \nAn artifact is a versioned object that is an input or output for your task. \n \nEverything can be an artifact, but the most common cases are: \n\\- data \n\\- model \n\\- code \n \nWrapping your assets around an artifact ensures reproducibility. \n \nFor example, you wrap your features into an artifact (e.g., features:3.1.2),\nwhich you can consume into your ML development step. \n \nThe ML development step will generate config (e.g., config:1.2.4) and code\n(e.g., code:1.0.2) artifacts used in the continuous training pipeline. \n \nDoing so lets you quickly respond to questions such as \"What I used to\ngenerate the model?\" and \"What Version?\" \n \n#\ud835\udff2. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\udde5\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06 \n \nThe model registry is the ultimate way to make your model accessible to your\nproduction ecosystem. \n \nFor example, in your continuous training pipeline, after the model is trained,\nyou load the weights as an artifact into the model registry (e.g.,\nmodel:1.2.4). \n \nYou label this model as \"staging\" under a new version and prepare it for\ntesting. If the tests pass, mark it as \"production\" under a new version and\nprepare it for deployment (e.g., model:2.1.5).\n\nTop 6 ML platform features you must know [Image by the Author].\n\n. \n \nAll of these features are used in a mature ML system. What is your favorite\none? \n \n\u21b3 You can see all these features in action in my: \ud83d\udd17 **The Full Stack 7-Steps\nMLOps Framework** FREE course.\n\n* * *\n\n### #2. Why serving an ML model using a batch architecture is so powerful?\n\nWhen you first start deploying your ML model, you want an initial end-to-end\nflow as fast as possible. \n \nDoing so lets you quickly provide value, get feedback, and even collect data. \n \n. \n \nBut here is the catch... \n \nSuccessfully serving an ML model is tricky as you need many iterations to\noptimize your model to work in real-time: \n\\- low latency \n\\- high throughput \n \nInitially, serving your model in batch mode is like a hack. \n \nBy storing the model's predictions in dedicated storage, you automatically\nmove your model from offline mode to a real-time online model. \n \nThus, you no longer have to care for your model's latency and throughput. The\nconsumer will directly load the predictions from the given storage. \n \n\ud835\udc13\ud835\udc21\ud835\udc1e\ud835\udc2c\ud835\udc1e \ud835\udc1a\ud835\udc2b\ud835\udc1e \ud835\udc2d\ud835\udc21\ud835\udc1e \ud835\udc26\ud835\udc1a\ud835\udc22\ud835\udc27 \ud835\udc2c\ud835\udc2d\ud835\udc1e\ud835\udc29\ud835\udc2c \ud835\udc28\ud835\udc1f \ud835\udc1a \ud835\udc1b\ud835\udc1a\ud835\udc2d\ud835\udc1c\ud835\udc21 \ud835\udc1a\ud835\udc2b\ud835\udc1c\ud835\udc21\ud835\udc22\ud835\udc2d\ud835\udc1e\ud835\udc1c\ud835\udc2d\ud835\udc2e\ud835\udc2b\ud835\udc1e: \n\\- extracts raw data from a real data source \n\\- clean, validate, and aggregate the raw data within a feature pipeline \n\\- load the cleaned data into a feature store \n\\- experiment to find the best model + transformations using the data from the\nfeature store \n\\- upload the best model from the training pipeline into the model registry \n\\- inside a batch prediction pipeline, use the best model from the model\nregistry to compute the predictions \n\\- store the predictions in some storage \n\\- the consumer will download the predictions from the storage \n\\- repeat the whole process hourly, daily, weekly, etc. (it depends on your\ncontext) \n. \n \n\ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude2e\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude25\ud835\ude30\ud835\ude38\ud835\ude2f\ud835\ude34\ud835\ude2a\ud835\ude25\ud835\ude26 of deploying your model in batch mode is that the\npredictions will have a level of lag. \n \nFor example, in a recommender system, if you make your predictions daily, it\nwon't capture a user's behavior in real-time, and it will update the\npredictions only at the end of the day. \n \nMoving to other architectures, such as request-response or streaming, will be\nnatural after your system matures in batch mode.\n\nML Batch Architecture Design [Image by the Author].\n\nSo remember, when you initially deploy your model, using a batch mode\narchitecture will be your best shot for a good user experience.\n\n* * *\n\n### _Story:_ \u201cI never forget anything\u201d - said no one but your second brain.\n\nAfter 6+ months of refinement, this is my second brain strategy \ud83d\udc47 \n \nTiago's Forte book inspired me, but I adapted his system to my needs. \n \n. \n \n#\ud835\udfec. \ud835\uddd6\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01 \n \nThis is where you are bombarded with information from all over the place. \n \n#\ud835\udfed. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddda\ud835\uddff\ud835\uddee\ud835\ude03\ud835\uddf2\ud835\ude06\ud835\uddee\ud835\uddff\ud835\uddf1 \n \nThis is where I save everything that looks interesting. \n \nI won't use 90% of what is here, but it satisfied my urge to save that \"cool\narticle\" I saw on LinkedIn. \n \nTools: Mostly Browser Bookmarks, but I rarely use GitHub stars, Medium lists,\netc. \n \n#\ud835\udfee. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddd5\ud835\uddfc\ud835\uddee\ud835\uddff\ud835\uddf1 \n \nHere, I start converging the information and planning what to do next. \n \nTools: Notion \n \n#\ud835\udfef. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddd9\ud835\uddf6\ud835\uddf2\ud835\uddf9\ud835\uddf1 \n \nHere is where I express myself through learning, coding, writing, etc. \n \nTools: whatever you need to express yourself. \n \n2 & 3 are iterative processes. Thus I often bounce between them until the\ninformation is distilled. \n \n#\ud835\udff0. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddea\ud835\uddee\ud835\uddff\ud835\uddf2\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\ude00\ud835\uddf2 \n \nHere is where I take the distilled information and write it down for cold\nstorage. \n \nTools: Notion, Google Drive \n \n. \n \nWhen I want to search for a piece of information, I start from the Warehouse\nand go backward until I find what I need. \n \nAs a minimalist, I kept my tools to a minimum. I primarily use only: Brave,\nNotion, and Google Drive. \n \nYou don't need 100+ tools to be productive. They just want to take your money\nfrom you.\n\nMy second brain strategy [Image by the Author].\n\nSo remember... \n \nYou have to: \n\\- collect \n\\- link \n\\- plan \n\\- distill \n\\- store\n\n* * *\n\nThat\u2019s it for today \ud83d\udc7e\n\nSee you next Thursday at 9:00 am CET.\n\nHave a fantastic weekend!\n\nPaul\n\n* * *\n\n#### Whenever you\u2019re ready, here is how I can help you:\n\n 1. **The Full Stack 7-Steps MLOps Framework :** a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.\n\n 2. **Machine Learning& MLOps Blog**: here, I approach in-depth topics about designing and productionizing ML systems using MLOps.\n\n 3. **Machine Learning& MLOps Hub**: a place where I will constantly aggregate all my work (courses, articles, webinars, podcasts, etc.),\n\n3\n\nShare this post\n\n#### DML: Top 6 ML Platform Features You Must Know to Build an ML System\n\ndecodingml.substack.com\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\n2\n\nShare\n\nPreviousNext\n\n#### Discussion about this post\n\nComments\n\nRestacks\n\n| Ahmed BesbesThe Tech Buffet Aug 31, 2023Liked by Paul IusztinHello Paul!\nGreat newsletter. It'd be even more useful to suggest tools for each of these\nfeatures (e.g. the model registry, the feature store, etc)Expand full\ncommentReplyShare \n---|--- \n \n1 reply by Paul Iusztin\n\n1 more comment...\n\nTop\n\nLatest\n\nDiscussions\n\nNo posts\n\nReady for more?\n\nSubscribe\n\n\u00a9 2024 Paul Iusztin\n\nPrivacy \u2219 Terms \u2219 Collection notice\n\nStart WritingGet the app\n\nSubstack is the home for great culture\n\nShare\n\nCopy link\n\nFacebook\n\nEmail\n\nNote\n\nOther\n\nThis site requires JavaScript to run correctly. Please turn on JavaScript or\nunblock scripts\n\n", "language": "en" }, "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-top-6-ml-platform-features-you?r=1ttoeh" } ] }