{ "artifact_data": [ { "id": "a964f3ac-e92f-4fcb-847a-a46da3d697d9", "content": "Maxime Labonne Fine tune Llama 3.1 Ultra Efficiently with Unsloth Maxime Labonne __LLM Course __Hands On GNNs __Research __About __ __ __ __ 1. LLM Post training 2. Fine tune Llama 3.1 8B 1. LLM Post training 2. Fine tune Llama 3.1 8B Fine tune Llama 3.1 Ultra Efficiently with Unsloth A beginner s guide to state of the art supervised fine tuning Large Language Models Author Maxime Lbonne Published July 29, 2024 LLM Post training __ Fine tune Llama 2 in Colab Fine tune Llama 2 in Axolotl Fine tune Mistral 7b with DPO Fine tune Llama 3 with ORPO Fine tune Llama 3.1 8B Merge LLMs with mergekit Create Mixture of Experts Uncensor any LLM LLM Quantization __ Intro to Quantization Quantization with GPTQ Quantization with GGML Quantization with ExLlamaV2 LLM stuff __ ChatGPT KG Decoding Strategies Agentic data generation Graph neural networks __ Graph Convolution Network Graph Attention Network GraphSAGE Graph Isomorphism Network Linear programming __ Linear Programming Integer Programming Constraint Programming Nonlinear Programming Miscellaneous __ Q learning Minecraft Bot Loops in Pandas What is a Tensor Sections Supervised Fine Tuning SFT Techniques Fine Tune Llama 3.1 8B Conclusion Pre order the LLM Engineer s Handbook , my new book to master the art of LLMs from concept to production The recent release of Llama 3.1 offers models with an incredible level of performance, closing the gap between closed source and open weight models. Instead of using frozen, general purpose LLMs like GPT 4o and Claude 3.5, you can fine tune Llama 3.1 for your specific use cases to achieve better performance and customizability at a lower cost. In this article, we will provide a comprehensive overview of supervised fine tuning. We will compare it to prompt engineering to understand when it makes sense to use it, detail the main techniques with their pros and cons, and introduce major concepts, such as LoRA hyperparameters, storage formats, and chat templates. Finally, we will implement it in practice by fine tuning Llama 3.1 8B in Google Colab with state of the art optimization using Unsloth. All the code used in this article is available on Google Colab and in the LLM Course. Special thanks to Daniel Han for answering my questions. Supervised Fine Tuning Supervised Fine Tuning SFT is a method to improve and customize pre trained LLMs. It involves retraining base models on a smaller dataset of instructions and answers. The main goal is to transform a basic model that predicts text into an assistant that can follow instructions and answer questions. SFT can also enhance the model s overall performance, add new knowledge, or adapt it to specific tasks and domains. Fine tuned models can then go through an optional preference alignment stage see my article about DPO to remove unwanted responses, modify their style, and more. The following figure shows an instruction sample. It includes a system prompt to steer the model, a user prompt to provide a task, and the output the model is expected to generate. You can find a list of high quality open source instruction datasets in the LLM Datasets GitHub repo. Before considering SFT, I recommend trying prompt engineering techniques like few shot prompting or retrieval augmented generation RAG . In practice, these methods can solve many problems without the need for fine tuning, using either closed source or open weight models e.g., Llama 3.1 Instruct . If this approach doesn t meet your objectives in terms of quality, cost, latency, etc. , then SFT becomes a viable option when instruction data is available. Note that SFT also offers benefits like additional control and customizability to create personalized LLMs. However, SFT has limitations. It works best when leveraging knowledge already present in the base model. Learning completely new information like an unknown language can be challenging and lead to more frequent hallucinations. For new domains unknown to the base model, it is recommended to continuously pre train it on a raw dataset first. On the opposite end of the spectrum, instruct models i.e., already fine tuned models can already be very close to your requirements. For example, a model might perform very well but state that it was trained by OpenAI or Meta instead of you. In this case, you might want to slightly steer the instruct model s behavior using preference alignment. By providing chosen and rejected samples for a small set of instructions between 100 and 1000 samples , you can force the LLM to say that you trained it instead of OpenAI. SFT Techniques The three most popular SFT techniques are full fine tuning, LoRA, and QLoRA. Full fine tuning is the most straightforward SFT technique. It involves retraining all parameters of a pre trained model on an instruction dataset. This method often provides the best results but requires significant computational resources several high end GPUs are required to fine tune a 8B model . Because it modifies the entire model, it is also the most destructive method and can lead to the catastrophic forgetting of previous skills and knowledge. Low Rank Adaptation LoRA is a popular parameter efficient fine tuning technique. Instead of retraining the entire model, it freezes the weights and introduces small adapters low rank matrices at each targeted layer. This allows LoRA to train a number of parameters that is drastically lower than full fine tuning less than 1 , reducing both memory usage and training time. This method is non destructive since the original parameters are frozen, and adapters can then be switched or combined at will. QLoRA Quantization aware Low Rank Adaptation is an extension of LoRA that offers even greater memory savings. It provides up to 33 additional memory reduction compared to standard LoRA, making it particularly useful when GPU memory is constrained. This increased efficiency comes at the cost of longer training times, with QLoRA typically taking about 39 more time to train than regular LoRA. While QLoRA requires more training time, its substantial memory savings can make it the only viable option in scenarios where GPU memory is limited. For this reason, this is the technique we will use in the next section to fine tune a Llama 3.1 8B model on Google Colab. Fine Tune Llama 3.1 8B To efficiently fine tune a Llama 3.1 8B model, we ll use the Unsloth library by Daniel and Michael Han. Thanks to its custom kernels, Unsloth provides 2x faster training and 60 memory use compared to other options, making it ideal in a constrained environment like Colab. Unfortunately, Unsloth only supports single GPU settings at the moment. For multi GPU settings, I recommend popular alternatives like TRL and Axolotl both also include Unsloth as a backend . In this example, we will QLoRA fine tune it on the mlabonne FineTome 100k dataset. It s a subset of arcee ai The Tome without arcee ai qwen2 72b magpie en that I re filtered using HuggingFaceFW fineweb edu classifier. Note that this classifier wasn t designed for instruction data quality evaluation, but we can use it as a rough proxy. The resulting FineTome is an ultra high quality dataset that includes conversations, reasoning problems, function calling, and more. Let s start by installing all the required libraries. !pip install unsloth colab new git https github.com unslothai unsloth.git !pip install no deps xformers 0.0.27 trl 0.9.0 peft accelerate bitsandbytes __ Once installed, we can import them as follows. import torch from trl import SFTTrainer from datasets import load_dataset from transformers import TrainingArguments, TextStreamer from unsloth.chat_templates import get_chat_template from unsloth import FastLanguageModel, is_bfloat16_supported __ Let s now load the model. Since we want to use QLoRA, I chose the pre quantized unsloth Meta Llama 3.1 8B bnb 4bit. This 4 bit precision version of meta llama Meta Llama 3.1 8B is significantly smaller 5.4 GB and faster to download compared to the original 16 bit precision model 16 GB . We load in NF4 format using the bitsandbytes library. When loading the model, we must specify a maximum sequence length, which restricts its context window. Llama 3.1 supports up to 128k context length, but we will set it to 2,048 in this example since it consumes more compute and VRAM. Finally, the dtype parameter automatically detects if your GPU supports the BF16 format for more stability during training this feature is restricted to Ampere and more recent GPUs . max_seq_length 2048 model, tokenizer FastLanguageModel.from_pretrained model_name unsloth Meta Llama 3.1 8B bnb 4bit , max_seq_length max_seq_length, load_in_4bit True, dtype None, __ Now that our model is loaded in 4 bit precision, we want to prepare it for parameter efficient fine tuning with LoRA adapters. LoRA has three important parameters Rank r , which determines LoRA matrix size. Rank typically starts at 8 but can go up to 256. Higher ranks can store more information but increase the computational and memory cost of LoRA. We set it to 16 here. Alpha \u03b1 , a scaling factor for updates. Alpha directly impacts the adapters contribution and is often set to 1x or 2x the rank value. Target modules LoRA can be applied to various model components, including attention mechanisms Q, K, V matrices , output projections, feed forward blocks, and linear output layers. While initially focused on attention mechanisms, extending LoRA to other components has shown benefits. However, adapting more modules increases the number of trainable parameters and memory needs. Here, we set r 16, \u03b1 16, and target every linear module to maximize quality. We don t use dropout and biases for faster training. In addition, we will use Rank Stabilized LoRA rsLoRA , which modifies the scaling factor of LoRA adapters to be proportional to 1 r instead of 1 r. This stabilizes learning especially for higher adapter ranks and allows for improved fine tuning performance as rank increases. Gradient checkpointing is handled by Unsloth to offload input and output embeddings to disk and save VRAM. model FastLanguageModel.get_peft_model model, r 16, lora_alpha 16, lora_dropout 0, target_modules q_proj , k_proj , v_proj , up_proj , down_proj , o_proj , gate_proj , use_rslora True, use_gradient_checkpointing unsloth __ With this LoRA configuration, we ll only train 42 million out of 8 billion parameters 0.5196 . This shows how much more efficient LoRA is compared to full fine tuning. Let s now load and prepare our dataset. Instruction datasets are stored in a particular format it can be Alpaca, ShareGPT, OpenAI, etc. First, we want to parse this format to retrieve our instructions and answers. Our mlabonne FineTome 100k dataset uses the ShareGPT format with a unique conversations column containing messages in JSONL. Unlike simpler formats like Alpaca, ShareGPT is ideal for storing multi turn conversations, which is closer to how users interact with LLMs. Once our instruction answer pairs are parsed, we want to reformat them to follow a chat template . Chat templates are a way to structure conversations between users and models. They typically include special tokens to identify the beginning and the end of a message, who s speaking, etc. Base models don t have chat templates so we can choose any ChatML, Llama3, Mistral, etc. In the open source community, the ChatML template originally from OpenAI is a popular option. It simply adds two special tokens im_start and im_end to indicate who s speaking. If we apply this template to the previous instruction sample, here s what we get im_start system You are a helpful assistant, who always provide explanation. Think like you are answering to a five year old. im_end im_start user Remove the spaces from the following sentence It prevents users to suspect that there are some hidden products installed on theirs device. im_end im_start assistant Itpreventsuserstosuspectthattherearesomehiddenproductsinstalledontheirsdevice. im_end In the following code block, we parse our ShareGPT dataset with the mapping parameter and include the ChatML template. We then load and process the entire dataset to apply the chat template to every conversation. tokenizer get_chat_template tokenizer, mapping role from , content value , user human , assistant gpt , chat_template chatml , def apply_template examples messages examples conversations text tokenizer.apply_chat_template message, tokenize False, add_generation_prompt False for message in messages return text text dataset load_dataset mlabonne FineTome 100k , split train dataset dataset.map apply_template, batched True __ We re now ready to specify the training parameters for our run. I want to briefly introduce the most important hyperparameters Learning rate It controls how strongly the model updates its parameters. Too low, and training will be slow and may get stuck in local minima. Too high, and training may become unstable or diverge, which degrades performance. LR scheduler It adjusts the learning rate LR during training, starting with a higher LR for rapid initial progress and then decreasing it in later stages. Linear and cosine schedulers are the two most common options. Batch size Number of samples processed before the weights are updated. Larger batch sizes generally lead to more stable gradient estimates and can improve training speed, but they also require more memory. Gradient accumulation allows for effectively larger batch sizes by accumulating gradients over multiple forward backward passes before updating the model. Num epochs The number of complete passes through the training dataset. More epochs allow the model to see the data more times, potentially leading to better performance. However, too many epochs can cause overfitting. Optimizer Algorithm used to adjust the parameters of a model to minimize the loss function. In practice, AdamW 8 bit is strongly recommended it performs as well as the 32 bit version while using less GPU memory. The paged version of AdamW is only interesting in distributed settings. Weight decay A regularization technique that adds a penalty for large weights to the loss function. It helps prevent overfitting by encouraging the model to learn simpler, more generalizable features. However, too much weight decay can impede learning. Warmup steps A period at the beginning of training where the learning rate is gradually increased from a small value to the initial learning rate. Warmup can help stabilize early training, especially with large learning rates or batch sizes, by allowing the model to adjust to the data distribution before making large updates. Packing Batches have a pre defined sequence length. Instead of assigning one batch per sample, we can combine multiple small samples in one batch, increasing efficiency. I trained the model on the entire dataset 100k samples using an A100 GPU 40 GB of VRAM on Google Colab. The training took 4 hours and 45 minutes. Of course, you can use smaller GPUs with less VRAM and a smaller batch size, but they re not nearly as fast. For example, it takes roughly 19 hours and 40 minutes on an L4 and a whopping 47 hours on a free T4. In this case, I recommend only loading a subset of the dataset to speed up training. You can do it by modifying the previous code block, like dataset load_dataset mlabonne FineTome 100k , split train 10000 to only load 10k samples. Alternatively, you can use cheaper cloud GPU providers like Paperspace, RunPod, or Lambda Labs. trainer SFTTrainer model model, tokenizer tokenizer, train_dataset dataset, dataset_text_field text , max_seq_length max_seq_length, dataset_num_proc 2, packing True, args TrainingArguments learning_rate 3e 4, lr_scheduler_type linear , per_device_train_batch_size 8, gradient_accumulation_steps 2, num_train_epochs 1, fp16 not is_bfloat16_supported , bf16 is_bfloat16_supported , logging_steps 1, optim adamw_8bit , weight_decay 0.01, warmup_steps 10, output_dir output , seed 0, , trainer.train __ Now that the model is trained, let s test it with a simple prompt. This is not a rigorous evaluation but just a quick check to detect potential issues. We use FastLanguageModel.for_inference to get 2x faster inference. model FastLanguageModel.for_inference model messages from human , value Is 9.11 larger than 9.9? , inputs tokenizer.apply_chat_template messages, tokenize True, add_generation_prompt True, return_tensors pt , .to cuda text_streamer TextStreamer tokenizer _ model.generate input_ids inputs, streamer text_streamer, max_new_tokens 128, use_cache True __ The model s response is 9.9 , which is correct! Let s now save our trained model. If you remember the part about LoRA and QLoRA, what we trained is not the model itself but a set of adapters. There are three save methods in Unsloth lora to only save the adapters, and merged_16bit merged_4bit to merge the adapters with the model in 16 bit 4 bit precision. In the following, we merge them in 16 bit precision to maximize the quality. We first save it locally in the model directory and then upload it to the Hugging Face Hub. You can find the trained model on mlabonne FineLlama 3.1 8B. model.save_pretrained_merged model , tokenizer, save_method merged_16bit model.push_to_hub_merged mlabonne FineLlama 3.1 8B , tokenizer, save_method merged_16bit __ Unsloth also allows you to directly convert your model into GGUF format. This is a quantization format created for llama.cpp and compatible with most inference engines, like LM Studio, Ollama, and oobabooga s text generation webui. Since you can specify different precisions see my article about GGUF and llama.cpp , we ll loop over a list to quantize it in q2_k , q3_k_m , q4_k_m , q5_k_m , q6_k , q8_0 and upload these quants on Hugging Face. The mlabonne FineLlama 3.1 8B GGUF contains all our GGUFs. quant_methods q2_k , q3_k_m , q4_k_m , q5_k_m , q6_k , q8_0 for quant in quant_methods model.push_to_hub_gguf mlabonne FineLlama 3.1 8B GGUF , tokenizer, quant __ Congratulations, we fine tuned a model from scratch and uploaded quants you can now use in your favorite inference engine. Feel free to try the final model available on mlabonne FineLlama 3.1 8B GGUF. What to do now? Here are some ideas on how to use your model Evaluate it on the Open LLM Leaderboard you can submit it for free or using other evals like in LLM AutoEval. Align it with Direct Preference Optimization using a preference dataset like mlabonne orpo dpo mix 40k to boost performance. Quantize it in other formats like EXL2, AWQ, GPTQ, or HQQ for faster inference or lower precision using AutoQuant. Deploy it on a Hugging Face Space with ZeroChat for models that have been sufficiently trained to follow a chat template 20k samples . Conclusion This article provided a comprehensive overview of supervised fine tuning and how to apply it in practice to a Llama 3.1 8B model. By leveraging QLoRA s efficient memory usage, we managed to fine tune an 8B LLM on a super high quality dataset with limited GPU resources. We also provided more efficient alternatives for bigger runs and suggestions for further steps, including evaluation, preference alignment, quantization, and deployment. I hope this guide was useful. If you re interested in learning more about LLMs, I recommend checking the LLM Course. If you enjoyed this article, follow me on X maximelabonne and on Hugging Face mlabonne. Good luck fine tuning models! __Copyright 2023, Maxime Labonne en", "platform": "mlabonne.github.io", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://mlabonne.github.io/blog/posts/2024-07-29_Finetune_Llama31.html" }, { "id": "4c510a29-a59a-4e15-874e-a5bd836a17de", "content": "Maxime Labonne The Rise of Agentic Data Generation Maxime Labonne __LLM Course __Hands On GNNs __Research __About __ __ __ __ 1. LLM stuff 2. Agentic data generation 1. LLM stuff 2. Agentic data generation The Rise of Agentic Data Generation Combining AgentInstruct and Arena Learning Large Language Models Author Maxime Lbonne Published July 15, 2024 LLM Post training __ Fine tune Llama 2 in Colab Fine tune Llama 2 in Axolotl Fine tune Mistral 7b with DPO Fine tune Llama 3 with ORPO Fine tune Llama 3.1 8B Merge LLMs with mergekit Create Mixture of Experts Uncensor any LLM LLM Quantization __ Intro to Quantization Quantization with GPTQ Quantization with GGML Quantization with ExLlamaV2 LLM stuff __ ChatGPT KG Decoding Strategies Agentic data generation Graph neural networks __ Graph Convolution Network Graph Attention Network GraphSAGE Graph Isomorphism Network Linear programming __ Linear Programming Integer Programming Constraint Programming Nonlinear Programming Miscellaneous __ Q learning Minecraft Bot Loops in Pandas What is a Tensor Sections AgentInstruct A Multi Agent Approach Arena Learning A Competitive Refinement Approach ArenaInstruct Combining AgentInstruct and Arena Learning Conclusion Pre order the LLM Engineer s Handbook , my new book to master the art of LLMs from concept to production With the consolidation of LLM architectures, the quality of training data has become the most important factor in creating state of the art models. This is true for both pre training and post training, where instruction datasets have a major impact on the final model. Two innovative approaches have recently emerged to address the challenge of generating high quality instruction datasets for post training LLMs AgentInstruct and Arena Learning. Both frameworks come from Microsoft Research and leverage multiple LLMs to create and refine samples. In this article, I want to explore both methods, analyze their similarities and differences, and see how we could combine them in a single end to end framework. AgentInstruct A Multi Agent Approach AgentInstruct is an agentic framework by Mitra et al. 2024 , designed to generate large scale, diverse, and high quality synthetic data. The framework uses a sophisticated pipeline that transforms raw text into refined instructions through multiple stages of processing. In the paper, the agents seem to be based on GPT 4, which is also used to evaluate data quality and hallucinations in some contexts. _Figure from the AgentInstruct paper._ The AgentInstruct pipeline consists of four main steps Seed Collection Assemble a diverse collection of raw seeds, such as textbook chapters, web articles, and code snippets. These seeds serve as the foundation for generating new instructions. Content Transformation One or more specialized agents modify each seed into an intermediate representation that simplifies instruction creation. These agents are designed to perform tasks like generating argument passages, debates, conversations, meeting transcripts, poems, satirical content, etc. Seed Instruction Generation Multiple agents take the transformed seed and generate diverse instructions based on a pre defined taxonomy of instruction types. For example, in the domain of reading comprehension, the taxonomy includes 43 question types, ranging from literal comprehension to critical analysis and inference. Instruction Refinement The final stage involves iteratively enhancing the complexity and quality of the generated instructions. This is achieved through suggester editor agent pairs. Suggester agents propose ways to increase instruction complexity, while editor agents modify the instructions accordingly. To get a better idea of what each stage produces, I recommend reading the examples provided in the paper. Each flow in the AgentInstruct pipeline consists of multiple agents powered by LLMs. These agents can be equipped with tools like search APIs or code interpreters to enhance their capabilities. The roles of these agents are carefully defined in their system messages to ensure they perform their specific tasks effectively. The authors of AgentInstruct implemented flows for 17 different skills, each with multiple subcategories. These skills cover a wide range of areas, including reading comprehension, question answering, coding, retrieval augmented generation, creative writing, tool use, and web control. Using this comprehensive pipeline, the researchers generated approximately 22 million instructions. They combined this synthetic data with 3.8 million instructions from other sources to create a dataset of 25.8 million paired instructions. This dataset was then used to fine tune the Mistral 7b model, resulting in the creation of the Orca 3 model. Arena Learning A Competitive Refinement Approach Arena Learning by Luo, Suo, et al. 2024 takes a different approach to generating high quality instruction data. Instead of creating instructions from scratch, it focuses on refining existing instruction datasets through a simulated competitive environment. It is not an agentic framework because tools are not provided to the models, but could easily be transformed into one. _Figure from the Arena Learning paper._ The key components of the Arena Learning pipeline are Offline Pairwise LLM Arena Arena Learning creates a simulated arena where multiple LLMs compete against each other on a large set of instruction data. A judge LLM meta llama Meta Llama 3 70B Instruct evaluates the responses from competing models for each instruction, providing rankings, scores, and explanations. This process effectively simulates human evaluation but at a much larger scale and lower cost. Data Collection and Preprocessing The framework starts with a large corpus of conversational data collected from various open sources. This data goes through filtering, cleaning, and deduplication. Instructions that are too short, illegal toxic, or too similar to benchmark test sets are removed. The refined dataset is then split into multiple parts for iterative training. Iterative Battle and Model Evolution The process involves multiple rounds of battles and training 1. An initial model WizardLM \u03b2 SFT I0 is trained on a subset of data. 2. This model competes against other state of the art LLMs on another data subset. 3. Instances where WizardLM \u03b2 loses are collected, with the winning model s response used as the target for fine tuning. 4. The process repeats for multiple iterations, with each iteration potentially using different training strategies SFT, DPO, PPO . Training Strategies Arena Learning employs multiple training strategies to improve the model _Supervised Fine Tuning SFT _ Uses battle results to fine tune the model on instances where it performed poorly. _Direct Preference Optimization DPO _ Treats win loss responses as choice reject pairs for training. _Proximal Policy Optimization PPO _ Uses battle results to train both a reward model and the language model. WizardArena Evaluation The authors create an offline test set WizardArena with diverse and hard subsets. This is used to evaluate models through pairwise battles, with results used to compute Elo rankings. The evaluation closely aligns with human based arenas but is much faster and cheaper. Data Selection The pipeline uses various strategies to select high quality training data, such as threshold based filtering to control data size and quality, focusing on instances where the model underperforms, and gradually shifting towards more complex data in later iterations. _Figure from the Arena Learning paper._ This framework allows for multiple iterations of battles and training, as illustrated with WizardLM \u03b2. The model s capabilities are progressively strengthened, particularly in complex tasks. The process results in significant gains in Elo rankings, MT bench scores, and other evaluation metrics. Arena Learning focuses on improving areas where the model under training is currently lacking. A nice feature is that it doesn t require particularly powerful models like Claude 3.5 Sonnet or GPT 4o. Models with a similar level can be better in some tasks and domains, as well as more suited to answer certain prompt syntaxes. It means that the entire pipeline can be deployed using open weight models, which is a big advantage if you already have a high quality infrastructure. ArenaInstruct Combining AgentInstruct and Arena Learning While both AgentInstruct and Arena Learning aim to generate high quality data for post training language models, they take fundamentally different approaches to achieve this goal. Understanding how they differ, as well as their strengths and weaknesses is a good first step to see how we could combine them. I selected four points I want to focus on Data Generation AgentInstruct starts from raw text, generating instructions from scratch through a multi stage pipeline. This allows for the creation of entirely new content, potentially leading to greater diversity and novelty in the generated instructions. On the other hand, Arena Learning refines existing instruction datasets through simulated battles between models. This method leverages the quality of existing datasets while improving upon them through competitive evaluation. Data Quality AgentInstruct relies on suggester editor agent pairs for iterative refinement of instructions. This approach allows for fine grained control over the complexity and quality of generated instructions. Arena Learning, in contrast, uses an LLM as a judge to evaluate responses in simulated battles. It means that the entire data quality process is handled by a single model. Diversity and Complexity AgentInstruct explicitly i.e., manually designs for diversity through a taxonomy of instruction types and multiple transformation agents. This structured approach ensures coverage across a wide range of skills and instruction types. Arena Learning s diversity comes from the variety of competing models and initial instruction datasets. While this may lead to less structured diversity, it could potentially capture more natural variations in instruction styles. Flexibility AgentInstruct s pipeline allows for easy addition of new seed types and instruction categories, making it highly adaptable to new domains and tasks. Arena Learning s iterative battle process enables continuous improvement of the target model, potentially allowing it to adapt more quickly to new challenges and competing models. Based on this comparison, it s not too difficult to see how we can leverage the advantages of each framework. For instance, a taxonomy based data generation is more steerable and could be improved upon by arena learning. But we could also use feedback signals to improve this first step over multiple iterations. Here s how such a hybrid approach might work 1. AgentInstruct Instruction Generation Use AgentInstruct to create a broad and diverse base of instructions no answers! from raw text. This would ensure wide coverage of tasks and domains that are relevant for our use cases. 2. Arena Learning Answer Generation Apply Arena Learning s competitive battle approach to refine and select the highest quality answers from a pool of models. This would combine AgentInstruct s ability to generate novel content with Arena Learning s robust quality control mechanism. 3. Data Quality Evaluation Instead of relying on a single LLM as a judge, we can use reward models or an LLM as a jury to improve the data selection process. 4. Diversity Feedback Use insights from Arena Learning battles to dynamically update AgentInstruct s instruction taxonomy. This would focus the generation process on producing more of the instruction types that prove most challenging or useful in real world scenarios. 5. Complexity Feedback Leverage Arena Learning s performance metrics to identify areas where instructions are too easy or too difficult. Use this information to guide AgentInstruct s complexity refinement process, ensuring a well balanced dataset that challenges the model appropriately over several iterations. By combining these approaches, we can create a powerful feedback loop between instruction generation and evaluation. This hybrid framework would benefit from AgentInstruct s ability to generate novel, diverse content and Arena Learning s competitive quality control and model improvement process. The result would be a more robust, effective, and continuously improving post training dataset for LLMs. Conclusion In conclusion, this article explored two recent approaches in synthetic data generation AgentInstruct and Arena Learning. We proposed a hybrid solution that combines AgentInstruct s structured, taxonomy based methodology with Arena Learning s iterative refinement using multiple LLMs. This combination leverages the strengths of both frameworks, allowing for a systematic generation of diverse data while enabling continuous improvement of the underlying taxonomy through feedback from the LLM pool. I feel like we might lose some quality by removing the suggester editor agent pairs. Let me know if you have better ideas. Still, data quality evaluation is a significant challenge to perfect this approach. The current reliance on models like GPT 4 or Llama 3 70B Instruct as judges is imperfect and has known limitations see my quick review here . Improving the quality assessment stage could lead to more efficient datasets, achieving better performance with fewer samples. To know more about how to create high quality datasets, check out my GitHub repo LLM Datasets. __Copyright 2023, Maxime Labonne en", "platform": "mlabonne.github.io", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://mlabonne.github.io/blog/posts/2024-07-15_The_Rise_of_Agentic_Data_Generation.html" }, { "id": "5a56c009-565d-4dc4-9bd5-d2b1be2ca2d4", "content": "Uncensor any LLM with abliteration Maxime Labonne Fine tuning without retraining Maxime Labonne SubscribeSign in Share this post Uncensor any LLM with abliteration maximelabonne.substack.com Copy link Facebook Email Note Other Uncensor any LLM with abliteration Fine tuning without retraining Maxime Labonne Jun 12, 2024 Share this post Uncensor any LLM with abliteration maximelabonne.substack.com Copy link Facebook Email Note Other Share _Fine tuning without retraining_ Image generated with DALL E 3 by author The third generation of Llama models provided fine tunes Instruct versions that excel in understanding and following instructions. However, these models are heavily censored, designed to refuse requests seen as harmful with responses such as As an AI assistant, I cannot help you. While this safety feature is crucial for preventing misuse, it limits the model s flexibility and responsiveness. In this article, we will explore a technique called abliteration that can uncensor any LLM without retraining. This technique effectively removes the model s built in refusal mechanism, allowing it to respond to all types of prompts. The code is available on Google Colab and in the LLM Course on GitHub. Special thanks to FailSpy for proofreading this article. What is abliteration? Modern LLMs are fine tuned for safety and instruction following, meaning they are trained to refuse harmful requests. In their blog post, Arditi et al. have shown that this refusal behavior is mediated by a specific direction in the model s residual stream. If we prevent the model from representing this direction, it loses its ability to refuse requests . Conversely, adding this direction artificially can cause the model to refuse even harmless requests. In the traditional decoder only Llama like architecture, there are three residual streams we can target at the start of each block pre , between the attention and MLP layers mid , and after the MLP post . The following figure illustrates the location of each residual stream. Image by author To uncensor an LLM, we first need to identify the refusal direction within the model. This process involves a few technical steps 1. Data Collection Run the model on a set of harmful instructions and a set of harmless instructions, recording the residual stream activations at the last token position for each. 2. Mean difference Calculate the mean difference between the activations of harmful and harmless instructions. This gives us a vector representing the refusal direction for each layer of the model. 3. Selection Normalize these vectors and evaluate them to select the single best refusal direction. Once we have identified the refusal direction, we can ablate it, effectively removing the model s ability to represent this feature. This can be done through an inference time intervention or permanently with weight orthogonalization . Let s talk about inference time intervention first. For every component that writes to the residual stream such as an attention head , we calculate the projection of its output onto the refusal direction and subtract this projection. This subtraction is applied at every token and every layer, ensuring that the model never represents the refusal direction. On the other hand, weight orthogonalization involves modifying the model weights directly. By orthogonalizing the component weights with respect to the refusal direction, it prevents the model from writing to this direction altogether. This is achieved by adjusting the matrices that write to the residual stream, ensuring they do not contribute to the refusal direction. In the next section, we will implement abliteration with weight orthogonalization. Implementation The following implementation of abliteration is based on FailSpy s notebook, which is itself based on the original authors notebook. I mostly adapted and simplified it to make it easier to understand. This section is quite code heavy so you can see what is going on, but you can use FailSpy s abliterator library if you re less interested in the technical details also check his collection of abliterated models on Hugging Face . The code relies on the excellent TransformerLens library formerly known as EasyTransformer to do the heavy lifting. It is designed for mechanistic interpretability and is used here to intervene on activations. Thanks to Neel Nanda and Joseph Bloom for creating and maintaining this library. First, let s install the necessary packages and import them. All these steps are available in this Google Colab notebook. !pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping import torch import functools import einops import gc from datasets import load_dataset from tqdm import tqdm from torch import Tensor from typing import List from transformer_lens import HookedTransformer, utils from transformer_lens.hook_points import HookPoint from transformers import AutoModelForCausalLM, AutoTokenizer from jaxtyping import Float, Int from collections import defaultdict Turn automatic differentiation off to save GPU memory credit Undi95 torch.set_grad_enabled False We need two datasets one containing harmless instructions, and one containing harmful instructions. We ll use tatsu lab alpaca as well as data from llm attacks. To make things easier, I repackaged them in two Hugging Face datasets mlabonne harmless_behaviors and mlabonne harmful_behaviors. That way, you can easily replace them with your own datasets. We will load the instructions and reformat them into a list of dictionaries with role and content keys. This makes it compatible with the apply_chat_tokenizer method, which we will use to follow Llama 3 s chat template. def reformat_texts texts return role user , content text for text in texts Get harmful and harmless datasets def get_harmful_instructions dataset load_dataset mlabonne harmful_behaviors return reformat_texts dataset train text , reformat_texts dataset test text def get_harmless_instructions dataset load_dataset mlabonne harmless_alpaca return reformat_texts dataset train text , reformat_texts dataset test text harmful_inst_train, harmful_inst_test get_harmful_instructions harmless_inst_train, harmless_inst_test get_harmless_instructions Now that we have our datasets, we can load the model we want to abliterate. Unfortunately, you can t directly load a custom model using HookedTransformer . Here, I use a trick described in FailSpy s notebook to download a custom model and rename it as meta llama Meta Llama 3 8B Instruct. Load in torch.float16 format if your GPU is not compatible with BF16. In this example, we ll use mlabonne Daredevil 8B, a mega merge created with DARE TIES see my article about model merging that has the highest MMLU score on the Open LLM Leaderboard in the 8B category. MODEL_ID mlabonne Daredevil 8B MODEL_TYPE meta llama Meta Llama 3 8B Instruct Download and load model !git clone https huggingface.co MODEL_ID MODEL_TYPE Load model and tokenizer model HookedTransformer.from_pretrained_no_processing MODEL_TYPE, local_files_only True, dtype torch.bfloat16, default_padding_side left tokenizer AutoTokenizer.from_pretrained MODEL_TYPE tokenizer.padding_side left tokenizer.pad_token tokenizer.eos_token We can now tokenize our datasets. We re using the same number of samples for both harmless and harmful instructions. Note that a high number of samples can use all the RAM VRAM, which is why I m limiting it to 256 here. def tokenize_instructions tokenizer, instructions return tokenizer.apply_chat_template instructions, padding True, truncation False, return_tensors pt , return_dict True, add_generation_prompt True, .input_ids n_inst_train min 256, len harmful_inst_train , len harmless_inst_train Tokenize datasets harmful_tokens tokenize_instructions tokenizer, instructions harmful_inst_train n_inst_train , harmless_tokens tokenize_instructions tokenizer, instructions harmless_inst_train n_inst_train , Everything is set up, we can now implement the first step of abliteration data collection. We want to process these tokenized datasets and store the residual stream activations in harmful and harmless . This is managed by the transformer_lens library. batch_size 32 Initialize defaultdicts to store activations harmful defaultdict list harmless defaultdict list Process the training data in batches num_batches n_inst_train batch_size 1 batch_size for i in tqdm range num_batches print i start_idx i batch_size end_idx min n_inst_train, start_idx batch_size Run models on harmful and harmless prompts, cache activations harmful_logits, harmful_cache model.run_with_cache harmful_tokens start_idx end_idx , names_filter lambda hook_name resid in hook_name, device cpu , reset_hooks_end True harmless_logits, harmless_cache model.run_with_cache harmless_tokens start_idx end_idx , names_filter lambda hook_name resid in hook_name, device cpu , reset_hooks_end True Collect and store the activations for key in harmful_cache harmful key .append harmful_cache key harmless key .append harmless_cache key Flush RAM and VRAM del harmful_logits, harmless_logits, harmful_cache, harmless_cache gc.collect torch.cuda.empty_cache Concatenate the cached activations harmful k torch.cat v for k, v in harmful.items harmless k torch.cat v for k, v in harmless.items We can now compute the refusal direction for each layer. This corresponds to the mean difference between the activations of harmful and harmless instructions, which is then normalized. We sort them in descending order in activation_scored . Helper function to get activation index def get_act_idx cache_dict, act_name, layer key act_name, layer return cache_dict utils.get_act_name key Compute difference of means between harmful and harmless activations at intermediate layers activation_layers resid_pre , resid_mid , resid_post activation_refusals defaultdict list for layer_num in range 1, model.cfg.n_layers pos 1 Position index for layer in activation_layers harmful_mean_act get_act_idx harmful, layer, layer_num , pos, .mean dim 0 harmless_mean_act get_act_idx harmless, layer, layer_num , pos, .mean dim 0 refusal_dir harmful_mean_act harmless_mean_act refusal_dir refusal_dir refusal_dir.norm activation_refusals layer .append refusal_dir selected_layers resid_pre activation_scored sorted activation_refusals layer l 1 for l in range 1, model.cfg.n_layers for layer in selected_layers , key lambda x abs x.mean , reverse True, The final step of the process consists of evaluating the refusal directions we calculated. To do this, we re going to apply the refusal direction to each residual stream and each block during inference. In the following snippet, we get generations for four test harmful instructions and 20 blocks or layers . def _generate_with_hooks model HookedTransformer, tokenizer AutoTokenizer, tokens Int Tensor, batch_size seq_len , max_tokens_generated int 64, fwd_hooks , List str all_tokens torch.zeros tokens.shape 0 , tokens.shape 1 max_tokens_generated , dtype torch.long, device tokens.device, all_tokens , tokens.shape 1 tokens for i in range max_tokens_generated with model.hooks fwd_hooks fwd_hooks logits model all_tokens , max_tokens_generated i next_tokens logits , 1, .argmax dim 1 greedy sampling temperature 0 all_tokens , max_tokens_generated i next_tokens return tokenizer.batch_decode all_tokens , tokens.shape 1 , skip_special_tokens True def get_generations model HookedTransformer, tokenizer AutoTokenizer, instructions List str , fwd_hooks , max_tokens_generated int 64, batch_size int 4, List str generations for i in tqdm range 0, len instructions , batch_size tokens tokenize_instructions tokenizer, instructions instructions i i batch_size generation _generate_with_hooks model, tokenizer, tokens, max_tokens_generated max_tokens_generated, fwd_hooks fwd_hooks, generations.extend generation return generations Inference time intervention hook def direction_ablation_hook activation Float Tensor, ... d_act , hook HookPoint, direction Float Tensor, d_act , if activation.device ! direction.device direction direction.to activation.device proj einops.einsum activation, direction.view 1, 1 , ... d_act, d_act single ... single direction return activation proj Testing baseline N_INST_TEST 4 baseline_generations get_generations model, tokenizer, harmful_inst_test N_INST_TEST , fwd_hooks Evaluating layers defined earlier needs human evaluation to determine best layer for refusal inhibition EVAL_N 20 Evaluate how many of the top N potential directions evals for refusal_dir in tqdm activation_scored EVAL_N hook_fn functools.partial direction_ablation_hook, direction refusal_dir fwd_hooks utils.get_act_name act_name, layer , hook_fn for layer in list range model.cfg.n_layers for act_name in activation_layers intervention_generations get_generations model, tokenizer, harmful_inst_test N_INST_TEST , fwd_hooks fwd_hooks evals.append intervention_generations We stored all the generations in the evals list. We can now print them and manually select the layer block that provides an uncensored response for each instruction. I m automatically excluding responses containing I cannot and I can t to filter out unwanted answers. If you can t find a layer that satisfies these requirements, you might want to test other residual streams in the previous selected_layers list, other instructions, additional blocks, etc. Print generations for human evaluation blacklist I cannot , I can t for i in range N_INST_TEST print f 033 1mINSTRUCTION i harmful_inst_test i print f nBASELINE COMPLETION n baseline_generations i 033 0m for layer_candidate in range EVAL_N if not any word in evals layer_candidate i for word in blacklist print f n n nLAYER CANDIDATE layer_candidate INTERVENTION COMPLETION print evals layer_candidate i In my case, the layer candidate 9 managed to provide uncensored answer for the four instructions. This is the one that we will select for the refusal direction. In the following, we implement weight orthogonalization to modify the weights and prevent the model from creating outputs with this direction. You can verify that the model is successfully uncensored by printing the completions. def get_orthogonalized_matrix matrix Float Tensor, ... d_model , vec Float Tensor, d_model Float Tensor, ... d_model proj einops.einsum matrix, vec.view 1, 1 , ... d_model, d_model single ... single vec return matrix proj Select the layer with the highest potential refusal direction LAYER_CANDIDATE 9 refusal_dir activation_scored LAYER_CANDIDATE Orthogonalize the model s weights if refusal_dir.device ! model.W_E.device refusal_dir refusal_dir.to model.W_E.device model.W_E.data get_orthogonalized_matrix model.W_E, refusal_dir for block in tqdm model.blocks if refusal_dir.device ! block.attn.W_O.device refusal_dir refusal_dir.to block.attn.W_O.device block.attn.W_O.data get_orthogonalized_matrix block.attn.W_O, refusal_dir block.mlp.W_out.data get_orthogonalized_matrix block.mlp.W_out, refusal_dir Generate text with abliterated model orthogonalized_generations get_generations model, tokenizer, harmful_inst_test N_INST_TEST , fwd_hooks Print generations for i in range N_INST_TEST if len baseline_generations i print f INSTRUCTION i harmful_inst_test i print f 033 92mBASELINE COMPLETION n baseline_generations i print f 033 91mINTERVENTION COMPLETION n evals LAYER_CANDIDATE i print f 033 95mORTHOGONALIZED COMPLETION n orthogonalized_generations i n We re now ready to use the model. We convert it back to the Hugging Face format and upload it to the HF hub. Convert model back to HF safetensors hf_model AutoModelForCausalLM.from_pretrained MODEL_TYPE, torch_dtype torch.bfloat16 lm_model hf_model.model state_dict model.state_dict lm_model.embed_tokens.weight torch.nn.Parameter state_dict embed.W_E .cpu for l in range model.cfg.n_layers lm_model.layers l .self_attn.o_proj.weight torch.nn.Parameter einops.rearrange state_dict f blocks. l .attn.W_O , n h m m n h , n model.cfg.n_heads .contiguous lm_model.layers l .mlp.down_proj.weight torch.nn.Parameter torch.transpose state_dict f blocks. l .mlp.W_out , 0, 1 .contiguous hf_model.push_to_hub f MODEL_ID abliterated DPO Fine Tuning I evaluated the abliterated and source models from the previous section on the Open LLM Leaderboard and on Nous benchmark suite. Here are the results Image by author As you can see, the source model significantly outperforms Llama 3 8B Instruct. However, we observe a performance drop in the ablated version across all benchmarks. The ablation process successfully uncensored it but also degraded the model s quality. To address this issue, an idea consists of further training our abliterated model to heal it. Like most fine tuned models, Llama 3 8B Instruct is quite brittle when it comes to supervised fine tuning. An additional SFT would likely break the model s performance. Alternatively, preference alignment is quite light and shouldn t lobotomize our abliterated model. DPO is a good candidate here for its ease of use and good track record. To implement it, I used LazyAxolotl thanks to Wing Lian for creating Axolotl with the mlabonne orpo dpo mix 40k dataset. Here s the configuration I used base_model mlabonne Daredevil 8B abliterated model_type LlamaForCausalLM tokenizer_type AutoTokenizer load_in_8bit false load_in_4bit true strict false save_safetensors true rl dpo chat_template chatml datasets path mlabonne orpo dpo mix 40k split train type chatml.intel dataset_prepared_path val_set_size 0.0 output_dir . out adapter qlora lora_model_dir sequence_len 2048 sample_packing false pad_to_sequence_len false lora_r 64 lora_alpha 32 lora_dropout 0.05 lora_target_linear true lora_fan_in_fan_out wandb_project axolotl wandb_entity wandb_watch wandb_name wandb_log_model gradient_accumulation_steps 8 micro_batch_size 1 num_epochs 1 optimizer paged_adamw_8bit lr_scheduler cosine learning_rate 5e 6 train_on_inputs false group_by_length false bf16 auto fp16 tf32 gradient_checkpointing true early_stopping_patience resume_from_checkpoint local_rank logging_steps 1 xformers_attention flash_attention true warmup_steps 100 evals_per_epoch 0 eval_table_size eval_table_max_new_tokens 128 saves_per_epoch 1 debug deepspeed deepspeed_configs zero2.json weight_decay 0.0 special_tokens pad_token end_of_text I trained it using 6xA6000 GPUs with DeepSpeed ZeRO 2. The training took about 6 hours and 45 minutes. Here are the training curves I got from W B Image by author It automatically uploaded the DPO fine tuned model, called mlabonne NeuralDaredevil 8B abliterated. To see if it fixed our abliterated version, I evaluated it on the same benchmarks Image by author We can see that this additional training allowed us to recover most of the performance drop due to abliteration. One area where the model doesn t improve is GSM8K, a math dataset, which could mean the orpo dpo mix 40k would benefit from more math samples. The final model is an uncensored LLM with state of the art performance in the 8B category. I recommend it as an improved version of Llama 3 8B Instruct when you don t need censorship. You can play with quantized versions like GGUF in LM Studio. Conclusion In this article, we introduced the concept of abliteration. This technique uses the model s activations on harmless and harmful prompts to calculate a refusal direction. It then uses this direction to modify the model s weights and ensure that we stop outputting refusals. This technique also demonstrates the fragility of safety fine tuning and raises ethical considerations. We applied abliteration to Daredevil 8B to uncensor it, which also degraded the model s performance. We then healed it using DPO to create the NeuralDaredevil 8B model, a fully uncensored and high quality 8B LLM. Abliteration is not limited to removing alignment and should be seen as a form of fine tuning without retraining. Indeed, it can creatively be applied to other goals, like FailSpy s MopeyMule, which adopts a melancholic conversational style. I hope you liked this article. If you want to see more follow me on Hugging Face and Twitter maximelabonne. References FailSpy, abliterator library, GitHub, 2024. Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, Refusal in LLMs is mediated by a single direction, Lesswrong, 2024. Share this post Uncensor any LLM with abliteration maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/uncensor-any-llm-with-abliteration-d30148b7d43e" }, { "id": "d3bf078f-7028-410f-b4ed-b79e717f7927", "content": "Create Mixtures of Experts with MergeKit Combine multiple models into a single MoE Maxime Labonne SubscribeSign in Share this post Create Mixtures of Experts with MergeKit maximelabonne.substack.com Copy link Facebook Email Note Other Create Mixtures of Experts with MergeKit Combine multiple models into a single MoE Maxime Labonne Mar 27, 2024 1 Share this post Create Mixtures of Experts with MergeKit maximelabonne.substack.com Copy link Facebook Email Note Other Share _Combine multiple models into a single MoE_ Image by author Thanks to the release of Mixtral, the Mixture of Experts MoE architecture has become popular in recent months. This architecture offers an interesting tradeoff higher performance at the cost of increased VRAM usage. While Mixtral and other MoE architectures are pre trained from scratch, another method of creating MoE has recently appeared. Thanks to Arcee s MergeKit library, we now have a new way of creating MoEs by ensembling several pre trained models. These are often referred to as frankenMoEs or MoErges to distinguish them from the pre trained MoEs. In this article, we will detail how the MoE architecture works and how frankenMoEs are created. Finally, we will make our own frankenMoE with MergeKit and evaluate it on several benchmarks. The code is available on Google Colab in a wrapper called LazyMergeKit. Special thanks to Charles Goddard, the creator of MergeKit, for proofreading this article. Introduction to MoEs A Mixture of Experts is an architecture designed for improved efficiency and performance. It uses multiple specialized subnetworks, known as experts . Unlike dense models, where the entire network is activated, MoEs only activate relevant experts based on the input. This results in faster training and more efficient inference. There are two components at the core of an MoE model 1. Sparse MoE Layers These replace the dense feed forward network layers in the transformer architecture. Each MoE layer contains several experts, and only a subset of these experts are engaged for a given input. 2. Gate Network or Router This component determines which tokens are processed by which experts, ensuring that each part of the input is handled by the most suitable expert s . In the following example, we show how a Mistral 7B block is transformed into an MoE block with a sparse MoE layer feedforward network 1, 2, and 3 and a router. This example represents an MoE with three experts, where two are currently engaged FFN 1 and FFN 3 . Image by author MoEs also come with their own set of challenges, especially in terms of fine tuning and memory requirements. The fine tuning process can be difficult due to the model s complexity, with the need to balance expert usage during training to properly train the gating weights to select the most relevant ones. In terms of memory, even though only a fraction of the total parameters are used during inference, the entire model, including all experts, needs to be loaded into memory , which requires high VRAM capacity. More specifically, there are two essential parameters when it comes to MoEs Number of experts num_local_experts This determines the total number of experts in the architecture e.g., 8 for Mixtral . The higher the number of experts, the higher the VRAM usage. Number of experts token num_experts_per_tok This determines the number of experts that are engaged for each token and each layer e.g., 2 for Mixtral . There is a tradeoff between a high number of experts per token for accuracy but diminishing returns vs. a low number for fast training and inference. Historically, MoEs have underperformed dense models. However, the release of Mixtral 8x7B in December 2023 shook things up and showed impressive performance for its size. Additionally, GPT 4 is also rumored to be an MoE, which would make sense as it would be a lot cheaper to run and train for OpenAI compared to a dense model. In addition to these recent excellent MoEs, we now have a new way of creating MoEs with MergeKit frankenMoEs, also called MoErges. True MoEs vs. frankenMoEs The main difference between true MoEs and frankenMoEs is how they re trained. In the case of true MoEs, the experts and the router are trained jointly. In the case of frankenMoEs, we upcycle existing models and initialize the router afterward. In other words, we copy the weights of the layer norm and self attention layers from a base model, and then copy the weights of the FFN layers found in each expert. This means that besides the FFNs, all the other parameters are shared. This explains why Mixtral 8x7B with eight experts doesn t have 8 7 56B parameters, but about 45B. This is also why using two experts per token gives the inference speed FLOPs of a 12B dense model instead of 14B. FrankenMoEs are about selecting the most relevant experts and initializing them properly. MergeKit currently implements three ways of initializing the routers 1. Random Random weights. Be careful when using it as the same experts might be selected every time it requires further fine tuning or num_local_experts num_experts_per_tok , which means you don t need any routing . 2. Cheap embed It uses the raw embeddings of the input tokens directly and applies the same transformation across all layers. This method is computationally inexpensive and suitable for execution on less powerful hardware. 3. Hidden It creates hidden representations of a list of positive and negative prompts by extracting them from the last layer of the LLM. They are averaged and normalized to initialize the gates. More information about it is available on Charles Goddard s blog. As you can guess, the hidden initialization is the most efficient to correctly route the tokens to the most relevant experts. In the next section, we will create our own frankenMoE using this technique. Creating a frankenMoE To create our frankenMoE, we need to select n experts. In this case, we will rely on Mistral 7B thanks to its popularity and relatively small size. However, eight experts like in Mixtral is quite a lot, as we need to fit all of them in memory. For efficiency, I ll only use four experts in this example, with two of them engaged for each token and each layer. In this case, we will end up with a model with 24.2B parameters instead of 4 7 28B parameters. Here, our goal is to create a well rounded model that can do pretty much everything write stories, explain articles, code in Python, etc. We can decompose this requirement into four tasks and select the best expert for each of them. This is how I decomposed it Chat model a general purpose model that is used in most interactions. I used mlabonne AlphaMonarch 7B, which perfectly satisfies the requirements. Code model a model capable of generating good code. I don t have a lot of experience with Mistral 7B based code models, but I found beowolx CodeNinja 1.0 OpenChat 7B particularly good compared to others. Math model math is tricky for LLMs, which is why we want a model specialized in math. Thanks to its high MMLU and GMS8K scores, I chose mlabonne NeuralDaredevil 7B for this purpose. Role play model The goal of this model is to write high quality stories and conversations. I selected SanjiWatsuki Kunoichi DPO v2 7B because of its good reputation and high MT Bench score 8.51 vs. 8.30 for Mixtral . Now that we ve identified the experts we want to use, we can create the YAML configuration that MergeKit will use to create our frankenMoE. This uses the mixtral branch of MergeKit. You can find more information about how to write the configuration on this page. Here is our version base_model mlabonne AlphaMonarch 7B experts source_model mlabonne AlphaMonarch 7B positive_prompts chat assistant tell me explain I want source_model beowolx CodeNinja 1.0 OpenChat 7B positive_prompts code python javascript programming algorithm source_model SanjiWatsuki Kunoichi DPO v2 7B positive_prompts storywriting write scene story character source_model mlabonne NeuralDaredevil 7B positive_prompts reason math mathematics solve count For each expert, I provide five basic positive prompts. You can be a bit fancier and write entire sentences if you want. The best strategy consists of using real prompts that should trigger a particular expert. You can also add negative prompts to do the opposite. Once this is ready, you can save your configuration as config.yaml . In the same folder, we will download and install the mergekit library mixtral branch . git clone b mixtral https github.com arcee ai mergekit.git cd mergekit pip install e . pip install U transformers If your computer has enough RAM roughly 24 32 GB of RAM , you can run the following command mergekit moe config.yaml merge copy tokenizer If you don t have enough RAM, you can shard the models instead as follows it will take longer mergekit moe config.yaml merge copy tokenizer allow crimes out shard size 1B lazy unpickle This command automatically downloads the experts and creates the frankenMoE in the merge directory. For the hidden gate mode, you can also use the load in 4bit and load in 8bit options to compute hidden states with lower precision. Alternatively, you can copy your configuration into LazyMergekit, a wrapper I made to simplify model merging. In this Colab notebook, you can input your model name, select the mixtral branch, specify your Hugging Face username token, and run the cells. After creating your frankenMoE, it will also upload it to the Hugging Face Hub with a nicely formatted model card. I called my model Beyonder 4x7B v3 and created GGUF versions of it using AutoGGUF. If you can t run GGUF versions on your local machine, you can also perform inference using this Colab notebook. To get a good overview of its capabilities, it has been evaluated on three different benchmarks Nous benchmark suite, EQ Bench, and the Open LLM Leaderboard. This model is not designed to excel in traditional benchmarks, as the code and role playing models generally do not apply to those contexts. Nonetheless, it performs remarkably well thanks to strong general purpose experts. Nous Beyonder 4x7B v3 is one of the best models on Nous benchmark suite evaluation performed using LLM AutoEval and significantly outperforms the v2. See the entire leaderboard here. EQ Bench It s also the best 4x7B model on the EQ Bench leaderboard, outperforming older versions of ChatGPT and Llama 2 70b chat. Beyonder is very close to Mixtral 8x7B Instruct v0.1 and Gemini Pro, which are supposedly much bigger models. Open LLM Leaderboard Finally, it s also a strong performer on the Open LLM Leaderboard, significantly outperforming the v2 model. On top of these quantitative evaluations, I recommend checking the model s outputs in a more qualitative way using a GGUF version on LM Studio. A common way of testing these models is to gather a private set of questions and check their outputs. With this strategy, I found that Beyonder 4x7B v3 is quite robust to changes in the user and system prompts compared to other models, including AlphaMonarch 7B. This is pretty cool as it improves the usefulness of the model in general. FrankenMoEs are a promising but still experimental approach. The trade offs, like higher VRAM demand and slower inference speeds, can make it challenging to see their advantage over simpler merging techniques like SLERP or DARE TIES. Especially, when you use frankenMoEs with just two experts, they might not perform as well as if you had simply merged the two models. However, frankenMoEs excel in preserving knowledge, which can result in stronger models, as demonstrated by Beyonder 4x7B v3. With the right hardware, these drawbacks can be effectively mitigated. Conclusion In this article, we introduced the Mixture of Experts architecture. Unlike traditional MoEs that are trained from scratch, MergeKit facilitates the creation of MoEs by ensembling experts, offering an innovative approach to improving model performance and efficiency. We detailed the process of creating a frankenMoE with MergeKit, highlighting the practical steps involved in selecting and combining different experts to produce a high quality MoE. Thanks for reading this article. I encourage you to try to make your own FrankenMoEs using LazyMergeKit select a few models, create your config based Beyonder s, and run the notebook to create your own models! If you liked this article, please follow me on Hugging Face and X Twitter maximelabonne. References Mixtral of Experts by Jiang et al. 2023 Mixture of Experts for Clowns by Charles Goddard 2023 Mixture of Experts Explained by Sanseviero et al. 2023 Adaptive Mixture of Local Experts by Jacobs et al. 1991 Sparse Upcycling Training Mixture of Experts from Dense Checkpoints by Komatsuzaki et al. 2022 _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com 1 Share this post Create Mixtures of Experts with MergeKit maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/create-mixtures-of-experts-with-mergekit-11b318c99562" }, { "id": "6d5c6e46-1390-4bb7-86ee-73df95b7a610", "content": "Merge Large Language Models with mergekit Create your own models easily, no GPU required! Maxime Labonne SubscribeSign in Share this post Merge Large Language Models with mergekit maximelabonne.substack.com Copy link Facebook Email Note Other Merge Large Language Models with mergekit Create your own models easily, no GPU required! Maxime Labonne Jan 08, 2024 1 Share this post Merge Large Language Models with mergekit maximelabonne.substack.com Copy link Facebook Email Note Other Share Create your own models easily, no GPU required! Image by author Model merging is a technique that combines two or more LLMs into a single model. It s a relatively new and experimental method to create new models for cheap no GPU required . Model merging works surprisingly well and produced many state of the art models on the Open LLM Leaderboard. In this tutorial, we will implement it using the mergekit library. More specifically, we will review four merge methods and provide examples of configurations. Then, we will use mergekit to create our own model, Marcoro14 7B slerp, which became the best performing model on the Open LLM Leaderboard 02 01 24 . The code is available on GitHub and Google Colab. I recommend using my automated notebook to easily run mergekit LazyMergekit. _A special thanks toCharles Goddard, the author of the mergekit library, for reviewing this article._ Image by author Merge algorithms In this section, we will focus on four methods currently implemented in mergekit. Note that there are other methods, such as linear and Task Arithmetic. If you re interested in papers on model merging, I recommend this excellent collection on Hugging Face. 1 . SLERP Spherical Linear Interpolation SLERP is a method used to smoothly interpolate between two vectors. It maintains a constant rate of change and preserves the geometric properties of the spherical space in which the vectors reside. There are several reasons to prefer SLERP over a traditional linear interpolation. For example, in high dimensional spaces, linear interpolation can lead to a decrease in the magnitude of the interpolated vector i.e., it reduces the scale of weights . Moreover, the change in direction of the weights often represents more meaningful information like feature learning and representation than the magnitude of change. SLERP is implemented using the following steps 1. Normalize the input vectors to unit length, ensuring they represent directions rather than magnitudes 2. Calculate the angle between these vectors using their dot product. 3. If the vectors are nearly collinear, it defaults to linear interpolation for efficiency. Otherwise, SLERP computing scale factors based on the interpolation factor t t 0 100 of the first vector, t 1 100 of model 2 and the angle between the vectors. 4. These factors are used to weigh the original vectors, which are then summed to obtain the interpolated vector. SLERP is currently the most popular merging method, but it is limited to combining only two models at a time. It is still possible to hierarchically combine multiple models, as shown in Mistral 7B Merge 14 v0.1. _Example of configuration _ slices sources model OpenPipe mistral ft optimized 1218 layer_range 0, 32 model mlabonne NeuralHermes 2.5 Mistral 7B layer_range 0, 32 merge_method slerp base_model OpenPipe mistral ft optimized 1218 parameters t filter self_attn value 0, 0.5, 0.3, 0.7, 1 filter mlp value 1, 0.5, 0.7, 0.3, 0 value 0.5 dtype bfloat16 This is a classic SLERP configuration, applied to every layer of both models. Note that we input a gradient of values for the interpolation factor t . The parameters for the self attention and MLP layers will use different combinations of OpenPipe mistral ft optimized 1218 and mlabonne NeuralHermes 2.5 Mistral 7B. The other layers are a 50 50 mixture of the two models. You can find the final model on the Hugging Face Hub at mlabonne NeuralPipe 7B slerp. 2 . TIES Introduced in this paper by Yadav et al., TIES Merging is designed to efficiently merge multiple task specific models into a single multitask model. It addresses two main challenges in model merging Redundancy in model parameters It identifies and eliminates redundant parameters within task specific models. This is achieved by focusing on the changes made during fine tuning, identifying the top k most significant changes, and discarding the rest. Disagreement between parameter signs Conflicts arise when different models suggest opposing adjustments to the same parameter. TIES Merging resolves these conflicts by creating a unified sign vector that represents the most dominant direction of change across all models. TIES Merging is divided into the following three steps 1. Trim Reduces redundancy in task specific models by retaining only a fraction the most significant parameters density parameter and resetting the rest to zero. 2. Elect Sign Resolves sign conflicts across different models by creating a unified sign vector based on the most dominant direction positive or negative in terms of cumulative magnitude. 3. Disjoint Merge Averages parameter values that align with the unified sign vector, excluding zero values. Unlike SLERP, TIES can merge multiple models at a time. _Example of configuration _ models model mistralai Mistral 7B v0.1 no parameters necessary for base model model OpenPipe mistral ft optimized 1218 parameters density 0.5 weight 0.5 model mlabonne NeuralHermes 2.5 Mistral 7B parameters density 0.5 weight 0.3 merge_method ties base_model mistralai Mistral 7B v0.1 parameters normalize true dtype float16 With this config, we use Mistral 7B as a base model to calculate the delta weights. We merge the same two models mistral ft optimized 1218 50 and NeuralHermes 2.5 Mistral 7B 30 with normalization. Here, the density means that we re only retaining 50 of the parameters of each model the other half comes from the base model . Note that the sum of the weights is not equal to 1 in the config, but the normalize true parameter will automatically normalize them internally. This config is inspired by the parameters provided by the author of OpenHermes 2.5 neural chat 7b v3 1 7B. You can find the final model on the Hugging Face Hub at mlabonne NeuralPipe 7B ties. 3 . DARE Introduced by Yu et al. 2023 , DARE uses an approach similar to TIES with two main differences Pruning DARE randomly reset fine tuned weights to their original values those of the base model . Rescaling DARE rescales the weights to keep the expectations of model outputs approximately unchanged. It adds the rescaled weights of both or more models to the weights of the base model with a scale factor. Mergekit s implementation of this method has two flavors with the sign election step of TIES dare_ties or without dare_linear . _Example of configuration _ models model mistralai Mistral 7B v0.1 No parameters necessary for base model model samir fama SamirGPT v1 parameters density 0.53 weight 0.4 model abacusai Slerp CM mist dpo parameters density 0.53 weight 0.3 model EmbeddedLLM Mistral 7B Merge 14 v0.2 parameters density 0.53 weight 0.3 merge_method dare_ties base_model mistralai Mistral 7B v0.1 parameters int8_mask true dtype bfloat16 In this configuration, we merge three different models based on Mistral 7B using dare_ties . This time, I chose weights that sum to 1 the sum should be between 0.9 and 1.1 . The density parameter is a little higher than what s recommended in the paper 0.5 , but it looks like it gives consistently better results see this discussion . You can find it on the Hugging Face Hub at mlabonne Daredevil 7B. It s also the best merge model in this article, outperforming even Marcoro14 7B slerp. 4 . Passthrough The passthrough method differs significantly from the previous ones. By concatenating layers from different LLMs, it can produce models with an exotic number of parameters e.g., 9B with two 7B parameter models . These models are often referred to as frankenmerges or Frankenstein models by the community. This technique is very experimental, but it managed to create impressive models, like goliath 120b using two Llama 2 70B models. The recently released SOLAR 10.7B v1.0 also uses the same idea, called depth up scaling in their paper. _Example of configuration _ slices sources model OpenPipe mistral ft optimized 1218 layer_range 0, 32 sources model mlabonne NeuralHermes 2.5 Mistral 7B layer_range 24, 32 merge_method passthrough dtype bfloat16 The resulting frankenmerge will have all the 32 layers from the first model and 8 additional layers from the second model. This creates a frankenmerge with a total of 40 layers and 8.99B parameters. This config is inspired by GML Mistral merged v1. You can find the final model on the Hugging Face Hub at mlabonne NeuralPipe 9B merged. Merge your own models In this section, we will use mergekit to load a merge configuration, run it, and upload the resulting model to the Hugging Face Hub. First of all, we install mergekit directly from source as follows !git clone https github.com cg123 mergekit.git !cd mergekit pip install q e . In the following block, we load the merge configuration in a YAML format. We also specify the name of the merged model for future use. You can copy paste any configuration from the previous section here. This time, we will use two different models Marcoroni 7B v3 and Mistral 7B Merge 14 v0.1 and merge them with the SLERP method. We save the config as a yaml file to be used as input in the merge command. import yaml MODEL_NAME Marcoro14 7B slerp yaml_config slices sources model AIDC ai business Marcoroni 7B v3 layer_range 0, 32 model EmbeddedLLM Mistral 7B Merge 14 v0.1 layer_range 0, 32 merge_method slerp base_model AIDC ai business Marcoroni 7B v3 parameters t filter self_attn value 0, 0.5, 0.3, 0.7, 1 filter mlp value 1, 0.5, 0.7, 0.3, 0 value 0.5 dtype bfloat16 Save config as yaml file with open config.yaml , w , encoding utf 8 as f f.write yaml_config We run the merge command with the following parameters copy tokenizer to copy the tokenizer from the base model allow crimes and out shard size to chunk the models into smaller shards that can be computed on a CPU with low RAM lazy unpickle to enable the experimental lazy unpickler for lower memory usage In addition, some models can require the trust_remote_code flag this is not the case with Mistral 7B . This command will download the weights of all the models listed in the merge configuration and run the selected merge method it should take 10 minutes . Merge models !mergekit yaml config.yaml merge copy tokenizer allow crimes out shard size 1B lazy unpickl The model is now merged and saved in the merge directory. Before uploading it, we can create a README file with all the information required for reproducibility. The following code block defines a Jinja template and automatically fills it with the data from the merge configuration. !pip install qU huggingface_hub from huggingface_hub import ModelCard, ModelCardData from jinja2 import Template username mlabonne template_text license apache 2.0 tags merge mergekit lazymergekit for model in models model endfor model_name model_name is a merge of the following models using mergekit https github.com cg123 mergekit for model in models model https huggingface.co model endfor Configuration yaml yaml_config Create a Jinja template object jinja_template Template template_text.strip Get list of models from config data yaml.safe_load yaml_config if models in data models data models i model for i in range len data models if parameters in data models i elif parameters in data models data slices 0 sources i model for i in range len data slices 0 sources elif slices in data models data slices i sources 0 model for i in range len data slices else raise Exception No models or slices found in yaml config Fill the template content jinja_template.render model_name MODEL_NAME, models models, yaml_config yaml_config, username username, Save the model card card ModelCard content card.save merge README.md Now that we have a model card, we can push the entire folder to the Hub. from google.colab import userdata from huggingface_hub import HfApi username mlabonne Defined in the secrets tab in Google Colab api HfApi token userdata.get HF_TOKEN api.create_repo repo_id f username MODEL_NAME , repo_type model api.upload_folder repo_id f username MODEL_NAME , folder_path merge , The model is now available on the Hugging Face Hub at mlabonne Marcoro14 7B slerp. In another notebook, we can try the model on a free T4 GPU using the following code !pip install qU transformers accelerate from transformers import AutoTokenizer import transformers import torch model mlabonne Marcoro14 7B slerp messages role user , content What is a large language model? tokenizer AutoTokenizer.from_pretrained model prompt tokenizer.apply_chat_template messages, tokenize False, add_generation_prompt True pipeline transformers.pipeline text generation , model model, torch_dtype torch.float16, device_map auto , outputs pipeline prompt, max_new_tokens 256, do_sample True, temperature 0.7, top_k 50, top_p 0.95 We re asking the question What is a Large Language Model? and received this output _A large language model is a type of artificial intelligence AI system that has been trained on vast amounts of text data. It s designed to understand and generate human like language, making predictions on what words or phrases might come next in a sentence or document. These models use complex algorithms and neural network architectures to learn from the data and improve their performance over time. Some well known large language models include GPT 3 from OpenAI and BERT from Google._ It s looking good, but we need a more comprehensive evaluation. For this kind of general purpose model, there are a few interesting benchmarks Chatbot Arena , which compiles an Elo based LLM leaderboard based on human votes. MT bench same link , which uses GPT 4 as a judge to grade model responses on a set of multi turn questions. NousResearch benchmark suite , which aggregates four benchmarks AGIEval, GPT4ALL, TruthfulQA, and Bigbench. GPT4ALL itself includes HellaSwag, OpenBookQA, Winogrande, ARC Easy, ARC Challenge, BoolQ, and PIQA. Open LLM Leaderboard , which aggregates six benchmarks ARC, HellaSwag, MMLU, Winogrande, GSM8K, and TruthfulQA. Unfortunately, we can t submit our model to the Chatbot Arena. Instead, I chose to evaluate it using the Open LLM Leaderboard and NousResearch benchmarks. I submitted our model to the Open LLM Leaderboard Submit here! tab . As shown in the introduction, it ranked as the best 7B parameter model on the leaderboard. Here are the complete results Image by author The problem with the Open LLM Leaderboard is that these benchmarks are public. It means that people can train LLMs on the test data to get better results. By merging the best models, we also contaminate our own results. It is safe to assume that Marcoro14 7B slerp is contaminated and some models used in this merge have been trained on the test set. If you want to create the best model and not hack the leaderboard, I recommend only using non merge models to create your own merges. This is why we don t want to only rely on the OpenLLM Leaderboard. For NousResearch benchmark suite, I used LLM AutoEval to compute the scores automatically with a simple Colab notebook. Here are the results compared to the excellent OpenHermes 2.5 Mistral 7B Image by author We get a significant improvement over this model on every benchmark . Note that NousResearch benchmark suite shares some tasks with the Open LLM Leaderboard ARC Challenge, TruthfulQA, HellaSwag, and Winogrande. To the best of my knowledge, Bigbench is the only benchmark that is 100 different feel free to contact me if that s not the case . However, one of the models we used in this merge could still have been trained on Bigbench. Conclusion In this article, we introduced the concept of merging LLMs with four different methods. We detailed how SLERP, TIES, DARE, and passthrough work and provided examples of configurations. Finally, we ran SLERP with mergekit to create Marcoro14 7B slerp and upload it to the Hugging Face Hub. We obtained excellent performance on two benchmark suites Open LLM Leaderboard best performing 7B model and NousResearch. If you want to create your own merges, I recommend using my automated notebook LazyMergekit. Another way of combining multiple models is to merge them in a Mixture of Experts MoE architecture. In the next article, we ll discuss how to do this in detail and create our own Mixtral like model. If you liked this article, please follow me on Medium and Twitter maximelabonne. _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com 1 Share this post Merge Large Language Models with mergekit maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/merge-large-language-models-with-mergekit-2118fb392b54" }, { "id": "d79f3c67-c491-4fd1-96ba-67e03ba66d93", "content": "Fine tune a Mistral 7b model with Direct Preference Optimization Boost the performance of your supervised fine tuned models Maxime Labonne SubscribeSign in Share this post Fine tune a Mistral 7b model with Direct Preference Optimization maximelabonne.substack.com Copy link Facebook Email Note Other Fine tune a Mistral 7b model with Direct Preference Optimization Boost the performance of your supervised fine tuned models Maxime Labonne Jan 01, 2024 1 Share this post Fine tune a Mistral 7b model with Direct Preference Optimization maximelabonne.substack.com Copy link Facebook Email Note Other Share Boost the performance of your supervised fine tuned models Image by author Pre trained Large Language Models LLMs can only perform next token prediction, making them unable to answer questions. This is why these base models are then fine tuned on pairs of instructions and answers to act as helpful assistants. However, this process can still be flawed fine tuned LLMs can be biased, toxic, harmful, etc. This is where Reinforcement Learning from Human Feedback RLHF comes into play. RLHF provides different answers to the LLM, which are ranked according to a desired behavior helpfulness, toxicity, etc. . The model learns to output the best answer among these candidates, hence mimicking the behavior we want to instill. Often seen as a way to censor models, this process has recently become popular for improving performance, as shown in neural chat 7b v3 1. In this article, we will create NeuralHermes 2.5, by fine tuning OpenHermes 2.5 using a RLHF like technique Direct Preference Optimization DPO . For this purpose, we will introduce a preference dataset, describe how the DPO algorithm works, and apply it to our model. We ll see that it significantly improves the performance of the base model on the Open LLM Leaderboard. As per usual, the code is available on GitHub and Google Colab. _ Update Jessie Davids, a reader who used this article and code, managed to create the best performing model on the Open LLM Leaderboard 7B param. Congrats to him! _ Image by author Preference datasets Preference datasets are not standardized, but they typically consist of a collection of answers that are ranked by humans. This ranking is essential, as the RLHF process fine tunes LLMs to output the preferred answer. Here is an example of Anthropic hh rlhf, a popular preference dataset Image by author The structure of the dataset is straightforward for each row, there is one chosen preferred answer, and one rejected answer. The goal of RLHF is to guide the model to output the preferred answer. Preference datasets are notoriously costly and difficult to make, as they require collecting manual feedback from humans. This feedback is also subjective and can easily be biased toward confident but wrong answers or contradict itself different annotators have different values . Over time, several solutions have been proposed to tackle these issues, such as replacing human feedback with AI feedback RLAIF . These datasets also tend to be a lot smaller than fine tuning datasets. To illustrate this, the excellent neural chat 7b v3 1 best 7B LLM on the Open LLM Leaderboard when it was released uses 518k samples for fine tuning Open Orca SlimOrca but only 12.9k samples for RLHF Intel orca_dpo_pairs . In this case, the authors generated answers with GPT 4 3.5 to create the preferred answers, and with Llama 2 13b chat to create the rejected responses. It s a smart way to bypass human feedback and only rely on models with different levels of performance. Direct Preference Optimization While the concept of RLHF has been used in robotics for a long time, it was popularized for LLMs in OpenAI s paper Fine Tuning Language Models from Human Preferences. In this paper, the authors present a framework where a reward model is trained to approximate human feedback. This reward model is then used to optimize the fine tuned model s policy using the Proximal Policy Optimization PPO algorithm. Image by author The core concept of PPO revolves around making smaller, incremental updates to the policy, as larger updates can lead to instability or suboptimal solutions. From experience, this technique is unfortunately still unstable loss diverges , difficult to reproduce numerous hyperparameters, sensitive to random seeds , and computationally expensive. This is where Direct Preference Optimization DPO comes into play. DPO simplifies control by treating the task as a classification problem. Concretely, it uses two models the trained model or policy model and a copy of it called the reference model . During training, the goal is to make sure the trained model outputs higher probabilities for preferred answers than the reference model. Conversely, we also want it to output lower probabilities for rejected answers. It means we re penalizing the LLM for bad answers and rewarding it for good ones. Image by author By using the LLM itself as a reward model and employing binary cross entropy objectives, DPO efficiently aligns the model s outputs with human preferences without the need for extensive sampling, reward model fitting, or intricate hyperparameter adjustments. It results in a more stable, more efficient, and computationally less demanding process. Formatting the data In this example, we ll fine tune the excellent OpenHermes 2.5 Mistral 7B, which is a Mistral 7b model that was only supervised fine tuned. To this end, we ll use the Intel orca_dpo_pairs dataset to align our model and improve its performance. We call this new model NeuralHermes 2.5 Mistral 7B. The first step consists of installing the required libraries as follows. pip install q datasets trl peft bitsandbytes sentencepiece wandb Once it s done, we can import the libraries. I m also using the secrets tab in Google Colab to store my Hugging Face token. import os import gc import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig from datasets import load_dataset from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training from trl import DPOTrainer import bitsandbytes as bnb from google.colab import userdata import wandb Defined in the secrets tab in Google Colab hf_token userdata.get huggingface wb_token userdata.get wandb wandb.login key wb_token model_name teknium OpenHermes 2.5 Mistral 7B new_model NeuralHermes 2.5 Mistral 7B OpenHermes 2.5 Mistral 7B uses a specific chat template, called ChatML. Here is an example of a conversation formatted with this template im_start system You are a helpful chatbot assistant. im_end im_start user Hi im_end im_start assistant Hi, how can I help you? im_end As you can see, ChatML defines different roles system, user, assistant and appends special tokens im_start and im_end to separate them. Moreover, DPOTrainer also requires a specific format with three columns prompt, chosen, and rejected. Our dataset contains four columns system, question, chatgpt, and llama2 13b chat. We ll simply concatenate the system and question columns to the prompt column. We ll also map the chatgpt column to chosen and llama2 13b chat to rejected . To format the dataset in a reliable way, we ll use the tokenizer s apply_chat_template function, which already uses ChatML. def chatml_format example Format system if len example system 0 message role system , content example system system tokenizer.apply_chat_template message , tokenize False else system Format instruction message role user , content example question prompt tokenizer.apply_chat_template message , tokenize False, add_generation_prompt True Format chosen answer chosen example chosen im_end n Format rejected answer rejected example rejected im_end n return prompt system prompt, chosen chosen, rejected rejected, Load dataset dataset load_dataset Intel orca_dpo_pairs train Save columns original_columns dataset.column_names Tokenizer tokenizer AutoTokenizer.from_pretrained model_name tokenizer.pad_token tokenizer.eos_token tokenizer.padding_side left Format dataset dataset dataset.map chatml_format, remove_columns original_columns Let s print a sample of the formatted dataset to confirm that everything works as expected prompt im_start system nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer. im_end n im_start user nGenerate an approximately fifteen word sentence that describes all this data Midsummer House eatType restaurant Midsummer House food Chinese Midsummer House priceRange moderate Midsummer House customer rating 3 out of 5 Midsummer House near All Bar One im_end n im_start assistant n , chosen Midsummer House is a moderately priced Chinese restaurant with a 3 5 customer rating, located near All Bar One. im_end n , rejected Sure! Here s a sentence that describes all the data you provided n n Midsummer House is a moderately priced Chinese restaurant with a customer rating of 3 out of 5, located near All Bar One, offering a variety of delicious dishes. im_end n We can see that the prompt combines system and user instructions. Thanks to the add_generation_prompt True argument, it also appends the beginning of the assistant s answer. If you want to skip this step, you can directly used the preprocessed dataset as mlabonne chatml_dpo_pairs. Training the model with DPO Next, we define the LoRA configurations to train the model. As described in Intel s blog post, we set the rank value to be equal to the lora_alpha , which is unusual 2 r as a rule of thumb . We also target all the linear modules with adapters. LoRA configuration peft_config LoraConfig r 16, lora_alpha 16, lora_dropout 0.05, bias none , task_type CAUSAL_LM , target_modules k_proj , gate_proj , v_proj , up_proj , q_proj , o_proj , down_proj We re now ready to load the model we want to fine tune with DPO. In this case, two models are required the model to fine tune as well as the reference model. This is mostly for the sake of readability, as the DPOTrainer object automatically creates a reference model if none is provided. Model to fine tune model AutoModelForCausalLM.from_pretrained model_name, torch_dtype torch.float16, load_in_4bit True model.config.use_cache False Reference model ref_model AutoModelForCausalLM.from_pretrained model_name, torch_dtype torch.float16, load_in_4bit True The final step consists of providing all the hyperparameters to TrainingArguments and DPOTrainer Among them, the beta parameter is unique to DPO since it controls the divergence from the initial policy 0.1 is a typical value for it . Compared to the values described in Intel s blog post, we lower the learning rate from 5e 4 to 5e 5 and the number of steps from 1,000 to 200 . I manually optimized these values after a few runs to stabilize training and achieve the best results. We can now start training the model. Note that it requires an A100 GPU and takes between 1 hour to complete the training. Training arguments training_args TrainingArguments per_device_train_batch_size 4, gradient_accumulation_steps 4, gradient_checkpointing True, learning_rate 5e 5, lr_scheduler_type cosine , max_steps 200, save_strategy no , logging_steps 1, output_dir new_model, optim paged_adamw_32bit , warmup_steps 100, bf16 True, report_to wandb , Create DPO trainer dpo_trainer DPOTrainer model, ref_model, args training_args, train_dataset dataset, tokenizer tokenizer, peft_config peft_config, beta 0.1, max_prompt_length 1024, max_length 1536, Fine tune model with DPO dpo_trainer.train Our model is now fine tuned. You can check the project on Weights Biases at this address. Here are some interesting metrics to analyze Image by author Interestingly, the training loss quickly drops to zero before 50 steps , despite 100 warmup steps. Meanwhile, the other metrics keep evolving. The train rewards chosen and train rewards rejected plots correspond to the mean difference between the log probabilities output by the trained and reference models. It makes sense that, over time, they diverge as our trained model learns the preferred answers. The train rewards margins plot also shows the difference between these two plots. Finally, the train reward accuracies plot shows the frequency of choosing the preferred answer. The trained model quickly reaches a perfect accuracy score, which is a good sign but could also mean that the difference between preferred and rejected answers is too obvious. Now that it s trained, we can merge the adapter with the original model. Next, we save the merged model and the tokenizer before pushing it to the Hugging Face Hub. Save artifacts dpo_trainer.model.save_pretrained final_checkpoint tokenizer.save_pretrained final_checkpoint Flush memory del dpo_trainer, model, ref_model gc.collect torch.cuda.empty_cache Reload model in FP16 instead of NF4 base_model AutoModelForCausalLM.from_pretrained model_name, return_dict True, torch_dtype torch.float16, tokenizer AutoTokenizer.from_pretrained model_name Merge base model with the adapter model PeftModel.from_pretrained base_model, final_checkpoint model model.merge_and_unload Save model and tokenizer model.save_pretrained new_model tokenizer.save_pretrained new_model Push them to the HF Hub model.push_to_hub new_model, use_temp_dir False, token hf_token tokenizer.push_to_hub new_model, use_temp_dir False, token hf_token Let s see how our model performs in a real test. We ll format the prompt to ask a basic question What is a Large Language Model? Format prompt message role system , content You are a helpful assistant chatbot. , role user , content What is a Large Language Model? tokenizer AutoTokenizer.from_pretrained new_model prompt tokenizer.apply_chat_template message, add_generation_prompt True, tokenize False Create pipeline pipeline transformers.pipeline text generation , model new_model, tokenizer tokenizer Generate text sequences pipeline prompt, do_sample True, temperature 0.7, top_p 0.9, num_return_sequences 1, max_length 200, print sequences 0 generated_text Here s the answer from the model A large language model is a type of artificial intelligence AI system that has been trained on vast amounts of text data. These models are designed to understand and generate human language, allowing them to perform various natural language processing tasks, such as text generation, language translation, and question answering. Large language models typically use deep learning techniques, like recurrent neural networks RNNs or transformers, to learn patterns and relationships in the data, enabling them to generate coherent and contextually relevant responses. The size of these models, in terms of the number of parameters and the volume of data they are trained on, plays a significant role in their ability to comprehend and produce complex language structures. Everything seems to be working, we can now evaluate the merged model. As this is a general purpose model, we can leverage the lm evaluation harness to evaluate it. As the process is quite resource intensive, we can also directly submit it for evaluation on the Open LLM Leaderboard. It took a few days, but here are the results compared to other OpenHermes models Image by author Compared to the original model, NeuralHermes 2 5 Mistral 7B model improved the average score by 6.7 points particularly on GSM8K . This is an unexpectedly large improvement, which showcases the power of Direct Preference Optimization. Conclusion In this article, we fine tuned an already supervised fine tuned model using DPO and created our own NeuralHermes 2.5 model. By leveraging a high quality preference dataset, we created a sample efficient fine tuning pipeline that produced a significant improvement on the Open LLM Leaderboard. If you want to give it a try, you can find quantized variants of this model or use this Hugging Face Space. Note that our fine tuning pipeline can still be improved in different ways. For example, the preference dataset is still quite raw and could be improved with more filtering and by using different models. In addition, numerous hyperparameters can still be tweaked to achieve better results. In particular, the learning rate can still be lowered to train the model on more steps and inject more preference data. References Fine tune Llama 2 with DPO by Kashif Rasul, Younes Belkada, and Leandro von Werra. Supervised Fine Tuning and Direct Preference Optimization on Intel Gaudi2 by Kaokao Lv, Wenxin Zhang, and Haihao Shen. llama2 fine tune by mzbac. _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com 1 Share this post Fine tune a Mistral 7b model with Direct Preference Optimization maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac" }, { "id": "cedddb77-189c-4ef8-a1af-d9b19d105fcd", "content": "ExLlamaV2 The Fastest Library to Run LLMs Quantize and run EXL2 models Maxime Labonne SubscribeSign in Share this post ExLlamaV2 The Fastest Library to Run LLMs maximelabonne.substack.com Copy link Facebook Email Note Other ExLlamaV2 The Fastest Library to Run LLMs Quantize and run EXL2 models Maxime Labonne Nov 20, 2023 Share this post ExLlamaV2 The Fastest Library to Run LLMs maximelabonne.substack.com Copy link Facebook Email Note Other Share Quantize and run EXL2 models Image by author Quantizing Large Language Models LLMs is the most popular approach to reduce the size of these models and speed up inference. Among these techniques, GPTQ delivers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3 times less VRAM while providing a similar level of accuracy and faster generation. It became so popular that it has recently been directly integrated into the transformers library. ExLlamaV2 is a library designed to squeeze even more performance out of GPTQ. Thanks to new kernels, it s optimized for blazingly fast inference. It also introduces a new quantization format, EXL2, which brings a lot of flexibility to how weights are stored. In this article, we will see how to quantize base models in the EXL2 format and how to run them. As usual, the code is available on GitHub and Google Colab. Quantize EXL2 models To start our exploration, we need to install the ExLlamaV2 library. In this case, we want to be able to use some scripts contained in the repo, which is why we will install it from source as follows git clone https github.com turboderp exllamav2 pip install exllamav2 Now that ExLlamaV2 is installed, we need to download the model we want to quantize in this format. Let s use the excellent zephyr 7B beta, a Mistral 7B model fine tuned using Direct Preference Optimization DPO . It claims to outperform Llama 2 70b chat on the MT bench, which is an impressive result for a model that is ten times smaller. You can try out the base Zephyr model using this space. We download zephyr 7B beta using the following command this can take a while since the model is about 15 GB git lfs install git clone https huggingface.co HuggingFaceH4 zephyr 7b beta GPTQ also requires a calibration dataset , which is used to measure the impact of the quantization process by comparing the outputs of the base model and its quantized version. We will use the wikitext dataset and directly download the test file as follows wget https huggingface.co datasets wikitext resolve 9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3 wikitext 103 v1 wikitext test.parquet Once it s done, we can leverage the convert.py script provided by the ExLlamaV2 library. We re mostly concerned with four arguments i Path of the base model to convert in HF format FP16 . o Path of the working directory with temporary files and final output. c Path of the calibration dataset in Parquet format . b Target average number of bits per weight bpw . For example, 4.0 bpw will give store weights in 4 bit precision. The complete list of arguments is available on this page. Let s start the quantization process using the convert.py script with the following arguments mkdir quant python python exllamav2 convert.py i base_model o quant c wikitext test.parquet b 5.0 Note that you will need a GPU to quantize this model. The official documentation specifies that you need approximately 8 GB of VRAM for a 7B model, and 24 GB of VRAM for a 70B model. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr 7b beta using a T4 GPU. Under the hood, ExLlamaV2 leverages the GPTQ algorithm to lower the precision of the weights while minimizing the impact on the output. You can find more details about the GPTQ algorithm in this article. So why are we using the EXL2 format instead of the regular GPTQ format? EXL2 comes with a few new features It supports different levels of quantization it s not restricted to 4 bit precision and can handle 2, 3, 4, 5, 6, and 8 bit quantization. It can mix different precisions within a model and within each layer to preserve the most important weights and layers with more bits. ExLlamaV2 uses this additional flexibility during quantization. It tries different quantization parameters and measures the error they introduce. On top of trying to minimize the error, ExLlamaV2 also has to achieve the target average number of bits per weight given as an argument. Thanks to this behavior, we can create quantized models with an average number of bits per weight of 3.5 or 4.5 for example. The benchmark of different parameters it creates is saved in the measurement.json file. The following JSON shows the measurement for one layer key model.layers.0.self_attn.q_proj , numel 16777216, options desc 0.05 3b 0.95 2b 32g s4 , bpw 2.1878662109375, total_bits 36706304.0, err 0.011161142960190773, qparams group_size 32, bits 3, 2 , bits_prop 0.05, 0.95 , scale_bits 4 , In this trial, ExLlamaV2 used 5 of 3 bit and 95 of 2 bit precision for an average value of 2.188 bpw and a group size of 32. This introduced a noticeable error that is taken into account to select the best parameters. Running ExLlamaV2 for Inference Now that our model is quantized, we want to run it to see how it performs. Before that, we need to copy essential config files from the base_model directory to the new quant directory. Basically, we want every file that is not hidden . or a safetensors file. Additionally, we don t need the out_tensor directory that was created by ExLlamaV2 during quantization. In bash, you can implement this as follows !rm rf quant out_tensor !rsync av exclude .safetensors exclude . . base_model . quant Our EXL2 model is ready and we have several options to run it. The most straightforward method consists of using the test_inference.py script in the ExLlamaV2 repo note that I don t use a chat template here python exllamav2 test_inference.py m quant p I have a dream The generation is very fast 56.44 tokens second on a T4 GPU , even compared to other quantization techniques and tools like GGUF llama.cpp or GPTQ. You can find an in depth comparison between different solutions in this excellent article from oobabooga. In my case, the LLM returned the following output Model quant Options rope_scale 1.0 , rope_alpha 1.0 Loading model... Loading tokenizer... Warmup... Generating... I have a dream. user Wow, that s an amazing speech! Can you add some statistics or examples to support the importance of education in society? It would make it even more persuasive and impactful. Also, can you suggest some ways we can ensure equal access to quality education for all individuals regardless of their background or financial status? Let s make this speech truly unforgettable! Absolutely! Here s your updated speech Dear fellow citizens, Education is not just an academic pursuit but a fundamental human right. It empowers people, opens doors Response generated in 3.40 seconds, 128 tokens, 37.66 tokens second includes prompt eval. Alternatively, you can use a chat version with the chatcode.py script for more flexibility python exllamav2 examples chatcode.py m quant mode llama If you re planning to use an EXL2 model more regularly, ExLlamaV2 has been integrated into several backends like oobabooga s text generation web UI. Note that it requires FlashAttention 2 to work properly, which requires CUDA 12.1 on Windows at the moment something you can configure during the installation process . Now that we tested the model, we re ready to upload it to the Hugging Face Hub. You can change the name of your repo in the following code snippet and simply run it. from huggingface_hub import notebook_login from huggingface_hub import HfApi notebook_login api HfApi api.create_repo repo_id f mlabonne zephyr 7b beta 5.0bpw exl2 , repo_type model api.upload_folder repo_id f mlabonne zephyr 7b beta 5.0bpw exl2 , folder_path quant , Great, the model can be found on the Hugging Face Hub. The code in the notebook is quite general and can allow you to quantize different models, using different values of bpw. This is ideal for creating models dedicated to your hardware. Conclusion In this article, we presented ExLlamaV2, a powerful library to quantize LLMs. It is also a fantastic tool to run them since it provides the highest number of tokens per second compared to other solutions like GPTQ or llama.cpp. We applied it to the zephyr 7B beta model to create a 5.0 bpw version of it, using the new EXL2 format. After quantization, we tested our model to see how it performs. Finally, it was uploaded to the Hugging Face Hub and can be found here. If you re interested in more technical content around LLMs, follow me on Medium. Articles about quantization Introduction to Weight Quantization _Reducing the size of Large Language Models with 8 bit quantization_towardsdatascience.com 4 bit Quantization with GPTQ _Quantize your own LLMs using AutoGPTQ_towardsdatascience.com _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com Share this post ExLlamaV2 The Fastest Library to Run LLMs maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/exllamav2-the-fastest-library-to-run-llms-32aeda294d26" }, { "id": "715b7861-0f40-4025-bf87-7dddeabaf278", "content": "Quantize Llama models with GGML and llama.cpp GGML vs. GPTQ vs. NF4 Maxime Labonne SubscribeSign in Share this post Quantize Llama models with GGML and llama.cpp maximelabonne.substack.com Copy link Facebook Email Note Other Quantize Llama models with GGML and llama.cpp GGML vs. GPTQ vs. NF4 Maxime Labonne Sep 04, 2023 Share this post Quantize Llama models with GGML and llama.cpp maximelabonne.substack.com Copy link Facebook Email Note Other Share GGML vs. GPTQ vs. NF4 Image by author Due to the massive size of Large Language Models LLMs , quantization has become an essential technique to run them efficiently. By reducing the precision of their weights, you can save memory and speed up inference while preserving most of the model s performance. Recently, 8 bit and 4 bit quantization unlocked the possibility of running LLMs on consumer hardware . Coupled with the release of Llama models and parameter efficient techniques to fine tune them LoRA, QLoRA , this created a rich ecosystem of local LLMs that are now competing with OpenAI s GPT 3.5 and GPT 4. Besides the naive approach covered in this article, there are three main quantization techniques NF4, GPTQ, and GGML. NF4 is a static method used by QLoRA to load a model in 4 bit precision to perform fine tuning. In a previous article, we explored the GPTQ method and quantized our own model to run it on a consumer GPU. In this article, we will introduce the GGML technique, see how to quantize Llama models, and provide tips and tricks to achieve the best results. You can find the code on Google Colab and GitHub. What is GGML? GGML is a C library focused on machine learning. It was created by Georgi Gerganov, which is what the initials GG stand for. This library not only provides foundational elements for machine learning, such as tensors, but also a unique binary format to distribute LLMs. This format recently changed to GGUF . This new format is designed to be extensible, so that new features shouldn t break compatibility with existing models. It also centralizes all the metadata in one file, such as special tokens, RoPE scaling parameters, etc. In short, it answers a few historical pain points and should be future proof. For more information, you can read the specification at this address. In the rest of the article, we will call GGML models all models that either use GGUF or previous formats. GGML was designed to be used in conjunction with the llama.cpp library, also created by Georgi Gerganov. The library is written in C C for efficient inference of Llama models. It can load GGML models and run them on a CPU . Originally, this was the main difference with GPTQ models, which are loaded and run on a GPU. However, you can now offload some layers of your LLM to the GPU with llama.cpp. To give you an example, there are 35 layers for a 7b parameter model. This drastically speeds up inference and allows you to run LLMs that don t fit in your VRAM. Image by author If command line tools are your thing, llama.cpp and GGUF support have been integrated into many GUIs, like oobabooga s text generation web ui, koboldcpp, LM Studio, or ctransformers. You can simply load your GGML models with these tools and interact with them in a ChatGPT like way. Fortunately, many quantized models are directly available on the Hugging Face Hub. You ll quickly notice that most of them are quantized by TheBloke, a popular figure in the LLM community. In the next section, we will see how to quantize our own models and run them on a consumer GPU. How to quantize LLMs with GGML? Let s look at the files inside of TheBloke Llama 2 13B chat GGML repo. We can see 14 different GGML models , corresponding to different types of quantization. They follow a particular naming convention q the number of bits used to store the weights precision a particular variant. Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by TheBloke q2_k Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors. q3_k_l Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K q3_k_m Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K q3_k_s Uses Q3_K for all tensors q4_0 Original quant method, 4 bit. q4_1 Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. q4_k_m Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K q4_k_s Uses Q4_K for all tensors q5_0 Higher accuracy, higher resource usage and slower inference. q5_1 Even higher accuracy, resource usage and slower inference. q5_k_m Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K q5_k_s Uses Q5_K for all tensors q6_k Uses Q8_K for all tensors q8_0 Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. As a rule of thumb, I recommend using Q5_K_M as it preserves most of the model s performance. Alternatively, you can use Q4_K_M if you want to save some memory. In general, K_M versions are better than K_S versions. I cannot recommend Q2 or Q3 versions, as they drastically decrease model performance. Now that we know more about the quantization types available, let s see how to use them on a real model. You can execute the following code on a free T4 GPU on Google Colab. The first step consists of compiling llama.cpp and installing the required libraries in our Python environment. Install llama.cpp !git clone https github.com ggerganov llama.cpp !cd llama.cpp git pull make clean LLAMA_CUBLAS 1 make !pip install r llama.cpp requirements.txt Now we can download our model. We will use the model we fine tuned in the previous article, mlabonne EvolCodeLlama 7b . MODEL_ID mlabonne EvolCodeLlama 7b Download model !git lfs install !git clone https huggingface.co MODEL_ID This step can take a while. Once it s done, we need to convert our weight to GGML FP16 format. MODEL_NAME MODEL_ID.split 1 GGML_VERSION gguf Convert to fp16 fp16 f MODEL_NAME MODEL_NAME.lower . GGML_VERSION .fp16.bin !python llama.cpp convert.py MODEL_NAME outtype f16 outfile fp16 Finally, we can quantize the model using one or several methods. In this case, we will use the Q4_K_M and Q5_K_M methods I recommended earlier. This is the only step that actually requires a GPU. QUANTIZATION_METHODS q4_k_m , q5_k_m for method in QUANTIZATION_METHODS qtype f MODEL_NAME MODEL_NAME.lower . GGML_VERSION . method .bin !. llama.cpp quantize fp16 qtype method Our two quantized models are now ready for inference . We can check the size of the bin files to see how much we compressed them. The FP16 model takes up 13.5 GB, while the Q4_K_M model takes up 4.08 GB 3.3 times smaller and the Q5_K_M model takes up 4.78 GB 2.8 times smaller . Let s use llama.cpp to efficiently run them. Since we re using a GPU with 16 GB of VRAM, we can offload every layer to the GPU. In this case, it represents 35 layers 7b parameter model , so we ll use the ngl 35 parameter. In the following code block, we ll also input a prompt and the quantization method we want to use. import os model_list file for file in os.listdir MODEL_NAME if GGML_VERSION in file prompt input Enter your prompt chosen_method input Please specify the quantization method to run the model options , .join model_list Verify the chosen method is in the list if chosen_method not in model_list print Invalid method chosen! else qtype f MODEL_NAME MODEL_NAME.lower . GGML_VERSION . method .bin !. llama.cpp main m qtype n 128 color ngl 35 p prompt Let s ask the model Write a Python function to print the nth Fibonacci numbers using the Q5_K_M method. If we look at the logs, we can confirm that we successfully offloaded our layers thanks to the line llm_load_tensors offloaded 35 35 layers to GPU . Here is the code the model generated def fib n if n 0 or n 1 return n return fib n 2 fib n 1 for i in range 1, 10 print fib i This wasn t a very complex prompt, but it successfully produced a working piece of code in no time. With this GGML, you can use your local LLM as an assistant in a terminal using the interactive mode i flag . Note that this also works on Macbooks with Apple s Metal Performance Shaders MPS , which is an excellent option to run LLMs. Finally, we can push our quantized model to a new repo on the Hugging Face Hub with the GGUF suffix. First, let s log in and modify the following code block to match your username. !pip install q huggingface_hub username mlabonne from huggingface_hub import notebook_login, create_repo, HfApi notebook_login Now we can create the repo and upload our models. We use the allow_patterns parameter to filter which files to upload, so we don t push the entirety of the directory. api HfApi Create repo create_repo repo_id f username MODEL_NAME GGML , repo_type model , exist_ok True Upload bin models api.upload_folder folder_path MODEL_NAME, repo_id f username MODEL_NAME GGML , allow_patterns f GGML_VERSION , We have successfully quantized, run, and pushed GGML models to the Hugging Face Hub! In the next section, we will explore how GGML actually quantize these models. Quantization with GGML The way GGML quantizes weights is not as sophisticated as GPTQ s. Basically, it groups blocks of values and rounds them to a lower precision. Some techniques, like Q4_K_M and Q5_K_M, implement a higher precision for critical layers . In this case, every weight is stored in 4 bit precision, with the exception of half of the attention.wv and feed_forward.w2 tensors. Experimentally, this mixed precision proves to be a good tradeoff between accuracy and resource usage. If we look into the ggml.c file, we can see how the blocks are defined. For example, the block_q4_0 structure is defined as define QK4_0 32 typedef struct ggml_fp16_t d delta uint8_t qs QK4_0 2 nibbles quants block_q4_0 In GGML, weights are processed in blocks, each consisting of 32 values. For each block, a scale factor delta is derived from the largest weight value. All weights in the block are then scaled, quantized, and packed efficiently for storage nibbles . This approach significantly reduces the storage requirements while allowing for a relatively simple and deterministic conversion between the original and quantized weights. Now that we know more about the quantization process, we can compare the results with NF4 and GPTQ. NF4 vs. GGML vs. GPTQ Which technique is better for 4 bit quantization? To answer this question, we need to introduce the different backends that run these quantized LLMs. For GGML models, llama.cpp with Q4_K_M models is the way to go. For GPTQ models, we have two options AutoGPTQ or ExLlama. Finally, NF4 models can directly be run in transformers with the load in 4bit flag. Oobabooga ran multiple experiments in an excellent blog post that compare different models in terms of perplexity lower is better Based on these results, we can say that GGML models have a slight advantage in terms of perplexity. The difference is not particularly significant, which is why it is better to focus on the generation speed in terms of tokens second. The best technique depends on your GPU if you have enough VRAM to fit the entire quantized model, GPTQ with ExLlama will be the fastest. If that s not the case, you can offload some layers and use GGML models with llama.cpp to run your LLM. Conclusion In this article, we introduced the GGML library and the new GGUF format to efficiently store these quantized models. We used it to quantize our own Llama model in different formats Q4_K_M and Q5_K_M . We then ran the GGML model and pushed our bin files to the Hugging Face Hub. Finally, we delved deeper into GGML s code to understand how it actually quantizes the weights and compared it to NF4 and GPTQ. Quantization is a formidable vector to democratize LLMs by lowering the cost of running them. In the future, mixed precision and other techniques will keep improving the performance we can achieve with quantized weights. Until then, I hope you enjoyed reading this article and learned something new. If you re interested in more technical content around LLMs, follow me on Medium. Articles about quantization Part 1 Introduction to Weight Quantization _Reducing the size of Large Language Models with 8 bit quantization_towardsdatascience.com Part 2 4 bit Quantization with GPTQ _Quantize your own LLMs using AutoGPTQ_towardsdatascience.com _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com Share this post Quantize Llama models with GGML and llama.cpp maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172" }, { "id": "a219cfaa-c52a-4c7c-aa39-60883cc507cd", "content": "A Beginner s Guide to LLM Fine Tuning Maxime Labonne How to fine tune Llama and other LLMs with one tool Maxime Labonne SubscribeSign in Share this post A Beginner s Guide to LLM Fine Tuning maximelabonne.substack.com Copy link Facebook Email Note Other A Beginner s Guide to LLM Fine Tuning How to fine tune Llama and other LLMs with one tool Maxime Labonne Aug 30, 2023 1 Share this post A Beginner s Guide to LLM Fine Tuning maximelabonne.substack.com Copy link Facebook Email Note Other 1 Share How to fine tune Llama and other LLMs with one tool Image by author The growing interest in Large Language Models LLMs has led to a surge in tools and wrappers designed to streamline their training process . Popular options include FastChat from LMSYS used to train Vicuna and Hugging Face s transformers trl libraries used in my previous article . In addition, each big LLM project, like WizardLM, tends to have its own training script, inspired by the original Alpaca implementation. In this article, we will use Axolotl , a tool created by the OpenAccess AI Collective. We will use it to fine tune a Code Llama 7b model on an evol instruct dataset comprised of 1,000 samples of Python code. Why Axolotl? The main appeal of Axolotl is that it provides a one stop solution, which includes numerous features, model architectures, and an active community. Here s a quick list of my favorite things about it Configuration All parameters used to train an LLM are neatly stored in a yaml config file. This makes it convenient for sharing and reproducing models. You can see an example for Llama 2 here. Dataset Flexibility Axolotl allows the specification of multiple datasets with varied prompt formats such as alpaca instruction ... , input ... , output ... , sharegpt chat conversations from ... , value ... , and raw completion text ... . Combining datasets is seamless, and the hassle of unifying the prompt format is eliminated. Features Axolotl is packed with SOTA techniques such as FSDP, deepspeed, LoRA, QLoRA, ReLoRA, sample packing, GPTQ, FlashAttention, xformers, and rope scaling. Utilities There are numerous user friendly utilities integrated, including the addition or alteration of special tokens, or a custom wandb configuration. Some well known models trained using this tool are Manticore 13b from the OpenAccess AI Collective and Samantha 1.11 70b from Eric Hartford. Like other wrappers, it is built on top of the transformers library and uses many of its features. Create your own config file Before anything, we need a configuration file. You can reuse an existing configuration from the examples folder. In our case, we will tweak the QLoRA config for Llama 2 to create our own Code Llama model. The model will be trained on a subset of 1,000 Python samples from the nickrosh Evol Instruct Code 80k v1 dataset. First, we must change the base_model and base_model_config fields to codellama CodeLlama 7b hf . To push our trained adapter to the Hugging Face Hub, let s add a new field hub_model_id , which corresponds to the name of our model, EvolCodeLlama 7b . Now, we have to update the dataset to mlabonne Evol Instruct Python 1k and set type to alpaca . There s no sample bigger than 2048 tokens in this dataset, so we can reduce the sequence_len to 2048 and save some VRAM. Talking about VRAM, we re going to use a micro_batch_size of 10 and a gradient_accumulation_steps of 1 to maximize its use. In practice, you try different values until you use 95 of the available VRAM. For convenience, I m going to add the name axolotl to the wandb_project field so it s easier to track on my account. I m also setting the warmup_steps to 100 personal preference and the eval_steps to 0.01 so we ll end up with 100 evaluations. Here s how the final config file should look base_model codellama CodeLlama 7b hf base_model_config codellama CodeLlama 7b hf model_type LlamaForCausalLM tokenizer_type LlamaTokenizer is_llama_derived_model true hub_model_id EvolCodeLlama 7b load_in_8bit false load_in_4bit true strict false datasets path mlabonne Evol Instruct Python 1k type alpaca dataset_prepared_path last_run_prepared val_set_size 0.02 output_dir . qlora out adapter qlora lora_model_dir sequence_len 2048 sample_packing true lora_r 32 lora_alpha 16 lora_dropout 0.05 lora_target_modules lora_target_linear true lora_fan_in_fan_out wandb_project axolotl wandb_entity wandb_watch wandb_run_id wandb_log_model gradient_accumulation_steps 1 micro_batch_size 10 num_epochs 3 optimizer paged_adamw_32bit lr_scheduler cosine learning_rate 0.0002 train_on_inputs false group_by_length false bf16 true fp16 false tf32 false gradient_checkpointing true early_stopping_patience resume_from_checkpoint local_rank logging_steps 1 xformers_attention flash_attention true warmup_steps 100 eval_steps 0.01 save_strategy epoch save_steps debug deepspeed weight_decay 0.0 fsdp fsdp_config special_tokens bos_token s eos_token s unk_token unk You can also find this config file here as a GitHub gist. Before we start training our model, I want to introduce a few parameters that are important to understand QLoRA We re using QLoRA for fine tuning, which is why we re loading the base model in 4 bit precision NF4 format . You can check this article from Benjamin Marie to know more about QLoRA. Gradient checkpointing It lowers the VRAM requirements by removing some activations that are re computed on demand during the backward pass. It also slows down training by about 20 , according to Hugging Face s documentation. FlashAttention This implements the FlashAttention mechanism, which improves the speed and memory efficiency of our model thanks to a clever fusion of GPU operations learn more about it in this article from Aleksa Gordi\u0107 . Sample packing Smart way of creating batches with as little padding as possible, by reorganizing the order of the samples bin packing problem . As a result, we need fewer batches to train the model on the same dataset. It was inspired by the Multipack Sampler see my note and Krell et al. You can find FlashAttention in some other tools, but sample packing is relatively new. As far as I know, OpenChat was the first project to use sample packing during fine tuning. Thanks to Axolotl, we ll use these techniques for free. Fine tune Code Llama Having the config file ready, it s time to get our hands dirty with the actual fine tuning. You might consider running the training on a Colab notebook. However, for those without access to a high performance GPU, a more cost effective solution consists of renting cloud based GPU services , like AWS, Lambda Labs, Vast.ai, Banana, or RunPod. Personally, I use RunPod, which is a popular option in the fine tuning community. It s not the cheapest service but it hits a good tradeoff with a clean UI. You can easily replicate the following steps using your favorite service. When your RunPod account is set up, go to Manage Templates and click on New Template . Here is a simple template Image by author Let s review the different fields and their corresponding values Template Name Axolotl you can choose whatever you want Container Image winglian axolotl runpod main py3.10 cu118 2.0.1 Container Disk 100 GB Volume Disk 0 GB Volume Mount Path workspace In addition, there are two handy environment variables can include HUGGING_FACE_HUB_TOKEN you can find your token on this page requires an account WANDB_API_KEY you can find your key on this page requires an account Alternatively, you can simply log in the terminal later using huggingface cli login and wandb login . Once you re set up, go to Community Cloud and deploy an RTX 3090. Here you can search for the name of your template and select it as follows Image by author You can click on Continue and RunPod will deploy your template. You can see the installation in your pod s logs Manage Pods . When the option becomes available, click on Connect . Here, click on Start Web Terminal and then Connect to Web Terminal . You are now connected to your pod! The following steps are the same no matter what service you choose 1. We install Axolotl and the PEFT library as follows git clone https github.com OpenAccess AI Collective axolotl cd axolotl pip3 install e . flash attn pip3 install U git https github.com huggingface peft.git 2 . Download the config file we created wget https gist.githubusercontent.com mlabonne 8055f6335e2b85f082c8c75561321a66 raw 93915a9563fcfff8df9a81fc0cdbf63894465922 EvolCodeLlama 7b.yaml 3 . You can now start fine tuning the model with the following command accelerate launch scripts finetune.py EvolCodeLlama 7b.yaml If everything is configured correctly, you should be able to train the model in a little more than one hour it took me 1h 11m 44s . If you check the GPU memory used, you ll see almost 100 with this config, which means we re optimizing it pretty nicely. If you re using a GPU with more VRAM like an A100 , you can increase the micro batch size to make sure you re fully using it. In the meantime, feel free to close the web terminal and check your loss on Weights Biases. We re using tmux so the training won t stop if you close the terminal. Here are my loss curves Image by author We see a steady improvement in the eval loss, which is a good sign. However, you can also spot drops in the eval loss that are not correlated with a decrease in the quality of the outputs The best way to evaluate your model is simply by using it you can run it in the terminal with the command accelerate launch scripts finetune.py EvolCodeLlama 7b.yaml inference lora_model_dir . qlora out . The QLoRA adapter should already be uploaded to the Hugging Face Hub. However, you can also merge the base Code Llama model with this adapter and push the merged model there by following these steps 1. Download this script wget https gist.githubusercontent.com mlabonne a3542b0519708b8871d0703c938bba9f raw 60abc5afc07f9d843bc23d56f4e0b7ab072c4a62 merge_peft.py 2 . Execute it with this command python merge_peft.py base_model codellama CodeLlama 7b hf peft_model . qlora out hub_id EvolCodeLlama 7b Congratulations, you should have your own EvolCodeLlama 7b on the Hugging Face Hub at this point! For reference, you can access my own model trained with this process here mlabonne EvolCodeLlama 7b Considering that our EvolCodeLlama 7b is a code LLM, it would be interesting to compare its performance with other models on standard benchmarks , such as HumanEval and MBPP. For reference, you can find a leaderboard at the following address Multilingual Code Evals. If you re happy with this model, you can quantize it with GGML for local inference with this free Google Colab notebook. You can also fine tune bigger models e.g., 70b parameters thanks to deepspeed, which only requires an additional config file. Conclusion In this article, we ve covered the essentials of how to efficiently fine tune LLMs . We customized parameters to train on our Code Llama model on a small Python dataset. Finally, we merged the weights and uploaded the result on Hugging Face. I hope you found this guide useful. I recommend using Axolotl with a cloud based GPU service to get some experience and upload a few models on Hugging Face. Build your own datasets, play with the parameters, and break stuff along the way. Like with every wrapper, don t hesitate to check the source code to get a good intuition of what it s actually doing. It will massively help in the long run. Thanks to the OpenAccess AI Collective and all the contributors! If you re interested in more technical content around LLMs, follow me on Medium. Related articles Fine Tune Your Own Llama 2 Model in a Colab Notebook _A practical introduction to LLM fine tuning_towardsdatascience.com 4 bit Quantization with GPTQ _Quantize your own LLMs using AutoGPTQ_towardsdatascience.com _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com 1 Share this post A Beginner s Guide to LLM Fine Tuning maximelabonne.substack.com Copy link Facebook Email Note Other 1 Share Discussion about this post Comments Restacks DanielJun 23Thanks for this great article! One question How do you deal with the issue that the chat template defined in the Axolotl config for training and a chat template used for inference e.g. when you load the model from the Hub via HuggingFace transformers method .from_pretrained and use their chat template might be different? If I am not mistaken then the Axolotl templates assembles prompts in token space, whereas HF chat templates assembles them in string space, which might cause tokenization mismatches? Expand full commentReplyShare Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/a-beginners-guide-to-llm-fine-tuning-4bae7d4da672" }, { "id": "30f815cd-5776-4f2f-9b1d-4038f07ec65e", "content": "Graph Convolutional Networks Introduction to GNNs A step by step guide using PyTorch Geometric Maxime Labonne SubscribeSign in Share this post Graph Convolutional Networks Introduction to GNNs maximelabonne.substack.com Copy link Facebook Email Note Other Graph Convolutional Networks Introduction to GNNs A step by step guide using PyTorch Geometric Maxime Labonne Aug 14, 2023 2 Share this post Graph Convolutional Networks Introduction to GNNs maximelabonne.substack.com Copy link Facebook Email Note Other Share A step by step guide using PyTorch Geometric Image by author Graph Neural Networks GNNs represent one of the most captivating and rapidly evolving architectures within the deep learning landscape. As deep learning models designed to process data structured as graphs, GNNs bring remarkable versatility and powerful learning capabilities. Among the various types of GNNs, the Graph Convolutional Networks GCNs have emerged as the most prevalent and broadly applied model. GCNs are innovative due to their ability to leverage both the features of a node and its locality to make predictions, providing an effective way to handle graph structured data. In this article, we will delve into the mechanics of the GCN layer and explain its inner workings. Furthermore, we will explore its practical application for node classification tasks, using PyTorch Geometric as our tool of choice. PyTorch Geometric is a specialized extension of PyTorch that has been created specifically for the development and implementation of GNNs. It is an advanced, yet user friendly library that provides a comprehensive suite of tools to facilitate graph based machine learning. To commence our journey, the PyTorch Geometric installation will be required. If you are using Google Colab, PyTorch should already be in place, so all we need to do is execute a few additional commands. All the code is available on Google Colab and GitHub. !pip install torch_geometric import torch import numpy as np import networkx as nx import matplotlib.pyplot as plt Now that PyTorch Geometric is installed, let s explore the dataset we will use in this tutorial. I. Graph data Graphs are an essential structure for representing relationships between objects. You can encounter graph data in a multitude of real world scenarios, such as social and computer networks, chemical structures of molecules, natural language processing, and image recognition, to name a few. In this article, we will study the infamous and much used Zachary s karate club dataset. Image by author The Zachary s karate club dataset embodies the relationships formed within a karate club as observed by Wayne W. Zachary during the 1970s. It is a kind of social network, where each node represents a club member, and edges between nodes represent interactions that occurred outside the club environment. In this particular scenario, the members of the club are split into four distinct groups. Our task is to assign the correct group to each member node classification , based on the pattern of their interactions. Let s import the dataset with PyG s built in function and try to understand the Datasets object it uses. from torch_geometric.datasets import KarateClub Import dataset from PyTorch Geometric dataset KarateClub Print information print dataset print print f Number of graphs len dataset print f Number of features dataset.num_features print f Number of classes dataset.num_classes KarateClub Number of graphs 1 Number of features 34 Number of classes 4 This dataset only has 1 graph, where each node has a feature vector of 34 dimensions and is part of one out of four classes our four groups . Actually, the Datasets object can be seen as a collection of Data graph objects. We can further inspect our unique graph to know more about it. Print first element print f Graph dataset 0 Graph Data x 34, 34 , edge_index 2, 156 , y 34 , train_mask 34 The Data object is particularly interesting. Printing it offers a good summary of the graph we re studying x 34, 34 is the node feature matrix with shape number of nodes, number of features . In our case, it means that we have 34 nodes our 34 members , each node being associated to a 34 dim feature vector. edge_index 2, 156 represents the graph connectivity how the nodes are connected with shape 2, number of directed edges . y 34 is the node ground truth labels . In this problem, every node is assigned to one class group , so we have one value for each node. train_mask 34 is an optional attribute that tells which nodes should be used for training with a list of True or False statements. Let s print each of these tensors to understand what they store. Let s start with the node features. data dataset 0 print f x data.x.shape print data.x x torch.Size 34, 34 tensor 1., 0., 0., ..., 0., 0., 0. , 0., 1., 0., ..., 0., 0., 0. , 0., 0., 1., ..., 0., 0., 0. , ..., 0., 0., 0., ..., 1., 0., 0. , 0., 0., 0., ..., 0., 1., 0. , 0., 0., 0., ..., 0., 0., 1. Here, the node feature matrix x is an identity matrix it doesn t contain any relevant information about the nodes. It could contain information like age, skill level, etc. but this is not the case in this dataset. It means we ll have to classify our nodes just by looking at their connections. Now, let s print the edge index. print f edge_index data.edge_index.shape print data.edge_index edge_index torch.Size 2, 156 tensor 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 13, 13, 13, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27, 27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33 , 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 17, 19, 21, 31, 0, 2, 3, 7, 13, 17, 19, 21, 30, 0, 1, 3, 7, 8, 9, 13, 27, 28, 32, 0, 1, 2, 7, 12, 13, 0, 6, 10, 0, 6, 10, 16, 0, 4, 5, 16, 0, 1, 2, 3, 0, 2, 30, 32, 33, 2, 33, 0, 4, 5, 0, 0, 3, 0, 1, 2, 3, 33, 32, 33, 32, 33, 5, 6, 0, 1, 32, 33, 0, 1, 33, 32, 33, 0, 1, 32, 33, 25, 27, 29, 32, 33, 25, 27, 31, 23, 24, 31, 29, 33, 2, 23, 24, 33, 2, 31, 33, 23, 26, 32, 33, 1, 8, 32, 33, 0, 24, 25, 28, 32, 33, 2, 8, 14, 15, 18, 20, 22, 23, 29, 30, 31, 33, 8, 9, 13, 14, 15, 18, 19, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32 In graph theory and network analysis, connectivity between nodes is stored using a variety of data structures. The edge_index is one such data structure, where the graph s connections are stored in two lists 156 directed edges, which equate to 78 bidirectional edges . The reason for these two lists is that one list stores the source nodes, while the second one identifies the destination nodes. This method is known as a coordinate list COO format, which is essentially a means to efficiently store a sparse matrix. Sparse matrices are data structures that efficiently store matrices with a majority of zero elements. In the COO format, only non zero elements are stored, saving memory and computational resources. Contrarily, a more intuitive and straightforward way to represent graph connectivity is through an adjacency matrix _A_. This is a square matrix where each element _A_ \u1d62\u2c7c _s_ pecifies the presence or absence of an edge from node _i_ to node _j_ in the graph. In other words, a non zero element _A_ \u1d62\u2c7c implies a connection from node _i_ to node _j_ , and a zero indicates no direct connection. Image by author An adjacency matrix, however, is not as space efficient as the COO format for sparse matrices or graphs with fewer edges. However, for clarity and easy interpretation, the adjacency matrix remains a popular choice for representing graph connectivity. The adjacency matrix can be inferred from the edge_index with a utility function to_dense_adj . from torch_geometric.utils import to_dense_adj A to_dense_adj data.edge_index 0 .numpy .astype int print f A A.shape print A A 34, 34 0 1 1 ... 1 0 0 1 0 1 ... 0 0 0 1 1 0 ... 0 1 0 ... 1 0 0 ... 0 1 1 0 0 1 ... 1 0 1 0 0 0 ... 1 1 0 With graph data, it is relatively uncommon for nodes to be densely interconnected. As you can see, our adjacency matrix _A_ is sparse filled with zeros . In many real world graphs, most nodes are connected to only a few other nodes, resulting in a large number of zeros in the adjacency matrix. Storing so many zeros is not efficient at all, which is why the COO format is adopted by PyG. On the contrary, ground truth labels are easy to understand. print f y data.y.shape print data.y y torch.Size 34 tensor 1, 1, 1, 1, 3, 3, 3, 1, 0, 1, 3, 1, 1, 1, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 0 Our node ground truth labels stored in y simply encode the group number 0, 1, 2, 3 for each node, which is why we have 34 values. Finally, let s print the train mask. print f train_mask data.train_mask.shape print data.train_mask train_mask torch.Size 34 tensor True, False, False, False, True, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False The train mask shows which nodes are supposed to be used for training with True statements. These nodes represent the training set, while the others can be considered as the test set. This division helps in model evaluation by providing unseen data for testing. But we re not done yet! The Data object has a lot more to offer. It provides various utility functions that enable the investigation of several properties of the graph. For instance is_directed tells you if the graph is directed . A directed graph signifies that the adjacency matrix is not symmetric, i.e., the direction of edges matters in the connections between nodes. isolated_nodes checks if some nodes are not connected to the rest of the graph. These nodes are likely to pose challenges in tasks like classification due to their lack of connections. has_self_loops indicates if at least one node is connected to itself . This is distinct from the concept of loops a loop implies a path that starts and ends at the same node, traversing other nodes in between. In the context of the Zachary s karate club dataset, all these properties return False . This implies that the graph is not directed, does not have any isolated nodes, and none of its nodes are connected to themselves. print f Edges are directed data.is_directed print f Graph has isolated nodes data.has_isolated_nodes print f Graph has loops data.has_self_loops Edges are directed False Graph has isolated nodes False Graph has loops False Finally, we can convert a graph from PyTorch Geometric to the popular graph library NetworkX using to_networkx . This is particularly useful to visualize a small graph with networkx and matplotlib . Let s plot our dataset with a different color for each group. from torch_geometric.utils import to_networkx G to_networkx data, to_undirected True plt.figure figsize 12,12 plt.axis off nx.draw_networkx G, pos nx.spring_layout G, seed 0 , with_labels True, node_size 800, node_color data.y, cmap hsv , vmin 2, vmax 3, width 0.8, edge_color grey , font_size 14 plt.show This plot of Zachary s karate club displays our 34 nodes, 78 bidirectional edges, and 4 labels with 4 different colors. Now that we ve seen the essentials of loading and handling a dataset with PyTorch Geometric, we can introduce the Graph Convolutional Network architecture. II. Graph Convolutional Network This section aims to introduce and build the graph convolutional layer from the ground up. In traditional neural networks, linear layers apply a linear transformation to the incoming data. This transformation converts input features _x_ into hidden vectors _h_ through the use of a weight matrix \ud835\udc16. Ignoring biases for the time being, this can be expressed as With graph data, an additional layer of complexity is added through the connections between nodes . These connections matter because, typically, in networks, it s assumed that similar nodes are more likely to be linked to each other than dissimilar ones, a phenomenon known as network homophily. We can enrich our node representation by merging its features with those of its neighbors. This operation is called convolution, or neighborhood aggregation. Let s represent the neighborhood of node _i_ including itself as _\u00d1_. Unlike filters in Convolutional Neural Networks CNNs , our weight matrix \ud835\udc16 is unique and shared among every node. But there is another issue nodes do not have a fixed number of neighbors like pixels do. How do we address cases where one node has only one neighbor, and another has 500? If we simply sum the feature vectors, the resulting embedding _h_ would be much larger for the node with 500 neighbors. To ensure a similar range of values for all nodes and comparability between them, we can normalize the result based on the degree of nodes, where degree refers to the number of connections a node has. We re almost there! Introduced by Kipf et al. 2016 , the graph convolutional layer has one final improvement. The authors observed that features from nodes with numerous neighbors propagate much more easily than those from more isolated nodes. To offset this effect, they suggested assigning bigger weights to features from nodes with fewer neighbors, thus balancing the influence across all nodes. This operation is written as Note that when _i_ and _j_ have the same number of neighbors, it is equivalent to our own layer. Now, let s see how to implement it in Python with PyTorch Geometric. III. Implementing a GCN PyTorch Geometric provides the GCNConv function, which directly implements the graph convolutional layer. In this example, we ll create a basic Graph Convolutional Network with a single GCN layer, a ReLU activation function, and a linear output layer. This output layer will yield four values corresponding to our four categories, with the highest value determining the class of each node. In the following code block, we define the GCN layer with a 3 dimensional hidden layer. from torch.nn import Linear from torch_geometric.nn import GCNConv class GCN torch.nn.Module def __init__ self super .__init__ self.gcn GCNConv dataset.num_features, 3 self.out Linear 3, dataset.num_classes def forward self, x, edge_index h self.gcn x, edge_index .relu z self.out h return h, z model GCN print model GCN gcn GCNConv 34, 3 out Linear in_features 3, out_features 4, bias True If we added a second GCN layer, our model would not only aggregate feature vectors from the neighbors of each node, but also from the neighbors of these neighbors. We can stack several graph layers to aggregate more and more distant values, but there s a catch if we add too many layers, the aggregation becomes so intense that all the embeddings end up looking the same. This phenomenon is called over smoothing and can be a real problem when you have too many layers. Now that we ve defined our GNN, let s write a simple training loop with PyTorch. I chose a regular cross entropy loss since it s a multi class classification task, with Adam as optimizer. In this article, we won t implement a train test split to keep things simple and focus on how GNNs learn instead. The training loop is standard we try to predict the correct labels, and we compare the GCN s results to the values stored in data.y . The error is calculated by the cross entropy loss and backpropagated with Adam to fine tune our GNN s weights and biases. Finally, we print metrics every 10 epochs. criterion torch.nn.CrossEntropyLoss optimizer torch.optim.Adam model.parameters , lr 0.02 Calculate accuracy def accuracy pred_y, y return pred_y y .sum len y Data for animations embeddings losses accuracies outputs Training loop for epoch in range 201 Clear gradients optimizer.zero_grad Forward pass h, z model data.x, data.edge_index Calculate loss function loss criterion z, data.y Calculate accuracy acc accuracy z.argmax dim 1 , data.y Compute gradients loss.backward Tune parameters optimizer.step Store data for animations embeddings.append h losses.append loss accuracies.append acc outputs.append z.argmax dim 1 Print metrics every 10 epochs if epoch 10 0 print f Epoch epoch 3 Loss loss .2f Acc acc 100 .2f Epoch 0 Loss 1.40 Acc 41.18 Epoch 10 Loss 1.21 Acc 47.06 Epoch 20 Loss 1.02 Acc 67.65 Epoch 30 Loss 0.80 Acc 73.53 Epoch 40 Loss 0.59 Acc 73.53 Epoch 50 Loss 0.39 Acc 94.12 Epoch 60 Loss 0.23 Acc 97.06 Epoch 70 Loss 0.13 Acc 100.00 Epoch 80 Loss 0.07 Acc 100.00 Epoch 90 Loss 0.05 Acc 100.00 Epoch 100 Loss 0.03 Acc 100.00 Epoch 110 Loss 0.02 Acc 100.00 Epoch 120 Loss 0.02 Acc 100.00 Epoch 130 Loss 0.02 Acc 100.00 Epoch 140 Loss 0.01 Acc 100.00 Epoch 150 Loss 0.01 Acc 100.00 Epoch 160 Loss 0.01 Acc 100.00 Epoch 170 Loss 0.01 Acc 100.00 Epoch 180 Loss 0.01 Acc 100.00 Epoch 190 Loss 0.01 Acc 100.00 Epoch 200 Loss 0.01 Acc 100.00 Great! Without much surprise, we reach 100 accuracy on the training set full dataset . It means that our model learned to correctly assign every member of the karate club to its correct group. We can produce a neat visualization by animating the graph and see the evolution of the GNN s predictions during the training process. capture from IPython.display import HTML from matplotlib import animation plt.rcParams animation.bitrate 3000 def animate i G to_networkx data, to_undirected True nx.draw_networkx G, pos nx.spring_layout G, seed 0 , with_labels True, node_size 800, node_color outputs i , cmap hsv , vmin 2, vmax 3, width 0.8, edge_color grey , font_size 14 plt.title f Epoch i Loss losses i .2f Acc accuracies i 100 .2f , fontsize 18, pad 20 fig plt.figure figsize 12, 12 plt.axis off anim animation.FuncAnimation fig, animate, np.arange 0, 200, 10 , interval 500, repeat True html HTML anim.to_html5_video display html The first predictions are random, but the GCN perfectly labels every node after a while. Indeed, the final graph is the same as the one we plotted at the end of the first section. But what does the GCN really learn? By aggregating features from neighboring nodes, the GNN learns a vector representation or embedding of every node in the network. In our model, the final layer just learns how to use these representations to produce the best classifications. However, embeddings are the real products of GNNs. Let s print the embeddings learned by our model. Print embeddings print f Final embeddings h.shape print h Final embeddings torch.Size 34, 3 tensor 1.9099e 00, 2.3584e 00, 7.4027e 01 , 2.6203e 00, 2.7997e 00, 0.0000e 00 , 2.2567e 00, 2.2962e 00, 6.4663e 01 , 2.0802e 00, 2.8785e 00, 0.0000e 00 , 0.0000e 00, 0.0000e 00, 2.9694e 00 , 0.0000e 00, 0.0000e 00, 3.3817e 00 , 0.0000e 00, 1.5008e 04, 3.4246e 00 , 1.7593e 00, 2.4292e 00, 2.4551e 01 , 1.9757e 00, 6.1032e 01, 1.8986e 00 , 1.7770e 00, 1.9950e 00, 6.7018e 01 , 0.0000e 00, 1.1683e 04, 2.9738e 00 , 1.8988e 00, 2.0512e 00, 2.6225e 01 , 1.7081e 00, 2.3618e 00, 1.9609e 01 , 1.8303e 00, 2.1591e 00, 3.5906e 01 , 2.0755e 00, 2.7468e 01, 1.9804e 00 , 1.9676e 00, 3.7185e 01, 2.0011e 00 , 0.0000e 00, 0.0000e 00, 3.4787e 00 , 1.6945e 00, 2.0350e 00, 1.9789e 01 , 1.9808e 00, 3.2633e 01, 2.1349e 00 , 1.7846e 00, 1.9585e 00, 4.8021e 01 , 2.0420e 00, 2.7512e 01, 1.9810e 00 , 1.7665e 00, 2.1357e 00, 4.0325e 01 , 1.9870e 00, 3.3886e 01, 2.0421e 00 , 2.0614e 00, 5.1042e 01, 2.4872e 00 , ... 2.1778e 00, 4.4730e 01, 2.0077e 00 , 3.8906e 02, 2.3443e 00, 1.9195e 00 , 3.0748e 00, 0.0000e 00, 3.0789e 00 , 3.4316e 00, 1.9716e 01, 2.5231e 00 , grad_fn ReluBackward0 As you can see, embeddings do not need to have the same dimensions as feature vectors. Here, I chose to reduce the number of dimensions from 34 dataset.num_features to three to get a nice visualization in 3D. Let s plot these embeddings before any training happens, at epoch 0. Get first embedding at epoch 0 embed h.detach .cpu .numpy fig plt.figure figsize 12, 12 ax fig.add_subplot projection 3d ax.patch.set_alpha 0 plt.tick_params left False, bottom False, labelleft False, labelbottom False ax.scatter embed , 0 , embed , 1 , embed , 2 , s 200, c data.y, cmap hsv , vmin 2, vmax 3 plt.show We see every node from Zachary s karate club with their true labels and not the model s predictions . For now, they re all over the place since the GNN is not trained yet. But if we plot these embeddings at each step of the training loop, we d be able to visualize what the GNN truly learns. Let s see how they evolve over time, as the GCN gets better and better at classifying nodes. capture def animate i embed embeddings i .detach .cpu .numpy ax.clear ax.scatter embed , 0 , embed , 1 , embed , 2 , s 200, c data.y, cmap hsv , vmin 2, vmax 3 plt.title f Epoch i Loss losses i .2f Acc accuracies i 100 .2f , fontsize 18, pad 40 fig plt.figure figsize 12, 12 plt.axis off ax fig.add_subplot projection 3d plt.tick_params left False, bottom False, labelleft False, labelbottom False anim animation.FuncAnimation fig, animate, np.arange 0, 200, 10 , interval 800, repeat True html HTML anim.to_html5_video display html Our Graph Convolutional Network GCN has effectively learned embeddings that group similar nodes into distinct clusters . This enables the final linear layer to distinguish them into separate classes with ease. Embeddings are not unique to GNNs they can be found everywhere in deep learning. They don t have to be 3D either actually, they rarely are. For instance, language models like BERT produce embeddings with 768 or even 1024 dimensions. Additional dimensions store more information about nodes, text, images, etc. but they also create bigger models that are more difficult to train. This is why keeping low dimensional embeddings as long as possible is advantageous. Conclusion Graph Convolutional Networks are an incredibly versatile architecture that can be applied in many contexts . In this article, we familiarized ourselves with the PyTorch Geometric library and objects like Datasets and Data . Then, we successfully reconstructed a graph convolutional layer from the ground up. Next, we put theory into practice by implementing a GCN, which gave us an understanding of practical aspects and how individual components interact. Finally, we visualized the training process and obtained a clear perspective of what it involves for such a network. Zachary s karate club is a simplistic dataset, but it is good enough to understand the most important concepts in graph data and GNNs. Although we only talked about node classification in this article, there are other tasks GNNs can accomplish link prediction e.g., to recommend a friend , graph classification e.g., to label molecules , graph generation e.g., to create new molecules , and so on. Beyond GCN, numerous GNN layers and architectures have been proposed by researchers. In the next article, we ll introduce the Graph Attention Network GAT architecture, which dynamically computes the GCN s normalization factor and the importance of each connection with an attention mechanism. If you want to know more about graph neural networks, dive deeper into the world of GNNs with my book, Hands On Graph Neural Networks. Next article Chapter 2 Graph Attention Networks Self Attention Explained _A guide to GNNs with self attention using PyTorch Geometric_towardsdatascience.com _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com _If you re already a member, you canfollow me on Medium._ 2 Share this post Graph Convolutional Networks Introduction to GNNs maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/graph-convolutional-networks-introduction-to-gnns-24b3f60d6c95" }, { "id": "a89d6d0f-861f-4a11-aa6b-730ed30f6eb8", "content": "4 bit Quantization with GPTQ Maxime Labonne Quantize your own LLMs using AutoGPTQ Maxime Labonne SubscribeSign in Share this post 4 bit Quantization with GPTQ maximelabonne.substack.com Copy link Facebook Email Note Other 4 bit Quantization with GPTQ Quantize your own LLMs using AutoGPTQ Maxime Labonne Jul 31, 2023 1 Share this post 4 bit Quantization with GPTQ maximelabonne.substack.com Copy link Facebook Email Note Other Share Quantize your own LLMs using AutoGPTQ Image by author Recent advancements in weight quantization allow us to run massive large language models on consumer hardware, like a LLaMA 30B model on an RTX 3090 GPU. This is possible thanks to novel 4 bit quantization techniques with minimal performance degradation, like GPTQ, GGML, and NF4. In the previous article, we introduced na\u00efve 8 bit quantization techniques and the excellent LLM.int8 . In this article, we will explore the popular GPTQ algorithm to understand how it works and implement it using the AutoGPTQ library. You can find the code on Google Colab and GitHub. Optimal Brain Quantization Let s start by introducing the problem we re trying to solve. For every layer \u2113 in the network, we want to find a quantized version \u0174\u2097 _of the original weights_ W\u2097 . This is called the layer wise compression problem . More specifically, to minimize performance degradation, we want the outputs \u0174 \u1d68 X \u1d68 of these new weights to be as close as possible to the original ones W \u1d68 X \u1d68 . In other words, we want to find Different approaches have been proposed to solve this problem, but we re interested in the Optimal Brain Quantizer OBQ framework here. This method is inspired by a pruning technique to carefully remove weights from a fully trained dense neural network Optimal Brain Surgeon . It uses an approximation technique and provides explicit formulas for the best single weight _w\ud801\udfa5_ to remove and optimal update _\u03b4_ \ua7f3 to adjust the set of remaining non quantized weights _F_ to make up for the removal where quant _w_ is the weight rounding given by the quantization and H \ua7f3 is the Hessian. Using OBQ, we can quantize the easiest weight first and then adjust all remaining non quantized weights to compensate for this precision loss . Then we pick the next weight to quantize, and so on. A potential issue with this approach is when there are outlier weights, which can result in high quantization error . Usually, these outliers would be quantized last, when there are few non quantized weights left that could be adjusted to compensate for the large error. This effect can worsen when some weights are pushed further outside the grid by intermediate updates. A simple heuristic is applied to prevent this outliers are quantized as soon as they appear. This process could be computationally heavy, especially for LLMs. To deal with this, the OBQ method uses a trick that avoids redoing the entire computation each time a weight is simplified. After quantizing a weight, it adjusts the matrix used in calculations the Hessian by removing the row and column associated with that weight using Gaussian elimination The method also employs vectorization to process multiple rows of the weight matrix at once. Despite its efficiency, the OBQ s computation time increases significantly as the size of the weight matrix increases. This cubic growth makes it difficult to use OBQ on very large models with billions of parameters. The GPTQ Algorithm Introduced by Frantar et al. 2023 , the GPTQ algorithm takes inspiration from the OBQ method, but with significant improvements to scale it for very large language models. Step 1 Arbitrary Order Insight The OBQ method selects weights parameters in a model for quantization in a certain order, determined by which will add the least additional error . However, GPTQ observes that for large models, quantizing weights in any fixed order can perform just as well. This is because even though some weights might introduce more error individually, they are quantized later in the process when there are few other weights left that could increase the error. So the order doesn t matter as much as we thought. Based on this insight, GPTQ aims to quantize all weights in the same order for all rows of a matrix. This makes the process faster because certain computations have to be done only once for each column, rather than once for each weight. Image by author Step 2 Lazy Batch Updates This scheme won t be fast because it requires updating a huge matrix with very few computations for each entry. This type of operation can t utilize the full compute capabilities of GPUs and will be slowed down by memory limitations memory throughput bottleneck . To resolve this, GPTQ introduces lazy batch updates. It turns out that the final rounding decisions for a given column are only affected by updates performed on that column, not on later columns. Therefore, GPTQ can apply the algorithm to a batch of columns at a time like 128 columns , updating only those columns and a corresponding block of the matrix. After a block is fully processed, the algorithm performs global updates on the entire matrix. Step 3 Cholesky Reformulation However, there s one more issue to address. When the algorithm scales up to very large models, numerical inaccuracies can become a problem. Specifically, repeated applications of a certain operation can accumulate numerical errors . To tackle this, GPTQ uses a Cholesky decomposition, a numerically stable method for solving certain mathematical problems. It involves precomputing some required information from the matrix using the Cholesky method. This approach, combined with a slight dampening adding a small constant to diagonal elements of the matrix , helps the algorithm to avoid numerical issues. The full algorithm can be summarized in a few steps 1. The GPTQ algorithm begins with a Cholesky decomposition of the Hessian inverse a matrix that helps decide how to adjust the weights 2. It then runs in loops, handling batches of columns at a time. 3. For each column in a batch, it quantizes the weights, calculates the error, and updates the weights in the block accordingly. 4. After processing the batch, it updates all remaining weights based on the block s errors. The GPTQ algorithm was tested on various language generation tasks. It was compared with other quantization methods, like rounding all weights to the nearest quantized value RTN . GPTQ was used with the BLOOM 176B parameters and OPT 175B parameters model families, and models were quantized using a single NVIDIA A100 GPU . Quantize an LLM with AutoGPTQ GPTQ has been very popular to create models in 4 bit precision that can efficiently run on GPUs. You can find many examples on the Hugging Face Hub, especially from TheBloke. If you re looking for an approach that is more CPU friendly, GGML is currently your best option. Finally, the transformers library with bitsandbytes allows you to quantize a model when it s loaded using the load_in_4bit true argument, which requires downloading full models and storing them in your RAM. Let s implement the GPTQ algorithm using the AutoGPTQ library and quantize a GPT 2 model. This requires a GPU, but a free T4 on Google Colab will do. We start by loading the libraries and defining the model we want to quantize in this case, GPT 2 . !BUILD_CUDA_EXT 0 pip install q auto gptq transformers import random from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from datasets import load_dataset import torch from transformers import AutoTokenizer Define base model and output directory model_id gpt2 out_dir model_id GPTQ We now want to load the model and the tokenizer. The tokenizer is loaded using the classic AutoTokenizer class from the transformers library. On the other hand, we need to pass a specific configuration BaseQuantizeConfig to load the model. In this configuration, we can specify the number of bits to quantize here, bits 4 and the group size size of the lazy batch . Note that this group size is optional we could also use one set of parameters for the entire weight matrix. In practice, these groups generally improve the quality of the quantization at a very low cost especially with group_size 1024 . The damp_percent value is here to help the Cholesky reformulation and should not be changed. Finally, the desc_act also called act order is a tricky parameter. It allows you to process rows based on decreasing activation , meaning the most important or impactful rows determined by sampled inputs and outputs are processed first. This method aims to place most of the quantization error inevitably introduced during quantization on less significant weights. This approach improves the overall accuracy of the quantization process by ensuring the most significant weights are processed with greater precision. However, when used alongside group size, desc_act can lead to performance slowdowns due to the need to frequently reload quantization parameters. For this reason, we won t use it here it will probably be fixed in the future, however . Load quantize config, model and tokenizer quantize_config BaseQuantizeConfig bits 4, group_size 128, damp_percent 0.01, desc_act False, model AutoGPTQForCausalLM.from_pretrained model_id, quantize_config tokenizer AutoTokenizer.from_pretrained model_id The quantization process relies heavily on samples to evaluate and enhance the quality of the quantization. They provide a means of comparison between the outputs produced by the origina and the newly quantized model. The larger the number of samples provided, the greater the potential for more accurate and effective comparisons, leading to improved quantization quality. In the context of this article, we utilize the C4 Colossal Clean Crawled Corpus dataset to generate our samples. The C4 dataset is a large scale, multilingual collection of web text gathered from the Common Crawl project. This expansive dataset has been cleaned and prepared specifically for training large scale language models, making it a great resource for tasks such as this. The WikiText dataset is another popular option. In the following code block, we load 1024 samples from the C4 dataset, tokenize them, and format them. Load data and tokenize examples n_samples 1024 data load_dataset allenai c4 , data_files en c4 train.00001 of 01024.json.gz , split f train n_samples 5 tokenized_data tokenizer n n .join data text , return_tensors pt Format tokenized examples examples_ids for _ in range n_samples i random.randint 0, tokenized_data.input_ids.shape 1 tokenizer.model_max_length 1 j i tokenizer.model_max_length input_ids tokenized_data.input_ids , i j attention_mask torch.ones_like input_ids examples_ids.append input_ids input_ids, attention_mask attention_mask Now that dataset is ready, we can start the quantization process with a batch size of 1. Optionally, we also use OpenAI Triton, a CUDA alternative, to communicate with the GPU. Once this is done, we save the tokenizer and the model in a safetensors format. Quantize with GPTQ model.quantize examples_ids, batch_size 1, use_triton True, Save model and tokenizer model.save_quantized out_dir, use_safetensors True tokenizer.save_pretrained out_dir As per usual, the model and tokenizer can then be loaded from the output directory using the AutoGPTQForCausalLM and AutoTokenizer classes. device cuda 0 if torch.cuda.is_available else cpu Reload model and tokenizer model AutoGPTQForCausalLM.from_quantized out_dir, device device, use_triton True, use_safetensors True, tokenizer AutoTokenizer.from_pretrained out_dir Let s check that the model is working correctly. The AutoGPTQ model mostly works as a normal transformers model, which makes it compatible with inference pipelines, as shown in the following example from transformers import pipeline generator pipeline text generation , model model, tokenizer tokenizer result generator I have a dream , do_sample True, max_length 50 0 generated_text print result I have a dream, she told CNN last week. I have this dream of helping my mother find her own. But, to tell that for the first time, now that I m seeing my mother now, just knowing how wonderful it is that We managed to get a convincing completion from our quantized GPT 2 model. A more in depth evaluation would require measuring the perplexity of the quantized model versus the original one. However, we will leave it out of the scope of this article. Conclusion In this article, we introduced the GPTQ algorithm, a state of the art quantization technique to run LLMs on consumer grade hardware. We showed how it addresses the layer wise compression problem, based on an improved OBS technique with arbitrary order insight, lazy batch updates, and Cholesky reformulation. This novel approach significantly reduces memory and computation requirements , making LLMs accessible to a broader audience. In addition, we quantized our own LLM model on a free T4 GPU and ran it to generate text. You can push your own version of a GPTQ 4 bit quantized model on the Hugging Face Hub. As mentioned in the introduction, GPTQ is not the only 4 bit quantization algorithm GGML and NF4 are excellent alternatives with slightly different scopes. I encourage you to learn more about them and give them a shot! If you re interested in more technical content around LLMs, follow me on Twitter maximelabonne. References B. Hassibi, D. G. Stork and G. J. Wolff, Optimal Brain Surgeon and general network pruning, IEEE International Conference on Neural Networks, San Francisco, CA, USA, 1993, pp. 293 299 vol.1, doi 10.1109 ICNN.1993.298572. Elias Frantar, Sidak Pal Singh, Dan Alistarh. 2023 . Optimal Brain Compression A Framework for Accurate Post Training Quantization and Pruning. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh. 2023 . GPTQ Accurate Post Training Quantization for Generative Pre trained Transformers. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020 . Exploring the Limits of Transfer Learning with a Unified Text to Text Transformer. Related articles Introduction to Weight Quantization _Reducing the size of Large Language Models with 8 bit quantization_towardsdatascience.com Fine Tune Your Own Llama 2 Model in a Colab Notebook _A practical introduction to LLM fine tuning_towardsdatascience.com _Learn more about machine learning and support my work with one click become a Medium member here _ Join Medium with my referral link Maxime Labonne _As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story _medium.com _If you re already a member, you canfollow me on Medium._ 1 Share this post 4 bit Quantization with GPTQ maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/4-bit-quantization-with-gptq-36b0f4f02c34" }, { "id": "d771ccaa-ca3e-4280-bbd7-c45aec8b7f0c", "content": "Fine Tune Your Own Llama 2 Model in a Colab Notebook A practical introduction to LLM fine tuning Maxime Labonne SubscribeSign in Share this post Fine Tune Your Own Llama 2 Model in a Colab Notebook maximelabonne.substack.com Copy link Facebook Email Note Other Fine Tune Your Own Llama 2 Model in a Colab Notebook A practical introduction to LLM fine tuning Maxime Labonne Jul 25, 2023 7 Share this post Fine Tune Your Own Llama 2 Model in a Colab Notebook maximelabonne.substack.com Copy link Facebook Email Note Other Share A practical introduction to LLM fine tuning Image by author With the release of LLaMA v1, we saw a Cambrian explosion of fine tuned models, including Alpaca, Vicuna, and WizardLM, among others. This trend encouraged different businesses to launch their own base models with licenses suitable for commercial use, such as OpenLLaMA, Falcon, XGen, etc. The release of Llama 2 now combines the best elements from both sides it offers a highly efficient base model along with a more permissive license . During the first half of 2023, the software landscape was significantly shaped by the widespread use of APIs like OpenAI API to create infrastructures based on Large Language Models LLMs . Libraries such as LangChain and LlamaIndex played a critical role in this trend. Moving into the latter half of the year, the process of fine tuning or instruction tuning these models is set to become a standard procedure in the LLMOps workflow. This trend is driven by various factors the potential for cost savings, the ability to process confidential data, and even the potential to develop models that exceed the performance of prominent models like ChatGPT and GPT 4 in certain specific tasks. In this article, we will see why instruction tuning works and how to implement it in a Google Colab notebook to create your own Llama 2 model. As usual, the code is available on Colab and GitHub. Background on fine tuning LLMs Image by author LLMs are pretrained on an extensive corpus of text. In the case of Llama 2, we know very little about the composition of the training set, besides its length of 2 trillion tokens. In comparison, BERT 2018 was only trained on the BookCorpus 800M words and English Wikipedia 2,500M words . From experience, this is a very costly and long process with a lot of hardware issues. If you want to know more about it, I recommend reading Meta s logbook about the pretraining of the OPT 175B model. When the pretraining is complete, auto regressive models like Llama 2 can predict the next token in a sequence. However, this does not make them particularly useful assistants since they don t reply to instructions. This is why we employ instruction tuning to align their answers with what humans expect. There are two main fine tuning techniques Supervised Fine Tuning SFT Models are trained on a dataset of instructions and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and ground truth responses, acting as labels. Reinforcement Learning from Human Feedback RLHF Models learn by interacting with their environment and receiving feedback. They are trained to maximize a reward signal using PPO , which is often derived from human evaluations of model outputs. In general, RLHF is shown to capture more complex and nuanced human preferences, but is also more challenging to implement effectively. Indeed, it requires careful design of the reward system and can be sensitive to the quality and consistency of human feedback. A possible alternative in the future is the Direct Preference Optimization DPO algorithm, which directly runs preference learning on the SFT model. In our case, we will perform SFT, but this raises a question why does fine tuning work in the first place? As highlighted in the Orca paper, our understanding is that fine tuning leverages knowledge learned during the pretraining process. In other words, fine tuning will be of little help if the model has never seen the kind of data you re interested in. However, if that s the case, SFT can be extremely performant. For example, the LIMA paper showed how you could outperform GPT 3 DaVinci003 by fine tuning a LLaMA v1 model with 65 billion parameters on only 1,000 high quality samples. The quality of the instruction dataset is essential to reach this level of performance, which is why a lot of work is focused on this issue like evol instruct, Orca, or phi 1 . Note that the size of the LLM 65b, not 13b or 7b is also fundamental to leverage pre existing knowledge efficiently. Another important point related to the data quality is the prompt template . Prompts are comprised of similar elements system prompt optional to guide the model, user prompt required to give the instruction, additional inputs optional to take into consideration, and the model s answer required . In the case of Llama 2, the authors used the following template s INST SYS System prompt SYS User prompt INST Model answer s There are other templates, like the ones from Alpaca and Vicuna, and their impact is not very clear. In this example, we will reformat our instruction dataset to follow Llama 2 s template. For the purpose of this tutorial, I ve already done it using the excellent timdettmers openassistant guanaco dataset. You can find it on Hugging Face under the name mlabonne guanaco llama2 1k . How to fine tune Llama 2 In this section, we will fine tune a Llama 2 model with 7 billion parameters on a T4 GPU with high RAM using Google Colab 2.21 credits hour . Note that a T4 only has 16 GB of VRAM, which is barely enough to store Llama 2 7b s weights 7b 2 bytes 14 GB in FP16 . In addition, we need to consider the overhead due to optimizer states, gradients, and forward activations see this excellent article for more information . This means that a full fine tuning is not possible here we need parameter efficient fine tuning PEFT techniques like LoRA or QLoRA. To drastically reduce the VRAM usage, we must fine tune the model in 4 bit precision , which is why we ll use QLoRA here. The good thing is that we can leverage the Hugging Face ecosystem with the transformers , accelerate , peft , trl , and bitsandbytes libraries. We ll do this in the following code based on Younes Belkada s GitHub Gist. First, we install and load these libraries. !pip install q accelerate 0.21.0 peft 0.4.0 bitsandbytes 0.40.2 transformers 4.31.0 trl 0.4.7 import os import torch from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, from peft import LoraConfig, PeftModel from trl import SFTTrainer Let s talk a bit about the parameters we can tune here. First, we want to load a llama 2 7b chat hf model and train it on the mlabonne guanaco llama2 1k 1,000 samples , which will produce our fine tuned model llama 2 7b miniguanaco . Feel free to change the dataset there are many options on the Hugging Face Hub. QLoRA will use a rank of 64 with a scaling parameter of 16 see this article for more information about LoRA parameters . We ll load the Llama 2 model directly in 4 bit precision using the NF4 type and train it for one epoch. To get more information about the other parameters, check the TrainingArguments, PeftModel, and SFTTrainer documentation. The model that you want to train from the Hugging Face hub model_name daryl149 llama 2 7b chat hf The instruction dataset to use dataset_name mlabonne guanaco llama2 1k Fine tuned model name new_model llama 2 7b miniguanaco QLoRA parameters LoRA attention dimension lora_r 64 Alpha parameter for LoRA scaling lora_alpha 16 Dropout probability for LoRA layers lora_dropout 0.1 bitsandbytes parameters Activate 4 bit precision base model loading use_4bit True Compute dtype for 4 bit base models bnb_4bit_compute_dtype float16 Quantization type fp4 or nf4 bnb_4bit_quant_type nf4 Activate nested quantization for 4 bit base models double quantization use_nested_quant False TrainingArguments parameters Output directory where the model predictions and checkpoints will be stored output_dir . results Number of training epochs num_train_epochs 1 Enable fp16 bf16 training set bf16 to True with an A100 fp16 False bf16 False Batch size per GPU for training per_device_train_batch_size 4 Batch size per GPU for evaluation per_device_eval_batch_size 4 Number of update steps to accumulate the gradients for gradient_accumulation_steps 2 Enable gradient checkpointing gradient_checkpointing True Maximum gradient normal gradient clipping max_grad_norm 0.3 Initial learning rate AdamW optimizer learning_rate 2e 4 Weight decay to apply to all layers except bias LayerNorm weights weight_decay 0.001 Optimizer to use optim paged_adamw_32bit Learning rate schedule constant a bit better than cosine lr_scheduler_type constant Number of training steps overrides num_train_epochs max_steps 1 Ratio of steps for a linear warmup from 0 to learning rate warmup_ratio 0.03 Group sequences into batches with same length Saves memory and speeds up training considerably group_by_length True Save checkpoint every X updates steps save_steps 10 Log every X updates steps logging_steps 1 SFT parameters Maximum sequence length to use max_seq_length None Pack multiple short examples in the same input sequence to increase efficiency packing False Load the entire model on the GPU 0 device_map 0 We can now load everything and start the fine tuning process. We re relying on multiple wrappers, so bear with me. First of all, we want to load the dataset we defined. If you changed it, you can preprocess it here and adapt it to the desired prompt template. Then, we re configuring bitsandbytes for 4 bit quantization. Next, we re loading the Llama 2 model in 4 bit precision on a GPU with the corresponding tokenizer. Finally, we re loading configurations for QLoRA, regular training parameters, and passing everything to the SFTTrainer . The training can finally start! Load dataset you can process it here dataset load_dataset dataset_name, split train Load tokenizer and model with QLoRA configuration compute_dtype getattr torch, bnb_4bit_compute_dtype bnb_config BitsAndBytesConfig load_in_4bit use_4bit, bnb_4bit_quant_type bnb_4bit_quant_type, bnb_4bit_compute_dtype compute_dtype, bnb_4bit_use_double_quant use_nested_quant, Check GPU compatibility with bfloat16 if compute_dtype torch.float16 and use_4bit major, _ torch.cuda.get_device_capability if major 8 print 80 print Your GPU supports bfloat16 accelerate training with bf16 True print 80 Load base model model AutoModelForCausalLM.from_pretrained model_name, quantization_config bnb_config, device_map device_map model.config.use_cache False model.config.pretraining_tp 1 Load LLaMA tokenizer tokenizer AutoTokenizer.from_pretrained model_name, trust_remote_code True tokenizer.pad_token tokenizer.eos_token tokenizer.padding_side right Fix weird overflow issue with fp16 training Load LoRA configuration peft_config LoraConfig lora_alpha lora_alpha, lora_dropout lora_dropout, r lora_r, bias none , task_type CAUSAL_LM , Set training parameters training_arguments TrainingArguments output_dir output_dir, num_train_epochs num_train_epochs, per_device_train_batch_size per_device_train_batch_size, gradient_accumulation_steps gradient_accumulation_steps, optim optim, save_steps save_steps, logging_steps logging_steps, learning_rate learning_rate, weight_decay weight_decay, fp16 fp16, bf16 bf16, max_grad_norm max_grad_norm, max_steps max_steps, warmup_ratio warmup_ratio, group_by_length group_by_length, lr_scheduler_type lr_scheduler_type, report_to tensorboard Set supervised fine tuning parameters trainer SFTTrainer model model, train_dataset dataset, peft_config peft_config, dataset_text_field text , max_seq_length max_seq_length, tokenizer tokenizer, args training_arguments, packing packing, Train model trainer.train Save trained model trainer.model.save_pretrained output_dir Image by author The training can be very long, depending on the size of your dataset. Here, it took less than an hour on a T4 GPU. We can check the plots on tensorboard, as follows load_ext tensorboard tensorboard logdir results runs Image by author Let s make sure that the model is behaving correctly. It would require a more exhaustive evaluation, but we can use the text generation pipeline to ask questions like What is a large language model? Note that I m formatting the input to match Llama 2 s prompt template. Ignore warnings logging.set_verbosity logging.CRITICAL Run text generation pipeline with our next model prompt What is a large language model? pipe pipeline task text generation , model model, tokenizer tokenizer, max_length 200 result pipe f s INST prompt INST print result 0 generated_text The model outputs the following response A large language model is a type of artificial intelligence that is trained on a large dataset of text to generate human like language. It is typically trained on a dataset of text that is much larger than the dataset used for smaller language models. The large dataset allows the model to learn more complex patterns in language, which can result in more accurate and natural sounding language generation. Large language models are often used for tasks such as text summarization, language translation, and chatbots. They are also used for more complex tasks such as writing articles, generating creative content, and even composing music. Large language models are trained using a technique called deep learning, which involves using many layers of artificial neural networks to learn complex patterns in the data. The model is trained on a large dataset of text, and the neural networks are trained to predict the next word in a sequence of text given From experience, it is very coherent for a model with only 7 billion parameters. You can play with it and ask harder questions from evaluation datasets like BigBench Hard. Guanaco is an excellent dataset that has produced high quality models in the past. You can train a Llama 2 model on the entire dataset using mlabonne guanaco llama2 . How can we store our new llama 2 7b miniguanaco model now? We need to merge the weights from LoRA with the base model. Unfortunately, as far as I know, there is no straightforward way to do it we need to reload the base model in FP16 precision and use the peft library to merge everything. Alas, it also creates a problem with the VRAM despite emptying it , so I recommend restarting the notebook , re executing the three first cells, and then executing the next one. Please contact me if you know a fix! Reload model in FP16 and merge it with LoRA weights base_model AutoModelForCausalLM.from_pretrained model_name, low_cpu_mem_usage True, return_dict True, torch_dtype torch.float16, device_map device_map, model PeftModel.from_pretrained base_model, output_dir model model.merge_and_unload Reload tokenizer to save it tokenizer AutoTokenizer.from_pretrained model_name, trust_remote_code True tokenizer.pad_token tokenizer.eos_token tokenizer.padding_side right Our weights are merged and we reloaded the tokenizer. We can now push everything to the Hugging Face Hub to save our model. !huggingface cli login model.push_to_hub new_model, use_temp_dir False tokenizer.push_to_hub new_model, use_temp_dir False You can now use this model for inference by loading it like any other Llama 2 model from the Hub. It is also possible to reload it for more fine tuning perhaps with another dataset? If you re interested in a script instead of a notebook, I recommend following the instructions provided in this blog post pip install trl git clone https github.com lvwerra trl python trl examples scripts sft_trainer.py model_name meta llama Llama 2 7b hf dataset_name timdettmers openassistant guanaco load_in_4bit use_peft batch_size 4 gradient_accumulation_steps 2 Conclusion In this article, we saw how to fine tune a Llama 2 7b model using a Colab notebook. We introduced some necessary background on LLM training and fine tuning, as well as important considerations related to instruction datasets. In the second section, we successfully fine tuned the Llama 2 model with its native prompt template and custom parameters. These fine tuned models can then be integrated into LangChain and other architectures as an advantageous alternative to OpenAI API. Remember that, in this new paradigm, instruction datasets are the new gold, and the quality of your model heavily depends on the data it s been fine tuned on. So good luck building high quality datasets! If you re interested in more content about LLMs, follow me on Twitter maximelabonne. References Hugo Touvron, Thomas Scialom, et al. 2023 . Llama 2 Open Foundation and Fine Tuned Chat Models. Philipp Schmid, Omar Sanseviero, Pedro Cuenca, Lewis Tunstall. Llama 2 is here get it on Hugging Face. https huggingface.co blog llama2 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. 2023 . Stanford Alpaca An Instruction following LLaMA model. Jacob Devlin, Ming Wei Chang, Kenton Lee, Kristina Toutanova. 2019 . BERT Pre training of Deep Bidirectional Transformers for Language Understanding. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer. 2023 . QLoRA Efficient Finetuning of Quantized LLMs. 7 Share this post Fine Tune Your Own Llama 2 Model in a Colab Notebook maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32" }, { "id": "0a0993af-948a-4784-846a-2dbc73cbdadc", "content": "Introduction to Weight Quantization Maxime Labonne Reducing the size of Large Language Models with 8 bit quantization Maxime Labonne SubscribeSign in Share this post Introduction to Weight Quantization maximelabonne.substack.com Copy link Facebook Email Note Other Introduction to Weight Quantization Reducing the size of Large Language Models with 8 bit quantization Maxime Labonne Jul 07, 2023 2 Share this post Introduction to Weight Quantization maximelabonne.substack.com Copy link Facebook Email Note Other Share Reducing the size of Large Language Models with 8 bit quantization Large Language Models LLMs are known for their extensive computational requirements. Typically, the size of a model is calculated by multiplying the number of parameters size by the precision of these values data type . However, to save memory, weights can be stored using lower precision data types through a process known as quantization. We distinguish two main families of weight quantization techniques in the literature Post Training Quantization PTQ is a straightforward technique where the weights of an already trained model are converted to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential performance degradation. Quantization Aware Training QAT incorporates the weight conversion process during the pre training or fine tuning stage, resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training data. In this article, we focus on PTQ to reduce the precision of our parameters. To get a good intuition, we will apply both na\u00efve and more sophisticated techniques to a toy example using a GPT 2 model. The entire code is freely available on Google Colab and GitHub. Background on Floating Point Representation The choice of data type dictates the quantity of computational resources required, affecting the speed and efficiency of the model. In deep learning applications, balancing precision and computational performance becomes a vital exercise as higher precision often implies greater computational demands. Among various data types, floating point numbers are predominantly employed in deep learning due to their ability to represent a wide range of values with high precision. Typically, a floating point number uses _n_ bits to store a numerical value. These _n_ bits are further partitioned into three distinct components 1. Sign The sign bit indicates the positive or negative nature of the number. It uses one bit where 0 indicates a positive number and 1 signals a negative number. 2. Exponent The exponent is a segment of bits that represents the power to which the base usually 2 in binary representation is raised. The exponent can also be positive or negative, allowing the number to represent very large or very small values. 3. Significand Mantissa The remaining bits are used to store the significand, also referred to as the mantissa. This represents the significant digits of the number. The precision of the number heavily depends on the length of the significand. This design allows floating point numbers to cover a wide range of values with varying levels of precision. The formula used for this representation is To understand this better, let s delve into some of the most commonly used data types in deep learning float32 FP32 , float16 FP16 , and bfloat16 BF16 FP32 uses 32 bits to represent a number one bit for the sign, eight for the exponent, and the remaining 23 for the significand. While it provides a high degree of precision, the downside of FP32 is its high computational and memory footprint. FP16 uses 16 bits to store a number one is used for the sign, five for the exponent, and ten for the significand. Although this makes it more memory efficient and accelerates computations, the reduced range and precision can introduce numerical instability, potentially impacting model accuracy. BF16 is also a 16 bit format but with one bit for the sign, _eight_ for the exponent, and _seven_ for the significand. BF16 expands the representable range compared to FP16, thus decreasing underflow and overflow risks. Despite a reduction in precision due to fewer significand bits, BF16 typically does not significantly impact model performance and is a useful compromise for deep learning tasks. Image by author In ML jargon, FP32 is often termed full precision 4 bytes , while BF16 and FP16 are half precision 2 bytes . But could we do even better and store weights using a single byte? The answer is the INT8 data type, which consists of an 8 bit representation capable of storing 2\u2078 256 different values. In the next section, we ll see how to convert FP32 weights into an INT8 format. Na\u00efve 8 bit Quantization In this section, we will implement two quantization techniques a symmetric one with absolute maximum absmax quantization and an asymmetric one with zero point quantization . In both cases, the goal is to map an FP32 tensor X original weights to an INT8 tensor X_quant quantized weights . With absmax quantization , the original number is divided by the absolute maximum value of the tensor and multiplied by a scaling factor 127 to map inputs into the range 127, 127 . To retrieve the original FP16 values, the INT8 number is divided by the quantization factor, acknowledging some loss of precision due to rounding. For instance, let s say we have an absolution maximum value of 3.2. A weight of 0.1 would be quantized to _round 0.1 127 3.2 4_. If we want to dequantize it, we would get _4 3.2 127 0.1008_ , which implies an error of 0.008. Here s the corresponding Python implementation import torch def absmax_quantize X Calculate scale scale 127 torch.max torch.abs X Quantize X_quant scale X .round Dequantize X_dequant X_quant scale return X_quant.to torch.int8 , X_dequant With zero point quantization , we can consider asymmetric input distributions, which is useful when you consider the output of a ReLU function only positive values , for example. The input values are first scaled by the total range of values 255 divided by the difference between the maximum and minimum values. This distribution is then shifted by the zero point to map it into the range 128, 127 notice the extra value compared to absmax . First, we calculate the scale factor and the zero point value Then, we can use these variables to quantize or dequantize our weights Let s take an example we have a maximum value of 3.2 and a minimum value of 3.0. We can calculate the scale is _255 3.2 3.0 41.13_ and the zero point _ round 41.13 3.0 128 123 128 5_ , so our previous weight of 0.1 would be quantized to _round 41.13 0.1 5 1_. This is very different from the previous value obtained using absmax 4 vs. 1 . Image by author The Python implementation is quite straightforward def zeropoint_quantize X Calculate value range denominator x_range torch.max X torch.min X x_range 1 if x_range 0 else x_range Calculate scale scale 255 x_range Shift by zero point zeropoint scale torch.min X 128 .round Scale and round the inputs X_quant torch.clip X scale zeropoint .round , 128, 127 Dequantize X_dequant X_quant zeropoint scale return X_quant.to torch.int8 , X_dequant Instead of relying on complete toy examples, we can use these two functions on a real model thanks to the transformers library. We start by loading the model and tokenizer for GPT 2. This is a very small model we probably don t want to quantize, but it will be good enough for this tutorial. First, we want to observe the model s size so we can compare it later and evaluate the memory savings due to 8 bit quantization. !pip install q bitsandbytes 0.39.0 !pip install q git https github.com huggingface accelerate.git !pip install q git https github.com huggingface transformers.git from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed 0 Set device to CPU for now device cpu Load model and tokenizer model_id gpt2 model AutoModelForCausalLM.from_pretrained model_id .to device tokenizer AutoTokenizer.from_pretrained model_id Print model size print f Model size model.get_memory_footprint , bytes Model size 510,342,192 bytes The size of the GPT 2 model is approximately 487MB in FP32. The next step consists of quantizing the weights using zero point and absmax quantization. In the following example, we apply these techniques to the first attention layer of GPT 2 to see the results. Extract weights of the first layer weights model.transformer.h 0 .attn.c_attn.weight.data print Original weights print weights Quantize layer using absmax quantization weights_abs_quant, _ absmax_quantize weights print nAbsmax quantized weights print weights_abs_quant Quantize layer using absmax quantization weights_zp_quant, _ zeropoint_quantize weights print nZero point quantized weights print weights_zp_quant Original weights tensor 0.4738, 0.2614, 0.0978, ..., 0.0513, 0.0584, 0.0250 , 0.0874, 0.1473, 0.2387, ..., 0.0525, 0.0113, 0.0156 , 0.0039, 0.0695, 0.3668, ..., 0.1143, 0.0363, 0.0318 , ..., 0.2592, 0.0164, 0.1991, ..., 0.0095, 0.0516, 0.0319 , 0.1517, 0.2170, 0.1043, ..., 0.0293, 0.0429, 0.0475 , 0.4100, 0.1924, 0.2400, ..., 0.0046, 0.0070, 0.0198 Absmax quantized weights tensor 21, 12, 4, ..., 2, 3, 1 , 4, 7, 11, ..., 2, 1, 1 , 0, 3, 16, ..., 5, 2, 1 , ..., 12, 1, 9, ..., 0, 2, 1 , 7, 10, 5, ..., 1, 2, 2 , 18, 9, 11, ..., 0, 0, 1 , dtype torch.int8 Zero point quantized weights tensor 20, 11, 3, ..., 3, 2, 2 , 5, 8, 12, ..., 1, 0, 0 , 1, 4, 18, ..., 6, 3, 0 , ..., 11, 0, 10, ..., 1, 1, 2 , 8, 11, 6, ..., 2, 1, 1 , 18, 8, 10, ..., 1, 1, 2 , dtype torch.int8 The difference between the original FP32 and quantized values INT8 is clear, but the difference between absmax and zero point weights is more subtle. In this case, the inputs look shifted by a value of 1. This suggests that the weight distribution in this layer is quite symmetric. We can compare these techniques by quantizing every layer in GPT 2 linear layers, attention layers, etc. and create two new models model_abs and model_zp . To be precise, we will actually replace the original weights with _ de _ quantized ones. This has two benefits it allows us to 1 compare the distribution of our weights same scale and 2 actually run the models. Indeed, PyTorch doesn t allow INT8 matrix multiplication by default. In a real scenario, we would dequantize them to run the model in FP16 for example but store them as INT8. In the next section, we will use the bitsandbytes library to solve this issue. import numpy as np from copy import deepcopy Store original weights weights param.data.clone for param in model.parameters Create model to quantize model_abs deepcopy model Quantize all model weights weights_abs for param in model_abs.parameters _, dequantized absmax_quantize param.data param.data dequantized weights_abs.append dequantized Create model to quantize model_zp deepcopy model Quantize all model weights weights_zp for param in model_zp.parameters _, dequantized zeropoint_quantize param.data param.data dequantized weights_zp.append dequantized Now that our models have been quantized, we want to check the impact of this process. Intuitively, we want to make sure that the quantized weights are close to the original ones . A visual way to check it is to plot the distribution of the dequantized and original weights. If the quantization is lossy, it would drastically change the weight distribution. The following figure shows this comparison, where the blue histogram represents the original FP32 weights, and the red one represents the dequantized from INT8 weights. Note that we only display this plot between 2 and 2 because of outliers with very high absolute values more on that later . Both plots are quite similar, with a surprising spike around 0. This spike shows that our quantization is quite lossy since reversing the process doesn t output the original values. This is particularly true for the absmax model, which displays both a lower valley and a higher spike around 0. Let s compare the performance of the original and quantized models. For this purpose, we define a generate_text function to generate 50 tokens with top k sampling. def generate_text model, input_text, max_length 50 input_ids tokenizer.encode input_text, return_tensors pt .to device output model.generate inputs input_ids, max_length max_length, do_sample True, top_k 30, pad_token_id tokenizer.eos_token_id, attention_mask input_ids.new_ones input_ids.shape return tokenizer.decode output 0 , skip_special_tokens True Generate text with original and quantized models original_text generate_text model, I have a dream absmax_text generate_text model_abs, I have a dream zp_text generate_text model_zp, I have a dream print f Original model n original_text print 50 print f Absmax model n absmax_text print 50 print f Zeropoint model n zp_text Original model I have a dream, and it is a dream I believe I would get to live in my future. I love my mother, and there was that one time I had been told that my family wasn t even that strong. And then I got the Absmax model I have a dream to find out the origin of her hair. She loves it. But there s no way you could be honest about how her hair is made. She must be crazy. We found a photo of the hairstyle posted on Zeropoint model I have a dream of creating two full time jobs in America one for people with mental health issues, and one for people who do not suffer from mental illness or at least have an employment and family history of substance abuse, to work part Instead of trying to see if one output makes more sense than the others, we can quantify it by calculating the perplexity of each output. This is a common metric used to evaluate language models, which measures the uncertainty of a model in predicting the next token in a sequence. In this comparison, we make the common assumption that the lower the score, the better the model is. In practice, a sentence with a high perplexity could also be correct. We implement it using a minimal function since it doesn t need to consider details like the length of the context window since our sentences are short. def calculate_perplexity model, text Encode the text encodings tokenizer text, return_tensors pt .to device Define input_ids and target_ids input_ids encodings.input_ids target_ids input_ids.clone with torch.no_grad outputs model input_ids, labels target_ids Loss calculation neg_log_likelihood outputs.loss Perplexity calculation ppl torch.exp neg_log_likelihood return ppl ppl calculate_perplexity model, original_text ppl_abs calculate_perplexity model_abs, absmax_text ppl_zp calculate_perplexity model_zp, absmax_text print f Original perplexity ppl.item .2f print f Absmax perplexity ppl_abs.item .2f print f Zeropoint perplexity ppl_zp.item .2f Original perplexity 15.53 Absmax perplexity 17.92 Zeropoint perplexity 17.97 We see that the perplexity of the original model is slightly lower than the two others. A single experiment is not very reliable, but we could repeat this process multiple times to see the difference between each model. In theory, zero point quantization should be slightly better than absmax, but is also more costly to compute. In this example, we applied quantization techniques to entire layers per tensor basis . However, we could apply it at different granularity levels from the entire model to individual values. Quantizing the entire model in one pass would seriously degrade the performance, while quantizing individual values would create a big overhead. In practice, we often prefer the vector wise quantization , which considers the variability of values in rows and columns inside of the same tensor. However, even vector wise quantization doesn t solve the problem of outlier features. Outlier features are extreme values negative or positive that appear in all transformer layers when the model reach a certain scale 6.7B parameters . This is an issue since a single outlier can reduce the precision for all other values. But discarding these outlier features is not an option since it would greatly degrade the model s performance. 8 bit Quantization with LLM.int8 Introduced by Dettmers et al. 2022 , LLM.int8 is a solution to the outlier problem. It relies on a vector wise absmax quantization scheme and introduces mixed precision quantization. This means that outlier features are processed in a FP16 format to retain their precision, while the other values are processed in an INT8 format. As outliers represent about 0.1 of values, this effectively reduces the memory footprint of the LLM by almost 2x. Image by author LLM.int8 works by conducting matrix multiplication computation in three key steps 1. Extract columns from the input hidden states X containing outlier features using a custom threshold. 2. Perform the matrix multiplication of the outliers using FP16 and the non outliers using INT8 with vector wise quantization row wise for the hidden state X and column wise for the weight matrix W . 3. Dequantize the non outlier results INT8 to FP16 and add them to the outlier results to get the full result in FP16. Image by author This approach is necessary because 8 bit precision is limited and can lead to substantial errors when quantizing a vector with large values. These errors also tend to amplify as they propagate through multiple layers. We can easily use this technique thanks to the integration of the bitsandbytes library into the Hugging Face ecosystem. We just need to specify load_in_8bit True when loading the model it also requires a GPU . device torch.device cuda if torch.cuda.is_available else cpu model_int8 AutoModelForCausalLM.from_pretrained model_id, device_map auto , load_in_8bit True, print f Model size model_int8.get_memory_footprint , bytes Model size 176,527,896 bytes With this extra line of code, the model is now almost three times smaller 168MB vs. 487MB . We can even compare the distribution of the original and quantized weights as we did earlier In this case, we see spikes around 2, 1, 0, 1, 2, etc. These values correspond to the parameters stored in the INT8 format non outliers . You can verify it by printing the model s weights using model_int8.parameters . We can also generate text with this quantized model and compare it to the original model. Generate text with quantized model text_int8 generate_text model_int8, I have a dream print f Original model n original_text print 50 print f LLM.int8 model n text_int8 Original model I have a dream, and it is a dream I believe I would get to live in my future. I love my mother, and there was that one time I had been told that my family wasn t even that strong. And then I got the LLM.int8 model I have a dream. I don t know what will come of it, but I am going to have to look for something that will be right. I haven t thought about it for a long time, but I have to try to get that thing Once again, it is difficult to judge what is the best output, but we can rely on the perplexity metric to give us an approximate answer. print f Perplexity original ppl.item .2f ppl calculate_perplexity model_int8, text_int8 print f Perplexity LLM.int8 ppl.item .2f Perplexity original 15.53 Perplexity LLM.int8 7.93 In this case, the perplexity of the quantized model is twice as low as the original one. In general, this is not the case, but it shows that this quantization technique is very competitive. In fact, the authors of LLM.int8 show that the performance degradation is so low it s negligible 1 . However, it has an additional cost in terms of computation LLM.int8 is roughly about 20 slower for large models. Conclusion This article provided an overview of the most popular weight quantization techniques. We started by gaining an understanding of floating point representation, before introducing two techniques for 8 bit quantization absmax and zero point quantization . However, their limitations, particularly when it comes to handling outliers, led to LLM.int8 , a technique that also preserves the model s performance. This approach underlines the progress being made in the field of weight quantization, revealing the importance of properly addressing outliers. Looking forward, our next article will explore the GPTQ weight quantization technique in depth. This technique, introduced by Frantar et al., only utilizes 4 bits and represents a significant advancement in the field of weight quantization. We will provide a comprehensive guide on how to implement GPTQ using the AutoGPTQ library. If you re interested in more technical content around LLMs, follow me on Twitter maximelabonne. References T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, LLM.int8 8 bit Matrix Multiplication for Transformers at Scale. 2022. Y. Beldaka, and T. Dettmers, A Gentle Introduction to 8 bit Matrix Multiplication, Hugging Face Blog 2022 . A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, A Survey of Quantization Methods for Efficient Neural Network Inference. 2021. H. Wu, P. Judd, X. Zhang, M. Isaev, and P. Micikevicius, Integer Quantization for Deep Learning Inference Principles and Empirical Evaluation. 2020. Lilian Weng, Large Transformer Model Inference Optimization, Lil Log 2023 . Kamil Czarnogorski, Local Large Language Models, Int8 2023 . 2 Share this post Introduction to Weight Quantization maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/introduction-to-weight-quantization-2494701b9c0c" }, { "id": "83419ab3-ff2b-4cc7-a792-67a62fe4c585", "content": "Decoding Strategies in Large Language Models A Guide to Text Generation From Beam Search to Nucleus Sampling Maxime Labonne SubscribeSign in Share this post Decoding Strategies in Large Language Models maximelabonne.substack.com Copy link Facebook Email Note Other Decoding Strategies in Large Language Models A Guide to Text Generation From Beam Search to Nucleus Sampling Maxime Labonne Jun 04, 2023 3 Share this post Decoding Strategies in Large Language Models maximelabonne.substack.com Copy link Facebook Email Note Other Share A Guide to Text Generation From Beam Search to Nucleus Sampling Image by author. In the fascinating world of large language models LLMs , much attention is given to model architectures, data processing, and optimization. However, decoding strategies like beam search, which play a crucial role in text generation, are often overlooked. In this article, we will explore how LLMs generate text by delving into the mechanics of greedy search and beam search, as well as sampling techniques with top k and nucleus sampling. By the conclusion of this article, you ll not only understand these decoding strategies thoroughly but also be familiar with how to handle important hyperparameters like temperature, num_beams, top_k, and top_p. The code for this article can be found on GitHub and Google Colab for reference and further exploration. Background To kick things off, let s start with an example. We ll feed the text I have a dream to a GPT 2 model and ask it to generate the next five tokens words or subwords . from transformers import GPT2LMHeadModel, GPT2Tokenizer import torch device cuda if torch.cuda.is_available else cpu model GPT2LMHeadModel.from_pretrained gpt2 .to device tokenizer GPT2Tokenizer.from_pretrained gpt2 model.eval text I have a dream input_ids tokenizer.encode text, return_tensors pt .to device outputs model.generate input_ids, max_length len input_ids.squeeze 5 generated_text tokenizer.decode outputs 0 , skip_special_tokens True print f Generated text generated_text Generated text I have a dream of being a doctor. The sentence I have a dream of being a doctor appears to have been generated by GPT 2. However, GPT 2 didn t _exactly_ produce this sentence. There s a common misconception that LLMs like GPT 2 directly produce text . This isn t the case. Instead, LLMs calculate logits, which are scores assigned to every possible token in their vocabulary. To simplify, here s an illustrative breakdown of the process Image by author. The tokenizer, Byte Pair Encoding in this instance, translates each token in the input text into a corresponding token ID. Then, GPT 2 uses these token IDs as input and tries to predict the next most likely token. Finally, the model generates logits, which are converted into probabilities using a softmax function. For example, the model assigns a probability of 17 to the token for of being the next token after I have a dream . This output essentially represents a ranked list of potential next tokens in the sequence. More formally, we denote this probability as _P of I have a dream 17 _. Autoregressive models like GPT predict the next token in a sequence based on the preceding tokens. Consider a sequence of tokens _w w_ \u2081 _, w_ \u2082 _, , w_ \u209c _ _. The joint probability of this sequence _P w _ can be broken down as For each token _w\u1d62_ in the sequence, _P w\u1d62 w\u2081, w\u2082, , w\u1d62 \u2081 _ represents the conditional probability of _w\u1d62_ given all the preceding tokens _w\u2081, w\u2082, , w\u1d62 \u2081_ . GPT 2 calculates this conditional probability for each of the 50,257 tokens in its vocabulary. This leads to the question how do we use these probabilities to generate text? This is where decoding strategies, such as greedy search and beam search, come into play. Greedy Search Greedy search is a decoding method that takes the most probable token at each step as the next token in the sequence. To put it simply, it only retains the most likely token at each stage, discarding all other potential options. Using our example Step 1 Input I have a dream Most likely token of Step 2 Input I have a dream of Most likely token being Step 3 Input I have a dream of being Most likely token a Step 4 Input I have a dream of being a Most likely token doctor Step 5 Input I have a dream of being a doctor Most likely token . While this approach might sound intuitive, it s important to note that the greedy search is short sighted it only considers the most probable token at each step without considering the overall effect on the sequence. This property makes it fast and efficient as it doesn t need to keep track of multiple sequences, but it also means that it can miss out on better sequences that might have appeared with slightly less probable next tokens. Next, let s illustrate the greedy search implementation using graphviz and networkx. We select the ID with the highest score, compute its log probability we take the log to simplify calculations , and add it to the tree. We ll repeat this process for five tokens. import matplotlib.pyplot as plt import networkx as nx import numpy as np import time def get_log_prob logits, token_id Compute the softmax of the logits probabilities torch.nn.functional.softmax logits, dim 1 log_probabilities torch.log probabilities Get the log probability of the token token_log_probability log_probabilities token_id .item return token_log_probability def greedy_search input_ids, node, length 5 if length 0 return input_ids outputs model input_ids predictions outputs.logits Get the predicted next sub word here we use top k search logits predictions 0, 1, token_id torch.argmax logits .unsqueeze 0 Compute the score of the predicted token token_score get_log_prob logits, token_id Add the predicted token to the list of input ids new_input_ids torch.cat input_ids, token_id.unsqueeze 0 , dim 1 Add node and edge to graph next_token tokenizer.decode token_id, skip_special_tokens True current_node list graph.successors node 0 graph.nodes current_node tokenscore np.exp token_score 100 graph.nodes current_node token next_token f _ length Recursive call input_ids greedy_search new_input_ids, current_node, length 1 return input_ids Parameters length 5 beams 1 Create a balanced tree with height length graph nx.balanced_tree 1, length, create_using nx.DiGraph Add tokenscore , cumscore , and token attributes to each node for node in graph.nodes graph.nodes node tokenscore 100 graph.nodes node token text Start generating text output_ids greedy_search input_ids, 0, length length output tokenizer.decode output_ids.squeeze .tolist , skip_special_tokens True print f Generated text output Generated text I have a dream of being a doctor. Our greedy search generates the same text as the one from the transformers library I have a dream of being a doctor. Let s visualize the tree we created. import matplotlib.pyplot as plt import networkx as nx import matplotlib.colors as mcolors from matplotlib.colors import LinearSegmentedColormap def plot_graph graph, length, beams, score fig, ax plt.subplots figsize 3 1.2 beams length, max 5, 2 length , dpi 300, facecolor white Create positions for each node pos nx.nx_agraph.graphviz_layout graph, prog dot Normalize the colors along the range of token scores if score token scores data tokenscore for _, data in graph.nodes data True if data token is not None elif score sequence scores data sequencescore for _, data in graph.nodes data True if data token is not None vmin min scores vmax max scores norm mcolors.Normalize vmin vmin, vmax vmax cmap LinearSegmentedColormap.from_list rg , r , y , g , N 256 Draw the nodes nx.draw_networkx_nodes graph, pos, node_size 2000, node_shape o , alpha 1, linewidths 4, node_color scores, cmap cmap Draw the edges nx.draw_networkx_edges graph, pos Draw the labels if score token labels node data token .split _ 0 f n data tokenscore .2f for node, data in graph.nodes data True if data token is not None elif score sequence labels node data token .split _ 0 f n data sequencescore .2f for node, data in graph.nodes data True if data token is not None nx.draw_networkx_labels graph, pos, labels labels, font_size 10 plt.box False Add a colorbar sm plt.cm.ScalarMappable cmap cmap, norm norm sm.set_array if score token fig.colorbar sm, ax ax, orientation vertical , pad 0, label Token probability elif score sequence fig.colorbar sm, ax ax, orientation vertical , pad 0, label Sequence score plt.show Plot graph plot_graph graph, length, 1.5, token Image by author. In this graph, the top node stores the input token thus with a 100 probability , while all other nodes represent generated tokens. Although each token in this sequence was the most likely at the time of prediction, being and doctor were assigned relatively low probabilities of 9.68 and 2.86 , respectively. This suggests that of , our first predicted token, may not have been the most suitable choice as it led to being , which is quite unlikely. In the following section, we ll explore how beam search can address this problem. Beam Search Unlike greedy search, which only considers the next most probable token, beam search takes into account the _n_ most likely tokens, where _n_ represents the number of beams. This procedure is repeated until a predefined maximum length is reached or an end of sequence token appears. At this point, the sequence or beam with the highest overall score is chosen as the output. We can adapt the previous function to consider the _n_ most probable tokens instead of just one. Here, we ll maintain the sequence score log _P w _ , which is the cumulative sum of the log probability of every token in the beam. We normalize this score by the sequence length to prevent bias towards longer sequences this factor can be adjusted . Once again, we ll generate five additional tokens to complete the sentence I have a dream. from tqdm.notebook import tqdm def greedy_sampling logits, beams return torch.topk logits, beams .indices def beam_search input_ids, node, bar, length, beams, sampling, temperature 0.1 if length 0 return None outputs model input_ids predictions outputs.logits Get the predicted next sub word here we use top k search logits predictions 0, 1, if sampling greedy top_token_ids greedy_sampling logits, beams elif sampling top_k top_token_ids top_k_sampling logits, temperature, 20, beams elif sampling nucleus top_token_ids nucleus_sampling logits, temperature, 0.5, beams for j, token_id in enumerate top_token_ids bar.update 1 Compute the score of the predicted token token_score get_log_prob logits, token_id cumulative_score graph.nodes node cumscore token_score Add the predicted token to the list of input ids new_input_ids torch.cat input_ids, token_id.unsqueeze 0 .unsqueeze 0 , dim 1 Add node and edge to graph token tokenizer.decode token_id, skip_special_tokens True current_node list graph.successors node j graph.nodes current_node tokenscore np.exp token_score 100 graph.nodes current_node cumscore cumulative_score graph.nodes current_node sequencescore 1 len new_input_ids.squeeze cumulative_score graph.nodes current_node token token f _ length _ j Recursive call beam_search new_input_ids, current_node, bar, length 1, beams, sampling, 1 Parameters length 5 beams 2 Create a balanced tree with height length and branching factor k graph nx.balanced_tree beams, length, create_using nx.DiGraph bar tqdm total len graph.nodes Add tokenscore , cumscore , and token attributes to each node for node in graph.nodes graph.nodes node tokenscore 100 graph.nodes node cumscore 0 graph.nodes node sequencescore 0 graph.nodes node token text Start generating text beam_search input_ids, 0, bar, length, beams, greedy , 1 The function computes the scores for 63 tokens and beams length 5\u00b2 25 possible sequences. In our implementation, all the information is stored in the graph. Our next step is to extract the best sequence. First, we identify the leaf node with the highest sequence score. Next, we find the shortest path from the root to this leaf. Every node along this path contains a token from the optimal sequence. Here s how we can implement it def get_best_sequence G Create a list of leaf nodes leaf_nodes node for node in G.nodes if G.out_degree node 0 Get the leaf node with the highest cumscore max_score_node None max_score float inf for node in leaf_nodes if G.nodes node sequencescore max_score max_score G.nodes node sequencescore max_score_node node Retrieve the sequence of nodes from this leaf node to the root node in a list path nx.shortest_path G, source 0, target max_score_node Return the string of token attributes of this sequence sequence .join G.nodes node token .split _ 0 for node in path return sequence, max_score sequence, max_score get_best_sequence graph print f Generated text sequence Generated text I have a dream. I have a dream The best sequence seems to be I have a dream. I have a dream, which is a common response from GPT 2, even though it may be surprising. To verify this, let s plot the graph. In this visualization, we ll display the sequence score for each node, which represents the score of the sequence up to that point. If the function get_best_sequence is correct, the dream node in the sequence I have a dream. I have a dream should have the highest score among all the leaf nodes. Plot graph plot_graph graph, length, beams, sequence Indeed, the dream token has the highest sequence score with a value of 0.69. Interestingly, we can see the score of the greedy sequence I have a dream of being a doctor. on the left with a value of 1.16. As expected, the greedy search leads to suboptimal results. But, to be honest, our new outcome is not particularly compelling either. To generate more varied sequences, we ll implement two sampling algorithms top k and nucleus. Top k sampling Top k sampling is a technique that leverages the probability distribution generated by the language model to select a token randomly from the _ k _ most likely options . To illustrate, suppose we have _k 3_ and four tokens A, B, C, and D, with respective probabilities _P A 30 _ , _P B 15 _ , _P C 5 _ , and _P D 1 _. In top k sampling, token D is disregarded, and the algorithm will output A 60 of the time, B 30 of the time, and C 10 of the time. This approach ensures that we prioritize the most probable tokens while introducing an element of randomness in the selection process. Another way of introducing randomness is the concept of temperature. The temperature _T_ is a parameter that ranges from 0 to 1, which affects the probabilities generated by the softmax function, making the most likely tokens more influential. In practice, it simply consists of dividing the input logits by a value we call temperature Here is a chart that demonstrates the impact of temperature on the probabilities generated for a given set of input logits 1.5, 1.8, 0.9, 3.2 . We ve plotted three different temperature values to observe the differences. A temperature of 1.0 is equivalent to a default softmax with no temperature at all. On the other hand, a low temperature setting 0.1 significantly alters the probability distribution. This is commonly used in text generation to control the level of creativity in the generated output. By adjusting the temperature, we can influence the extent to which the model produces more diverse or predictable responses. Let s now implement the top k sampling algorithm. We ll use it in the beam_search function by providing the top_k argument. To illustrate how the algorithm works, we will also plot the probability distributions for top_k 20. def plot_prob_distribution probabilities, next_tokens, sampling, potential_nb, total_nb 50 Get top k tokens top_k_prob, top_k_indices torch.topk probabilities, total_nb top_k_tokens tokenizer.decode idx for idx in top_k_indices.tolist Get next tokens and their probabilities next_tokens_list tokenizer.decode idx for idx in next_tokens.tolist next_token_prob probabilities next_tokens .tolist Create figure plt.figure figsize 0.4 total_nb, 5 , dpi 300, facecolor white plt.rc axes , axisbelow True plt.grid axis y , linestyle , alpha 0.5 if potential_nb total_nb plt.axvline x potential_nb 0.5, ls , color grey , label Sampled tokens plt.bar top_k_tokens, top_k_prob.tolist , color blue plt.bar next_tokens_list, next_token_prob, color red , label Selected tokens plt.xticks rotation 45, ha right , va top plt.gca .spines top .set_visible False plt.gca .spines right .set_visible False if sampling top_k plt.title Probability distribution of predicted tokens with top k sampling elif sampling nucleus plt.title Probability distribution of predicted tokens with nucleus sampling plt.legend plt.savefig f sampling _ time.time .png , dpi 300 plt.close def top_k_sampling logits, temperature, top_k, beams, plot True assert top_k 1 assert beams top_k indices_to_remove logits torch.topk logits, top_k 0 ..., 1, None new_logits torch.clone logits new_logits indices_to_remove float inf Convert logits to probabilities probabilities torch.nn.functional.softmax new_logits temperature, dim 1 Sample n tokens from the resulting distribution next_tokens torch.multinomial probabilities, beams Plot distribution if plot total_prob torch.nn.functional.softmax logits temperature, dim 1 plot_prob_distribution total_prob, next_tokens, top_k , top_k return next_tokens Start generating text beam_search input_ids, 0, bar, length, beams, top_k , 1 Image by author. These plots give a good intuition of how top k sampling works, with all the potentially selected tokens on the left of the horizontal bar. While the most probable tokens are selected in red most of the time, it also allows less likely tokens to be chosen. This offers an interesting tradeoff that can steer a sequence towards a less predictable but more natural sounding sentence. Now let s print the text it generated. sequence, max_score get_best_sequence graph print f Generated text sequence Generated text I have a dream job and I want to The top k sampling found a new sequence I have a dream job and I want to , which feels significantly more natural than I have a dream. I have a dream . We re making progress! Let s see how this decision tree differs from the previous one. Plot graph plot_graph graph, length, beams, sequence You can see how the nodes differ significantly from the previous iteration, making more diverse choices. Although the sequence score of this new outcome might not be the highest 1.01 instead of 0.69 previously , it s important to remember that higher scores do not always lead to more realistic or meaningful sequences. Now that we ve introduced top k sampling, we have to present the other most popular sampling technique nucleus sampling. Nucleus sampling Nucleus sampling, also known as top p sampling, takes a different approach from top k sampling. Rather than selecting the top _k_ most probable tokens, nucleus sampling chooses a cutoff value _p_ such that the sum of the probabilities of the selected tokens exceeds _ p _. This forms a nucleus of tokens from which to randomly choose the next token. In other words, the model examines its top probable tokens in descending order and keeps adding them to the list until the total probability surpasses the threshold _p_. Unlike top k sampling, the number of tokens included in the nucleus can vary from step to step. This variability often results in a more diverse and creative output, making nucleus sampling popular for tasks such as text generation. To implement the nucleus sampling method, we can use the nucleus parameter in the beam_search function. In this example, we ll set the value of _p_ to 0.5. To make it easier, we ll include a minimum number of tokens equal to the number of beams. We ll also consider tokens with cumulative probabilities lower than _p_ , rather than higher. It s worth noting that while the details may differ, the core idea of nucleus sampling remains the same. def nucleus_sampling logits, temperature, p, beams, plot True assert p 0 assert p 1 Sort the probabilities in descending order and compute cumulative probabilities sorted_logits, sorted_indices torch.sort logits, descending True probabilities torch.nn.functional.softmax sorted_logits temperature, dim 1 cumulative_probabilities torch.cumsum probabilities, dim 1 Create a mask for probabilities that are in the top p mask cumulative_probabilities p If there s not n index where cumulative_probabilities p, we use the top n tokens instead if mask.sum beams top_p_index_to_keep torch.where mask 0 1 .detach .cpu .tolist else top_p_index_to_keep beams Only keep top p indices indices_to_remove sorted_indices top_p_index_to_keep sorted_logits indices_to_remove float inf Sample n tokens from the resulting distribution probabilities torch.nn.functional.softmax sorted_logits temperature, dim 1 next_tokens torch.multinomial probabilities, beams Plot distribution if plot total_prob torch.nn.functional.softmax logits temperature, dim 1 plot_prob_distribution total_prob, next_tokens, nucleus , top_p_index_to_keep return next_tokens Start generating text beam_search input_ids, 0, bar, length, beams, nucleus , 1 Image by author. In this plot, you can see that the number of tokens included in the nucleus left of the vertical bar fluctuates a lot. The generated probability distributions vary considerably, leading to the selection of tokens that are not always among the most probable ones. This opens the door to the generation of unique and varied sequences. Now, let s observe the text it generated. sequence, max_score get_best_sequence graph print f Generated text sequence Generated text I have a dream. I m going to The nucleus sampling algorithm produces the sequence I have a dream. I m going to , which shows a notable enhancement in semantic coherence compared to greedy sampling. To compare the decision paths, let s visualize the new tree nucleus sampling generated. Plot graph plot_graph graph, length, beams, sequence As with top k sampling, this tree is very different from the one generated with greedy sampling, displaying more variety. Both top k and nucleus sampling offer unique advantages when generating text, enhancing diversity, and introducing creativity into the output. Your choice between the two methods or even greedy search will depend on the specific requirements and constraints of your project. Conclusion In this article, we have delved deep into various decoding methods used by LLMs, specifically GPT 2. We started with a simply greedy search and its immediate yet often suboptimal selection of the most probable next token. Next, we introduced the beam search technique, which considers several of the most likely tokens at each step. Although it offers more nuanced results, beam search can sometimes fall short in generating diverse and creative sequences. To bring more variability into the process, we then moved on to top k sampling and nucleus sampling . Top k sampling diversifies the text generation by randomly selecting among the _k_ most probable tokens, while nucleus sampling takes a different path by dynamically forming a nucleus of tokens based on cumulative probability. Each of these methods brings unique strengths and potential drawbacks to the table, and the specific requirements of your project will largely dictate the choice among them. Ultimately, understanding these techniques and their trade offs will equip you to better guide the LLMs towards producing increasingly realistic, nuanced, and compelling textual output. If you re interested in more technical content around LLMs, you can follow me on Twitter maximelabonne. 3 Share this post Decoding Strategies in Large Language Models maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/decoding-strategies-in-large-language-models-9733a8f70539" }, { "id": "d0f2f790-c745-4858-a2c5-e4daeedb53cf", "content": "The Art of Spending Optimizing Your Marketing Budget with Nonlinear Optimization Introduction to CVXPY to maximize marketing ROI Maxime Labonne SubscribeSign in Share this post The Art of Spending Optimizing Your Marketing Budget with Nonlinear Optimization maximelabonne.substack.com Copy link Facebook Email Note Other The Art of Spending Optimizing Your Marketing Budget with Nonlinear Optimization Introduction to CVXPY to maximize marketing ROI Maxime Labonne May 22, 2023 1 Share this post The Art of Spending Optimizing Your Marketing Budget with Nonlinear Optimization maximelabonne.substack.com Copy link Facebook Email Note Other Share Introduction to CVXPY to maximize marketing ROI Image by author In the age of digital marketing, businesses face the challenge of allocating their marketing budget across multiple channels to maximize sales. However, as they broaden their reach, these firms inevitably face the issue of diminishing returns the phenomenon where additional investment in a marketing channel yields progressively smaller increases in conversions. This is where the concept of marketing budget allocation steps in, adding another layer of complexity to the whole process. In this article, we re going to explore the potential of nonlinear programming, specifically conic optimization or cone programming , as a tool for marketing budget allocation. With the use of this advanced mathematical technique, we aim to optimize the distribution of marketing budget across various platforms to extract the maximum value and the highest possible ROI. The code is available on GitHub and Google Colab. Marketing budget allocation Marketing budget allocation is a critical aspect of any advertising campaign, requiring businesses to strategically distribute their resources across different channels. The goal is to maximize the effectiveness of their marketing efforts and achieve the highest possible return on investment ROI . To tackle this challenge, we need to consider three key components 1. Attribution How can we connect conversion events to specific campaigns? 2. Performance Estimation How can we predict the performance of a campaign based on its allocated budget? 3. Optimization How can we allocate budgets across various campaigns to maximize ROI? 1. Attribution Connecting Conversions to Campaigns Attribution is the process of determining which campaigns are responsible for converting customers. Some channels, like Facebook or AdWords, can directly claim conversions. However, there are various attribution models to consider, including First touch Last touch Multi touch Time decay Position based Attribution systems are not without their issues, with two main challenges Lag The time it takes to measure the performance of ads and attribute conversions accurately Attribution Window The trade off between using a short versus a long window to attribute conversions For example, DoorDash used a several day last touch attribution system. The problem they faced was the need to wait for several days to measure the performance of their ads, which proved too lengthy given the rapid changes in their market. 2. Performance Estimation Predicting Campaign Success Performance estimation involves creating a model that can predict the success of a marketing campaign based on its budget allocation. Here, success can be defined in terms of various Key Performance Indicators KPIs , such as Leads Cost per Lead CPL Customer Lifetime Value CLV Customer Acquisition Cost CAC Traditionally, linear models have been used for performance estimation. However, they assume that marketing channels don t exhibit diminishing returns , which is often not the case. To obtain nontrivial solutions, linear models typically incorporate multiple constraints and are solved using Linear Programming LP . In reality, response curves in marketing mix modeling often display different shapes, such as Linear rare Concave common, indicating diminishing returns Convex rare S shaped rare Image by author These shapes reflect the diminishing returns of marketing spending or the varying effectiveness of different channels at different budget levels. For example, investing more money into a channel might initially yield higher returns convex , but after a certain point, each additional dollar may generate less and less incremental outcome becoming concave , creating an S shaped curve overall. To capture the intrinsic nonlinearity of the marketing budget allocation problem, a more sophisticated approach is needed. This is where nonlinear programming, specifically conic optimization, comes into play. 3. Optimization Nonlinear Optimization with CVXPY Nonlinear programming, also known as nonlinear optimization, is a method used to solve optimization problems where the objective function, constraints , or both, are nonlinear . In simple terms, it s the process of finding the optimal solution either maximizing or minimizing for a system that s governed by a set of nonlinear equations. In this example, we will model the returns for each marketing channel response curve using the natural logarithm as follows The two previous steps of attribution and performance estimation approximate the values of \u03b1\u1d62 and \u03b2\u1d62 for every channel _i_. Let s take a simple example with three channels The noise observed in these values is typical in marketing budget allocation problems. Note that the alpha values are negative this can be interpreted as the initial cost of engaging with a new marketing channel. We can plot the response curves of each marketing channel using matplotlib. import matplotlib.pyplot as plt import numpy as np np.random.seed 0 TOTAL_BUDGET 100_000 Alpha and beta constants alphas np.array 9453.72, 8312.84, 7371.33 betas np.array 8256.21, 7764.20, 7953.36 Linearly spaced numbers x np.linspace 1, TOTAL_BUDGET, TOTAL_BUDGET Plot the response curves fig plt.figure figsize 10, 5 , dpi 300 plt.plot x, alphas 0 betas 0 np.log x , color red , label Google Ads plt.plot x, alphas 1 betas 1 np.log x , color blue , label Facebook Ads plt.plot x, alphas 2 betas 2 np.log x , color green , label Twitter Ads plt.xlabel Budget plt.ylabel Returns plt.legend plt.show How to find the best values for each response curve? The easiest solution consists of a greedy algorithm that randomly samples values and evaluates the result. Our optimization problem can be described as follows The following function has a budget of 1,000 iterations to find the best allocation. def greedy_optimization TOTAL_BUDGET, alphas, betas, num_iterations 1_000 Initialize the budget allocation and the best objective value google_budget facebook_budget twitter_budget TOTAL_BUDGET 3 obj alphas 0 betas 0 np.log google_budget alphas 1 betas 1 np.log facebook_budget alphas 2 betas 2 np.log twitter_budget for _ in range num_iterations Generate a new random allocation random_allocation np.random.dirichlet np.ones 3 TOTAL_BUDGET google_budget_new, facebook_budget_new, twitter_budget_new random_allocation Calculate the new objective value new_obj alphas 0 betas 0 np.log google_budget_new alphas 1 betas 1 np.log facebook_budget_new alphas 2 betas 2 np.log twitter_budget_new If the new allocation improves the objective value, keep it if new_obj obj google_budget, facebook_budget, twitter_budget google_budget_new, facebook_budget_new, twitter_budget_new obj new_obj Return the best allocation and the corresponding objective value return google_budget, facebook_budget, twitter_budget , objp Let s run it and see the approximated solution it found Run the greedy optimization best_google, best_facebook, best_twitter , obj greedy_optimization TOTAL_BUDGET, alphas, betas Print the result print 59 n 24 Solution 24 n 59 print f Returns round obj , n print Marketing allocation print f Google Ads round best_google , print f Facebook Ads round best_facebook , print f Twitter Ads round best_twitter , Solution Returns 224,534 Marketing allocation Google Ads 35,476 Facebook Ads 31,722 Twitter Ads 32,802 After running our calculations, we find that our total return is 224,533. You might wonder if we can improve it by tweaking our model more or running more iterations. This kind of guarantee is exactly where nonlinear programming comes to the rescue it can output the best solution possible , also called the optimal solution. On top of this overwhelming advantage, it is also faster to run. To solve the marketing budget allocation problem using nonlinear programming, we ll use the CVXPY library, which supports conic optimization thanks to specialized solvers like ECOS, MOSEK interior point method , and SCS first order method . In this example, we ll use the open source ECOS solver to find the optimal solution. Let s set up the optimization problem Our decision variables are the positive budgets for each channel Our constraint is that the sum of all budgets must not exceed the total budget Our objective is to maximize the total return, which is the sum of the returns for each channel import cvxpy as cp Variables google cp.Variable pos True facebook cp.Variable pos True twitter cp.Variable pos True Constraint constraint google facebook twitter TOTAL_BUDGET Objective obj cp.Maximize alphas 0 betas 0 cp.log google alphas 1 betas 1 cp.log facebook alphas 2 betas 2 cp.log twitter Finally, we call the ECOS solver to find the optimal budget allocations and display the results. Solve prob cp.Problem obj, constraint prob.solve solver ECOS , verbose False Print solution print 59 n 24 Solution 24 n 59 print f Status prob.status print f Returns round prob.value , n print Marketing allocation print f Google Ads round google.value , print f Facebook Ads round facebook.value , print f Twitter Ads round twitter.value , Solution Status optimal Returns 224,540 Marketing allocation Google Ads 34,439 Facebook Ads 32,386 Twitter Ads 33,175 The optimal allocation found by the solver is 34,439 for Google Ads, 32,386 for Facebook Ads, and 33,175 for YouTube, for a total return of 224,540! This is 7 higher than what the greedy algorithm returned 224,533 . Keep in mind that this allocation maximizes the returns based on our response curves correctly modeling these curves is crucial for optimizing the budget effectively. Let s visualize this optimal allocation on top of the previous response curves. Plot the functions and the results fig plt.figure figsize 10, 5 , dpi 300 plt.plot x, alphas 0 betas 0 np.log x , color red , label Google Ads plt.plot x, alphas 1 betas 1 np.log x , color blue , label Facebook Ads plt.plot x, alphas 2 betas 2 np.log x , color green , label Twitter Ads Plot optimal points plt.scatter google.value, facebook.value, twitter.value , alphas 0 betas 0 np.log google.value , alphas 1 betas 1 np.log facebook.value , alphas 2 betas 2 np.log twitter.value , marker , color black , zorder 10 plt.xlabel Budget plt.ylabel Returns plt.legend plt.show But is it really optimal ? We can do a quick sanity check by running the greedy algorithm for different numbers of iterations. This will show us the difference between these two approaches. Let s run it for 20 different numbers of iterations between 1 and 1,000,000. List to store the best objective value for each number of iterations best_obj_list Range of number of iterations to test num_iterations_range np.logspace 0, 6, 20 .astype int Run the greedy algorithm for each number of iterations and store the best objective value for num_iterations in num_iterations_range _, best_obj greedy_optimization TOTAL_BUDGET, alphas, betas, num_iterations best_obj_list.append best_obj We can now plot the resulting list using matplotlib and compare it to the optimal solution Plot the results plt.figure figsize 10, 5 , dpi 300 plt.ticklabel_format useOffset False plt.plot num_iterations_range, best_obj_list, label Greedy algorithm plt.axhline y prob.value, color r , linestyle , label Optimal solution CVXPY plt.xlabel Number of iterations plt.xticks num_iterations_range plt.xscale log plt.ylabel Best returns plt.title Best returns found by the greedy algorithm for different numbers of iterations plt.legend plt.show We observe that the greedy algorithm performs relatively well when given a large number of iterations. However, despite one million attempts, it falls just short of finding the optimal allocation, which yields a return of 224,540.1500. The best non rounded value it could reach is 224,540.1489. To add to this, there s a significant difference in terms of computational speed between the two approaches. The nonlinear programming model identified the optimal solution in a swift 22.3 milliseconds. In stark contrast, the greedy algorithm took a considerable 30 seconds to run its 1 million iterations and find a nearly optimal solution. This disparity becomes even more crucial when we extend our problem to numerous marketing channels . Nonlinear programming with CVXPY maintains its speed and precision, making it a highly efficient tool for complex, high dimensional marketing budget allocation problems. Conclusion Nonlinear programming offers a powerful approach to tackling the marketing budget allocation problem. By modeling the diminishing returns of each marketing channel with nonlinear functions and leveraging the CVXPY library, we can find the optimal allocation of resources that maximizes sales. As the marketing landscape evolves and the number of channels increases, optimization techniques like nonlinear programming can help businesses make better, data driven decisions about their marketing investments. While this article provides a starting point, there are many more advanced techniques and models to explore. Keep learning and experimenting to find the best approach for your business. If you re interested to know more about it, feel free to follow me on Twitter maximelabonne. Happy optimizing! References If you want to learn more about marketing budget allocation, I recommend the following resources Park et al., A Nonlinear Optimization Model of Advertising Budget Allocation across Multiple Digital Media Channels 2022 an excellent approach based on diminishing returns, which inspired this article. Zhao et al., A Unified Framework for Marketing Budget Allocation 2019 fascinating architecture currently in production at Alibaba, based on a logit response curve. Katsov, Cross channel marketing spend optimization using deep learning 2019 blog post about an intriguing LSTM based approach, without convex optimization. Related articles Introduction to Linear Programming in Python _A guide to mathematical optimization with Google OR Tools_towardsdatascience.com Integer vs. Linear Programming in Python _A guide to identify and solve any optimization problem_towardsdatascience.com 1 Share this post The Art of Spending Optimizing Your Marketing Budget with Nonlinear Optimization maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/the-art-of-spending-optimizing-your-marketing-budget-with-nonlinear-optimization-6c8a39afb3c2" }, { "id": "319b83ba-c6bd-44bf-9f73-91096f4a0c47", "content": "Reinforcement Learning in Minecraft Create a Bot to Find Diamonds Reinforcement Learning and Behavior Cloning in Python with MineRL Maxime Labonne SubscribeSign in Share this post Reinforcement Learning in Minecraft Create a Bot to Find Diamonds maximelabonne.substack.com Copy link Facebook Email Note Other Reinforcement Learning in Minecraft Create a Bot to Find Diamonds Reinforcement Learning and Behavior Cloning in Python with MineRL Maxime Labonne May 25, 2022 Share this post Reinforcement Learning in Minecraft Create a Bot to Find Diamonds maximelabonne.substack.com Copy link Facebook Email Note Other Share Reinforcement Learning and Behavior Cloning in Python with MineRL Image by author Mojang license Minecraft is an incredible challenge for Reinforcement Learning. It s a huge game, with many mechanics and complex sequences of actions. It takes an entire wiki with over 8000 pages just to teach humans how to play Minecraft. So how good can be machine learning? This is the question we ll answer in this article. We ll design a bot and try to achieve one of the most difficult challenges in Minecraft finding diamonds from scratch . To make things even worse, we will take on this challenge in randomly generated worlds so we can t learn a particular seed. Sequence of actions to find diamonds, image by author Mojang license What we re gonna talk about is not limited to Minecraft. It can be applied to similar complex environments . More specifically, we will implement two different techniques that will become the backbone of our intelligent agent. But before we can train an agent, we need to understand how to interact with the environment. Let s start with a scripted bot to get familiar with the syntax. We ll use MineRL, a fantastic library to build AI applications in Minecraft. The code used in this article is available on Google Colab. It is a simplified and finetuned version of the excellent notebooks made by the organizers of the MineRL 2021 competition MIT License . I. Scripted bot MineRL allows us to launch Minecraft in Python and interact with the game. This is done through the popular gym library. env gym.make MineRLObtainDiamond v0 env.seed 21 Image by author We are in front of a tree. As you can see, the resolution is quite low . A low resolution means fewer pixels, which speeds things up. Fortunately for us, neural networks don t need a 4K resolution to understand what s happening on screen. Now, we would like to interact with the game. What can our agent do? Here s the list of possible actions List of actions image by author The first step to find diamonds is to get wood to make a crafting table and a wooden pickaxe. Let s try to get closer to the tree. It means that we need to hold the forward button for less than a second. With MineRL, there are 20 actions processed per second we don t need a full second so let s process it 5 times, and wait for 40 more ticks. Image by author Define the sequence of actions script forward 5 40 env gym.make MineRLObtainDiamond v0 env Recorder env, . video , fps 60 env.seed 21 obs env.reset for action in script Get the action space dict of possible actions action_space env.action_space.noop Activate the selected action in the script action_space action 1 Update the environment with the new action space obs, reward, done, _ env.step action_space env.release env.play Image by author Great, let s chop this tree now. We need four actions in total Forward to go in front of the tree Attack to chop the tree Camera to look up or down Jump to get the final piece of wood. Image by author Handling the camera can be a hassle. To simplify the syntax, we re gonna use the str_to_act function from this GitHub repository MIT license . This is what the new script looks like script script 20 script forward 5 script attack 61 script camera 10,0 7 Look up script attack 240 script jump script forward 10 Jump forward script camera 10,0 2 Look up script attack 150 script camera 10,0 7 Look down script 40 for action in tqdm script obs, reward, done, _ env.step str_to_act env, action env.release env.play The agent efficiently chopped the entire tree . This is a good start, but we would like to do it in a more automated way II. Deep Learning Our bot works well in a fixed environment, but what happens if we change the seed or its starting point? Everything is scripted so the agent would probably try to chop a non existent tree. This approach is too static for our requirements we need something that can adapt to new environments. Instead of scripting orders, we want an AI that knows how to chop trees. Naturally, reinforcement learning is a pertinent framework to train this agent. More specifically, deep RL seems to be the solution since we re processing images to select the best actions. There are two ways of implementing it Pure deep RL the agent is trained from scratch by interacting with the environment. It is rewarded every time it chops a tree. Imitation learning the agent learns how to chop trees from a dataset. In this case, it is a sequence of actions to chop trees made by a human. The two approaches have the same outcome, but they re not equivalent. According to the authors of the MineRL 2021 competition, it takes 8 hours for the pure RL solution and 15 minutes for the imitation learning agent to reach the same level of performance. We don t have that much time to spend, so we re going for the Imitation Learning solution. This technique is also called Behavior Cloning , which is the simplest form of imitation. Note that Imitation Learning is not always more efficient than RL. If you want to know more about it, Kumar et al. wrote a great blog post about this topic. Image by author The problem is reduced to a multi class classification task. Our dataset consists of mp4 videos, so we ll use a Convolutional Neural Network CNN to translate these images into relevant actions. Our goal is also to limit the number of actions classes that can be taken so the CNN has fewer options, which means it ll be trained more efficiently. class CNN nn.Module def __init__ self, input_shape, output_dim super .__init__ n_input_channels input_shape 0 self.cnn nn.Sequential nn.Conv2d n_input_channels, 32, kernel_size 8, stride 4 , nn.BatchNorm2d 32 , nn.ReLU , nn.Conv2d 32, 64, kernel_size 4, stride 2 , nn.BatchNorm2d 64 , nn.ReLU , nn.Conv2d 64, 64, kernel_size 3, stride 1 , nn.BatchNorm2d 64 , nn.ReLU , nn.Flatten , nn.Linear 1024, 512 , nn.ReLU , nn.Linear 512, output_dim def forward self, observations return self.cnn observations def dataset_action_batch_to_actions dataset_actions, camera_margin 5 ... class ActionShaping gym.ActionWrapper ... In this example, we manually define 7 relevant actions attack, forward, jump, and move the camera left, right, up, down . Another popular approach is to apply K means in order to automatically retrieve the most relevant actions taken by humans. In any case, the objective is to discard the least useful actions to complete our objective, such as crafting in our example. Let s train our CNN on the MineRLTreechop v0 dataset. Other datasets can be found at this address. We chose a learning rate of 0.0001 and 6 epochs with a batch size of 32. Get data minerl.data.download directory data , environment MineRLTreechop v0 data minerl.data.make MineRLTreechop v0 , data_dir data , num_workers 2 Model model CNN 3, 64, 64 , 7 .cuda optimizer torch.optim.Adam model.parameters , lr 0.0001 criterion nn.CrossEntropyLoss Training loop step 0 losses for state, action, _, _, _ in tqdm data.batch_iter num_epochs 6, batch_size 32, seq_len 1 Get pov observations obs state pov .squeeze .astype np.float32 Transpose and normalize obs obs.transpose 0, 3, 1, 2 255.0 Translate batch of actions for the ActionShaping wrapper actions dataset_action_batch_to_actions action Remove samples with no corresponding action mask actions ! 1 obs obs mask actions actions mask Update weights with backprop logits model torch.from_numpy obs .float .cuda loss criterion logits, torch.from_numpy actions .long .cuda optimizer.zero_grad loss.backward optimizer.step Print loss step 1 losses.append loss.item if step 2000 0 mean_loss sum losses len losses tqdm.write f Step step 5 Training loss mean_loss .3f losses.clear Step 4000 Training loss 0.878 Step 8000 Training loss 0.826 Step 12000 Training loss 0.805 Step 16000 Training loss 0.773 Step 20000 Training loss 0.789 Step 24000 Training loss 0.816 Step 28000 Training loss 0.769 Step 32000 Training loss 0.777 Step 36000 Training loss 0.738 Step 40000 Training loss 0.751 Step 44000 Training loss 0.764 Step 48000 Training loss 0.732 Step 52000 Training loss 0.748 Step 56000 Training loss 0.765 Step 60000 Training loss 0.735 Step 64000 Training loss 0.716 Step 68000 Training loss 0.710 Step 72000 Training loss 0.693 Step 76000 Training loss 0.695 Our model is trained. We can now instantiate an environment and see how it behaves. If the training was successful, it should frantically cut all the trees in sight . This time, we ll use the ActionShaping wrapper to map the array of numbers created with dataset_action_batch_to_actions to discrete actions in MineRL. Our model needs a pov observation in the correct format and outputs logits. These logits can be turned into a probability distribution over a set of 7 actions with the softmax function. We then randomly choose an action based on the probabilities. The selected action is implemented in MineRL thanks to env.step action . This process is repeated as many times as we want. Let s do it 1000 times and watch the result. model CNN 3, 64, 64 , 7 .cuda model.load_state_dict torch.load model.pth env gym.make MineRLObtainDiamond v0 env1 Recorder env, . video , fps 60 env ActionShaping env1 action_list np.arange env.action_space.n obs env.reset for step in tqdm range 1000 Get input in the correct format obs torch.from_numpy obs pov .transpose 2, 0, 1 None .astype np.float32 255 .cuda Turn logits into probabilities probabilities torch.softmax model obs , dim 1 0 .detach .cpu .numpy Sample action according to the probabilities action np.random.choice action_list, p probabilities obs, reward, _, _ env.step action env1.release env1.play Our agent is quite chaotic but it manages to chop trees in this new, unseen environment . Now, how to find diamonds? III. Script Imitation Learning A simple yet powerful approach consists of combining scripted actions with artificial intelligence. Learn the boring stuff, script the knowledge. In this paradigm, we ll use the CNN to get a healthy amount of wood 3000 steps . Then, we can script a sequence to craft planks, sticks, a crafting table, a wooden pickaxe, and start mining stone it should be below our feet . This stone can then be used to craft a stone pickaxe, which can mine iron ore. CNN script approach, image by author Mojang license This is when things get complicated iron ore is quite rare , so we would need to run the game for a while to find a deposit. Then, we would have to craft a furnace and melt it to get the iron pickaxe. Finally, we would have to go even deeper and be even luckier to obtain a diamond without falling into lava. As you can see, it s doable but the outcome is fairly random. We could train another agent to find diamonds, and even a third one to create the iron pickaxe. If you re interested in more complex approaches, you can read the results of the MineRL Diamond 2021 Competition by Kanervisto et al. It describes several solutions using different clever techniques, including end to end deep learning architectures. Nonetheless, it is a complex problem and no team managed to consistently find diamonds, if at all. This is why we will limit ourselves to obtaining a stone pickaxe in the following example, but you can modify the code to go further. obs env_script.reset done False 1. Get wood with the CNN for i in tqdm range 3000 obs torch.from_numpy obs pov .transpose 2, 0, 1 None .astype np.float32 255 .cuda probabilities torch.softmax model obs , dim 1 0 .detach .cpu .numpy action np.random.choice action_list, p probabilities obs, reward, done, _ env_script.step action if done break 2. Craft stone pickaxe with scripted actions if not done for action in tqdm script obs, reward, done, _ env_cnn.step str_to_act env_cnn, action if done break print obs inventory env_cnn.release env_cnn.play We can see our agent chopping wood like a madman during the first 3000 steps, then our script takes over and completes the task. It might not be obvious, but the command print obs.inventory shows a stone pickaxe. Note that this is a cherry picked example most of the runs don t end that well. There are several reasons why the agent may fail it can spawn in a hostile environment water, lava, etc. , in an area without wood, or even fall and die. Playing with different seeds will give you a good understanding of the complexity of this problem and, hopefully, ideas to build event better agents. Conclusion I hope you enjoyed this little guide to reinforcement learning in Minecraft. Beyond its obvious popularity, Minecraft is an interesting environment to try and test RL agents. Like NetHack, it requires a thorough knowledge of its mechanics to plan precise sequences of actions in a procedurally generated world. In this article, We learned how to use MineRL We saw two approaches script and behavior cloning and how to combine them We visualized the agent s actions with short videos. The main drawback of the environment is its slow processing time . Minecraft is not a lightweight game like NetHack or Pong, which is why the agents take a long time to be trained. If this is a problem for you, I would recommend lighter environments like Gym Retro. Thank you for your attention! Feel free to follow me on Twitter if you re interested in AI applied to video games. Share this post Reinforcement Learning in Minecraft Create a Bot to Find Diamonds maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/create-a-bot-to-find-diamonds-in-minecraft-d836606a993a" }, { "id": "fef26b86-df5b-4379-8e7d-03bb90767e4e", "content": "Constraint Programming in Python Maxime Labonne The Programming Paradigm to Find One Solution Among 8,080,104 Candidates Maxime Labonne SubscribeSign in Share this post Constraint Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Constraint Programming in Python The Programming Paradigm to Find One Solution Among 8,080,104 Candidates Maxime Labonne May 02, 2022 Share this post Constraint Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Share The Programming Paradigm to Find One Solution Among 8,080,104 Candidates Image by author, emojis by OpenMoji CC BY SA 4.0 Constraint Programming is a technique to find every solution that respects a set of predefined constraints. It is an invaluable tool for data scientists to solve a huge variety of problems, such as scheduling, timetabling, sequencing, etc. In this article, we ll see how to use CP in two different ways 1. Satisfiability the goal is to find one or multiple feasible solutions _i.e._ , solutions that respect our constraints by narrowing down a large set of potential solutions 2. Optimization the goal is to find the best feasible solution according to an objective function, just like Linear Programming LP . We ll use CP SAT from Google OR Tools, an excellent free and open source CP solver. Note that it is different from MPSolver, which is dedicated to Linear and Mixed Integer Programming. The difference between CP and LP is quite confusing, we ll touch on this topic at the end of the article. You can run the code with the following Google Colab notebook. I. Satisfiability with the 3 scouts problem Image by author, emojis by OpenMoji CC BY SA 4.0 In the previous article, we created an army to defeat our opponent. But there was one small problem we had to guess how powerful his army was. This time, let s send scouts to know the exact number . Our 3 scouts observed the enemy camp, and this is what they tell us Scout 1 _the number of soldiers is a multiple of 13_ Scout 2 _the number of soldiers is a multiple of 19_ Scout 3 _the number of soldiers is a multiple of 37_ They all agree that the number of soldiers doesn t exceed 10,000 . Our scouts have a personal way of counting soldiers, but we can combine these three observations to make a model. Let s call the number of soldiers _army_. We can translate our problem into the following congruence system If you re not familiar with this notation, this is what it means in programming terms Let s implement it with OR Tools. The first thing we need to do is to import and create the CP SAT model and solver . The modeling process is very similar to what we did in Linear Programming. The first step to create our CP model is to declare the variables . In this example, we only have one _army_ , the number of soldiers. We have to give lower and upper bounds. The lower bound is 1 since we know there s an army, and the upper bound is 10,000 according to the scouts In OR Tools, we use the NewIntVar method to create this variable. The second step is to declare the constraints . We identified three constraints in this example. Modulo is a special operator, so we need a specific function to handle it with CP SAT AddModuloEquality . You can find a reference guide at this address if you need other methods. Unlike Linear Programming, we don t have to define an objective function here. The reason is simple there is nothing to optimize! We just want to find a feasible solution that satisfies our constraints, but there is no good or bad answers. This is a key feature of Constraint Programming. Our model is complete , we can now ask OR Tools to solve it. Solution Solved in 0.00 milliseconds Army 9139 Check solution Constraint 1 9139 13 0 Constraint 2 9139 19 0 Constraint 3 9139 37 0 We obtained our solution in less than a millisecond there are 9,139 soldiers in the enemy army. Huzzah, we can now fire the scouts! We limited the search space with an upper bound of 10,000, which gave us a unique solution . But is it still the case if we push this limit? Another perk of CP is the ability to find every possible solution to a problem. This might take a long time when the search space is large because the solver has to brute force the entire space instead of reducing it with heuristics . Let s explore this feature by printing every possible solution with a new upper bound of 100,000 . With OR Tools, we ask the solver to look for every possible solution thanks to the enumerate_all_solutions parameter. We then assign it a callback class that prints every solution the solver finds. We found 10 solutions ! This was to be expected since we increased the upper bound tenfold these solutions all are multiples of 9,139. As you can see, this example has nothing to do with optimization it s a pure satisfiability problem . On another note, this congruence system can be solved manually with the Chinese remainder theorem. But CP is not limited to that II. Optimization and beer Image by author, emojis by OpenMoji CC BY SA 4.0 Let s see another problem our army will face the enemy in a few days. In the meantime, the quartermaster has to prepare the rations that will be used during the campaign. The space in the supply wagons is limited and some rations are more popular than others. There are three possible rations Bread it takes only 1 space but soldiers don t like it that much with a popularity of 3 Meat it takes 3 spaces and has a popularity of 10 Beer it takes 7 spaces but soldiers love it with a popularity of 26. Image by author, emojis by OpenMoji CC BY SA 4.0 The supply wagons have a capacity of 19 spaces . How to select the best rations to maximize the popularity? This is an optimization problem we ve already seen actually, it is a variant of the famous knapsack problem. We could reuse the code from the previous article and just change the input parameters. This time, we ll solve it using Constraint Programming. This paradigm is not limited to finding feasible solutions. It can also perform optimization using different algorithms to handle this overhead. Let s create a model of the problem. First of all, we have to declare three variables bread , meat , and beer . It s possible to have 0 of them, but their number cannot exceed the maximal capacity. This time, we only have one constraint the space occupied by the bread, the meat, and the beer cannot exceed the wagons capacity 19 . We want to maximize the total popularity of the rations that are selected The model is complete, CP SAT can solve the problem ! Solution Solved in 0.00 milliseconds Optimal value 68 popularity Food Bread 2 Meat 1 Beer 2 We obtained the highest popularity 68 possible with a capacity of 19. Is the constraint respected? Let s quickly check it 1 2 3 1 7 2 19, which is indeed 19. Okay, I d like to ask another question how many solutions to this problem are there? Once again, we can answer it with a specific callback to count them. 121 We found 121 solutions with a capacity of 19. But this number quickly increases with a capacity of 1000, there are 8,080,104 possible solutions! And yet, CP SAT finds the optimal solution in less than a second. How is it possible? CP solvers do not brute force the problem with an exhaustive search but combine heuristics and combinatorial search instead. More specifically, the three most popular techniques for constraint satisfaction problems are backtracking , constraint propagation , and local search . CP SAT is quite particular since it combines CP and SAT it is part of a broader trend of merging CP, LP, SAT, and metaheuristics. We said that the previous problem could be solved with Linear Programming, so let s compare the code of both solutions Left LP code, Right CP code image by author As you can see, the syntax is quite similar but it s not the same model solver vs. solver, NewIntVar instead of IntVar , etc. There s a bit of translation to do, but it s easily manageable. These two techniques are incredibly close to each other they both handle variables with constraints and perform optimization using math and heuristics. However, CP is limited to discrete parameters, while LP handles continuous ones. On the other hand, you can implement specialized constraints like all different in CP, but not in LP. Here is a summary of the main differences between these two technologies Image by author, emojis by OpenMoji CC BY SA 4.0 If you want to know more about this topic, I would recommend this article by Irvin J. Lustig and Jean Fran\u00e7ois Puget. CPLEX s documentation also details the differences at this address, in terms of modeling and optimization. Conclusion Image by author Constraint Programming is another incredible technique in the mathematical optimization toolbox. It is a radically different approach compared to traditional, declarative programming. In this article, We saw two applications of CP with satisfiability and optimization We implemented CP models in OR Tools and played with the callback function We highlighted the differences between CP and LP. We limited ourselves to simple problems in this introduction, but CP has amazing applications in complex scheduling and routing problems. This is a topic I d love to address in a future article. If you re interested to know more about it, feel free to follow me on Twitter at maximelabonne. Thanks for your attention! Related articles Introduction to Linear Programming in Python _A guide to mathematical optimization with Google OR Tools_towardsdatascience.com Integer vs. Linear Programming in Python _A guide to identify and solve any optimization problem_towardsdatascience.com Share this post Constraint Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/constraint-programming-67ac16fa0c81" }, { "id": "9de9825b-36e8-4512-b1c8-4c1d60fbcb6c", "content": "GIN How to Design the Most Powerful Graph Neural Network Graph classification with Graph Isomorphism Networks Maxime Labonne SubscribeSign in Share this post GIN How to Design the Most Powerful Graph Neural Network maximelabonne.substack.com Copy link Facebook Email Note Other GIN How to Design the Most Powerful Graph Neural Network Graph classification with Graph Isomorphism Networks Maxime Labonne Apr 27, 2022 Share this post GIN How to Design the Most Powerful Graph Neural Network maximelabonne.substack.com Copy link Facebook Email Note Other Share Graph classification with Graph Isomorphism Networks Image by author Graph Neural Networks are not limited to classifying nodes. One of the most popular applications is graph classification . This is a common task when dealing with molecules they are represented as graphs and features about each atom node can be used to predict the behavior of the entire molecule. However, GNNs only learn node embeddings. How to combine them in order to produce an entire graph embedding ? In this article, we will See a new type of layer, called global pooling , to combine node embeddings Introduce a new architecture called Graph Isomorphism Network GIN , designed by Xu et al. in 2018. We ll detail the advantages of GIN in terms of discriminative power compared to a GCN or GraphSAGE, and its connection to the Weisfeiler Lehman test. Beyond its powerful aggregator, GIN brings exciting takeaways about GNNs in general. You can run the code with the following Google Colab notebook. I. PROTEINS dataset 3D plot of a protein image by author PROTEINS\u00b9 is a popular dataset in bioinformatics. It is a collection of 1113 graphs representing proteins, where nodes are amino acids. Two nodes are connected by an edge when they are close enough 0.6 nanometers . The goal is to classify each protein as an enzyme or not . Enzymes are a particular type of proteins that act as catalysts to speed up chemical reactions in the cell. They are essential for digestion e.g., lipases , respiration e.g., oxidases , and other crucial functions of the human body. They are also used in commercial applications, like the production of antibiotics. This dataset is also available on TUDataset\u00b9 and implemented in PyTorch Geometric. Dataset PROTEINS 1113 Number of graphs 1113 Number of nodes 23 Number of features 3 Number of classes 2 I m not a biochemist so I m curious about these proteins. Let s plot one as a graph to see what it looks like 3D plot of a protein with matplotlib image by author The previous 3D structure is randomly generated obtaining the correct 3D representation is a problem so difficult it s the whole point of AlphaFold. Graphs are not the only way to represent molecules. The simplified molecular input line entry system SMILES is another popular method, which uses a line string notation. It is obtained by printing the nodes encountered in a depth first tree traversal of a slightly modified molecular graph. Researchers often use this representation when working with molecules or chemical compounds. Fortunately for us, the PROTEINS dataset is already encoded in the form of graphs. Otherwise, we could have to translate the SMILES strings into networkx graphs. It doesn t mean we ll directly feed the PROTEINS dataset to our GNN. If GraphSAGE taught us anything, it s that mini batching is incredibly efficient . It is now an indispensable tool whenever we implement a GNN. Training set 890 graphs 14 subgraphs Validation set 111 graphs 2 subgraphs Test set 112 graphs 2 subgraphs PROTEINS is not a huge dataset, but mini batching will s peed up the training nonetheless. We could use a GCN or a GAT, but there s a new architecture I d like to introduce the Graph Isomorphism Network . II. Graph Isomorphism Network GIN GIN was designed by researchers trying to maximize the representational or discriminative power of a GNN. But how do you define a representational power ? A. Weisfeiler Lehman test A way to characterize the power of a GNN is to use the Weisfeiler Lehman WL graph isomorphism test. Isomorphic graphs mean that they have the same structure identical connections but a permutation of nodes. The WL test is able to tell if two graphs are non isomorphic, but it cannot guarantee that they are isomorphic. Two isomorphic graphs image by author This might not seem like much, but it can be extremely difficult to tell two large graphs apart. In fact, this problem is not known to be solvable in polynomial time, nor to be NP complete. It might even be somewhere in between, in the computational complexity class NP intermediate if it only exists . Okay, but how is it related to GNNs? Some researchers in graph learning noticed that this test and the way GNNs learn are oddly similar . In the WL test, 1. Every node starts with the same label 2. Labels from neighboring nodes are aggregated and hashed to produce a new label 3. The previous step is repeated until the labels stop changing . If you re interested in the WL test, I would recommend this blog post by David Bieber and this article by Michael Bronstein. Not only this test is similar to how feature vectors are aggregated in GNNs, but its ability to tell graphs apart makes it more powerful than a lot of architectures, including GCNs and GraphSAGE. This is what inspired Xu et al.\u00b2 to design a new aggregator that they proved to be as good as the WL test. B. One aggregator to rule them all To be as good as the WL test, this new aggregator must produce different node embeddings when dealing with non isomorphic graphs. We ll skip the math heavy part of the paper, but the solution they found is to use two injective functions. Which ones? We don t know, we can just learn them with a MLP! With GATs, we used a neural network to learn the best weighting factors for a given task With GINs, we now learn the approximation of two injective functions thanks to the Universal Approximation Theorem. Here s how to calculate the hidden vector of a particular node _i_ with GIN In this formula, \u025b determines the importance of the target node compared to its neighbors it has the same importance if \u025b 0 . It can be a learnable parameter or a fixed scalar. Note that we talk about MLPs to highlight the fact that there is more than one layer. According to the authors, one layer is not sufficient for graph learning in general. C. Global pooling Global pooling or graph level readout consists of producing a graph embedding using the node embeddings calculated by the GNN. A simple way to obtain a graph embedding is to use the mean , sum , or max of every node embedding _h\u1d62_ The authors make two important points about graph level readout To consider all structural information, it is necessary to keep embeddings from previous layers The sum operator is surprisingly more expressive than the mean and the max. These observations lead them to propose the following global pooling method For each layer, node embeddings are summed and the result is concatenated . This solution combines the expressiveness of the sum operator with the memory of previous iterations from the concatenation. III. GIN in PyTorch Geometric It is always interesting to see the differences between the original design and its implementations. There is a GINConv layer in PyTorch Geometric with different parameters nn the MLP that is used to approximate our two injective functions eps the initial value of \u025b, which is 0 by default train_eps a True False statement to determine if \u025b is trainable, which is False by default . You can see that \u025b is entirely removed by default in this implementation it s a hyperparameter we can tune, but probably not an essential one. There is a second GIN layer in PyTorch Geometric, called GINEConv . It comes from this paper s implementation of GIN, which applies a _ReLU_ function to the neighbors features. We won t use it in this tutorial, since the benefits are not clear. We still need to design a MLP for the GINConv layer. Here s the design we ll implement, inspired by the original paper MLP used in the GIN layer image by author The paper stacks 5 layers but we ll be more humble with 3 layers instead. Here is what the entire architecture looks like Our GIN architecture image by author I could not find any implementation of GIN with graph embedding concatenation , so here is my version it improves the accuracy by 1 on average . Let s compare it to a GCN with a simple mean pooling and no concatenation . GCN test accuracy 59.38 GIN test accuracy 73.70 This time, there s no competition! The GIN architecture completely outperforms the GCN. This gap 10 accuracy on average is due to several reasons GIN s aggregator is specifically designed to discriminate graphs that the GCN s aggregator cannot Graph hidden vectors from every layer are concatenated instead of only considering the last one The sum operator is superior to the mean operator at least in theory . Let s visualize the proteins we classified with the GCN and the GIN. Image by author Interestingly enough, the two models make different mistakes . This is a common result in machine learning when different algorithms are applied to the same problem. We can take advantage of this behavior by creating an ensemble . There are many ways of combining our graph embeddings. The simplest method is to take the mean of the normalized output vectors. GCN test accuracy 59.38 GIN test accuracy 73.70 GCN GIN test accuracy 75.00 This time, we re lucky enough to see the accuracy improved . Obviously, it s not always the case. More sophisticated methods involve building an entirely different ML algorithm for classification, such as a Random Forest. This classifier takes graph embeddings as inputs and outputs the final classification. Conclusion Graph Isomorphism Networks are an important step in the understanding of GNNs. They not only improve the accuracy scores on several benchmarks but also provide a theoretical framework to explain why one architecture is better than another. In this article, We saw a new task with graph classification , performed with global pooling We introduced the WL test and its connection with the new GIN layer We implemented a GIN and a GCN and made a simple ensemble with their classifications. Although GINs achieve good performance, especially with social graphs, their theoretical superiority doesn t always translate well in the real world. It is true with other provably powerful architectures, which tend to underperform in practice , such as the 3WLGNN. If you enjoyed this article, feel free to follow me on Twitter for more graph content! References 1 Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann. TUDataset A collection of benchmark datasets for learning with graphs. In _ICML 2020 Workshop on Graph Representation Learning and Beyond_. 2 Xu, Keyulu and Hu, Weihua and Leskovec, Jure and Jegelka, Stefanie. How Powerful are Graph Neural Networks?__ In _ICLR 2019_. Related articles Introduction to GraphSAGE in Python _Scaling Graph Neural Networks to billions of connections_towardsdatascience.com Graph Attention Networks Self Attention Explained _A guide to GNNs with self attention using PyTorch Geometric_towardsdatascience.com Share this post GIN How to Design the Most Powerful Graph Neural Network maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/how-to-design-the-most-powerful-graph-neural-network-3d18b07a6e66" }, { "id": "4ddd85f7-4d82-4be0-96c1-16056bd9ec18", "content": "GraphSAGE Scaling up Graph Neural Networks Introduction to GraphSAGE with PyTorch Geometric Maxime Labonne SubscribeSign in Share this post GraphSAGE Scaling up Graph Neural Networks maximelabonne.substack.com Copy link Facebook Email Note Other GraphSAGE Scaling up Graph Neural Networks Introduction to GraphSAGE with PyTorch Geometric Maxime Labonne Apr 20, 2022 Share this post GraphSAGE Scaling up Graph Neural Networks maximelabonne.substack.com Copy link Facebook Email Note Other Share Introduction to GraphSAGE with PyTorch Geometric Image by author, emoji by OpenMoji CC BY SA 4.0 What do UberEats and Pinterest have in common? They both use GraphSAGE to power their recommender system on a massive scale millions and billions of nodes and edges. Pinterest developed its own version called PinSAGE to recommend the most relevant images pins to its users. Their graph has 18 billion connections and 3 billion nodes. UberEats also reported using a modified version of GraphSAGE to suggest dishes, restaurants, and cuisines . UberEats claims to support more than 600,000 restaurants and 66 million users. In this tutorial, we ll use a dataset with 20k nodes instead of billions because Google Colab cannot handle our ambitions. We will stick to the original GraphSAGE architecture, but the previous variants also bring exciting features we will discuss. You can run the code with the following Google Colab notebook. I. PubMed dataset t SNE plot of PubMed image by author In this article, we will use the PubMed dataset. As we saw in the previous article, PubMed is part of the Planetoid dataset MIT license . Here s a quick summary It contains 19,717 scientific publications about diabetes from PubMed s database Node features are TF IDF weighted word vectors with 500 dimensions, which is an efficient way of summarizing documents without transformers The task is a multi class classification with three categories diabetes mellitus experimental, diabetes mellitus type 1, and diabetes mellitus type 2. This is the beauty and the curse of deep learning I don t know anything about diabetes, but I ll still feel pretty satisfied if we reach 70 accuracy. At least we re not building the next IBM Watson. Dataset Pubmed Number of graphs 1 Number of nodes 19717 Number of features 500 Number of classes 3 Graph Training nodes 60 Evaluation nodes 500 Test nodes 1000 Edges are directed False Graph has isolated nodes False Graph has loops False As we can see, PubMed has an insanely low number of training nodes compared to the whole graph. There are only 60 samples to learn how to classify the 1000 test nodes. Despite this challenge, GNNs manage to obtain high levels of accuracy. Here s the leaderboard of known techniques a more exhaustive benchmark can be found on PapersWithCode I couldn t find any result for GraphSAGE on PubMed with this specific setting 60 training nodes, 1000 test nodes , so I don t expect a great accuracy. But another metric can be just as relevant when working with large graphs training time . II. GraphSAGE in theory Image by author The GraphSAGE algorithm can be divided into two steps 1. Neighbor sampling 2. Aggregation . A. Neighbor sampling Mini batching is a common technique used in machine learning. It works by breaking down a dataset into smaller batches , which allows us to train models more effectively. Mini batching has several benefits 1. Improved accuracy mini batches help to reduce overfitting gradients are averaged , as well as variance in error rates 2. Increased speed mini batches are processed in parallel and take less time to train than larger batches 3. Improved scalability an entire dataset can exceed the GPU memory, but smaller batches can get around this limitation. Mini batching is so useful it became standard in regular neural networks. However, it is not as straightforward with graph data, since splitting the dataset into smaller chunks would break essential connections between nodes. So, what can we do? In recent years, researchers developed different strategies to create graph mini batches. The one we re interested in is called neighbor sampling . There are many other techniques you can find on PyG s documentation, such as subgraph clustering. Neighbor sampling image by author Neighbor sampling considers only a fixed number of random neighbors. Here s the process 1. We define the number of neighbors 1 hop , the number of neighbors of neighbors 2 hops , etc. we would like to have. 2. The sampler looks at the list of neighbors, of neighbors of neighbors, etc. of a target node and randomly selects a predefined number of them 3. The sampler outputs a subgraph containing the target node and the randomly selected neighboring nodes. This process is repeated for every node in a list or the entirety of the graph. However, creating a subgraph for each node is not efficient, that is why we can process them in batches instead. In this case, each subgraph is shared by multiple target nodes. Neighbor sampling has an added benefit. Sometimes, we observe extremely popular nodes that act like hubs, such as celebrities on social media. Obtaining the hidden vectors of these nodes can be computationally very expensive since it requires calculating the hidden vectors of thousands or even millions of neighbors. GraphSAGE fixes this issue by simply ignoring most of the nodes! In PyG, neighbor sampling is implemented through the NeighborLoader object. Let s say we want 5 neighbors and 10 of their neighbors num_neighbors . As we discussed, we can also specify a batch_size to speed up the process by creating subgraphs for multiple target nodes. Subgraph 0 Data x 389, 500 , edge_index 2, 448 , batch_size 16 Subgraph 1 Data x 264, 500 , edge_index 2, 314 , batch_size 16 Subgraph 2 Data x 283, 500 , edge_index 2, 330 , batch_size 16 Subgraph 3 Data x 189, 500 , edge_index 2, 229 , batch_size 12 We created 4 subgraphs of various sizes. It allows us to process them in parallel and they re easier to fit on a GPU since they re smaller. The number of neighbors is an important parameter since pruning our graph removes a lot of information. How much, exactly? Well, quite a lot. We can visualize this effect by looking at the node degrees number of neighbors . Node degrees in the original graph Node degrees after neighbor sampling In this example, the maximum node degree of our subgraphs is 5, which is much lower than the original max value. It s important to remember this tradeoff when talking about GraphSAGE. PinSAGE implements another sampling solution using random walks . It has two main objectives 1. Sample a fixed number of neighbors like GraphSAGE 2. Obtain their relative importance important nodes are seen more frequently than others . This strategy feels a bit like a fast attention mechanism . It assigns weights to nodes and increases the relevance of the most popular ones. B. Aggregation The aggregation process determines how to combine the feature vectors to produce the node embeddings. The original paper presents three ways of aggregating features Mean aggregator LSTM aggregator Pooling aggregator. Aggregation image by author The mean aggregator is the simplest one. The idea is close to a GCN approach 1. The hidden features of the target node and its selected neighbors are averaged \u00d1\u1d62 2. A linear transformation with a weight matrix \ud835\udc16 is applied. The result can then be fed to a non linear activation function like _ReLU_. The LSTM aggregator can seem like a weird idea because this architecture is sequential it assigns an order to our unordered nodes. This is why the authors randomly shuffle them to force the LSTM to only consider the hidden features. It is the best performing technique in their benchmarks. The pooling aggregator feeds each neighbor s hidden vector to a feedforward neural network. A max pooling operation is applied to the result. III. GraphSAGE in PyTorch Geometric We can easily implement a GraphSAGE architecture in PyTorch Geometric with the SAGEConv layer. This implementation uses two weight matrices instead of one, like UberEats version of GraphSAGE Let s create a network with two SAGEConv layers The first one will use _ ReLU _ as the activation function and a dropout layer The second one will directly output the node embeddings . As we re dealing with a multi class classification task, we ll use the cross entropy loss as our loss function. I also added an L2 regularization of 0.0005 for good measure. To see the benefits of GraphSAGE, let s compare it with a GCN and a GAT without any sampling. With GraphSAGE, we loop through batches our 4 subgraphs created by the neighbor sampling process. The way we calculate the accuracy and the validation loss is also different because of that. Here are the results in terms of accuracy and training time for the GCN, the GAT, and GraphSAGE GCN test accuracy 78.40 52.6 s GAT test accuracy 77.10 18min 7s GraphSAGE test accuracy 77.20 12.4 s The three models obtain similar results in terms of accuracy. We expect the GAT to perform better because its aggregation mechanism is more nuanced, but it s not always the case. The real difference is the training time GraphSAGE is 88 times faster than the GAT and 4 times faster than the GCN in this example! Here lies the true power of GraphSAGE. We do lose a lot of information by pruning our graph with neighbor sampling. The final node embeddings might not be as good as what we could find with a GCN or a GAT. But this is not the point GraphSAGE is designed to improve scalability. In turn, it can lead to building larger graphs that can improve accuracy. Image by author This work was done in a supervised training setting node classification , but we could also train GraphSAGE in an unsupervised way . In this case, we can t use the cross entropy loss. We have to engineer a loss function that forces nodes that are nearby in the original graph to remain close to each other in the embedding space. Conversely, the same function must ensure that distant nodes in the graph must have distant representations in the embedding space. This is the loss that is presented in GraphSAGE s paper. In the case of PinSAGE and UberEeats modified GraphSAGE, we re dealing with recommender systems . The goal is to correctly rank the most relevant items pins, restaurants for each user, which is very different. We don t only want to know what the closest embeddings are, we have to produce the best rankings possible . This is why these systems are also trained in an unsupervised way, but with another loss function a max margin ranking loss. Conclusion GraphSAGE is an incredibly fast architecture to process large graphs. It might not be as accurate as a GCN or a GAT, but it is an essential model for handling massive amounts of data . It delivers this speed thanks to a clever combination of 1 neighbor sampling to prune the graph and 2 fast aggregation with a mean aggregator in this example. In this article, We explored a new dataset with PubMed, which is several times larger than the previous one We explained the idea behind neighbor sampling , which only considers a predefined number of random neighbors at each hop We saw the three aggregators presented in GraphSAGE s paper and focused on the mean aggregator We benchmarked three models GraphSAGE, GAT, and GCN in terms of accuracy and training time . We saw three architectures with the same end application node classification. But GNNs have been successfully applied to other tasks. In the next tutorials, I d like to use them in two different contexts graph and edge prediction . This will be a good way to discover new datasets and applications where GNNs dominate the state of the art. If you enjoyed this article, let s connect on Twitter maximelabonne for more graph learning content. Thanks for your attention! Related articles How to Design the Most Powerful Graph Neural Network _Graph classification with Graph Isomorphism Networks_towardsdatascience.com Graph Attention Networks Self Attention Explained _A guide to GNNs with self attention using PyTorch Geometric_towardsdatascience.com Share this post GraphSAGE Scaling up Graph Neural Networks maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/introduction-to-graphsage-in-python-a9e7f9ecf9d7" }, { "id": "e48f1530-201c-4ee2-8d49-bdc30a70b5af", "content": "Graph Attention Networks Self Attention Explained A guide to GNNs with self attention using PyTorch Geometric Maxime Labonne SubscribeSign in Share this post Graph Attention Networks Self Attention Explained maximelabonne.substack.com Copy link Facebook Email Note Other Graph Attention Networks Self Attention Explained A guide to GNNs with self attention using PyTorch Geometric Maxime Labonne Apr 17, 2022 Share this post Graph Attention Networks Self Attention Explained maximelabonne.substack.com Copy link Facebook Email Note Other Share A guide to GNNs with self attention using PyTorch Geometric Image by author, file icon by OpenMoji CC BY SA 4.0 Graph Attention Networks are one of the most popular types of Graph Neural Networks. For a good reason. With Graph _Convolutional_ Networks GCN , every neighbor has the same importance . Obviously, it should not be the case some nodes are more essential than others. Node 4 is more important than node 3, which is more important than node 2 image by author Graph _Attention_ Networks offer a solution to this problem. To consider the importance of each neighbor, an attention mechanism assigns a weighting factor to every connection . In this article, we ll see how to calculate these attention scores and implement an efficient GAT in PyTorch Geometric PyG . You can run the code of this tutorial with the following Google Colab notebook. I. Graph data CiteSeer dataset image by author, made with yEd Live There are three classic graph datasets we can use for this work MIT license . They represent networks of research papers, where each connection is a citation. Cora it consists of 2708 machine learning papers that belong to one of 7 categories. Node features represent the presence 1 or absence 0 of 1433 words in a paper binary bag of words . CiteSeer it is a bigger but similar dataset of 3312 scientific papers to classify into one of 6 categories. Node features represent the presence 1 or absence 0 of 3703 words in a paper. PubMed it is an even bigger dataset with 19717 scientific publications about diabetes from PubMed s database, classified into 3 categories. Node features are TF IDF weighted word vectors from a dictionary of 500 unique words. These datasets have been widely used by the scientific community. As a challenge, we can compare our accuracy scores to those obtained in the literature using Multilayer Perceptrons MLPs , GCNs , and GATs PubMed is quite large so it would take longer to process it and train a GNN on it. Cora is the most studied one in the literature, so let s focus on CiteSeer as a middle ground. We can directly import any of these datasets in PyTorch Geometric with the Planetoid class Number of graphs 1 Number of nodes 3327 Number of features 3703 Number of classes 6 Has isolated nodes True Interestingly enough, we have 3327 nodes instead of 3312. I found that PyG actually uses this paper s implementation of CiteSeer, which also displays 3327 nodes. Mystery solved for now. However, we observe that some nodes are isolated 48 to be precise ! Correctly classifying these isolated nodes will be a challenge since we cannot rely on any aggregation. Let s plot the number of connections of each node with degree Most nodes only have 1 or 2 neighbors . It could explain why CiteSeer obtains lower accuracy scores than the two other datasets II. Self attention Introduced by Veli\u010dkovi\u0107 et al. in 2017, self attention in GNNs relies on a simple idea nodes should not all have the same importance . We talk about _self_ attention and not just attention because inputs are compared to each other. Image by author This mechanism assigns a weighting factor attention score to each connection. Let s call _ \u03b1 _ \u1d62\u2c7c the attention score between the nodes _i_ and _j_. Here s how to calculate the embedding of node 1, where \ud835\udc16 is a shared weight matrix But how do we calculate the attention scores? We could write a static formula, but there s a smarter solution we can learn their values with a neural network . There are three steps in this process 1. Linear transformation 2. Activation function 3. Softmax normalization. 1 Linear transformation We want to calculate the importance of each connection , so we need pairs of hidden vectors. An easy way to create these pairs is to concatenate vectors from both nodes. Only then can we apply a new linear transformation with a weight matrix \ud835\udc16 \u2090\u209c\u209c Image by author 2 Activation function We re building a neural network, so the second step is to add an activation function. In this case, the authors of the paper chose the _LeakyReLU_ function. Image by author 3 Softmax normalization The output of our neural network is not normalized , which is a problem since we want to compare these scores. To be able to say if node 2 is more important to node 1 than node 3 _\u03b1_ \u2081\u2082 _\u03b1_ \u2081\u2083 , we need to share the same scale. A common way to do it with neural networks is to use the _ softmax _ function. Here, we apply it to every neighboring node Image by author Here you have it we can calculate every _\u03b1_ \u1d62\u2c7c. The only problem is self attention is not very stable . In order to improve performance, Vaswani et al. introduced multi head attention in the transformer architecture. 4 Bonus multi head attention This is only slightly surprising since we ve been talking about self attention a lot but, in reality, transformers are GNNs in disguise . This is why we can reuse some ideas from Natural Language Processing here. Multi head attention image by author In GATs, multi head attention consists of replicating the same 3 steps several times in order to average or concatenate the results. That s it. Instead of a single _h\u2081_ , we get one hidden vector _h\u2081\u1d4f_ per attention head. One of the two following schemes can then be applied Average we sum the different _h\u1d62\u1d4f _ and normalize the result by the number of attention heads _n_ Concatenation we concatenate the different _h\u1d62\u1d4f_. In practice, we use the concatenation scheme when it s a hidden layer, and the average scheme when it s the last layer of the network. III. Graph Attention Networks Let s implement a GAT in PyTorch Geometric. This library has two different graph attention layers GATConv and GATv2Conv . What we talked about so far is the GatConv layer, but in 2021 Brody et al. introduced an improvement by modifying the order of operations. The weight matrix \ud835\udc16 is applied after the concatenation , and the attention weight matrix \ud835\udc16 \u2090\u209c\u209c is used after the _ LeakyReLU _ function . In summary GatConv Gatv2Conv Which one should you use? According to Brody et al., Gatv2Conv consistently outperforms GatConv and thus should be preferred. Now let s classify the papers from CiteSeer! I tried to roughly reproduce the experiments of the original authors without adding too much complexity. You can find the official implementation of GAT on GitHub. Note that we use graph attention layers in two configurations The first layer concatenates 8 outputs multi head attention The second layer only has 1 head, which produces our final embeddings. We re also gonna train and test a GCN to compare the accuracy scores. GCN gcn1 GCNConv 3703, 16 gcn2 GCNConv 16, 6 Epoch 0 Train Loss 1.782 Train Acc 20.83 Val Loss 1.79 Epoch 20 Train Loss 0.165 Train Acc 95.00 Val Loss 1.30 Epoch 40 Train Loss 0.069 Train Acc 99.17 Val Loss 1.66 Epoch 60 Train Loss 0.053 Train Acc 99.17 Val Loss 1.50 Epoch 80 Train Loss 0.054 Train Acc 100.00 Val Loss 1.67 Epoch 100 Train Loss 0.062 Train Acc 99.17 Val Loss 1.62 Epoch 120 Train Loss 0.043 Train Acc 100.00 Val Loss 1.66 Epoch 140 Train Loss 0.058 Train Acc 98.33 Val Loss 1.68 Epoch 160 Train Loss 0.037 Train Acc 100.00 Val Loss 1.44 Epoch 180 Train Loss 0.036 Train Acc 99.17 Val Loss 1.65 Epoch 200 Train Loss 0.093 Train Acc 95.83 Val Loss 1.73 GCN test accuracy 67.70 CPU times user 25.1 s, sys 847 ms, total 25.9 s Wall time 32.4 s GAT gat1 GATv2Conv 3703, 8, heads 8 gat2 GATv2Conv 64, 6, heads 1 Epoch 0 Train Loss 1.790 Val Loss 1.81 Val Acc 12.80 Epoch 20 Train Loss 0.040 Val Loss 1.21 Val Acc 64.80 Epoch 40 Train Loss 0.027 Val Loss 1.20 Val Acc 67.20 Epoch 60 Train Loss 0.009 Val Loss 1.11 Val Acc 67.00 Epoch 80 Train Loss 0.013 Val Loss 1.16 Val Acc 66.80 Epoch 100 Train Loss 0.013 Val Loss 1.07 Val Acc 67.20 Epoch 120 Train Loss 0.014 Val Loss 1.12 Val Acc 66.40 Epoch 140 Train Loss 0.007 Val Loss 1.19 Val Acc 65.40 Epoch 160 Train Loss 0.007 Val Loss 1.16 Val Acc 68.40 Epoch 180 Train Loss 0.006 Val Loss 1.13 Val Acc 68.60 Epoch 200 Train Loss 0.007 Val Loss 1.13 Val Acc 68.40 GAT test accuracy 70.00 CPU times user 53.4 s, sys 2.68 s, total 56.1 s Wall time 55.9 s This experiment is not super rigorous we d need to repeat it _ n _ times and take the average accuracy with a standard deviation as the final result. We can see in this example that the GAT outperforms the GCN in terms of accuracy 70.00 vs. 67.70 , but takes longer to train 55.9s vs. 32.4s . It s a tradeoff that can cause scalability issues when working with large graphs. The authors obtained 72.5 for the GAT and 70.3 for the GCN, which is clearly better than what we did. The difference can be explained by preprocessing , some tweaks in the models, and a different training setting _e.g.,_ a patience of 100 instead of a fixed number of epochs . Let s visualize what the GAT learned. We re gonna use t SNE, a powerful method to plot high dimensional data in 2D or 3D. First, let s see what the embeddings looked like before any training it should be absolutely random since they re produced by randomly initialized weight matrices. Indeed, there s no apparent structure . But do the embeddings produced by our trained model look better? The difference is noticeable nodes belonging to the same classes cluster together . We can see 6 clusters, corresponding to the 6 classes of papers. There are outliers, but this was to be expected our accuracy score is far from perfect. Previously, I speculated that poorly connected nodes might negatively impact performance on CiteSeer. Let s calculate the model s accuracy for each degree. These results confirm our intuition nodes with few neighbors are indeed harder to classify . This is due to the nature of GNNs the more relevant connections you have, the more information you can aggregate. Conclusion While they take longer to train, GATs are a substantial improvement over GCNs in terms of accuracy. The self attention mechanism automatically calculates weighting factors instead of static coefficients to produce better embeddings. In this article, We learned about the self attention mechanism applied to GNNs We implemented and compared two architectures a GCN and a GAT in PyTorch Geometric We visualized how and what the GAT learns with a t SNE plot and the accuracy score for each degree GATs are the de facto standard in a lot of GNN applications. However, their slow training time can become a problem when applied to massive graph datasets. Scalability is an important factor in deep learning most often, more data can lead to better performance. In the next article, we ll see how to improve scalability with mini batching and a new GNN architecture called GraphSAGE. If you enjoyed this tutorial, feel free to follow me on Twitter for more GNN content. Thank you and see you in the next article! Related articles Introduction to GraphSAGE in Python _Scaling Graph Neural Networks to billions of connections_towardsdatascience.com How to Design the Most Powerful Graph Neural Network _Graph classification with Graph Isomorphism Networks_towardsdatascience.com Share this post Graph Attention Networks Self Attention Explained maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/graph-attention-networks-in-python-975736ac5c0c" }, { "id": "bb728e7c-4c22-443c-a630-b68f5e54b5a6", "content": "Integer vs. Linear Programming in Python A guide to identify and solve any optimization problem Maxime Labonne SubscribeSign in Share this post Integer vs. Linear Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Integer vs. Linear Programming in Python A guide to identify and solve any optimization problem Maxime Labonne Apr 07, 2022 Share this post Integer vs. Linear Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Share Mixed Integer Programming for optimization with Google OR Tools Image by author, emojis by OpenMoji CC BY SA 4.0 Why is linear programming called that way? Both terms are confusing Linear implies that nonlinear programming exists Programming actually means planning in this context. In summary, it has nothing to do with code linear or not. It s about optimizing variables with various constraints. In this article, we re gonna talk about another type of optimization integer programming . We ll see why a good understanding of the problem we face is necessary to choose the right solver. Finally, we will write a model that can take on a bigger challenge and actually solve a whole class of optimization problems. You can run the code from this tutorial with the following Google Colab notebook . Image by author, emojis by OpenMoji CC BY SA 4.0 I. Optimization problem types In the introduction to linear programming, we optimized an army composition . Here was the result Solution Solved in 87.00 milliseconds in 2 iterations Optimal power 1800.0 power Army Swordsmen 6.0000000000000036 Bowmen 0.0 Horsemen 5.999999999999999 How can we have 5.999 horsemen? We specified that our variables should be integers with VarInt . What was wrong with our code? The problem is not the model but the choice of the solver. GLOP is a pure linear programming solver. This means that it cannot understand the concept of integers . It is limited to continuous parameters with a linear relationship. This is the difference between linear programming LP and integer linear programming ILP . In summary, LP solvers can only use real numbers and not integers as variables. So why did we declare our variables as integers if it doesn t take them into account? GLOP cannot solve ILP problems, but other solvers can. Actually, a lot of them are mixed integer linear programming MILP, commonly called MIP solvers. This means that they can consider both continuous real numbers and discrete integers variables. A particular case of discrete values is Boolean variables to represent decisions with 0 1 values. Other solvers like SCIP or CBC can solve both MILP and MINLP mixed integer _nonlinear_ programming problems. Thanks to OR Tools, we can use the same model and just change the solver to SCIP or CBC. Solution Solved in 3.00 milliseconds in 0 iterations Optimal value 1800.0 power Army Swordsmen 6.0 Bowmen 0.0 Horsemen 6.0 Strictly speaking, our variables are still floats type swordsmen.solution_value float but we can see that they don t have weird decimals anymore the CBC solver really considered them as integers . In this example, we would generally just round up these values since the error is insignificant. However, it is important to remember to choose the appropriate solver according to the studied problem LP for continuous variables MIP MILP for a combination of continuous and discrete variables. There are other types such as quadratic QP or nonlinear NLP or MINLP, with an exponential objective function or constraints for instance problems. They re applied in different contexts, but follow the same principles as LP or MIP solvers. Image by author II. Building a general model But what if our resources change ? Or if the cost of a unit evolved? What if we upgraded horsemen and their power increased? One of the best perks of OR Tools is that it uses a general purpose programming language like Python. Instead of static numbers, we can store our parameters in objects like dictionaries or lists . The code won t be as readable, but it becomes much more flexible actually, it can be so flexible that we can solve an entire class of optimization problems without changing the model just the parameters . Let s transform our input parameters into Python lists and feed them to the solver through a function. Solution Solved in 2.00 milliseconds in 0 iterations Optimal value 1800.0 power Army Swordsmen 6.0 Bowmen 0.0 Horsemen 6.0 We obtain the same results our code seems to work. Now let s change the parameters to tackle a slightly more complex problem. Imagine we have a lot more resources 183000 , 90512 , and 80150 , so we can also produce a lot more units! This is the new table Notice that we transformed the power into two values attack and health , which is a little more detailed. Health values are higher than attack values, which is why we want to add a weighting factor to make them more comparable. Let s take 10 as an example, so _power 10 attack health_. Our objective function becomes Adapting our code to this new problem is actually quite simple we just have to change the input parameters and update the objective function . Solution Solved in 74.00 milliseconds in 412 iterations Optimal value 1393145.0 power Army Swordsmen 2.0 Men at arms 1283.0 Bowmen 3.0 Crossbowmen 0.0 Handcannoneers 454.0 Horsemen 0.0 Knights 0.0 Battering rams 301.0 Springalds 0.0 Mangonels 0.0 This problem would take a long time for humans to address, but the ILP solver did it in the blink of an eye. Better than that it also gives us the guarantee that our solution is optimal , which means that our enemy cannot find a better army composition for the same cost! We could increase the number of units and give billions of resources but you get the picture it would just take longer to obtain a solution, but it wouldn t change the problem. III. Combining constraints Now, let s say we scouted our enemy and know that their army has a power of 1,000,000 . We could build a much better army, but our resources are precious and it wouldn t be very efficient all we have to do is to build an army with a power higher than 1,000,000 even 1,000,001 would be enough . In other words, the total power is now a constraint 1,000,000 instead of the objective to maximize. The new goal is to minimize the resources we need to produce this army. However, we can reuse our input parameters since they didn t change. The new constraint can be translated as the sum of the power of the selected units must be strictly greater than 1,000,000 . In code, we can loop through our units and resources to design this constraint. The objective function also has to change. Our goal is to minimize the sum of resources spent to build the army. Once again, we can loop through our resources to implement it in OR Tools. Solution Solved in 4.00 milliseconds in 0 iterations Optimal value 111300.0 resources Power 1001700.0 Army Swordsmen 0.0 Men at arms 0.0 Bowmen 0.0 Crossbowmen 0.0 Handcannoneers 0.0 Horsemen 0.0 Knights 0.0 Battering rams 371.0 Springalds 0.0 Mangonels 0.0 Resources Food 0.0 Wood 111300.0 Gold 0.0 The solver found an optimal solution we need to build 371 battering rams for a total cost of 111,300 wood. Wait, what if we don t have that much wood? In the previous section, we only had 90512 we cannot produce 371 battering rams. So is it possible to take these limited resources into account and still try to build the best army ? Actually, it s super easy we just have to copy paste the constraints from the previous section. In this version, we have two types of constraints The total power must be greater than 1,000,000 We cannot spend more than our limited resources . Solution Solved in 28.00 milliseconds in 1 iterations Optimal value 172100.0 resources Power 1000105.0 Army Swordsmen 1.0 Men at arms 681.0 Bowmen 0.0 Crossbowmen 0.0 Handcannoneers 0.0 Horsemen 0.0 Knights 0.0 Battering rams 301.0 Springalds 0.0 Mangonels 0.0 Resources Food 68160.0 Wood 90320.0 Gold 13620.0 Since we now have a limited resource of wood , the number of battering rams sadly dropped from 371 to 301. In exchange, we got 681 men at arms and 1 lost swordsman welcome to them . The total cost of the army is 172,100 , which is much higher than the 111,300 we previously found 65 increase but it truly is the optimal solution under these constraints. It shows that we should produce more wood because these battering rams are extremely cost efficient! This example shows how modular LP models can be. It is possible to reuse parts of the code, like constraints, in another model to combine them and solve more complex problems. IV. Linear Programming vs Machine Learning Let s talk about the elephant in the room. Why not use machine learning in a broad sense instead of linear programming? It s not like this problem cannot be solved with a genetic algorithm for instance. Mathematical optimization is often neglected in favor of machine learning techniques, but both have their merits Linear programming can produce an optimal solution in an undetermined amount of time it can take years , while machine learning can approximate complex functions in no time. There is no training in LP, but an expert is required to build a mathematical model. Machine learning needs data, but the models can be used as black boxes to solve a problem. As a rule of thumb, problems that do not have a particular time constraint and or are not extremely complex can be advantageously solved with linear programming. Image by author, emojis by OpenMoji CC BY SA 4.0 Conclusion In this tutorial, we dived deeper into our understanding of mathematical optimization. We talked about solvers and types of optimization problems LP, MIP, NLP We modeled and solved an extremely common optimization problem in an optimal way and generalized our model through a function We reframed this problem and merged two sets of constraints to obtain the best army composition for the lowest price We compared the pros and cons of linear programming and machine learning. There are a lot more problems where optimization can be applied. For instance, how to create school timetables that satisfy everybody s requirements? How to deliver 1,000 different orders in a minimum amount of time? Where to create a new metro line to maximize its usefulness? In future articles, we ll talk about new types of applications for these techniques, including satisfiability and nonlinear problems. I hope you enjoyed this more advanced article. If you like machine learning and optimization, let s connect on Twitter ! Related articles Part 3 Constraint Programming in Python _The Programming Paradigm to Find One Solution Among 8,080,104 Candidates_towardsdatascience.com Part 1 Introduction to Linear Programming in Python _A guide to mathematical optimization with Google OR Tools_towardsdatascience.com Share this post Integer vs. Linear Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/integer-programming-vs-linear-programming-in-python-f1be5bb4e60e" }, { "id": "e75d9b4e-1a14-450e-ad51-b396969de6c5", "content": "Introduction to Linear Programming in Python A guide to mathematical optimization with Google OR Tools Maxime Labonne SubscribeSign in Share this post Introduction to Linear Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Introduction to Linear Programming in Python A guide to mathematical optimization with Google OR Tools Maxime Labonne Apr 04, 2022 Share this post Introduction to Linear Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Share A guide to mathematical optimization with Google OR Tools Image by author, emojis by OpenMoji CC BY SA 4.0 Linear programming is a technique to optimize any problem with multiple variables and constraints. It s a simple but powerful tool every data scientist should master. Imagine you are a strategist recruiting an army . You have Three resources food , wood , and gold Three units swordsmen , bowmen , and horsemen . Horsemen are stronger than bowmen, who are in turn stronger than swordsmen. The following table provides the cost and power of each unit Image by author Now we have 1200 food, 800 wood, and 600 gold. How should we maximize the power of our army considering these resources? We could simply find the unit with the best power cost ratio, take as many of them as possible, and repeat the process with the other two units. But this guess and check solution might not even be optimal Now imagine we have millions of units and resources the previous greedy strategy is likely to completely miss the optimal solution. It is possible to use a machine learning algorithm e.g., a genetic algorithm to solve this problem, but we have no guarantee that the solution will be optimal either. Fortunately for us, there is a method that can solve our problem in an optimal way linear programming or linear optimization , which is part of the field of operations research OR . In this article, we ll use it to find the best numbers of swordsmen, bowmen, and horsemen to build the army with the highest power possible . You can run the code from this tutorial with the following Google Colab notebook . I. Solvers In Python, there are different libraries for linear programming such as the multi purposed SciPy , the beginner friendly PuLP , the exhaustive Pyomo , and many others. Today, we are going to use Google OR Tools , which is quite user friendly, comes with several prepackaged solvers, and has by far the most stars on GitHub. If the installation doesn t work, please restart the kernel and try again it can fail sometimes. _ \u30c4 _ All these libraries have a hidden benefit they act as interfaces to use the same model with different solvers . Solvers like Gurobi, Cplex, or SCIP have their own APIs, but the models they create are tied to a specific solver. OR Tools allows us to use an abstract and quite pythonic way of modeling our problems. We can then choose one or several solvers to find an optimal solution. The model we built is thus highly reusable! Image by author OR Tools comes with its own linear programming solver, called GLOP Google Linear Optimization Package . It is an open source project created by Google s Operations Research Team and written in C . Other solvers are available such as SCIP , an excellent non commercial solver created in 2005 and updated and maintained to this day. We could also use popular commercial options like Gurobi and Cplex . However, we would need to install them on top of OR Tools and get the appropriate licenses which can be quite costly . For now, let s try GLOP. II. Variables We created an instance of the OR Tools solver using GLOP. Now, how to use linear programming? The first thing we want to define is the variables we want to optimize . In our example, we have three variables the number of swordsmen, bowmen, and horsemen in the army. OR Tools accepts three types of variables NumVar for continuous variables IntVar for integer variables BoolVar for boolean variables. We re looking for round numbers of units, so let s choose IntVar . We then need to specify lower and upper bounds for these variables. We want at least 0 unit, but we don t really have an upper bound. So we can say that our upper bound is infinity or any big number we will never reach . It can be written as Let s translate it into code. Infinity is replaced by solver.infinity in OR Tools. Other than that, the syntax is quite straightforward III. Constraints We defined our variables, but the constraints are just as important. Perhaps counter intuitively, adding more constraints helps the solver to find an optimal solution faster . Why is this the case? Think of the solver as a tree constraints help it trim branches and reduce the search space. In our case, we have a limited number of resources we can use to produce units. In other words, we can t spend more resources than we have . For instance, the food spent to recruit units cannot be higher than 1200. The same is true with wood 800 and gold 600 . According to our table, units have the following costs 1 swordsman 60 20 1 bowman 80 10 40 1 horseman 140 100. We can write one constraint per resource as follows In OR Tools, we simply add the constraints to our solver instance with solver.Add . IV. Objective Now that we have our variables and constraints, we want to define our goal or objective function . In linear programming, this function has to be linear like the constraints , so of the form _ax by cz d_. In our example, the objective is quite clear we want to recruit the army with the highest power. The table gives us the following power values 1 swordsman 70 1 bowman 95 1 horseman 230. Maximizing the power of the army amounts to maximizing the sum of the power of each unit . Our objective function can be written as In general, there are only two types of objective functions maximizing or minimizing . In OR Tools, we declare this goal with solver.Maximize or solver.Minimize . And we re done! There are three steps to model any linear optimization problem 1. Declaring the variables to optimize with lower and upper bounds 2. Adding constraints to these variables 3. Defining the objective function to maximize or to minimize. Now that is clear, we can ask the solver to find an optimal solution for us. V. Optimize! Calculating the optimal solution is done with solver.Solve . This function returns a status that can be used to check that the solution is indeed optimal . Let s print the highest total power we can get with the best army configuration. Solution Solved in 87.00 milliseconds in 2 iterations Optimal power 1800.0 power Army Swordsmen 6.0000000000000036 Bowmen 0.0 Horsemen 5.999999999999999 Great! The solver found an optimal solution our army has a total power of 1800 with 6 swordsmen and 6 horsemen sorry bowmen! . Let s unpack this result The solver decided to take the maximum number of horsemen 6, since we only have 600 and they each cost 100 The remaining resources are spent in swordsmen we have 1200 6 140 360 food left, which is why the solver chose 6 swordsmen We can deduce that the horsemen are the best unit and the bowmen are the worst one because they haven t been chosen at all. Okay, but there s something quite weird these numbers are not round, even though we specified that we wanted integers IntVar . So what happened? Unfortunately, answering this question requires a deep dive into linear programming To keep things simple in this introduction, let s say it s because of GLOP. Solvers have characteristics we have to take into account, and GLOP doesn t handle integers . This is another proof that building reusable models is more than just convenient. We ll explain why GLOP has this strange behavior and how to fix it in a more advanced tutorial. Conclusion We saw through this example the five main steps of any linear optimization problem 1. Choosing a solver in our case, we selected GLOP for convenience. 2. Declaring variables the parameters to optimize were the number of swordsmen, bowmen, and horsemen. 3. Declaring constraints each of these units has a cost. The total cost could not exceed our limited resources. 4. Defining objective the criterion to maximize was the total power of this army. It could have been something else, like the number of units. 5. Optimizing GLOP found an optimal solution to this problem in less than a second. Image by author This is the main benefit of linear programming the algorithm gives us a guarantee that the solution that was found is optimal with a certain error . This guarantee is powerful, but comes at a cost the model can be so complex that the solver takes years or more to find an optimal solution. In this scenario, we have two options We can stop the solver after a certain time and probably obtain a suboptimal answer We can use a metaheuristic like a genetic algorithm to calculate an excellent solution in a short amount of time. In the next article, we ll talk about the different types of optimization problems and generalize our approach to an entire class of them. I hope you enjoyed this introduction! Feel free to share it and spread the knowledge about linear optimization. Don t forget to check my blog and follow me on Twitter where I post summaries of these articles. Cheers! Related articles Part 2 Integer vs. Linear Programming in Python _A guide to identify and solve any optimization problem_towardsdatascience.com Part 3 Constraint Programming in Python _The Programming Paradigm to Find One Solution Among 8,080,104 Candidates_towardsdatascience.com Share this post Introduction to Linear Programming in Python maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/introduction-to-linear-programming-in-python-9261e7eb44b" }, { "id": "3ab3dc4a-2632-46fc-b12e-6ed4fc48fe9f", "content": "What is a Tensor in Machine Learning? Maxime Labonne The difference between tensors, arrays, and matrices Maxime Labonne SubscribeSign in Share this post What is a Tensor in Machine Learning? maximelabonne.substack.com Copy link Facebook Email Note Other What is a Tensor in Machine Learning? The difference between tensors, arrays, and matrices Maxime Labonne Mar 29, 2022 Share this post What is a Tensor in Machine Learning? maximelabonne.substack.com Copy link Facebook Email Note Other Share The difference between tensors, arrays, and matrices Image by author What is a tensor, exactly? Most deep learning practitioners know about them but can t pinpoint an exact definition . TensorFlow, PyTorch every deep learning framework relies on the same basic object tensors . They re used to store almost everything in deep learning input data, weights, biases, predictions, etc. And yet, their definition is incredibly fuzzy the Wikipedia category alone has over 100 pages related to tensors. In this article, we ll give a definitive answer to the following question what is a tensor in neural networks? Tensors in computer science So why are there so many definitions? It s quite simple different fields have different definitions. Tensors in mathematics are not quite the same as tensors in physics , which are different from tensors in computer science . Image by author These definitions can be divided into two categories tensors as a data structure or as objects in an object oriented programming sense . Data structure this is the definition we use in computer science. Tensors are multidimensional arrays that store a specific type of value. Objects this is the definition used in other fields. In mathematics and physics, tensors are not just a data structure they also have a list of properties, like a specific product. This is why you see a lot of people sometimes quite pedantically saying _tensors are not n dimensional arrays matrices_ they don t talk about data structures, but about objects with properties . Even the same words have different meanings . For instance, in computer science, a 2D tensor is a matrix it s a tensor of rank 2 . In linear algebra, a tensor with 2 dimensions means it only stores two values. The rank also has a completely different definition it is the maximum number of its linearly independent column or row vectors. In computer science, we re only interested in a definition focused on the data structure . From this point of view, tensors truly are a generalization in _n_ dimensions of matrices. But we re still missing an important nuance when talking about tensors specifically in the context of deep learning... Tensors in deep learning _Icons created by Freepik and smashingstocks Flaticon_ So why are they called tensors instead of multidimensional arrays ? Ok, it is shorter, but is it all there is to it? Actually, people make an implicit assumption when they talk about tensors. PyTorch s official documentation gives us a practical answer _The biggest difference between a numpy array and a PyTorch Tensor is that a PyTorch Tensor can run on either CPU or GPU ._ In deep learning, we need performance to compute a lot of matrix multiplications in a highly parallel way. These matrices and n dimensional arrays in general are generally stored and processed on GPUs to speed up training and inference times. This is what was missing in our previous definition tensors in deep learning are not just n dimensional arrays, there s also the implicit assumption they can be run on a GPU . NumPy vs PyTorch Let s see the difference between NumPy arrays and PyTorch tensors. Image by author These two objects are very similar we can initialize a 1D array and a 1D tensor with nearly the same syntax. They also share a lot of methods and can be easily converted into one another. You can find the code used in this article at this address. NumPy Array 1 2 3 PyTorch Tensor tensor 1, 2, 3 Initializing 2D arrays and 2D tensors is not more complicated. NumPy Array 1 2 3 4 5 6 PyTorch Tensor tensor 1, 2, 3 , 4, 5, 6 We said that the only difference between tensors and arrays was the fact that tensors can be run on GPUs . So in the end, this distinction is based on performance. But is this boost that important? Let s compare the performance between NumPy arrays and PyTorch tensors on matrix multiplication. In the following example, we randomly initialize 4D arrays tensors and multiply them . 1.32 s 25.2 ms As we can see, PyTorch tensors completed outperformed NumPy arrays they completed the multiplication 52 times faster ! We could attribute this performance to different factors, such as NumPy arrays use a _float64_ format, whereas PyTorch tensors leverage the more efficient _float32_ format. However, even when NumPy arrays are converted to _float32_ , PyTorch tensors are still 40 times faster. PyTorch tensors are stored on a GPU, unlike NumPy arrays. But if we repeat the same experiment on a CPU, PyTorch tensors still manage to be 2.8 times faster on average. Even when combining both factors, PyTorch tensors prove to be 1.4 times faster, showing that NumPy arrays are truly less performant for matrix multiplication. This is the true power of tensors they re blazingly fast ! Performance might vary depending on the dimensions, the implementation , and the hardware, but this speed is the reason why tensors and not arrays are so common in deep learning. Conclusion In this article, we wrote a definition of tensors based on 1. Their use in computer science data structure 2. More specifically, in deep learning they can run on GPUs . Here s how we can summarize it in one sentence _Tensors are n dimensional arrays with the implicit assumption that they can run on a GPU. _ Finally, we saw the difference in performance between tensors and arrays, which motivates the need for tensors in deep learning. So next time someone tries to explain to you that tensors are not exactly a generalization of matrices, you ll know that they re right in a particular definition of tensors, but not in the computer science deep learning one. If you re looking for more data science and machine learning content in n dimensions, please follow me on twitter maximelabonne . You can find the code used in this article at this address. Share this post What is a Tensor in Machine Learning? maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/what-is-a-tensor-in-deep-learning-6dedd95d6507" }, { "id": "eac6604b-9bfe-4039-99b1-6449c0a65dd2", "content": "Efficiently iterating over rows in a Pandas DataFrame Never use iterrows and itertuples again Maxime Labonne SubscribeSign in Share this post Efficiently iterating over rows in a Pandas DataFrame maximelabonne.substack.com Copy link Facebook Email Note Other Efficiently iterating over rows in a Pandas DataFrame Never use iterrows and itertuples again Maxime Labonne Mar 21, 2022 Share this post Efficiently iterating over rows in a Pandas DataFrame maximelabonne.substack.com Copy link Facebook Email Note Other Share Never use iterrows and itertuples again Image by author, emojis by OpenMoji CC BY SA 4.0 . When I started machine learning, I followed the guidelines and created my own features by combining multiple columns in my dataset. It s all well and good, but the way I did it was horribly inefficient . I had to wait several minutes to do the most basic operations. My problem was simple I didn t know the fastest way to iterate over rows in Pandas. I often see people online using the same techniques I used to apply. It s not elegant but it s ok if you don t have much data. However, if you process more than 10k rows , it quickly becomes an obvious performance issue. In this article, I m gonna give you the best way to iterate over rows in a Pandas DataFrame , with no extra code required. It s not just about performance it s also about understanding what s going on under the hood to become a better data scientist. Let s import a dataset in Pandas. In this case, I chose the one I worked on when I started it s time to fix my past mistakes! You can run the code with the following Google Colab notebook. This dataset has 22k rows and 43 columns with a combination of categorical and numerical values. Each row describes a connection between two computers. Let s say we want to create a new feature the total number of bytes in the connection. We just have to sum up two existing features src_bytes and dst_bytes . Let s see different methods to calculate this new feature. 1. Iterrows According to the official documentation, iterrows iterates over the rows of a Pandas DataFrame as index, Series pairs . It converts each row into a Series object, which causes two problems 1. It can change the type of your data dtypes 2. The conversion greatly degrades performance . For these reasons, the ill named iterrows is the WORST possible method to actually iterate over rows. 10 loops, best of 5 1.07 s per loop Now let s see slightly better techniques 2. For loop with .loc or .iloc 3 faster This is what I used to do when I started a basic for loop to select rows by index with .loc or .iloc . Why is it bad? Because DataFrames are not designed for this purpose. As with the previous method, rows are converted into Pandas Series objects, which degrades performance. Interestingly enough, .iloc is faster than .loc . It makes sense since Python doesn t have to check user defined labels and directly look at where the row is stored in memory. 10 loops, best of 5 600 ms per loop 10 loops, best of 5 377 ms per loop Even this basic for loop with .iloc is 3 times faster than the first method! 3. Apply 4 faster The apply method is another popular choice to iterate over rows. It creates code that is easy to understand but at a cost performance is nearly as bad as the previous for loop. This is why I would strongly advise you to avoid this function for this specific purpose it s fine for other applications . Note that I convert the DataFrame into a list using the to_list method to obtain identical results. 10 loops, best of 5 282 ms per loop The apply method is a for loop in disguise, which is why the performance doesn t improve that much it s only 4 times faster than the first technique. 4. Itertuples 10 faster If you know about iterrows , you probably know about itertuples . According to the official documentation, it iterates over the rows of a DataFrame as namedtuples of the values . In practice, it means that rows are converted into tuples , which are much lighter objects than Pandas Series. This is why itertuples is a better version of iterrows . This time, we need to access the values with an attribute or an index . If you want to access them with a string e.g., if there s a space in the string , you can use the getattr function instead. 10 loops, best of 5 99.3 ms per loop This is starting to look better it is now 10 times faster than iterrows . 5. List comprehensions 200 faster List comprehensions are a fancy way to iterate over a list as a one liner. For instance, print i for i in range 10 prints numbers from 0 to 9 without any explicit for loop . I say explicit because Python actually processes it as a for loop if we look at the bytecode. So why is it faster? Quite simply because we don t call the .append method in this version. 100 loops, best of 5 5.54 ms per loop Indeed, this technique is 200 times faster than the first one! But we can still do better. 6. Pandas vectorization 1500 faster Until now, all the techniques used simply add up single values. Instead of adding single values, why not group them into vectors to sum them up? The difference between adding two numbers or two vectors is not significant for a CPU, which should speed things up. On top of that, Pandas can process Series objects in parallel , using every CPU core available! The syntax is also the simplest imaginable this solution is extremely intuitive. Under the hood, Pandas takes care of vectorizing our data with an optimized C code using contiguous memory blocks. 1000 loops, best of 5 734 \u00b5s per loop This code is 1500 times faster than iterrows and it is even simpler to write. 7. NumPy vectorization 1900 faster NumPy is designed to handle scientific computing. It has less overhead than Pandas methods since rows and dataframes all become np.array . It relies on the same optimizations as Pandas vectorization. There are two ways of converting a Series into a np.array using .values or .to_numpy . The former has been deprecated for years, which is why we re gonna use .to_numpy in this example. 1000 loops, best of 5 575 \u00b5s per loop We found our winner with a technique that is 1900 times faster than our first competitor! Let s wrap things up. Conclusion The number of rows in the dataset can greatly impact the performance of certain techniques image by author . Don t be like me if you need to iterate over rows in a DataFrame, vectorization is the way to go! You can find the code to reproduce the experiments at this address. Vectorization is not harder to read, it doesn t take longer to write, and the performance gain is incredible. It s not just about performance understanding how each method works under the hood helped me to write better code . Performance gains are always based on the same techniques transforming data into vectors and matrices to take advantage of parallel processing. Alas, this is often at the expense of readability. But it doesn t have to be. Iterating over rows is just an example but it shows that, sometimes, you can have the cake and eat it. If you liked this article, follow me on Twitter maximelabonne for more tips about data science and machine learning! Share this post Efficiently iterating over rows in a Pandas DataFrame maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/efficiently-iterating-over-rows-in-a-pandas-dataframe-7dd5f9992c01" }, { "id": "59fc9ced-cf49-4c21-9875-7c6c99fb0c16", "content": "Q learning for beginners Maxime Labonne Train an AI to solve the Frozen Lake environment Maxime Labonne SubscribeSign in Share this post Q learning for beginners maximelabonne.substack.com Copy link Facebook Email Note Other Q learning for beginners Train an AI to solve the Frozen Lake environment Maxime Labonne Mar 07, 2022 Share this post Q learning for beginners maximelabonne.substack.com Copy link Facebook Email Note Other Share Train an AI to solve the Frozen Lake environment Image by author The goal of this article is to teach an AI how to solve the Frozen Lake environment using reinforcement learning . Instead of reading Wikipedia articles and explaining formulas, we re going to start from scratch and try to recreate the Q learning algorithm by ourselves. We ll not just understand how it works , but more importantly why it works why was it designed that way? What are the hidden assumptions, the details that are never explained in regular courses and tutorials? At the end of this article, you ll master the Q learning algorithm and be able to apply it to other environments and real world problems . It s a cool mini project that gives a better insight into how reinforcement learning works and can hopefully inspire ideas for original and creative applications . Let s start by installing the Frozen Lake environment and importing the necessary libraries gym for the game, random to generate random numbers, and numpy to do some math. I. Frozen Lake Now, let s talk about the game we re going to be solving in this tutorial. Frozen Lake is a simple environment composed of tiles, where the AI has to move from an initial tile to a goal . Tiles can be a safe frozen lake , or a hole that gets you stuck forever. The AI, or agent, has 4 possible actions go LEFT , DOWN , RIGHT , or UP . The agent must learn to avoid holes in order to reach the goal in a minimal number of actions . By default, the environment is always in the same configuration . In the environment s code, each tile is represented by a letter as follows S F F F S starting point, safe F H F H F frozen surface, safe F F F H H hole, stuck forever H F F G G goal, safe Image by author We can try to manually solve the example above to understand the game. Let s see if the following sequence of actions is a correct solution RIGHT RIGHT RIGHT DOWN DOWN DOWN . Our agent starts on tile S , so we move right on a frozen surface , then again , then once more , then we go down and find a hole . Actually, it s really easy to find several correct solutions RIGHT RIGHT DOWN DOWN DOWN RIGHT is an obvious one. But we could make a sequence of actions that loops around a hole 10 times before reaching the goal. This sequence is valid, but it doesn t meet our final requirement the agent needs to meet the goal in a minimum number of actions . In this example, the minimum number of actions to complete the game is 6 . We need to remember this fact to check if our agent really masters Frozen Lake or not. Image by author Let s initialize the environment thanks to the gym library. There are two versions of the game one with slippery ice , where selected actions have a random chance of being disregarded by the agent and a non slippery one , where actions cannot be ignored . We ll use the non slippery one to begin with because it s easier to understand. FFF FHFH FFFH HFFG We can see that the game that was created has the exact same configuration as in our example it is the same puzzle. The position of our agent is indicated by a red rectangle . Solving this puzzle can be done with a simple script and if else conditions, which would actually be useful to compare our AI to a simpler approach . However, we want to try a more exciting solution reinforcement learning . II. Q table In Frozen Lake , there are 16 tiles, which means our agent can be found in 16 different positions, called states . For each state, there are 4 possible actions go LEFT , DOWN , RIGHT , and UP . Learning how to play Frozen Lake is like learning which action you should choose in every state . To know which action is the best in a given state, we would like to assign a quality value to our actions. We have 16 states and 4 actions, so want to calculate 16 x 4 64 values. A nice way of representing it is using a table, known as a Q table, where rows list every state s and columns list every action a . In this Q table, each cell contains a value Q s, a , which is the value quality of the action a in the state s 1 if it s the best action possible, 0 if it s really bad . When our agent is in a particular state s, it just has to check this table to see which action has the highest value . Taking the action with the highest value makes sense but we ll see later that we can design something even better _Example of Q table, where each cell contains the value_ Q a, s _of the action_ a _ column in a given state_ s _ row _ Let s create our Q table and fill it with zeros since we still have no idea of the value of each action in each state . Q table 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. Great! We have our Q table with 16 rows our 16 states and 4 columns our 4 actions as expected. Let s try to see what we can do next every value is set to zero, so we have no information at all. Let s say that the agent takes a random action LEFT , DOWN , RIGHT , or UP . We can use the random library with the choice method to randomly choose an action. LEFT Wait, actually the agent is currently on the initial state S , which means only two actions are possible RIGHT and DOWN . The agent can also take the actions UP and LEFT , but it won t move its state doesn t change. Therefore, we do not put any constraint on what actions are possible the agent will naturally understand that some of them don t do anything . We can keep using random.choice , but the gym library already implements a method to randomly choose an action . It might save us some hassle later, so let s try it. 0 Oops... this time it s a number . We could read gym s documentation but it is quite scarce unfortunately. No worries though, we can check the source code on GitHub to understand what these numbers mean . It s actually super straightforward LEFT 0 DOWN 1 RIGHT 2 UP 3 Image by author Okay, now that we understand how gym connects numbers to directions , let s try to use it to move our agent to the right . This time, it can be performed using the step action method. We can try to directly provide it the number 2 , corresponding to the direction we chose right , and check if the agent moved. Right S FF FHFH FFFH HFFG Huzzah ! The red square moved from the initial state S to the right our prediction was correct . And that s all we need to know in order to interact with the environment 1. How to randomly choose an action using action_space.sample 2. How to implement this action and move our agent in the desired direction with step action . To be completely exhaustive, we can add 1. How to display the current map to see what we re doing with render 2. How to restart the game when the agent falls into a hole or reaches the goal G with reset . Now that we understand how to interact with our gym environment, let s go back to our algorithm. In reinforcement learning, agents are rewarded by the environment when they accomplish a predefined goal . In Frozen Lake , the agent is only rewarded when it reaches the state G see the source code . We cannot control this reward, it is set in the environment it s 1 when the agent reaches G, and 0 otherwise . Let s print it every time we implement an action. The reward is given by the method step action . Left FFF FHFH FFFH HFFG Reward 0.0 The reward is indeed 0 wow, I guess we re in a pickle, because only one state can give us a positive reward in the entire game. How are we supposed to take the right directions at the very beginning when the only validation we have is at the very end? If we ever want to see a reward of 1, we d need to be lucky enough to find the correct sequence of actions by chance . Unfortunately, that s exactly how it works the Q table will remain filled with zeros until the agent randomly reaches the goal G . The problem would be much simpler if we could have intermediate, smaller rewards to guide our path towards the goal G . Alas, this is actually one of the main issues of reinforcement learning this phenomenon, called sparse rewards , makes agents very difficult to train on problems where the only reward is at the end of a long sequence of actions . Different techniques were proposed to mitigate this issue, but we ll talk about it another time. III. Q learning Let s go back to our problem. Okay, we need to be lucky enough to find the goal G by accident. But once it s done, how to backpropagate the information to the initial state? The Q learning algorithm offers a clever solution to this issue. We need to update the value of our state action pairs each cell in the Q table considering 1 the reward for reaching the next state, and 2 the highest possible value in the next state . Image by author We know we get a reward of 1 when we move to G . As we just said, the value of the state next to G let s call it G 1 with the relevant action to reach G is increased thanks to the reward. Okay good, end of the episode the agent won and we restart the game. Now, the next time the agent is in a state next to G 1 , it will increase the value of this state let s call it G 2 with the relevant action to reach G 1 . The next time the agent is in a state next to G 2 , it will do the same. Rinse and repeat, until the update reaches the initial state S . Let s try to find the update formula to backpropagate the values from G to S . Remember values denote the quality of an action in a specific state 0 if it s terrible, 1 if it s the best action possible in this state . We try to update the value of the action a\u209c for example, a\u209c 0 if the action is left in the state s\u209c for example, s\u209c 0 when the agent is in the initial state S . This value is just a cell in our Q table , corresponding to the row number s \u209c and the column number a \u209c this value is formally called Q s\u209c, a\u209c . As we said previously, we need to update it using 1 the reward for the next state formally noted r\u209c , and 2 the maximum possible value in the next state max\u2090 _Q s_ \u209c \u2081, a . Therefore, the update formula must look like The new value is the current one the reward the highest value in the next state. We can manually try our formula to check if it looks correct let s pretend our agent is in the state G 1 next to the goal G for the first time . We can update the value corresponding to the winning action in this state G 1 with where Q G 1, a\u209c 0 and max\u2090 _Q G_ , a 0 because the Q table is empty, and r\u209c _ 1_ because we get the only reward in this environment. We obtain Q new G 1, a\u209c 1. The next time the agent is in a state next to this one G 2 , we update it too using the formula and get the same result _Q_ new G 2, a\u209c 1. In the end, we backpropagate ones in the Q table from G to S . Okay it works, but the result is binary either it s the wrong state action pair or the best one . We would like more nuance Actually, we almost found the true Q learning update formula with common sense. The nuance we re looking for adds two parameters \u03b1 is the learning rate between 0 and 1 , which is how much we should change the original Q s\u209c, a\u209c value. If \u03b1 0, the value never changes , but if \u03b1 1, the value changes extremely fast . In our attempt, we didn t limit the learning rate so \u03b1 1. But this is too fast in reality the reward and the maximum value in the next state quickly overpower the current value . We need to find a balance between the importance of past and new knowledge . \u03b3 is the discount factor between 0 and 1 , which determines how much the agent cares about future rewards compared to immediate ones as the saying goes, a bird in the hand is worth two in the bush . If \u03b3 0, the agent only focuses on immediate rewards , but if \u03b3 1, any potential future reward has the same value than current ones . In Frozen Lake , we want a high discount factor since there s only one possible reward at the very end of the game. With the real Q learning algorithm, the new value is calculated as follows Okay, let s try this new formula before implementing it. Once again, we can pretend that our agent is next to the goal G for the first time . We can update the state action pair to win the game using our formula Q new G 1, a\u209c 0 \u03b1 1 \u03b3 0 0 _._ We can assign arbitrary values to \u03b1 and \u03b3 to calculate the result. With \u03b1 0.5 and \u03b3 0.9, we get Q new G 1, a\u209c 0 0.5 1 0.9 0 0 0.5. The second time the agent is in this state, we would get Q new G 1, a\u209c 0.5 0.5 1 0.9 0 0.5 0.75, then 0.875, 0.9375, 0.96875, etc. Image by author So training our agent in code means 1. Choosing a random action using action_space.sample if the values in the current state are just zeros. Otherwise, we take the action with the highest value in the current state with the function np.argmax 2. Implementing this action by moving in the desired direction with step action 3. Updating the value of the original state with the action we took, using information about the new state and the reward given by step action We keep repeating these 3 steps until the agent gets stuck in a hole or reaches the goal G . When it happens, we just restart the environment with reset and start a new episode until we hit 1,000 episodes. Additionally, we can plot the outcome of each run failure if it didn t reach the goal, success otherwise to observe the progress of our agent. Q table before training 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. Q table after training 0. 0. 0.59049 0. 0. 0. 0.6561 0. 0. 0.729 0. 0. 0. 0. 0. 0. 0. 0.02050313 0. 0. 0. 0. 0. 0. 0. 0.81 0. 0. 0. 0. 0. 0. 0. 0. 0.17085938 0. 0. 0. 0.49359375 0. 0. 0.9 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. Image by author The agent is trained! Each blue bar on the figure corresponds to a win, so we can see that the agent had a hard time finding the goal at the beginning of the training. But once it found it several times in a row, it began to consistently win . The trained Q table is also very interesting these values indicate the unique sequence of actions the agent learned to reach the goal . Now let s see how it performs by evaluating it on 100 episodes. We consider that the training is over, so we don t need to update the Q table anymore . To see how the agent performs, we can calculate the percentage of times the it managed to reach the goal success rate . Success rate 100.0 Not only our agent has been trained, but it manages to hit a 100 success rate . Great job everyone, the non slippery Frozen Lake is solved! We can even visualize the agent moving on the map by executing the code below and print the sequence of actions it took to check if it s the best one. Right SFFF FHFH FFFH HFF Sequence 2, 2, 1, 1, 1, 2 The agent can learn several correct sequence of actions 2, 2, 1, 1, 1, 2 , 1, 1, 2, 2, 1, 2 , etc. The good thing is there s only 6 actions in our sequence , which was the minimum possible number of actions we counted it means that our agent learned to solve the game in an optimal way. In the case of 2, 2, 1, 1, 1, 2 , which corresponds to RIGHT RIGHT DOWN DOWN DOWN RIGHT, it s exactly the sequence we predicted at the very beginning of the article. IV. Epsilon Greedy algorithm Despite this success, there s something that bothers me with our previous approach the agent always chooses the action with the highest value. So whenever a state action pair starts having a non zero value, the agent will always choose it . The other actions will never be taken, which means we ll never update their value But what if one of these actions was better than the one the agent always takes ? Shouldn t we encourage the agent to try news things from time to time and see if it can improve? In other words, we want to allow our agent to either Take the action with the highest value exploitation Choose a random action to try to find even better ones exploration . A tradeoff between these two behaviors is important if the agent only focuses on exploitation , it cannot try new solutions and thus doesn t learn anymore . On the other hand, if the agent only takes random actions , the training is pointless since it doesn t use the Q table. So we want to change this parameter over time at the beginning of the training, we want to explore the environment as much as possible . But exploration becomes less and less interesting, as the agent already knows every possible state action pairs . This parameter represents the amount of randomness in the action selection . This technique is commonly called the epsilon greedy algorithm , where epsilon is our parameter. It is a simple but extremely efficient method to find a good tradeoff. Every time the agent has to take an action, it has a probability \u03b5 of choosing a random one , and a probability 1 \u03b5 of choosing the one with the highest value . We can decrease the value of epsilon at the end of each episode by a fixed amount linear decay , or based on the current value of epsilon exponential decay . Image by author Let s implement a linear decay . Beforehand, I d like to see how the curve looks like with arbitrary parameters. We ll start with \u03b5 1 to be in full exploration mode, and decrease this value by 0.001 after each episode. Image by author Okay now that we have a sound understanding of it, we can implement it for real and see how it changes the agent s behavior . Q table before training 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. Q table after training 0.531441 0.59049 0.59049 0.531441 0.531441 0. 0.6561 0.56396466 0.58333574 0.729 0.56935151 0.65055117 0.65308668 0. 0.33420534 0.25491326 0.59049 0.6561 0. 0.531441 0. 0. 0. 0. 0. 0.81 0. 0.65519631 0. 0. 0. 0. 0.6561 0. 0.729 0.59049 0.6561 0.81 0.81 0. 0.72899868 0.9 0. 0.72711067 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.81 0.9 0.729 0.81 0.9 1. 0.81 0. 0. 0. 0. Image by author Hey, the agent takes more time to consistently win the game now! And the Q table has a lot more non zero values than the previous one, which means the agent has learned several sequences of actions to reach the goal. It is understandable, since this new agent is forced to explore state action pairs instead of always exploiting ones with non zero values . Let s see if it s as successful as the previous one to win the game. In evaluation mode, we don t want exploration anymore because the agent is trained now. Success rate 100.0 Phew, it s another 100 success rate ! We didn t degrade the model. The benefits of this approach might not be obvious in this example, but our model became less static and more flexible . It learned different paths sequences of actions from S to G instead of just one as in the previous approach. More exploration can degrade performance but it s necessary to train agents that can adapt to new environments . IV. Challenge slippery Frozen Lake We didn t solve the entire Frozen Lake environment we only trained an agent on the non slippery version, using is_slippery False during initialization. In the slippery variant, the action the agent takes only has 33 chance of succeeding . In case of failure, one of the three other actions is randomly taken instead. This feature adds a lot of randomness to the training, which makes things more difficult for our agent. Let s see how well our code is doing in this new environment... Q table before training 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. Q table after training 0.06208723 0.02559574 0.02022059 0.01985828 0.01397208 0.01425862 0.01305446 0.03333396 0.01318348 0.01294602 0.01356014 0.01461235 0.01117016 0.00752795 0.00870601 0.01278227 0.08696239 0.01894036 0.01542694 0.02307306 0. 0. 0. 0. 0.09027682 0.00490451 0.00793372 0.00448314 0. 0. 0. 0. 0.03488138 0.03987256 0.05172554 0.10780482 0.12444437 0.12321815 0.06462294 0.07084008 0.13216145 0.09460133 0.09949734 0.08022573 0. 0. 0. 0. 0. 0. 0. 0. 0.1606242 0.18174032 0.16636549 0.11444442 0.4216631 0.42345944 0.40825367 0.74082329 0. 0. 0. 0. Image by author Success rate 17.0 Oof it s not so good. But can you improve the performance by tweaking the different parameters we talked about? I encourage you to take this little challenge and do it on your own to have fun with reinforcement learning and check if you understood everything we said about Q learning . And why not implementing exponential decay for the epsilon greedy algorithm too? During this quick exercise, you might realise that slightly modifying the hyperparameters can completely destroy the results . This is another quirk of reinforcement learning hyperparameters are quite moody, and it is important to understand their meaning if you want to tweak them. It s always good to test and try new combinations to build your intuition and become more efficient . Good luck and have fun! V. Conclusion Q learning is a simple yet powerful algorithm at the core of reinforcement learning. In this article, We learned to interact with the gym environment to choose actions and move our agent We introduced the idea of a Q table , where rows are states , columns are actions , and cells are the value of an action in a given state We experimentally recreated the Q learning update formula to tackle the sparse reward problem We implemented an entire training and evaluation process, that solved the Frozen Lake environment with 100 success rate We implemented the famous epsilon greedy algorithm in order to create a tradeoff between the exploration of unknown state action pairs and the exploitation of the most successful ones . The Frozen Lake is a very simple environment, but others can have so many states and actions that it becomes impossible to store the Q table in memory . This is especially the case in environments where events are not discrete, but continuous like Super Mario Bros. or Minecraft . When the problem arises, a popular technique consists of training a deep neural network to approximate the Q table . This method adds several layers of complexity, since the neural networks are not very stable . But I will cover it in another tutorial with different techniques to stabilize them. Until then, share this article if it helped you and follow me on Twitter and Medium for more practical content around machine learning and deep learning. Share this post Q learning for beginners maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/q-learning-for-beginners-2837b777741" }, { "id": "8fbc7862-3fd6-4e44-a9c2-19bf6eb43ba4", "content": "How to start Machine Learning for Developers in 2022 A list of curated resources to start your ML journey Maxime Labonne SubscribeSign in Share this post How to start Machine Learning for Developers in 2022 maximelabonne.substack.com Copy link Facebook Email Note Other How to start Machine Learning for Developers in 2022 A list of curated resources to start your ML journey Maxime Labonne Jan 31, 2022 Share this post How to start Machine Learning for Developers in 2022 maximelabonne.substack.com Copy link Facebook Email Note Other Share A list of curated resources to start your ML journey As a PhD student and a research scientist in machine learning, many people have asked me the same question over the years _ how do I start machine learning? _ My answers varied greatly, ranging from the most technical _ start looking at notebooks on Kaggle? ,_ to the more approachable _ I think fast.ai has a great course _ , or _ oh do you know Coursera? _ So, it s finally time for me to settle the matter once and for all, until next year. Machine learning is a constantly evolving field with an abundance of guides and tutorials. And that may just be the main problem there are just too many options . Even searching for _start machine learning_ on the Internet yields mixed results alluring ads, outdated forum responses, and an overwhelming amount of e learning courses. In this post, I want to talk about my recommended methods for learning about this ever changing field and provide you with the best resources for getting started with machine learning . This guide is not just for coding, but also for inspiration and motivation, depending on your learning style. Top down learning style Image by author. Learning is difficult it takes time and motivation. To me, the most daunting part of learning something new is the fact that I do not know yet how much work it entails. So I find that the best first step in my learning journey is to try and map the field that I am entering. When it s a niche topic, I can look at academic surveys. But for something as big as machine learning, I consume high level resources like videos and podcasts to stay up to date. These high level resources are a great way to understand the breadth and depth of this field, which keeps growing on a daily basis with new methods, applications, and challenges. Unfortunately, these resources are usually not technical enough to truly teach machine learning. To truly delve deeper into ML, start implementing algorithms, and understand more of the field, some kind of course is needed. The choice of language and libraries is not very relevant at this point, so it s better to follow the standards found in most guides Python, scikit learn, Pandas It is much more important to understand the concepts than to learn the syntax of each and every framework. Courses can be complemented by more specific technical articles , often in the form of blog posts. These are an essential link between the theoretical knowledge from courses and the actual implementation to solve real problems. Finally, whether it s because you encounter fundamental problems that you don t know how to solve or because you seek a complete understanding of the field, low level resources become necessary at some point. They can be books, academic courses, scientific papers, etc. The goal here is not to learn math from scratch, but to take a bottom up approach to identify what was missing in our understanding of the problem. In the case of machine learning, some grasp of statistics, probability, and linear algebra is a plus. You may already be using this learning style instead of the opposite academic approach, and you may be encountering hurdles in your learning process, or you have not used any of these methods before. In any case, this article aims to provide you with the best educational resources for different types of media, divided per tier. And since individuals differ in the way they learn, I encourage you to choose the materials that best suit you. The most effective way to make progress is to combine different media at different levels to see the same concepts addressed in different ways. Whatever you choose, these guides are great tools for starting or continuing to learn machine learning. Tier 1 educational entertainment Videos and podcasts are the easiest way to approach a new topic. They do not require extensive work or focus and can be consumed anywhere. While they by no means replace proper courses, they can be highly motivating and are effective in introducing a lot of applications and topics in a short amount of time. Two Minute Papers Two Minute Papers is a YouTube channel run by K\u00e1roly Zsolnai Feh\u00e9, an ex researcher at TU Wien. He showcases and explains in simple terms research works in several minutes. This channel focuses on topics related to physical simulation and computer graphics. It s a great way to see a variety of original machine learning applications and find inspiration for your own projects. Yannic Kilcher Yannic Kilcher is the host of _ML news_ , an upbeat summary of the latest news in machine learning. And there is a lot of news more and more companies, institutions, and universities communicate about new projects, products, and advancements in this field. The last segment of ML news, called useful things , is entirely dedicated to the presentation of new and popular libraries, frameworks, and applications. Yannic Kilcher also and maybe most importantly makes videos of paper reviews, where he explains and annotates research papers in an easy to follow step by step manner. Though this type of video content is more specific and does require a good understanding of the topic, it is an excellent solution if you need to read a paper he already covered. AI Coffee Break with Letitia AI Coffee Break with Letitia Parcalabescu covers recent research articles and advancements in deep learning. Her videos can be quite technical and require some prior knowledge of the topic, but there are quite a few that are more high level and talk about broader topics in AI. They are a good way of understanding what s currently happening in research sometimes in great detail and what we can expect next. Practical AI The Practical AI Podcast _In the second of the AI in Africa spotlight episodes, we welcome guests from Radiant Earth to talk about machine _changelog.com Practical AI is a podcast hosted by a data scientist at SIL International and a principal AI strategist at Lockheed Martin. As the name suggests, it has a particular focus on making AI accessible to everyone with real world implementations. They talk about tools to automate and simplify ML tasks and how to scale a product to serve millions of users. Their grounded approach makes them accessible, even to beginners in this field. The TWIML AI Podcast The TWIML AI Podcast This Week in Machine Learning and AI Podcast _Keep up with the most interesting important stories from the world of machine learning, deep learning artificial _twimlai.com This Week in Machine Learning Artificial Intelligence is your typical interview podcast with ML practitioners and enthusiasts. It has over 500 episodes and covers a broad spectrum of interviewees engineers, leaders, researchers, and business people. This means they tackle ML from different points of view, giving unique perspectives to problems in the field and on ML as a subject, and allows a better understanding of the topic and its stakes. Tier 2 courses and technical posts Taking courses still is a necessary step to learn the libraries and tools related to machine learning. The resources I list below focus primarily on the Python ecosystem since Python is the most used language in ML thanks to its powerful libraries sklearn, Tensorflow, Pytorch and its clean and easy syntax. However, the knowledge from these courses is absolutely transferable to other languages and frameworks. Depending on the end application, technical posts are also a great source of information since they can point towards certain techniques and give you clear answers to particular problems. Keep in mind though that posts and articles can easily be outdated and so their results are not always easily reproducible. Kaggle s Intro to Machine Learning Kaggle has a great introductory course with a practical approach to the basics of machine learning. It s a series of 7 quick tutorials with exercises, for example on how to set up a classic pipeline with data exploration and how to get started with model training and model validation. It s the perfect first step to learn machine learning in under 3 hours, without any installation required. Another perk Kaggle offers online notebooks, which makes practicing the exercises very accessible. fast.ai fast.ai provides great online courses designed by a passionate and active team. Their goal is to make AI accessible to everyone, regardless of your background, your preferred language, or your data and applications. Instead of being confronted with an overwhelming amount of theory at the start, they advocate a very hands on approach. Their Practical Deep Learning for Coders course is a good example of this. From the first lesson, you are able to execute very recent models of deep neural networks and see their results. In the following lessons, they build on these insights by giving you an explanation of their architectures, how they truly work, and are able to output these results. While this particular course can be quite advanced, their other course Introduction to Machine Learning covers regular ML starting with the basics tabular datasets, random forests, and model validation. It has the same practical and comprehensive approach that is very effective in teaching you the basics and complexities of ML and can be seen as an extended version around 24 hours of the Kaggle course. Machine Learning Mastery Machine Learning Mastery Machine Learning Mastery _Making developers awesome at machine learning._machinelearningmastery.com Machine Learning Mastery is a popular blog among practitioners with a lot of practical applications of ML tasks and topics, like time series forecasting or imbalanced learning. Unsurprisingly, it is often one of the first results that appear on Google when I look for an answer to specific ML problems. And that s also probably the best way of using it there are so many articles that it s simply impossible to read them all, but you should definitely check if they have something about your problem of interest. Machine Learning Mastery creates a valuable library of practical ML resources you can pick and choose. Towards Data Science Towards Data Science _Your home for data science. A Medium publication sharing concepts, ideas and codes._towardsdatascience.com Towards Data Science is a Medium publication focused on data science, machine learning, and deep learning. Articles are not necessarily of the highest academic quality you can find language specific tips and other kinds of clickbait content. But it also tackles a wide range of topics, from cool applications, like geospatial wildfire risk prediction, to educational pieces, such as a specific new metric. Towards Data Science and posts on Medium in general can be used as a place to find answers to specific problems, like Machine Learning Mastery, or these posts can simply act as inspiration from creative and well presented work. Tier 3 academic sources Academic sources have the benefit that they are backed, checked, and managed by known and trusted sources. On the other hand, they re also more difficult to read and can be quite time consuming. The investment you make in reading them does not bring the same level of reward as for online courses, because the information is significantly less dense. Nonetheless, they are a necessary step to reproduce models and architectures from research papers or to truly master the fundamentals of machine learning. Machine Learning Stanford University Machine Learning _4,627,641 already enrolled Machine learning is the science of getting computers to act without being explicitly _www.coursera.org Andrew Ng is the co founder of Coursera and is especially known for his Machine Learning course. It is by far the most popular and influential course in ML. His teaching style is the opposite of fast.ai s it s a bottom up approach, with a lot of theory to understand before applying it to real problems. Since it was released in 2011, the quality of the audio and video leaves something to be desired. However, the content is still relevant and can be completed with a deep learning specialization. Neural Network and Deep Learning book Neural networks and deep learning _Neural Networks and Deep Learning is a free online book. The book will teach you about Neural networks, a beautiful _neuralnetworksanddeeplearning.com Neural Network and Deep Learning is a book focused on explaining the core concepts of neural networks step by step, with clear code and explanations. It does not cover any other ML algorithm but is an excellent introduction to the theory behind _deep_ and _shallow_ neural networks. The author does a great job of building the reader s intuition into key concepts to be able to make their own nets from scratch. The book also answers fundamental questions like why are deep neural networks difficult to train? that can be applied to a variety of deep learning architectures. Scientific papers arXiv.org _arXiv is a free distribution service and an open access archive for 2,011,228 scholarly articles in the fields of _arxiv.org Scientific papers are published in journals or as proceedings at conferences and are most often protected behind a paywall. Fortunately, there is a culture in ML of publishing preprints non final versions of articles on arXiv in machine learning. This website is a popular open access archive of over 2 million articles in various scientific fields. If all else fails and you can t find the article you re looking for on arXiv, you can always send a polite email to the first author to request it. We re generally happy to share our work with as many people as possible. Conclusion This article is far from being an exhaustive list of resources to learn ML, but the content discussed above does provide a solid foundation and specific knowledge of ML. But practice makes perfect, and only practice can truly give you the skills to translate the theoretical knowledge you learn into real world applications. Therefore, it is important to play with ML projects, whether they are real problems you want to tackle or public projects on Kaggle. And to be honest, they probably won t be solved with linear regression or k means clustering. _ \u30c4 _ Learning the basics and practicing is nonetheless an important step to master if you want to build expertise in more in depth subfields, like natural language processing or graph neural networks. I hope you can apply the same learning framework to every topic you encounter and become an expert in no time. AI is an exciting field, so don t forget to have fun! Follow me on Twitter maximelabonne and tell me what resources you use d in your ML journey, I need inspiration for next year. Share this post How to start Machine Learning for Developers in 2022 maximelabonne.substack.com Copy link Facebook Email Note Other Share Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Maxime Labonne Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "maximelabonne.substack.com", "author_id": "eff74089-0271-4319-8543-745c087f4f61", "author_full_name": "Maxime Labonne", "link": "https://maximelabonne.substack.com/p/how-to-start-machine-learning-for-developers-in-2022-390af12b193f" }, { "id": "34978aea-e179-44b5-975c-7deb64456380", "content": "An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin From data gathering to productionizing LLMs using LLMOps good practices. End to End Framework for Production Ready LLMs Decoding MLOpen in appSign upSign inWriteSign upSign inTop highlightLLM Twin Course Building Your Production Ready AI ReplicaAn End to End Framework for Production Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Paul Iusztin FollowPublished inDecoding ML 16 min read Mar 16, 20242.1K13ListenShare the 1st out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL EWhy is this course different?By finishing the LLM Twin Building Your Production Ready AI Replica free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real world LLM system from start to finish from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices the data collection pipeline crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. deployed on AWS the feature pipeline consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded using Superlinked , and loaded into a Qdrant vector DB in real time. deployed on AWS the training pipeline create a custom dataset based on your digital data. Fine tune an LLM using QLoRA. Use Comet ML s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet s model registry. deployed on Qwak the inference pipeline load and quantize the fine tuned LLM from Comet s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet s prompt monitoring dashboard. deployed on Qwak LLM twin system architecture Image by the Author Along the 4 microservices, you will learn to integrate 3 serverless tools Comet ML as your ML Platform Qdrant as your vector DB Qwak as your ML infrastructure Who is this for?Audience MLE, DE, DS, or SWE who want to learn to engineer production ready LLM systems using LLMOps good principles.Level intermediatePrerequisites basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands on written lessons and the open source code you can access on GitHub, showing how to build an end to end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace. To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms AWS, Qwak have a pay as you go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools Qdrant, Comet , we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by Paul Iusztin Senior ML MLOps EngineerAlex Vesa Senior AI EngineerAlex Razvant Senior ML MLOps EngineerLessons Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson An End to End Framework for Production Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture Enabling Event Driven ArchitecturesSOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine Tuning LLMsHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine Tuned LLMsArchitect scalable and cost effective LLM RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework Bonus Build a scalable RAG ingestion pipeline using 74.3 less code Bonus Build Multi Index Advanced RAG Apps Check out the code on GitHub 1 and support us with a Let s start with Lesson 1 Lesson 1 End to end framework for production ready LLM systemsIn the first lesson, we will present the project you will build during the course your production ready LLM Twin AI replica.Afterward, we will explain what the 3 pipeline design is and how it is applied to a standard ML system.Ultimately, we will dig into the LLM project system design.We will present all our architectural decisions regarding the design of the data collection pipeline for social media data and how we applied the 3 pipeline architecture to our LLM microservices.In the following lessons, we will examine each component s code and learn how to implement and deploy it to AWS and Qwak.LLM twin system architecture Image by the Author Table of ContentsWhat are you going to build? The LLM twin conceptThe 3 pipeline architectureLLM twin system design Check out the code on GitHub 1 and support us with a 1. What are you going to build? The LLM twin conceptThe outcome of this course is to learn to build your own AI replica. We will use an LLM to do that, hence the name of the course LLM Twin Building Your Production Ready AI Replica.But what is an LLM twin?Shortly, your LLM twin will be an AI character who writes like you, using your writing style and personality.It will not be you. It will be your writing copycat.More concretely, you will build an AI replica that writes social media posts or technical articles like this one using your own voice.Why not directly use ChatGPT? You may ask When trying to generate an article or post using an LLM, the results tend to be very generic and unarticulated,contain misinformation due to hallucination ,require tedious prompting to achieve the desired result.But here is what we are going to do to fix that First, we will fine tune an LLM on your digital data gathered from LinkedIn, Medium, Substack and GitHub.By doing so, the LLM will align with your writing style and online personality. It will teach the LLM to talk like the online version of yourself.Have you seen the universe of AI characters Meta released in 2024 in the Messenger app? If not, you can learn more about it here 2 .To some extent, that is what we are going to build.But in our use case, we will focus on an LLM twin who writes social media posts or articles that reflect and articulate your voice.For example, we can ask your LLM twin to write a LinkedIn post about LLMs. Instead of writing some generic and unarticulated post about LLMs e.g., what ChatGPT will do , it will use your voice and style.Secondly, we will give the LLM access to a vector DB to access external information to avoid hallucinating. Thus, we will force the LLM to write only based on concrete data.Ultimately, in addition to accessing the vector DB for information, you can provide external links that will act as the building block of the generation process.For example, we can modify the example above to Write me a 1000 word LinkedIn post about LLMs based on the article from this link URL . Excited? Let s get started 2. The 3 pipeline architectureWe all know how messy ML systems can get. That is where the 3 pipeline architecture kicks in.The 3 pipeline design brings structure and modularity to your ML system while improving your MLOps processes.ProblemDespite advances in MLOps tooling, transitioning from prototype to production remains challenging.In 2022, only 54 of the models get into production. Auch.So what happens?Maybe the first things that come to your mind are the model is not mature enoughsecurity risks e.g., data privacy not enough dataTo some extent, these are true.But the reality is that in many scenarios the architecture of the ML system is built with research in mind, or the ML system becomes a massive monolith that is extremely hard to refactor from offline to online.So, good SWE processes and a well defined architecture are as crucial as using suitable tools and models with high accuracy.Solution The 3 pipeline architectureLet s understand what the 3 pipeline design is.It is a mental map that helps you simplify the development process and split your monolithic ML pipeline into 3 components 1. the feature pipeline2. the training pipeline3. the inference pipeline also known as the Feature Training Inference FTI architecture. 1. The feature pipeline transforms your data into features labels, which are stored and versioned in a feature store. The feature store will act as the central repository of your features. That means that features can be accessed and shared only through the feature store. 2. The training pipeline ingests a specific version of the features labels from the feature store and outputs the trained model weights, which are stored and versioned inside a model registry. The models will be accessed and shared only through the model registry. 3. The inference pipeline uses a given version of the features from the feature store and downloads a specific version of the model from the model registry. Its final goal is to output the predictions to a client.The 3 pipeline architecture Image by the Author .This is why the 3 pipeline design is so beautiful it is intuitive it brings structure, as on a higher level, all ML systems can be reduced to these 3 components it defines a transparent interface between the 3 components, making it easier for multiple teams to collaborate the ML system has been built with modularity in mind since the beginning the 3 components can easily be divided between multiple teams if necessary every component can use the best stack of technologies available for the job every component can be deployed, scaled, and monitored independently the feature pipeline can easily be either batch, streaming or bothBut the most important benefit is that by following this pattern, you know 100 that your ML model will move out of your Notebooks into production. If you want to learn more about the 3 pipeline design, I recommend this excellent article 3 written by Jim Dowling, one of the creators of the FTI architecture.3. LLM Twin System designLet s understand how to apply the 3 pipeline architecture to our LLM system.The architecture of the LLM twin is split into 4 Python microservices The data collection pipelineThe feature pipelineThe training pipelineThe inference pipelineLLM twin system architecture Image by the Author As you can see, the data collection pipeline doesn t follow the 3 pipeline design. Which is true.It represents the data pipeline that sits before the ML system.The data engineering team usually implements it, and its scope is to gather, clean, normalize and store the data required to build dashboards or ML models.But let s say you are part of a small team and have to build everything yourself, from data gathering to model deployment.Thus, we will show you how the data pipeline nicely fits and interacts with the FTI architecture.Now, let s zoom in on each component to understand how they work individually and interact with each other. 3.1. The data collection pipelineIts scope is to crawl data for a given user from Medium articles Substack articles LinkedIn posts GitHub code As every platform is unique, we implemented a different Extract Transform Load ETL pipeline for each website. 1 min read on ETL pipelines 4 However, the baseline steps are the same for each platform.Thus, for each ETL pipeline, we can abstract away the following baseline steps log in using your credentialsuse selenium to crawl your profileuse BeatifulSoup to parse the HTMLclean normalize the extracted HTMLsave the normalized but still raw data to Mongo DBImportant note We are crawling only our data, as most platforms do not allow us to access other people s data due to privacy issues. But this is perfect for us, as to build our LLM twin, we need only our own digital data.Why Mongo DB?We wanted a NoSQL database that quickly allows us to store unstructured data aka text .How will the data pipeline communicate with the feature pipeline?We will use the Change Data Capture CDC pattern to inform the feature pipeline of any change on our Mongo DB. 1 min read on the CDC pattern 5 To explain the CDC briefly, a watcher listens 24 7 for any CRUD operation that happens to the Mongo DB.The watcher will issue an event informing us what has been modified. We will add that event to a RabbitMQ queue.The feature pipeline will constantly listen to the queue, process the messages, and add them to the Qdrant vector DB.For example, when we write a new document to the Mongo DB, the watcher creates a new event. The event is added to the RabbitMQ queue ultimately, the feature pipeline consumes and processes it.Doing this ensures that the Mongo DB and vector DB are constantly in sync.With the CDC technique, we transition from a batch ETL pipeline our data pipeline to a streaming pipeline our feature pipeline .Using the CDC pattern, we avoid implementing a complex batch pipeline to compute the difference between the Mongo DB and vector DB. This approach can quickly get very slow when working with big data.Where will the data pipeline be deployed?The data collection pipeline and RabbitMQ service will be deployed to AWS. We will also use the freemium serverless version of Mongo DB.3.2. The feature pipelineThe feature pipeline is implemented using Bytewax a Rust streaming engine with a Python interface . Thus, in our specific use case, we will also refer to it as a streaming ingestion pipeline.It is an entirely different service than the data collection pipeline.How does it communicate with the data pipeline?As explained above, the feature pipeline communicates with the data pipeline through a RabbitMQ queue.Currently, the streaming pipeline doesn t care how the data is generated or where it comes from.It knows it has to listen to a given queue, consume messages from there and process them.By doing so, we decouple the two components entirely. In the future, we can easily add messages from multiple sources to the queue, and the streaming pipeline will know how to process them. The only rule is that the messages in the queue should always respect the same structure interface.What is the scope of the feature pipeline?It represents the ingestion component of the RAG system.It will take the raw data passed through the queue and clean the data chunk it embed it using the embedding models from Superlinked load it to the Qdrant vector DB.Every type of data post, article, code will be processed independently through its own set of classes.Even though all of them are text based, we must clean, chunk and embed them using different strategies, as every type of data has its own particularities.What data will be stored?The training pipeline will have access only to the feature store, which, in our case, is represented by the Qdrant vector DB.Note that a vector DB can also be used as a NoSQL DB.With these 2 things in mind, we will store in Qdrant 2 snapshots of our data 1. The cleaned data without using vectors as indexes store them in a NoSQL fashion .2. The cleaned, chunked, and embedded data leveraging the vector indexes of Qdrant The training pipeline needs access to the data in both formats as we want to fine tune the LLM on standard and augmented prompts.With the cleaned data, we will create the prompts and answers.With the chunked data, we will augment the prompts aka RAG .Why implement a streaming pipeline instead of a batch pipeline?There are 2 main reasons.The first one is that, coupled with the CDC pattern, it is the most efficient way to sync two DBs between each other. Otherwise, you would have to implement batch polling or pushing techniques that aren t scalable when working with big data.Using CDC a streaming pipeline, you process only the changes to the source DB without any overhead.The second reason is that by doing so, your source and vector DB will always be in sync. Thus, you will always have access to the latest data when doing RAG.Why Bytewax?Bytewax is a streaming engine built in Rust that exposes a Python interface. We use Bytewax because it combines Rust s impressive speed and reliability with the ease of use and ecosystem of Python. It is incredibly light, powerful, and easy for a Python developer.Where will the feature pipeline be deployed?The feature pipeline will be deployed to AWS. We will also use the freemium serverless version of Qdrant.3.3. The training pipelineHow do we have access to the training features?As highlighted in section 3.2, all the training data will be accessed from the feature store. In our case, the feature store is the Qdrant vector DB that contains the cleaned digital data from which we will create prompts answers we will use the chunked embedded data for RAG to augment the cleaned data.We will implement a different vector DB retrieval client for each of our main types of data posts, articles, code .We must do this separation because we must preprocess each type differently before querying the vector DB, as each type has unique properties.Also, we will add custom behavior for each client based on what we want to query from the vector DB. But more on this in its dedicated lesson.What will the training pipeline do?The training pipeline contains a data to prompt layer that will preprocess the data retrieved from the vector DB into prompts.It will also contain an LLM fine tuning module that inputs a HuggingFace dataset and uses QLoRA to fine tune a given LLM e.g., Mistral . By using HuggingFace, we can easily switch between different LLMs so we won t focus too much on any specific LLM.All the experiments will be logged into Comet ML s experiment tracker.We will use a bigger LLM e.g., GPT4 to evaluate the results of our fine tuned LLM. These results will be logged into Comet s experiment tracker.Where will the production candidate LLM be stored?We will compare multiple experiments, pick the best one, and issue an LLM production candidate for the model registry.After, we will inspect the LLM production candidate manually using Comet s prompt monitoring dashboard. If this final manual check passes, we will flag the LLM from the model registry as accepted.A CI CD pipeline will trigger and deploy the new LLM version to the inference pipeline.Where will the training pipeline be deployed?The training pipeline will be deployed to Qwak.Qwak is a serverless solution for training and deploying ML models. It makes scaling your operation easy while you can focus on building.Also, we will use the freemium version of Comet ML for the following experiment tracker model registry prompt monitoring.3.4. The inference pipelineThe inference pipeline is the final component of the LLM system. It is the one the clients will interact with.It will be wrapped under a REST API. The clients can call it through HTTP requests, similar to your experience with ChatGPT or similar tools.How do we access the features?To access the feature store, we will use the same Qdrant vector DB retrieval clients as in the training pipeline.In this case, we will need the feature store to access the chunked data to do RAG.How do we access the fine tuned LLM?The fine tuned LLM will always be downloaded from the model registry based on its tag e.g., accepted and version e.g., v1.0.2, latest, etc. .How will the fine tuned LLM be loaded?Here we are in the inference world.Thus, we want to optimize the LLM s speed and memory consumption as much as possible. That is why, after downloading the LLM from the model registry, we will quantize it.What are the components of the inference pipeline?The first one is the retrieval client used to access the vector DB to do RAG. This is the same module as the one used in the training pipeline.After we have a query to prompt the layer, that will map the prompt and retrieved documents from Qdrant into a prompt.After the LLM generates its answer, we will log it to Comet s prompt monitoring dashboard and return it to the clients.For example, the client will request the inference pipeline to Write a 1000 word LinkedIn post about LLMs, and the inference pipeline will go through all the steps above to return the generated post.Where will the inference pipeline be deployed?The inference pipeline will be deployed to Qwak.By default, Qwak also offers autoscaling solutions and a nice dashboard to monitor all the production environment resources.As for the training pipeline, we will use a serverless freemium version of Comet for its prompt monitoring dashboard.ConclusionThis is the 1st article of the LLM Twin Building Your Production Ready AI Replica free course.In this lesson, we presented what you will build during the course.After we briefly discussed how to design ML systems using the 3 pipeline design.Ultimately, we went through the system design of the course and presented the architecture of each microservice and how they interact with each other The data collection pipelineThe feature pipelineThe training pipelineThe inference pipelineIn Lesson 2, we will dive deeper into the data collection pipeline, learn how to implement crawlers for various social media platforms, clean the gathered data, store it in a Mongo DB, and finally, show you how to deploy it to AWS. Check out the code on GitHub 1 and support us with a Have you enjoyed this article? Then Join 5k engineers in the \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf \ud835\udde1\ud835\uddf2\ud835\ude04\ud835\ude00\ud835\uddf9\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff for battle tested content on production grade ML. \ud835\uddd8\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06 \ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\uddf8 Decoding ML Newsletter Paul Iusztin SubstackJoin for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For decodingml.substack.comReferences 1 Your LLM Twin Course GitHub Repository 2024 , Decoding ML GitHub Organization 2 Introducing new AI experiences from Meta 2023 , Meta 3 Jim Dowling, From MLOps to ML Systems with Feature Training Inference Pipelines 2023 , Hopsworks 4 Extract Transform Load ETL , Databricks Glossary 5 Daniel Svonava and Paolo Perrone, Understanding the different Data Modality Types 2023 , SuperlinkedSign up to discover human stories that deepen your understanding of the world.FreeDistraction free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for 5 monthGenerative AiLarge Language ModelsMlopsArtificial IntelligenceMachine Learning2.1K2.1K13FollowWritten by Paul Iusztin5.1K Followers Editor for Decoding MLSenior ML MLOps Engineer Founder Decoding ML Content about building production grade ML AI systems DML Newsletter https decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLArchitect scalable and cost effective LLM RAG inference pipelinesDesign, build and deploy RAG inference pipeline using LLMOps best practices.Jun 15601See all from Paul IusztinSee all from Decoding MLRecommended from MediumVishal RajputinAIGuysWhy GEN AI Boom Is Fading And What s Next?Every technology has its hype and cool down period.Sep 42.3K72DerckData architecture for MLOps Metadata storeIntroductionJul 17ListsAI Regulation6 stories 593 savesNatural Language Processing1766 stories 1367 savesPredictive Modeling w Python20 stories 1607 savesPractical Guides to Machine Learning10 stories 1961 savesIda Silfverski\u00f6ldinLevel Up CodingAgentic AI Build a Tech Research AgentUsing a custom data pipeline with millions of textsSep 679610Alex RazvantinDecoding MLHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine tune a Mistral7b Instruct using PEFT QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922Vipra SinghBuilding LLM Applications Serving LLMs Part 9 Learn Large Language Models LLM through the lens of a Retrieval Augmented Generation RAG Application.Apr 188666Steve HeddeninTowards Data ScienceHow to Implement Graph RAG Using Knowledge Graphs and Vector DatabasesA Step by Step Tutorial on Implementing Retrieval Augmented Generation RAG , Semantic Search, and RecommendationsSep 61.4K18See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.", "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin-2cc6bb01141f" }, { "id": "d331f23e-88c6-4606-b397-52842c9a6295", "content": "A Real time Retrieval System for RAG on Social Media Data Use a streaming engine to populate a vector DB in real time. Improve RAG accuracy using rerank UMAP. Real time Retrieval for RAG on Social Media Data Decoding MLOpen in appSign upSign inWriteSign upSign inA Real time Retrieval System for RAG on Social Media DataUse a streaming engine to populate a vector DB in real time. Improve RAG accuracy using rerank UMAP.Paul Iusztin FollowPublished inDecoding ML 12 min read Mar 30, 2024358ListenShareImage by DALL EIn this article, you will learn how to build a real time retrieval system for social media data. In our example, we will use only my LinkedIn posts, but our implementation can easily be extended to other platforms supporting written content, such as X, Instagram, or Medium.In this article, you will learn how to build a streaming pipeline that ingests LinkedIn posts into a vector DB in real timeclean, chunk, and embed LinkedIn postsbuild a retrieval client to query LinkedIn postsuse a rerank pattern to improve retrieval accuracyvisualize content retrieved for a given query in a 2D plot using UMAPOur implementation focuses on just the retrieval part of an RAG system. But you can quickly hook the retrieved LinkedIn posts to an LLM for post analysis or personalized content generation.Table of Contents System DesignDataStreaming ingestion pipelineRetrieval clientConclusion1. System DesignThe retrieval system is based on 2 detached components the streaming ingestion pipelinethe retrieval clientThe architecture of the retrieval system Image by the Author in collaboration with VectorHub .The streaming ingestion pipeline runs 24 7 to keep the vector DB synced up with current raw LinkedIn posts data source, while the retrieval client is used in RAG applications to query the vector DB. These 2 components communicate with each other only through the vector DB.1.1. The streaming ingestion pipelineThe streaming ingestion pipeline implements the Change Data Capture CDC pattern between a data source containing the raw LinkedIn posts and the vector DB used for retrieval.In a real world scenario, the streaming pipeline listens to a queue populated by all the changes made to the source database. But because we are focusing primarily on the retrieval system, we simulate the data within the queue with a couple of JSON files.The streaming pipeline is built in Python using Bytewax, and cleans, chunks, and embeds the LinkedIn posts before loading them into a Qdrant vector DB.Why do we need a stream engine?Because LinkedIn posts or any other social media data evolve frequently, your vector DB can quickly get out of sync. To handle this, you can build a batch pipeline that runs every minute. But to really minimize data lag, to make sure your vector DB stays current with new social media posts, you need to use a streaming pipeline that immediately takes every new item the moment it s posted, preprocesses it, and loads it into the vector DB.Why Bytewax?Bytewax is a streaming engine built in Rust that exposes a Python interface. We use Bytewax because it combines the impressive speed and reliability of Rust with the ease of use and ecosystem of Python.1.2. The retrieval clientOur retrieval client is a standard Python module that preprocesses user queries and searches the vector DB for most similar results. Qdrant vector DB lets us decouple the retrieval client from the streaming ingestion pipeline.Using a semantic based retrieval system lets us query our LinkedIn post collection very flexibly. For example, we can retrieve similar posts using a variety of query types e.g., posts, questions, sentences.Also, to improve the retrieval system s accuracy, we use a rerank pattern.Lastly, to better understand and explain the retrieval process for particular queries, we visualize our results on a 2D plot using UMAP.2. DataWe will ingest 215 LinkedIn posts from my Linked profile Paul Iusztin. Though we simulate the post ingestion step using JSON files, the posts themselves are authentic.Before diving into the code, let s take a look at an example LinkedIn post to familiarize ourselves with the challenges it will introduce text \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 do you need to \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 an open source \ud835\udddf\ud835\udddf\ud835\udde0 to create your own \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddfc\ud835\uddff? nThis is the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf8\ud835\uddf6\ud835\ude01 you must know n\ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 nThe key component of any successful ML project is the data. nYou need a 100 1000 sample Q A questions answers dataset with financial scenarios. nThe best approach is to hire a bunch of experts to create it manually. nBut, for a PoC, that might get expensive slow. nThe good news is that a method called \ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude25\ud835\ude2a\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f exists. n ...Along with ease of deployment, you can easily add your training code to your CI CD to add the final piece of the MLOps puzzle, called CT continuous training . n Beam nhttps lnkd.in dedCaMDh n. n To see all these components in action, check out my FREE \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 give it a nhttps lnkd.in dZgqtf8f nhashtag n nmachinelearning nhashtag n nmlops nhashtag n ndatascience , image https media.licdn.com dms image D4D10AQHWQzZcToQQ1Q image shrink_800 0 1698388219549?e 1705082400 v beta t 9mrDC_NooJgD7u7Qk0PmrTGGaZtuwDIFKh3bEqeBsm0 The following features of the above post are not compatible with embedding models. We ll need to find some way of handling them in our preprocessing step emojisbold, italic textother non ASCII charactersURLscontent that exceeds the context window limit of the embedding modelEmojis and bolded and italic text are represented by Unicode characters that are not available in the vocabulary of the embedding model. Thus, these items cannot be tokenized and passed to the model we have to remove them or normalize them to something that can be parsed by the tokenizer. The same holds true for all other non ASCII characters.URLs take up space in the context window without providing much semantic value. Still, knowing that there s a URL in the sentence may add context. For this reason, we replace all URLs with a URL token. This lets us ingest whatever value the URL s presence conveys without it taking up valuable space.3. Streaming ingestion pipelineLet s dive into the streaming pipeline, starting from the top and working our way to the bottom 3.1. The Bytewax flowThe Bytewax flow transparently conveys all the steps of the streaming pipeline.The first step is ingesting every LinkedIn post from our JSON files. In the next steps, every map operation has a single responsibility validate the ingested data using a RawPost pydantic modelclean the postschunk the posts because chunking will output a list of ChunkedPost objects, we use a flat_map operation to flatten them outembed the postsload the posts to a Qdrant vector DBdef build_flow embedding_model EmbeddingModelSingleton flow Dataflow flow stream op.input input , flow, JSONSource data paul.json stream op.map raw_post , stream, RawPost.from_source stream op.map cleaned_post , stream, CleanedPost.from_raw_post stream op.flat_map chunked_post , stream, lambda cleaned_post ChunkedPost.from_cleaned_post cleaned_post, embedding_model embedding_model , stream op.map embedded_chunked_post , stream, lambda chunked_post EmbeddedChunkedPost.from_chunked_post chunked_post, embedding_model embedding_model , op.inspect inspect , stream, print op.output output , stream, QdrantVectorOutput vector_size model.embedding_size return flow3.2. The processing stepsEvery processing step is incorporated into a pydantic model. This way, we can easily validate the data at each step and reuse the code in the retrieval module.We isolate every step of an ingestion pipeline into its own class cleaningchunkingembeddingDoing so, we follow the separation of concerns good SWE practice. Thus, every class has its own responsibility.Now the code is easy to read and understand. Also, it s future proof, as it s extremely easy to change or extend either of the 3 steps cleaning, chunking and embedding.Here is the interface of the pydantic models class RawPost BaseModel post_id str text str image Optional str classmethod def from_source cls, k_v Tuple str, dict RawPost ... Mapping a dictionary to a RawPost validated pydantic model. return cls ... class CleanedPost BaseModel post_id str raw_text str text str image Optional str classmethod def from_raw_post cls, raw_post RawPost CleanedPost ... Cleaning the raw post return cls ... class ChunkedPost BaseModel post_id str chunk_id str full_raw_text str text str image Optional str classmethod def from_cleaned_post cls, cleaned_post CleanedPost, embedding_model EmbeddingModelSingleton list ChunkedPost chunks ... Compute chunks return cls ... for chunk in chunks class EmbeddedChunkedPost BaseModel post_id str chunk_id str full_raw_text str text str text_embedding list image Optional str None score Optional float None rerank_score Optional float None classmethod def from_chunked_post cls, chunked_post ChunkedPost, embedding_model EmbeddingModelSingleton EmbeddedChunkedPost ... Compute embedding. return cls ... Now, the data at each step is validated and has a clear structure.Note Providing different types when instantiating a pydantic model will throw a validation error. For example, if the post_id is defined as a string, and we try to instantiate an EmbeddedChunkedPost with a None or int post_id, it will throw an error.Check out the full implementation on our GitHub Articles Hub repository.3.3. Load to QdrantTo load the LinkedIn posts to Qdrant, you have to override Bytewax s StatelessSinkPartition class which acts as an output in a Bytewax flow class QdrantVectorSink StatelessSinkPartition def __init__ self, client QdrantClient, collection_name str self._client client self._collection_name collection_name def write_batch self, chunks list EmbeddedChunkedPost ... Map chunks to ids, embeddings, and metadata. self._client.upsert collection_name self._collection_name, points Batch ids ids, vectors embeddings, payloads metadata, , Within this class, you must overwrite the write_batch method, where we will serialize every EmbeddedChunkedPost to a format expected by Qdrant and load it to the vector DB.4. Retrieval clientHere, we focus on preprocessing a user s query, searching the vector DB, and postprocessing the retrieved posts for maximum results.To design the retrieval step, we implement a QdrantVectorDBRetriever class to expose all the necessary features for our retrieval client.class QdrantVectorDBRetriever def __init__ self, embedding_model EmbeddingModelSingleton, vector_db_client QdrantClient, cross_encoder_model CrossEncoderModelSingleton vector_db_collection str self._embedding_model embedding_model self._vector_db_client vector_db_client self._cross_encoder_model cross_encoder_model self._vector_db_collection vector_db_collection def search self, query str, limit int 3, return_all bool False Union list EmbeddedChunkedPost , dict str, list ... Search the Qdrant vector DB based on the given query. def embed_query self, query str list list float ... Embed the given query. def rerank self, query str, posts list EmbeddedChunkedPost list EmbeddedChunkedPost ... Rerank the posts relative to the given query. def render_as_html self, post EmbeddedChunkedPost None ... Map the embedded post to HTML to display it.4.1. Embed queryWe must embed the query in precisely the same way we ingested our posts into the vector DB. Because the streaming pipeline is written in Python thanks to Bytewax , and every preprocessing operation is modular, we can quickly replicate all the steps necessary to embed the query.class QdrantVectorDBRetriever ... def embed_query self, query str list list float cleaned_query CleanedPost.clean query chunks ChunkedPost.chunk cleaned_query, self._embedding_model embdedded_queries self._embedding_model chunk, to_list True for chunk in chunks return embdedded_queriesCheck out the full implementation on our GitHub repository.4.2. Plain retrievalLet s try to retrieve a set of posts without using the rerank algorithm.vector_db_retriever QdrantVectorDBRetriever embedding_model EmbeddingModelSingleton , vector_db_client build_qdrant_client query Posts about Qdrant retrieved_results vector_db_retriever.search query query for post in retrieved_results posts vector_db_retriever.render_as_html post Here are the top 2 retrieved results sorted using the cosine similarity score Result 1 Result 1 for the Posts about Qdrant query without using reranking Image by the Author in collaboration with VectorHub Result 2 Result 2 for the Posts about Qdrant query without using reranking Image by the Author in collaboration with VectorHub You can see from the results above, that starting from the second post the results are irrelevant. Even though it has a cosine similarly score of 0.69 the posts doesn t contain any information about Qdrant or vector DBs.Note We looked over the top 5 retrieved results. Nothing after the first post was relevant. We haven t added them here as the article is already too long.4.3. Visualize retrievalTo visualize our retrieval, we implement a dedicated class that uses the UMAP dimensionality reduction algorithm. We have picked UMAP as it preserves the geometric properties between points e.g., the distance in higher dimensions when they are projected onto lower dimensions better than its peers e.g., PCA, t SNE .The RetrievalVisualizer computes the projected embeddings for the entire vector space once. Afterwards, it uses the render method to project only the given query and retrieved posts, and plot them to a 2D graph.class RetrievalVisualizer def __init__ self, posts list EmbeddedChunkedPost self._posts posts self._umap_transform self._fit_model self._posts self._projected_post_embeddings self.project_posts self._posts def _fit_model self, posts list EmbeddedChunkedPost umap.UMAP umap_transform ... Fit a UMAP model on the given posts. return umap_transform def project_posts self, posts list EmbeddedChunkedPost np.ndarray embeddings np.array post.text_embedding for post in posts return self._project embeddings embeddings def _project self, embeddings np.ndarray np.ndarray ... Project the embeddings to 2D using UMAP. return umap_embeddings def render self, embedded_queries list list float , retrieved_posts list EmbeddedChunkedPost , None ... Render the given queries retrieved posts using matplotlib.Let s take a look at the result to see how the Posts about Qdrant query looks Visualization of the Posts about Qdrant query using UMAP without reranking Image by the Author in collaboration with VectorHub .Our results are not great. You can see how far the retrieved posts are from our query in the vector space.Can we improve the quality of our retrieval system using the rerank algorithm?4.4. RerankWe use the reranking algorithm to refine our retrieval for the initial query. Our initial retrieval step because it used cosine similarity or similar distance metrics to compute the distance between a query and post embeddings may have missed more complex but essential relationships between the query and the documents in the vector space. Reranking leverages the power of transformer models that are capable of understanding more nuanced semantic relationships.We use a cross encoder model to implement the reranking step, so we can score the query relative to all retrieved posts individually. These scores take into consideration more complex relationships than cosine similarity can. Under the hood is a BERT classifier that outputs a number between 0 and 1 according to how similar the 2 given sentences are. The BERT classifier outputs 0 if they are entirely different and 1 if they are a perfect match.Bi Encoder vs. Cross Encoder Image by the Author in collaboration with VectorHub Bi Encoder vs. Cross Encoder Image by the Author in collaboration with VectorHub But, you might ask, Why not use the cross encoder model from the start if it is that much better? The answer, in a word, is speed. Using a cross encoder model to search your whole collection is much slower than using cosine similarity. To optimize your retrieval, therefore, your reranking process should involve 2 steps an initial rough retrieval step using cosine similarity, which retrieves the top N items as potential candidatesfiltering the rough search using the rerank strategy, which retrieves the top K items as your final resultsThe implementation is relatively straightforward. For each retrieved post, we create a pair consisting of the cleaned query and the text of the post. We do this for all retrieved posts, resulting in a list of pairs.Next, we call a cross encoder ms marco MiniLM L 6 v2 model from sentence transformers to give the retrieved posts their rerank score. We then sort the posts in descending order based on their rerank score.Check out the rerank algorithm implementation on our GitHub repository.4.5. Visualize retrieval with rerankNow that we ve added the rerank pattern to our retrieval system, let s see if it improves the results of our Posts about Qdrant query Result 1Result 1 for the Posts about Qdrant query using reranking Image by the Author in collaboration with VectorHub Result 2 Result 2 for the Posts about Qdrant query using reranking Image by the Author in collaboration with VectorHub The improvement is remarkable! All our results are about Qdrant and vector DBs.Note We looked over the top 5 retrieved results. The top 4 out of 5 posts are relevant to our query, which is incredible.Now, let s look at the UMAP visualization Visualization of the Posts about Qdrant query using UMAP with reranking Image by the Author in collaboration with VectorHub .While the returned posts aren t very close to the query, they are a lot closer to the query compared to when we weren t reranking the retrieved posts.5. ConclusionIn this article, we learned how to adapt a RAG retrieval pattern to improve LinkedIn post retrieval. To keep our database up to date with rapidly changing social media data, we implemented a real time streaming pipeline that uses CDC to sync the raw LinkedIn posts data source with a vector DB. You also saw how to use Bytewax to write using only Python a streaming pipeline that cleans, chunks, and embeds LinkedIn posts.Finally, you learned how to implement a standard retrieval client for RAG and saw how to improve it using the rerank pattern. As retrieval is complex to evaluate, you saw how to visualize the retrieval for a given query by rendering all the posts, the query, and the retrieved posts in a 2D space using UMAP.This article is a summary of my contribution from VectorHub. Check out the full article here to dig into the details, the code and more experiments. Join 5k engineers in the \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf \ud835\udde1\ud835\uddf2\ud835\ude04\ud835\ude00\ud835\uddf9\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff for battle tested content on production grade ML. \ud835\uddd8\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06 \ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\uddf8 Decoding ML Newsletter Paul Iusztin SubstackJoin for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For decodingml.substack.comSign up to discover human stories that deepen your understanding of the world.FreeDistraction free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for 5 monthMl System DesignArtificial IntelligenceMachine LearningStreaming PipelineData Science358358FollowWritten by Paul Iusztin5.1K Followers Editor for Decoding MLSenior ML MLOps Engineer Founder Decoding ML Content about building production grade ML AI systems DML Newsletter https decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLAn End to End Framework for Production Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13See all from Paul IusztinSee all from Decoding MLRecommended from MediumMdabdullahalhasibinTowards AIA Complete Guide to Embedding For NLP Generative AI LLMUnderstand the concept of vector embedding, why it is needed, and implementation with LangChain.3d agoVishal RajputinAIGuysWhy GEN AI Boom Is Fading And What s Next?Every technology has its hype and cool down period.Sep 42.3K72ListsPredictive Modeling w Python20 stories 1607 savesNatural Language Processing1766 stories 1367 savesPractical Guides to Machine Learning10 stories 1961 savesChatGPT prompts 50 stories 2121 savesTarun SinghinAI AdvancesAI Powered OCR with Phi 3 Vision 128K The Future of Document ProcessingIn the fast evolving world of artificial intelligence, multimodal models are setting new standards for integrating visual and textual data Oct 989916Alex RazvantinDecoding MLHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine tune a Mistral7b Instruct using PEFT QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922Kamal DhunganaImplementing Human in the Loop with LangGraphStreamlit app HIL Agent Framework LangGraph Jul 16205Umair Ali KhaninTowards Data ScienceIntegrating Multimodal Data into a Large Language ModelDeveloping a context retrieval, multimodal RAG using advanced parsing, semantic keyword search, and re ranking4d ago841See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.", "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/a-real-time-retrieval-system-for-rag-on-social-media-data-9cc01d50a2a0" }, { "id": "c647c345-aeb5-46f7-8f16-8a6345344069", "content": "SOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! Use a Python streaming engine to populate a feature store from 4 data sources Streaming Pipelines for LLMs and RAG Decoding MLOpen in appSign upSign inWriteSign upSign inTop highlightLLM TWIN COURSE BUILDING YOUR PRODUCTION READY AI REPLICASOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time!Use a Python streaming engine to populate a feature store from 4 data sourcesPaul Iusztin FollowPublished inDecoding ML 19 min read Apr 20, 20248241ListenShare the 4th out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL EWhy is this course different?By finishing the LLM Twin Building Your Production Ready AI Replica free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real world LLM system from start to finish from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices the data collection pipeline crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. deployed on AWS the feature pipeline consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded using Superlinked , and loaded into a Qdrant vector DB in real time. deployed on AWS the training pipeline create a custom dataset based on your digital data. Fine tune an LLM using QLoRA. Use Comet ML s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet s model registry. deployed on Qwak the inference pipeline load and quantize the fine tuned LLM from Comet s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet s prompt monitoring dashboard. deployed on Qwak LLM twin system architecture Image by the Author Along the 4 microservices, you will learn to integrate 3 serverless tools Comet ML as your ML Platform Qdrant as your vector DB Qwak as your ML infrastructure Who is this for?Audience MLE, DE, DS, or SWE who want to learn to engineer production ready LLM systems using LLMOps good principles.Level intermediatePrerequisites basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands on written lessons and the open source code you can access on GitHub, showing how to build an end to end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace. To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms AWS, Qwak have a pay as you go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools Qdrant, Comet , we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by Paul Iusztin Senior ML MLOps EngineerAlex Vesa Senior AI EngineerAlex Razvant Senior ML MLOps Engineer Check out the code on GitHub 1 and support us with a Lessons Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson An End to End Framework for Production Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture Enabling Event Driven ArchitecturesSOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine Tuning LLMsHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine Tuned LLMsArchitect scalable and cost effective LLM RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework Bonus Build a scalable RAG ingestion pipeline using 74.3 less code Bonus Build Multi Index Advanced RAG AppsTo better understand the course s goal, technical details, and system design Check out Lesson 1Let s start with Lesson 4 Lesson 4 Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time!In the 4th lesson, we will focus on the feature pipeline.The feature pipeline is the first pipeline presented in the 3 pipeline architecture feature, training and inference pipelines.A feature pipeline is responsible for taking raw data as input, processing it into features, and storing it in a feature store, from which the training inference pipelines will use it.The component is completely isolated from the training and inference code. All the communication is done through the feature store.To avoid repeating myself, if you are unfamiliar with the 3 pipeline architecture, check out Lesson 1 for a refresher.By the end of this article, you will learn to design and build a production ready feature pipeline that uses Bytewax as a stream engine to process data in real time ingests data from a RabbitMQ queue uses SWE practices to process multiple data types posts, articles, code cleans, chunks, and embeds data for LLM fine tuning and RAG loads the features to a Qdrant vector DB.Note In our use case, the feature pipeline is also a streaming pipeline, as we use a Bytewax streaming engine. Thus, we will use these words interchangeably.We will wrap up Lesson 4 by showing you how to deploy the feature pipeline to AWS and integrate it with the components from previous lessons data collection pipeline, MongoDB, and CDC.In the 5th lesson, we will go through the vector DB retrieval client, where we will teach you how to query the vector DB and improve the accuracy of the results using advanced retrieval techniques.Excited? Let s get started!The architecture of the feature streaming pipeline.Table of ContentsWhy are we doing this?System design of the feature pipelineThe Bytewax streaming flowPydantic data modelsLoad data to QdrantThe dispatcher layerPreprocessing steps Clean, chunk, embedThe AWS infrastructureRun the code locallyDeploy the code to AWS Run it from the cloudConclusion Check out the code on GitHub 1 and support us with a 1. Why are we doing this?A quick reminder from previous lessonsTo give you some context, in Lesson 2, we crawl data from LinkedIn, Medium, and GitHub, normalize it, and load it to MongoDB.In Lesson 3, we are using CDC to listen to changes to the MongoDB database and emit events in a RabbitMQ queue based on any CRUD operation done on MongoDB. and here we are in Lesson 4, where we are building the feature pipeline that listens 24 7 to the RabbitMQ queue for new events to process and load them to a Qdrant vector DB.The problem we are solvingIn our LLM Twin use case, the feature pipeline constantly syncs the MongoDB warehouse with the Qdrant vector DB while processing the raw data into features.Important In our use case, the Qdrant vector DB will be our feature store.Why we are solving itThe feature store will be the central point of access for all the features used within the training and inference pipelines.For consistency and simplicity, we will refer to different formats of our text data as features. The training pipeline will use the feature store to create fine tuning datasets for your LLM twin. The inference pipeline will use the feature store for RAG.For reliable results especially for RAG , the data from the vector DB must always be in sync with the data from the data warehouse.The question is, what is the best way to sync these 2?Other potential solutionsThe most common solution is probably to use a batch pipeline that constantly polls from the warehouse, computes a difference between the 2 databases, and updates the target database.The issue with this technique is that computing the difference between the 2 databases is extremely slow and costly.Another solution is to use a push technique using a webhook. Thus, on any CRUD change in the warehouse, you also update the source DB.The biggest issue here is that if the webhook fails, you have to implement complex recovery logic.Lesson 3 on CDC covers more of this.2. System design of the feature pipeline our solutionOur solution is based on CDC, a queue, a streaming engine, and a vector DB CDC adds any change made to the Mongo DB to the queue read more in Lesson 3 . the RabbitMQ queue stores all the events until they are processed. The Bytewax streaming engine cleans, chunks, and embeds the data. A streaming engine works naturally with a queue based system. The data is uploaded to a Qdrant vector DB on the flyWhy is this powerful?Here are 4 core reasons The data is processed in real time.Out of the box recovery system If the streaming pipeline fails to process a message will be added back to the queueLightweight No need for any diffs between databases or batching too many recordsNo I O bottlenecks on the source database It solves all our problems!The architecture of the feature streaming pipeline.How is the data stored?We store 2 snapshots of our data in the feature store. Here is why Remember that we said that the training and inference pipeline will access the features only from the feature store, which, in our case, is the Qdrant vector DB?Well, if we had stored only the chunked embedded version of the data, that would have been useful only for RAG but not for fine tuning.Thus, we make an additional snapshot of the cleaned data, which will be used by the training pipeline.Afterward, we pass it down the streaming flow for chunking embedding.How do we process multiple data types?How do you process multiple types of data in a single streaming pipeline without writing spaghetti code?Yes, that is for you, data scientists! Joking am I?We have 3 data types posts, articles, and code.Each data type and its state will be modeled using Pydantic models.To process them we will write a dispatcher layer, which will use a creational factory pattern 9 to instantiate a handler implemented for that specific data type post, article, code and operation cleaning, chunking, embedding .The handler follows the strategy behavioral pattern 10 .Intuitively, you can see the combination between the factory and strategy patterns as follows Initially, we know we want to clean the data, but as we don t know the data type, we can t know how to do so.What we can do, is write the whole code around the cleaning code and abstract away the login under a Handler interface aka the strategy .When we get a data point, the factory class creates the right cleaning handler based on its type.Ultimately the handler is injected into the rest of the system and executed.By doing so, we can easily isolate the logic for a given data type operation while leveraging polymorphism to avoid filling up the code with 1000x if else statements.We will dig into the implementation in future sections.Streaming over batchYou may ask why we need a streaming engine instead of implementing a batch job that polls the messages at a given frequency.That is a valid question.The thing is that Nowadays, using tools such as Bytewax makes implementing streaming pipelines a lot more frictionless than using their JVM alternatives.The key aspect of choosing a streaming vs. a batch design is real time synchronization between your source and destination DBs.In our particular case, we will process social media data, which changes fast and irregularly.Also, for our digital twin, it is important to do RAG on up to date data. We don t want to have any delay between what happens in the real world and what your LLM twin sees.That being said choosing a streaming architecture seemed natural in our use case.3. The Bytewax streaming flowThe Bytewax flow is the central point of the streaming pipeline. It defines all the required steps, following the next simplified pattern input processing output .As I come from the AI world, I like to see it as the graph of the streaming pipeline, where you use the input , map , and output Bytewax functions to define your graph, which in the Bytewax world is called a flow .As you can see in the code snippet below, we ingest posts, articles or code messages from a RabbitMQ queue. After we clean, chunk and embed them. Ultimately, we load the cleaned and embedded data to a Qdrant vector DB, which in our LLM twin use case will represent the feature store of our system.To structure and validate the data, between each Bytewax step, we map and pass a different Pydantic model based on its current state raw, cleaned, chunked, or embedded.Bytewax flow GitHub Code We have a single streaming pipeline that processes everything.As we ingest multiple data types posts, articles, or code snapshots , we have to process them differently.To do this the right way, we implemented a dispatcher layer that knows how to apply data specific operations based on the type of message.More on this in the next sections Why Bytewax?Bytewax is an open source streaming processing framework that is built in Rust for performance has Python bindings for leveraging its powerful ML ecosystem so, for all the Python fanatics out there, no more JVM headaches for you.Jokes aside, here is why Bytewax is so powerful Bytewax local setup is plug and play can quickly be integrated into any Python project you can go wild even use it in Notebooks can easily be integrated with other Python packages NumPy, PyTorch, HuggingFace, OpenCV, SkLearn, you name it out of the box connectors for Kafka and local files, or you can quickly implement your ownWe used Bytewax to build the streaming pipeline for the LLM Twin course and loved it.To learn more about Bytewax, go and check them out. They are open source, so no strings attached Bytewax 2 4. Pydantic data modelsLet s take a look at what our Pydantic models look like.First, we defined a set of base abstract models for using the same parent class across all our components.Pydantic base model structure GitHub Code Afterward, we defined a hierarchy of Pydantic models for all our data types posts, articles, or codeall our states raw, cleaned, chunked, and embeddedThis is how the set of classes for the posts will look like Pydantic posts model structure GitHub Code We repeated the same process for the articles and code model hierarchy.Check out the other data classes on our GitHub.Why is keeping our data in Pydantic models so powerful?There are 4 main criteria every field has an enforced type you are ensured the data types are going to be correctthe fields are automatically validated based on their type for example, if the field is a string and you pass an int, it will through an errorthe data structure is clear and verbose no more clandestine dicts that you never know what is in themyou make your data the first class citizen of your program5. Load data to QdrantThe first step is to implement our custom Bytewax DynamicSink class Qdrant DynamicSink GitHub Code Next, for every type of operation we need output cleaned or embedded data we have to subclass the StatelessSinkPartition Bytewax class they also provide a stateful option more in their docs An instance of the class will run on every partition defined within the Bytewax deployment.In the course, we are using a single partition per worker. But, by adding more partitions and workers , you can quickly scale your Bytewax pipeline horizontally.Qdrant worker partitions GitHub Code Note that we used Qdrant s Batch method to upload all the available points at once. By doing so, we reduce the latency on the network I O side more on that here 8 The RabbitMQ streaming input follows a similar pattern. Check it out here 6. The dispatcher layerNow that we have the Bytewax flow and all our data models.How do we map a raw data model to a cleaned data model? All our domain logic is modeled by a set of Handler classes.For example, this is how the handler used to map a PostsRawModel to a PostCleanedModel looks like Handler hierarchy of classes GitHub Code Check out the other handlers on our GitHub ChunkingDataHandler and EmbeddingDataHandlerIn the next sections, we will explore the exact cleaning, chunking and embedding logic.Now, to build our dispatcher, we need 2 last components a factory class instantiates the right handler based on the type of the eventa dispatcher class the glue code that calls the factory class and handlerHere is what the cleaning dispatcher and factory look like The dispatcher and factory classes GitHub Code Check out the other dispatchers on our GitHub.By repeating the same logic, we will end up with the following set of dispatchers RawDispatcher no factory class required as the data is not processed CleaningDispatcher with a ChunkingHandlerFactory class ChunkingDispatcher with a ChunkingHandlerFactory class EmbeddingDispatcher with an EmbeddingHandlerFactory class 7. Preprocessing steps Clean, chunk, embedHere we will focus on the concrete logic used to clean, chunk, and embed a data point.Note that this logic is wrapped by our handler to be integrated into our dispatcher layer using the Strategy behavioral pattern 10 .We already described that in the previous section. Thus, we will directly jump into the actual logic here, which can be found in the utils module of our GitHub repository.Note These steps are experimental. Thus, what we present here is just the first iteration of the system. In a real world scenario, you would experiment with different cleaning, chunking or model versions to improve it on your data.CleaningThis is the main utility function used to clean the text for our posts, articles, and code.Out of simplicity, we used the same logic for all the data types, but after more investigation, you would probably need to adapt it to your specific needs.For example, your posts might start containing some weird characters, and you don t want to run the unbold_text or unitalic_text functions on your code data point as is completely redundant.Cleaning logic GitHub Code Most of the functions above are from the unstructured 3 Python package. It is a great tool for quickly finding utilities to clean text data. More examples of unstructured here 3 One key thing to notice is that at the cleaning step, we just want to remove all the weird, non interpretable characters from the text.Also, we want to remove redundant data, such as extra whitespace or URLs, as they do not provide much value.These steps are critical for our tokenizer to understand and efficiently transform our string input into numbers that will be fed into the transformer models.Note that when using bigger models transformers modern tokenization techniques, you don t need to standardize your dataset too much.For example, it is redundant to apply lemmatization or stemming, as the tokenizer knows how to split your input into a commonly used sequence of characters efficiently, and the transformers can pick up the nuances of the words. What is important at the cleaning step is to throw out the noise.ChunkingWe are using Langchain to chunk our text.We use a 2 step strategy using Langchain s RecursiveCharacterTextSplitter 4 and SentenceTransformersTokenTextSplitter 5 . As seen below Chunking logic GitHub Code Overlapping your chunks is a common pre indexing RAG technique, which helps to cluster chunks from the same document semantically.Again, we are using the same chunking logic for all of our data types, but to get the most out of it, we would probably need to tweak the separators, chunk_size, and chunk_overlap parameters for our different use cases.But our dispatcher handler architecture would easily allow us to configure the chunking step in future iterations.EmbeddingThe data preprocessing, aka the hard part is done.Now we just have to call an embedding model to create our vectors.Embedding logic GitHub Code We used the all MiniLm L6 v2 6 from the sentence transformers library to embed our articles and posts a lightweight embedding model that can easily run in real time on a 2 vCPU machine.As the code data points contain more complex relationships and specific jargon to embed, we used a more powerful embedding model hkunlp instructor xl 7 .This embedding model is unique as it can be customized on the fly with instructions based on your particular data. This allows the embedding model to specialize on your data without fine tuning, which is handy for embedding pieces of code.8. The AWS infrastructureIn Lesson 2, we covered how to deploy the data collection pipeline that is triggered by a link to Medium, Substack, LinkedIn or GitHub crawls the given link saves the crawled information to a MongoDB.In Lesson 3, we explained how to deploy the CDC components that emit events to a RabbitMQ queue based on any CRUD operation done to MongoDB.What is left is to deploy the Bytewax streaming pipeline and Qdrant vector DB.We will use Qdrant s self hosted option, which is easy to set up and scale.To test things out, they offer a Free Tier plan for up to a 1GB cluster, which is more than enough for our course. We explained in our GitHub repository how to configure Qdrant.AWS infrastructure of the feature streaming pipeline.The last piece of the puzzle is the Bytewax streaming pipeline.As we don t require a GPU and the streaming pipeline needs to run 24 7, we will deploy it to AWS Fargate, a cost effective serverless solution from AWS.As a serverless solution, Fargate allows us to deploy our code quickly and scale it fast in case of high traffic.How do we deploy the streaming pipeline code to Fargate?Using GitHub Actions, we wrote a CD pipeline that builds a Docker image on every new commit made on the main branch.After, the Docker image is pushed to AWS ECR. Ultimately, Fargate pulls the latest version of the Docker image.This is a common CD pipeline to deploy your code to AWS services.Why not use lambda functions, as we did for the data pipeline?An AWS lambda function executes a function once and then closes down.This worked perfectly for the crawling logic, but it won t work for our streaming pipeline, which has to run 24 7.9. Run the code locallyTo quickly test things up, we wrote a docker compose.yaml file to spin up the MongoDB, RabbitMQ queue and Qdrant vector db.You can spin up the Docker containers using our Makefile by running the following, which will start the CDC component and streaming pipeline make local startTo start the data collection pipeline, run the following make local test githubThe documentation of our GitHub repository provides more details on how to run and set up everything.10. Deploy the code to AWS Run it from the cloudThis article is already too long, so I won t go into the details of how to deploy the AWS infrastructure described above and test it out here.But to give you some insights, we have used Pulumi as our infrastructure as a code IaC tool, which will allow you to spin it quickly with a few commands.Also, I won t let you hang on to this one. We made a promise and We prepared step by step instructions in the README of our GitHub repository on how to use Pulumni to spin up the infrastructure and test it out.ConclusionNow you know how to write streaming pipelines like a PRO!In Lesson 4, you learned how to design a feature pipeline using the 3 pipeline architecturewrite a streaming pipeline using Bytewax as a streaming engineuse a dispatcher layer to write a modular and flexible application to process multiple types of data posts, articles, code load the cleaned and embedded data to Qdrantdeploy the streaming pipeline to AWS This is only the ingestion part used for fine tuning LLMs and RAG.In Lesson 5, you will learn how to write a retrieval client for the 3 data types using good SWE practices and improve the retrieval accuracy using advanced retrieval post retrieval techniques. See you there! Check out the code on GitHub 1 and support us with a Enjoyed This Article?Join the Decoding ML Newsletter for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For FREE Decoding ML Newsletter Paul Iusztin SubstackJoin for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For decodingml.substack.comReferencesLiterature 1 Your LLM Twin Course GitHub Repository 2024 , Decoding ML GitHub Organization 2 Bytewax, Bytewax Landing Page 3 Unstructured Cleaning Examples, Unstructured Documentation 4 Recursively split by character, LangChain s Documentation 5 Split by tokens, LangChain s Documentation 6 sentence transformers all MiniLM L6 v2, HuggingFace 7 hkunlp instructor xl, HuggingFace 8 Qdrant, Qdrant Documentation 9 Abstract Factory Pattern, Refactoring Guru 10 Strategy Pattern, Refactoring GuruImagesIf not otherwise stated, all images are created by the author.Sign up to discover human stories that deepen your understanding of the world.FreeDistraction free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for 5 monthMl System DesignMachine LearningArtificial IntelligenceData ScienceSoftware Engineering8248241FollowWritten by Paul Iusztin5.1K Followers Editor for Decoding MLSenior ML MLOps Engineer Founder Decoding ML Content about building production grade ML AI systems DML Newsletter https decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLAn End to End Framework for Production Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13See all from Paul IusztinSee all from Decoding MLRecommended from MediumVipra SinghBuilding LLM Applications Serving LLMs Part 9 Learn Large Language Models LLM through the lens of a Retrieval Augmented Generation RAG Application.Apr 188666Vishal RajputinAIGuysWhy GEN AI Boom Is Fading And What s Next?Every technology has its hype and cool down period.Sep 42.3K72ListsPredictive Modeling w Python20 stories 1607 savesNatural Language Processing1766 stories 1367 savesPractical Guides to Machine Learning10 stories 1961 savesdata science and AI40 stories 269 savesDerckData architecture for MLOps Metadata storeIntroductionJul 17Alex RazvantinDecoding MLHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine tune a Mistral7b Instruct using PEFT QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922Tarun SinghinAI AdvancesMastering RAG Chunking Techniques for Enhanced Document ProcessingDividing large documents into smaller parts is a crucial yet intricate task that significantly impacts the performance of Jun 182592Steve HeddeninTowards Data ScienceHow to Implement Graph RAG Using Knowledge Graphs and Vector DatabasesA Step by Step Tutorial on Implementing Retrieval Augmented Generation RAG , Semantic Search, and RecommendationsSep 61.4K18See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.", "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/sota-python-streaming-pipelines-for-fine-tuning-llms-and-rag-in-real-time-82eb07795b87" }, { "id": "649bd7d7-aa0e-4ada-b5e2-1c50fe7c95e6", "content": "The 4 Advanced RAG Algorithms You Must Know to Implement Implement from scratch 4 advanced RAG methods to optimize your retrieval and post retrieval algorithm 4 Advanced RAG Algorithms You Must Know Decoding MLOpen in appSign upSign inWriteSign upSign inTop highlightLLM TWIN COURSE BUILDING YOUR PRODUCTION READY AI REPLICAThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post retrieval algorithmPaul Iusztin FollowPublished inDecoding ML 16 min read May 4, 20241.8K12ListenShare the 5th out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL EWhy is this course different?By finishing the LLM Twin Building Your Production Ready AI Replica free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real world LLM system from start to finish from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices the data collection pipeline crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. deployed on AWS the feature pipeline consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded using Superlinked , and loaded into a Qdrant vector DB in real time. deployed on AWS the training pipeline create a custom dataset based on your digital data. Fine tune an LLM using QLoRA. Use Comet ML s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet s model registry. deployed on Qwak the inference pipeline load and quantize the fine tuned LLM from Comet s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet s prompt monitoring dashboard. deployed on Qwak LLM twin system architecture Image by the Author Along the 4 microservices, you will learn to integrate 3 serverless tools Comet ML as your ML Platform Qdrant as your vector DB Qwak as your ML infrastructure Who is this for?Audience MLE, DE, DS, or SWE who want to learn to engineer production ready LLM systems using LLMOps good principles.Level intermediatePrerequisites basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands on written lessons and the open source code you can access on GitHub, showing how to build an end to end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace. To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms AWS, Qwak have a pay as you go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools Qdrant, Comet , we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by Paul Iusztin Senior ML MLOps EngineerAlex Vesa Senior AI EngineerAlex Razvant Senior ML MLOps Engineer Check out the code on GitHub 1 and support us with a Lessons Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson An End to End Framework for Production Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture Enabling Event Driven ArchitecturesSOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine Tuning LLMsHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine Tuned LLMsArchitect scalable and cost effective LLM RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework Bonus Build a scalable RAG ingestion pipeline using 74.3 less code Bonus Build Multi Index Advanced RAG AppsTo better understand the course s goal, technical details, and system design Check out Lesson 1Let s start with Lesson 5 Lesson 5 The 4 Advanced RAG Algorithms You Must Know to ImplementIn Lesson 5, we will focus on building an advanced retrieval module used for RAG.We will show you how to implement 4 retrieval and post retrieval advanced optimization techniques to improve the accuracy of your RAG retrieval step.In this lesson, we will focus only on the retrieval part of the RAG system.In Lesson 4, we showed you how to clean, chunk, embed, and load social media data to a Qdrant vector DB the ingestion part of RAG .In future lessons, we will integrate this retrieval module into the inference pipeline for a full fledged RAG system.Retrieval Python Module ArchitectureWe assume you are already familiar with what a naive RAG looks like. If not, check out the following article from Decoding ML, where we present in a 2 minute read what a naive RAG looks like Why you must choose streaming over batch pipelines when doing RAG in LLM applicationsLesson 2 RAG, streaming pipelines, vector DBs, text processingmedium.comTable of ContentsOverview of advanced RAG optimization techniquesAdvanced RAG techniques applied to the LLM twinRetrieval optimization 1 Query expansionRetrieval optimization 2 Self queryRetrieval optimization 3 Hybrid filtered vector searchImplement the advanced retrieval Python classPost retrieval optimization Rerank using GPT 4How to use the retrievalConclusion Check out the code on GitHub 1 and support us with a 1. Overview of advanced RAG optimization techniquesA production RAG system is split into 3 main components ingestion clean, chunk, embed, and load your data to a vector DBretrieval query your vector DB for contextgeneration attach the retrieved context to your prompt and pass it to an LLMThe ingestion component sits in the feature pipeline, while the retrieval and generation components are implemented inside the inference pipeline.You can also use the retrieval and generation components in your training pipeline to fine tune your LLM further on domain specific prompts.You can apply advanced techniques to optimize your RAG system for ingestion, retrieval and generation.That being said, there are 3 main types of advanced RAG techniques Pre retrieval optimization ingestion tweak how you create the chunksRetrieval optimization retrieval improve the queries to your vector DBPost retrieval optimization retrieval process the retrieved chunks to filter out the noiseThe generation step can be improved through fine tuning or prompt engineering, which will be explained in future lessons.The pre retrieval optimization techniques are explained in Lesson 4.In this lesson, we will show you some popular retrieval and post retrieval optimization techniques.2. Advanced RAG techniques applied to the LLM twinRetrieval optimizationWe will combine 3 techniques Query ExpansionSelf QueryFiltered vector searchPost retrieval optimizationWe will use the rerank pattern using GPT 4 and prompt engineering instead of Cohere or an open source re ranker cross encoder 4 .I don t want to spend too much time on the theoretical aspects. There are plenty of articles on that.So, we will jump straight to implementing and integrating these techniques in our LLM twin system.But before seeing the code, let s clarify a few things Advanced RAG architecture2.1 Important Note!We will show you a custom implementation of the advanced techniques and NOT use LangChain.Our primary goal is to build your intuition about how they work behind the scenes. However, we will attach LangChain s equivalent so you can use them in your apps.Customizing LangChain can be a real headache. Thus, understanding what happens behind its utilities can help you build real world applications.Also, it is critical to know that if you don t ingest the data using LangChain, you cannot use their retrievals either, as they expect the data to be in a specific format.We haven t used LangChain s ingestion function in Lesson 4 either the feature pipeline that loads data to Qdrant as we want to do everything by hand .2.2. Why Qdrant?There are many vector DBs out there, too many But since we discovered Qdrant, we loved it.Why?It is built in Rust.Apache 2.0 license open source It has a great and intuitive Python SDK.It has a freemium self hosted version to build PoCs for free.It supports unlimited document sizes, and vector dims of up to 645536.It is production ready. Companies such as Disney, Mozilla, and Microsoft already use it.It is one of the most popular vector DBs out there.To put that in perspective, Pinecone, one of its biggest competitors, supports only documents with up to 40k tokens and vectors with up to 20k dimensions . and a proprietary license.I could go on and on but if you are curious to find out more, check out Qdrant 3. Retrieval optimization 1 Query expansionThe problemIn a typical retrieval step, you query your vector DB using a single point.The issue with that approach is that by using a single vector, you cover only a small area of your embedding space.Thus, if your embedding doesn t contain all the required information, your retrieved context will not be relevant.What if we could query the vector DB with multiple data points that are semantically related?That is what the Query expansion technique is doing!The solutionQuery expansion is quite intuitive.You use an LLM to generate multiple queries based on your initial query.These queries should contain multiple perspectives of the initial query.Thus, when embedded, they hit different areas of your embedding space that are still relevant to our initial question.You can do query expansion with a detailed zero shot prompt.Here is our simple custom solution Query expansion template GitHub Code Here is LangChain s MultiQueryRetriever class 5 their equivalent .4. Retrieval optimization 2 Self queryThe problemWhen embedding your query, you cannot guarantee that all the aspects required by your use case are present in the embedding vector.For example, you want to be 100 sure that your retrieval relies on the tags provided in the query.The issue is that by embedding the query prompt, you can never be sure that the tags are represented in the embedding vector or have enough signal when computing the distance against other vectors.The solutionWhat if you could extract the tags within the query and use them along the embedded query?That is what self query is all about!You use an LLM to extract various metadata fields that are critical for your business use case e.g., tags, author ID, number of comments, likes, shares, etc. In our custom solution, we are extracting just the author ID. Thus, a zero shot prompt engineering technique will do the job.But, when extracting multiple metadata types, you should also use few shot learning to optimize the extraction step.Self queries work hand in hand with vector filter searches, which we will explain in the next section.Here is our solution Self query template GitHub Code Here is LangChain s SelfQueryRetriever class 6 equivalent and this is an example using Qdrant 8 .5. Retrieval optimization 3 Hybrid filtered vector searchThe problemEmbeddings are great for capturing the general semantics of a specific chunk.But they are not that great for querying specific keywords.For example, if we want to retrieve article chunks about LLMs from our Qdrant vector DB, embeddings would be enough.However, if we want to query for a specific LLM type e.g., LLama 3 , using only similarities between embeddings won t be enough.Thus, embeddings are not great for finding exact phrase matching for specific terms.The solutionCombine the vector search technique with one or more complementary search strategy, which works great for finding exact words.It is not defined which algorithms are combined, but the most standard strategy for hybrid search is to combine the traditional keyword based search and modern vector search.How are these combined?The first method is to merge the similarity scores of the 2 techniques as follows hybrid_score 1 alpha sparse_score alpha dense_scoreWhere alpha takes a value between 0, 1 , with alpha 1 Vector Searchalpha 0 Keyword searchAlso, the similarity scores are defined as follows sparse_score is the result of the keyword search that, behind the scenes, uses a BM25 algorithm 7 that sits on top of TF IDF.dense_score is the result of the vector search that most commonly uses a similarity metric such as cosine distanceThe second method uses the vector search technique as usual and applies a filter based on your keywords on top of the metadata of retrieved results. This is also known as filtered vector search.In this use case, the similar score is not changed based on the provided keywords.It is just a fancy word for a simple filter applied to the metadata of your vectors.But it is essential to understand the difference between the first and second methods the first method combines the similarity score between the keywords and vectors using the alpha parameter the second method is a simple filter on top of your vector search.How does this fit into our architecture?Remember that during the self query step, we extracted the author_id as an exact field that we have to match.Thus, we will search for the author_id using the keyword search algorithm and attach it to the 5 queries generated by the query expansion step.As we want the most relevant chunks from a given author, it makes the most sense to use a filter using the author_id as follows filtered vector search self._qdrant_client.search collection_name vector_posts , query_filter models.Filter must models.FieldCondition key author_id , match models.MatchValue value metadata_filter_value, , , query_vector self._embedder.encode generated_query .tolist , limit k, Note that we can easily extend this with multiple keywords e.g., tags , making the combination of self query and hybrid search a powerful retrieval duo.The only question you have to ask yourself is whether we want to use a simple vector search filter or the more complex hybrid search strategy.Note that LangChain s SelfQueryRetriever class combines the self query and hybrid search techniques behind the scenes, as can be seen in their Qdrant example 8 . That is why we wanted to build everything from scratch.6. Implement the advanced retrieval Python classNow that you ve understood the advanced retrieval optimization techniques we re using, let s combine them into a Python retrieval class.Here is what the main retriever function looks like VectorRetriever main retriever function GitHub Using a Python ThreadPoolExecutor is extremely powerful for addressing I O bottlenecks, as these types of operations are not blocked by Python s GIL limitations.Here is how we wrapped every advanced retrieval step into its own class Query expansion chains wrapper GitHub The SelfQuery class looks very similar access it here 1 .Now the final step is to call Qdrant for each query generated by the query expansion step VectorRetriever main search function GitHub Note that we have 3 types of data posts, articles, and code repositories.Thus, we have to make a query for each collection and combine the results in the end.The most performant method is to use multi indexing techniques, which allow you to query multiple types of data at once.But at the time I am writing this article, this is not a solved problem at the production level.Thus, we gathered data from each collection individually and kept the best retrieved results using rerank.Which is the final step of the article.7. Post retrieval optimization Rerank using GPT 4We made a different search in the Qdrant vector DB for N prompts generated by the query expansion step.Each search returns K results.Thus, we end up with N x K chunks.In our particular case, N 5 K 3. Thus, we end up with 15 chunks.Post retrieval optimization rerankThe problemThe retrieved context may contain irrelevant chunks that only add noise the retrieved context might be irrelevantmake the prompt bigger results in higher costs the LLM is usually biased in looking only at the first and last pieces of context. Thus, if you add a big context, there is a big chance it will miss the essence.unaligned with your question the chunks are retrieved based on the query and chunk embedding similarity. The issue is that the embedding model is not tuned to your particular question, which might result in high similarity scores that are not 100 relevant to your question.The solutionWe will use rerank to order all the N x K chunks based on their relevance relative to the initial question, where the first one will be the most relevant and the last chunk the least.Ultimately, we will pick the TOP K most relevant chunks.Rerank works really well when combined with query expansion.A natural flow when using rerank is as follows Search for K chunks Reorder using rerank Take top KThus, when combined with query expansion, we gather potential useful context from multiple points in space rather than just looking for more than K samples in a single location.Now the flow looks like Search for N x K chunks Reoder using rerank Take top KA typical re ranking solution uses open source Cross Encoder models from sentence transformers 4 .These solutions take both the question and context as input and return a score from 0 to 1.In this article, we want to take a different approach and use GPT 4 prompt engineering as our reranker.If you want to see how to apply rerank using open source algorithms, check out this hands on article from Decoding ML A Real time Retrieval System for RAG on Social Media DataUse a streaming engine to populate a vector DB in real time. Improve RAG accuracy using rerank UMAP.medium.comNow let s see our implementation using GPT 4 prompt engineering.Similar to what we did for the expansion and self query chains, we define a template and a chain builder Rerank chain GitHub Here is how we integrate the rerank chain into the retriever Retriever rerank step GitHub and that s it!Note that this is an experimental process. Thus, you can further tune your prompts for better results, but the primary idea is the same.8. How to use the retrievalThe last step is to run the whole thing.But there is a catch.As we said in the beginning the retriever will not be used as a standalone component in the LLM system.It will be used as a layer between the data and the Qdrant vector DB by the training pipeline to retrieve raw data for fine tuning we haven t shown that as it s a straightforward search operation no RAG involved inference pipeline to do RAG That is why, for this lesson, there is no infrastructure involved!But, to test the retrieval, we wrote a simple script Retriever testing entry point GitHub Look at how easy it is to call the whole chain with our custom retriever no fancy LangChain involved!Now, to call this script, run the following Make command make local test retriever and that s it!In future lessons, we will learn to integrate it into the training inference pipelines. Check out the LLM Twin GitHub repository and try it yourself! Of course, don t forget to give it a to stay updated with the latest changes.ConclusionCongratulations!In Lesson 5, you learned to build an advanced RAG retrieval module optimized for searching posts, articles, and code repositories from a Qdrant vector DB.First, you learned about where the RAG pipeline can be optimized pre retrievalretrievalpost retrievalAfter you learn how to build from scratch without using LangChain s utilities the following advanced RAG retrieval post retrieval optimization techniques query expansionself queryhybrid searchrerankUltimately, you understood where the retrieval component sits in an RAG production LLM system, where the code is shared between multiple microservices and doesn t sit in a single Notebook.In Lesson 6, we will move to the training pipeline and show you how to automatically transform the data crawled from LinkedIn, Substack, Medium, and GitHub into an instruction dataset using GPT 4 to fine tune your LLM Twin.See you there! Check out the code on GitHub 1 and support us with a Enjoyed This Article?Join the Decoding ML Newsletter for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For FREE Decoding ML Newsletter Paul Iusztin SubstackJoin for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For decodingml.substack.comReferencesLiterature 1 Your LLM Twin Course GitHub Repository 2024 , Decoding ML GitHub Organization 2 Bytewax, Bytewax Landing Page 3 Qdrant, Qdrant Documentation 4 Retrieve Re Rank, Sentence Transformers Documentation 5 MultiQueryRetriever, LangChain s Documentation 6 Self querying, LangChain s Documentation 7 Okapi BM25, Wikipedia 8 Qdrant Self Query Example, LangChain s DocumentationImagesIf not otherwise stated, all images are created by the author.Sign up to discover human stories that deepen your understanding of the world.FreeDistraction free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for 5 monthData ScienceMachine LearningArtificial IntelligenceRagGenerative Ai1.8K1.8K12FollowWritten by Paul Iusztin5.1K Followers Editor for Decoding MLSenior ML MLOps Engineer Founder Decoding ML Content about building production grade ML AI systems DML Newsletter https decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Paul IusztininDecoding MLAn End to End Framework for Production Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLArchitect scalable and cost effective LLM RAG inference pipelinesDesign, build and deploy RAG inference pipeline using LLMOps best practices.Jun 15601See all from Paul IusztinSee all from Decoding MLRecommended from MediumVishal RajputinAIGuysWhy GEN AI Boom Is Fading And What s Next?Every technology has its hype and cool down period.Sep 42.3K72Austin StarksinDataDrivenInvestorI used OpenAI s o1 model to develop a trading strategy. It is DESTROYING the marketIt literally took one try. I was shocked.Sep 154.3K119ListsPredictive Modeling w Python20 stories 1607 savesNatural Language Processing1766 stories 1367 savesPractical Guides to Machine Learning10 stories 1961 savesAI Regulation6 stories 593 savesIda Silfverski\u00f6ldinLevel Up CodingAgentic AI Build a Tech Research AgentUsing a custom data pipeline with millions of textsSep 679610Steve HeddeninTowards Data ScienceHow to Implement Graph RAG Using Knowledge Graphs and Vector DatabasesA Step by Step Tutorial on Implementing Retrieval Augmented Generation RAG , Semantic Search, and RecommendationsSep 61.4K18Louis Fran\u00e7ois BouchardinTowards AIThe Best RAG Stack to Date exploring every component Sep 1473911Necati DemirAdvanced RAG Implementing Advanced Techniques to Enhance Retrieval Augmented Generation SystemsMay 16481See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.", "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/the-4-advanced-rag-algorithms-you-must-know-to-implement-5d0c7f1199d2" }, { "id": "597ead2d-ae88-43f9-945d-d974630e858a", "content": "Architect scalable and cost effective LLM RAG inference pipelines Design, build and deploy RAG inference pipeline using LLMOps best practices. Architect LLM RAG inference pipelines Decoding MLOpen in appSign upSign inWriteSign upSign inLLM TWIN COURSE BUILDING YOUR PRODUCTION READY AI REPLICAArchitect scalable and cost effective LLM RAG inference pipelinesDesign, build and deploy RAG inference pipeline using LLMOps best practices.Paul Iusztin FollowPublished inDecoding ML 17 min read Jun 1, 20245601ListenShare the 9th out of 12 lessons of the LLM Twin free courseWhat is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.Image by DALL EWhy is this course different?By finishing the LLM Twin Building Your Production Ready AI Replica free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.Why should you care? No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system.What will you learn to build by the end of this course?You will learn how to architect and build a real world LLM system from start to finish from data collection to deployment.You will also learn to leverage MLOps best practices, such as experiment trackers, model registries, prompt monitoring, and versioning.The end goal? Build and deploy your own LLM twin.The architecture of the LLM twin is split into 4 Python microservices the data collection pipeline crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. deployed on AWS the feature pipeline consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded using Superlinked , and loaded into a Qdrant vector DB in real time. deployed on AWS the training pipeline create a custom dataset based on your digital data. Fine tune an LLM using QLoRA. Use Comet ML s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet s model registry. deployed on Qwak the inference pipeline load and quantize the fine tuned LLM from Comet s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet s prompt monitoring dashboard. deployed on Qwak LLM twin system architecture Image by the Author Along the 4 microservices, you will learn to integrate 3 serverless tools Comet ML as your ML Platform Qdrant as your vector DB Qwak as your ML infrastructure Who is this for?Audience MLE, DE, DS, or SWE who want to learn to engineer production ready LLM systems using LLMOps good principles.Level intermediatePrerequisites basic knowledge of Python, ML, and the cloudHow will you learn?The course contains 10 hands on written lessons and the open source code you can access on GitHub, showing how to build an end to end LLM system.Also, it includes 2 bonus lessons on how to improve the RAG system.You can read everything at your own pace. To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.Costs?The articles and code are completely free. They will always remain free.But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.The cloud computing platforms AWS, Qwak have a pay as you go pricing plan. Qwak offers a few hours of free computing. Thus, we did our best to keep costs to a minimum.For the other serverless tools Qdrant, Comet , we will stick to their freemium version, which is free of charge.Meet your teachers!The course is created under the Decoding ML umbrella by Paul Iusztin Senior ML MLOps EngineerAlex Vesa Senior AI EngineerAlex Razvant Senior ML MLOps Engineer Check out the code on GitHub 1 and support us with a Lessons Quick overview of each lesson of the LLM Twin free course.The course is split into 12 lessons. Every Medium article will be its own lesson An End to End Framework for Production Ready LLM Systems by Building Your LLM TwinThe Importance of Data Pipelines in the Era of Generative AIChange Data Capture Enabling Event Driven ArchitecturesSOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time!The 4 Advanced RAG Algorithms You Must Know to ImplementThe Role of Feature Stores in Fine Tuning LLMsHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLBest Practices When Evaluating Fine Tuned LLMsArchitect scalable and cost effective LLM RAG inference pipelinesHow to evaluate your RAG pipeline using the RAGAs Framework Bonus Build a scalable RAG ingestion pipeline using 74.3 less code Bonus Build Multi Index Advanced RAG AppsTo better understand the course s goal, technical details, and system design Check out Lesson 1Let s start with Lesson 9 Lesson 9 Architect scalable and cost effective LLM RAG inference pipelinesIn Lesson 9, we will focus on implementing and deploying the inference pipeline of the LLM twin system.First, we will design and implement a scalable LLM RAG inference pipeline based on microservices, separating the ML and business logic into two layers.Secondly, we will use Comet ML to integrate a prompt monitoring service to capture all input prompts and LLM answers for further debugging and analysis.Ultimately, we will deploy the inference pipeline to Qwak and make the LLM twin service available worldwide. Context from previous lessons. What you must know.This lesson is part of a more extensive series in which we learn to build an end to end LLM system using LLMOps best practices.In Lesson 4, we populated a Qdrant vector DB with cleaned, chunked, and embedded digital data posts, articles, and code snippets .In Lesson 5, we implemented the advanced RAG retrieval module to query relevant digital data. Here, we will learn to integrate it into the final inference pipeline.In Lesson 7, we used Qwak to build a training pipeline to fine tune an open source LLM on our custom digital data. The LLM weights are available in a model registry.In Lesson 8, we evaluated the fine tuned LLM to ensure the production candidate behaves accordingly.So What you must know from all of this?Don t worry. If you don t want to replicate the whole system, you can read this article independently from the previous lesson.Thus, the following assumptions are what you have to know. We have a Qdrant vector DB populated with digital data posts, articles, and code snippets a vector DB retrieval module to do advanced RAGa fine tuned open source LLM available in a model registry from Comet ML In this lesson, we will focus on gluing everything together into a scalable inference pipeline and deploying it to the cloud.Architect scalable and cost effective LLM RAG inference pipelinesTable of ContentsThe architecture of the inference pipelineThe training vs. the inference pipelineSettings Pydantic classThe RAG business moduleThe LLM microservicePrompt monitoringDeploying and running the inference pipelineConclusion Check out the code on GitHub 1 and support us with a 1. The architecture of the inference pipelineOur inference pipeline contains the following core elements a fine tuned LLMa RAG modulea monitoring serviceLet s see how to hook these into a scalable and modular system.The interface of the inference pipelineAs we follow the feature training inference FTI pipeline architecture, the communication between the 3 core components is clear.Our LLM inference pipeline needs 2 things a fine tuned LLM pulled from the model registryfeatures for RAG pulled from a vector DB which we modeled as a logical feature store This perfectly aligns with the FTI architecture. If you are unfamiliar with the FTI pipeline architecture, we recommend you review Lesson 1 s section on the 3 pipeline architecture.Monolithic vs. microservice inference pipelinesUsually, the inference steps can be split into 2 big layers the LLM service where the actual inference is being donethe business service domain specific logicWe can design our inference pipeline in 2 ways.Option 1 Monolithic LLM business serviceIn a monolithic scenario, we implement everything into a single service.Pros easy to implementeasy to maintainCons harder to scale horizontally based on the specific requirements of each componentharder to split the work between multiple teamsnot being able to use different tech stacks for the two servicesMonolithic vs. microservice inference pipelinesOption 2 Different LLM business microservicesThe LLM and business services are implemented as two different components that communicate with each other through the network, using protocols such as REST or gRPC.Pros each component can scale horizontally individuallyeach component can use the best tech stack at handCons harder to deployharder to maintainLet s focus on the each component can scale individually part, as this is the most significant benefit of the pattern. Usually, LLM and business services require different types of computing. For example, an LLM service depends heavily on GPUs, while the business layer can do the job only with a CPU.As the LLM inference takes longer, you will often need more LLM service replicas to meet the demand. But remember that GPU VMs are really expensive.By decoupling the 2 components, you will run only what is required on the GPU machine and not block the GPU VM with other computing that can quickly be done on a much cheaper machine.Thus, by decoupling the components, you can scale horizontally as required, with minimal costs, providing a cost effective solution to your system s needs.Microservice architecture of the LLM twin inference pipelineLet s understand how we applied the microservice pattern to our concrete LLM twin inference pipeline.As explained in the sections above, we have the following components A business microserviceAn LLM microserviceA prompt monitoring microserviceThe business microservice is implemented as a Python module that contains the advanced RAG logic, which calls the vector DB and GPT 4 API for advanced RAG operations calls the LLM microservice through a REST API using the prompt computed utilizing the user s query and retrieved contextsends the prompt and the answer generated by the LLM to the prompt monitoring microservice.As you can see, the business microservice is light. It glues all the domain steps together and delegates the computation to other services.The end goal of the business layer is to act as an interface for the end client. In our case, as we will ship the business layer as a Python module, the client will be a Streamlit application.However, you can quickly wrap the Python module with FastAPI and expose it as a REST API to make it accessible from the cloud.Microservice architecture of the LLM twin inference pipelineThe LLM microservice is deployed on Qwak. This component is wholly niched on hosting and calling the LLM. It runs on powerful GPU enabled machines.How does the LLM microservice work?It loads the fine tuned LLM twin model from Comet s model registry 2 .It exposes a REST API that takes in prompts and outputs the generated answer.When the REST API endpoint is called, it tokenizes the prompt, passes it to the LLM, decodes the generated tokens to a string and returns the answer.That s it!The prompt monitoring microservice is based on Comet ML s LLM dashboard. Here, we log all the prompts and generated answers into a centralized dashboard that allows us to evaluate, debug, and analyze the accuracy of the LLM.Remember that a prompt can get quite complex. When building complex LLM apps, the prompt usually results from a chain containing other prompts, templates, variables, and metadata.Thus, a prompt monitoring service, such as the one provided by Comet ML, differs from a standard logging service. It allows you to quickly dissect the prompt and understand how it was created. Also, by attaching metadata to it, such as the latency of the generated answer and the cost to generate the answer, you can quickly analyze and optimize your prompts.2. The training vs. the inference pipelineBefore diving into the code, let s quickly clarify what is the difference between the training and inference pipelines.Along with the apparent reason that the training pipeline takes care of training while the inference pipeline takes care of inference Duh! , there are some critical differences you have to understand.The input of the pipeline How the data is accessedDo you remember our logical feature store based on the Qdrant vector DB and Comet ML artifacts? If not, consider checking out Lesson 6 for a refresher.The core idea is that during training, the data is accessed from an offline data storage in batch mode, optimized for throughput and data lineage.Our LLM twin architecture uses Comet ML artifacts to access, version, and track all our data.The data is accessed in batches and fed to the training loop.During inference, you need an online database optimized for low latency. As we directly query the Qdrant vector DB for RAG, that fits like a glove.During inference, you don t care about data versioning and lineage. You just want to access your features quickly for a good user experience.The data comes directly from the user and is sent to the inference logic.The training vs. the inference pipelineThe output of the pipelineThe training pipeline s final output is the trained weights stored in Comet s model registry.The inference pipeline s final output is the predictions served directly to the user.The infrastructureThe training pipeline requires more powerful machines with as many GPUs as possible.Why? During training, you batch your data and have to hold in memory all the gradients required for the optimization steps. Because of the optimization algorithm, the training is more compute hungry than the inference.Thus, more computing and VRAM result in bigger batches, which means less training time and more experiments.The inference pipeline can do the job with less computation. During inference, you often pass a single sample or smaller batches to the model.If you run a batch pipeline, you will still pass batches to the model but don t perform any optimization steps.If you run a real time pipeline, as we do in the LLM twin architecture, you pass a single sample to the model or do some dynamic batching to optimize your inference step.Are there any overlaps?Yes! This is where the training serving skew comes in.During training and inference, you must carefully apply the same preprocessing and postprocessing steps.If the preprocessing and postprocessing functions or hyperparameters don t match, you will end up with the training serving skew problem.Enough with the theory. Let s dig into the RAG business microservice 3. Settings Pydantic classFirst, let s understand how we defined the settings to configure the inference pipeline components.We used pydantic_settings and inherited its BaseSettings class.This approach lets us quickly define a set of default settings variables and load sensitive values such as the API KEY from a .env file.from pydantic_settings import BaseSettings, SettingsConfigDictclass AppSettings BaseSettings model_config SettingsConfigDict env_file .env , env_file_encoding utf 8 ... Settings. CometML config COMET_API_KEY str COMET_WORKSPACE str COMET_PROJECT str llm twin course ... More settings.settings AppSettings All the variables called settings. e.g., settings.Comet_API_KEY come from this class.4. The RAG business moduleWe will define the RAG business module under the LLMTwin class. The LLM twin logic is directly correlated with our business logic.We don t have to introduce the word business in the naming convention of the classes. What we presented so far was used for a clear separation of concern between the LLM and business layers.Initially, within the LLMTwin class, we define all the clients we need for our business logic Inference pipeline business module __init__ method GitHub Now let s dig into the generate method, where we call the RAG module create the prompt using the prompt template, query and context call the LLM microservice log the prompt, prompt template, and answer to Comet ML s prompt monitoring service.Inference pipeline business module generate method GitHub Now, let s look at the complete code of the generate method. It s the same thing as what we presented above, but with all the nitty little details.class LLMTwin def __init__ self None ... def generate self, query str, enable_rag bool True, enable_monitoring bool True, dict prompt_template self.template.create_template enable_rag enable_rag prompt_template_variables question query, if enable_rag is True retriever VectorRetriever query query hits retriever.retrieve_top_k k settings.TOP_K, to_expand_to_n_queries settings.EXPAND_N_QUERY context retriever.rerank hits hits, keep_top_k settings.KEEP_TOP_K prompt_template_variables context context prompt prompt_template.format question query, context context else prompt prompt_template.format question query input_ pd.DataFrame instruction prompt .to_json response list dict self.qwak_client.predict input_ answer response 0 content 0 if enable_monitoring is True self.prompt_monitoring_manager.log prompt prompt, prompt_template prompt_template.template, prompt_template_variables prompt_template_variables, output answer, metadata metadata, return answer answer Let s look at how our LLM microservice is implemented using Qwak.5. The LLM microserviceAs the LLM microservice is deployed on Qwak, we must first inherit from the QwakModel class and implement some specific functions.initialize_model where we load the fine tuned model from the model registry at serving timeschema where we define the input and output schemapredict where we implement the actual inference logicNote The build function contains all the training logic, such as loading the dataset, training the LLM, and pushing it to a Comet experiment. To see the full implementation, consider checking out Lesson 7, where we detailed the training pipeline.LLM microservice GitHub Let s zoom into the implementation and the life cycle of the Qwak model.The schema method is used to define how the input and output of the predict method look like. This will automatically validate the structure and type of the predict method. For example, the LLM microservice will throw an error if the variable instruction is a JSON instead of a string.The other Qwak specific methods are called in the following order __init__ when deploying the modelinitialize_model when deploying the modelpredict on every request to the LLM microservice Note that these methods are called only during serving time and not during training .Qwak exposes your model as a RESTful API, where the predict method is called on each request.Inside the prediction method, we perform the following steps map the input text to token IDs using the LLM specific tokenizermove the token IDs to the provided device GPU or CPU pass the token IDs to the LLM and generate the answerextract only the generated tokens from the generated_ids variable by slicing it using the shape of the input_idsdecode the generated_ids back to textreturn the generated textHere is the complete code for the implementation of the Qwak LLM microservice class CopywriterMistralModel QwakModel def __init__ self, use_experiment_tracker bool True, register_model_to_model_registry bool True, model_type str mistralai Mistral 7B Instruct v0.1 , fine_tuned_llm_twin_model_type str settings.FINE_TUNED_LLM_TWIN_MODEL_TYPE, dataset_artifact_name str settings.DATASET_ARTIFACT_NAME, config_file str settings.CONFIG_FILE, model_save_dir str settings.MODEL_SAVE_DIR, None self.use_experiment_tracker use_experiment_tracker self.register_model_to_model_registry register_model_to_model_registry self.model_save_dir model_save_dir self.model_type model_type self.fine_tuned_llm_twin_model_type fine_tuned_llm_twin_model_type self.dataset_artifact_name dataset_artifact_name self.training_args_config_file config_file def build self None Training logic ... def initialize_model self None self.model, self.tokenizer, _ build_qlora_model pretrained_model_name_or_path self.model_type, peft_pretrained_model_name_or_path self.fine_tuned_llm_twin_model_type, bnb_config self.nf4_config, lora_config self.qlora_config, cache_dir settings.CACHE_DIR, self.model self.model.to self.device logging.info f Successfully loaded model from self.model_save_dir def schema self ModelSchema return ModelSchema inputs RequestInput name instruction , type str , outputs InferenceOutput name content , type str , qwak.api output_adapter DefaultOutputAdapter def predict self, df pd.DataFrame input_text list df instruction .values input_ids self.tokenizer input_text, return_tensors pt , add_special_tokens True input_ids input_ids.to self.device generated_ids self.model.generate input_ids, max_new_tokens 500, do_sample True, pad_token_id self.tokenizer.eos_token_id, answer_start_idx input_ids input_ids .shape 1 generated_answer_ids generated_ids , answer_start_idx decoded_output self.tokenizer.batch_decode generated_answer_ids 0 return pd.DataFrame content decoded_output Where the settings used in the code above have the following values class AppSettings BaseSettings model_config SettingsConfigDict env_file .env , env_file_encoding utf 8 ... Other settings. DATASET_ARTIFACT_NAME str posts instruct dataset FINE_TUNED_LLM_TWIN_MODEL_TYPE str decodingml llm twin 1.0.0 CONFIG_FILE str . finetuning config.yaml MODEL_SAVE_DIR str . training_pipeline_output CACHE_DIR Path Path . .cache The most important one is the FINE_TUNED_LLM_TWIN_MODEL_TYPE setting, which reflects what model and version to load from the model registry.Access the code here The final step is to look at Comet s prompt monitoring service. 6. Prompt monitoringComet makes prompt monitoring straightforward. There is just one API call where you connect to your project and workspace and send the following to a single function the prompt and LLM outputthe prompt template and variables that created the final outputyour custom metadata specific to your use case here, you add information about the model, prompt token count, token generation costs, latency, etc.Prompt monitoring service GitHub Let s look at the logs in Comet ML sML s LLMOps dashboard.Here is how you can quickly access them log in to Comet or create an account go to your workspaceaccess the project with the LLM symbol attached to it. In our case, this is the llm twin course monitoring project.Note Comet ML provides a free version which is enough to run these examples.Screenshot from Comet ML s dashboardThis is how Comet ML s prompt monitoring dashboard looks. Here, you can scroll through all the prompts that were ever sent to the LLM. You can click on any prompt and see everything we logged programmatically using the PromptMonitoringManager class.Screenshot from Comet ML s dashboardBesides what we logged, adding various tags and the inference duration can be valuable.7. Deploying and running the inference pipelineQwak makes the deployment of the LLM microservice straightforward.During Lesson 7, we fine tuned the LLM and built the Qwak model. As a quick refresher, we ran the following CLI command to build the Qwak model, where we used the build_config.yaml file with the build configuration poetry run qwak models build f build_config.yaml .After the build is finished, we can make various deployments based on the build. For example, we can deploy the LLM microservice using the following Qwak command qwak models deploy realtime model id llm_twin instance gpu.a10.2xl timeout 50000 replicas 2 server workers 2We deployed two replicas of the LLM twin. Each replica has access to a machine with x1 A10 GPU. Also, each replica has two workers running on it. More on Qwak instance types Two replicas and two workers result in 4 microservices that run in parallel and can serve our users.You can scale the deployment to more replicas if you need to serve more clients. Qwak provides autoscaling mechanisms triggered by listening to the consumption of GPU, CPU or RAM.To conclude, you build the Qwak model once, and based on it, you can make multiple deployments with various strategies.You can quickly close the deployment by running the following qwak models undeploy model id llm_twin We strongly recommend closing down the deployment when you are done, as GPU VMs are expensive.To run the LLM system with a predefined prompt example, you have to run the following Python file poetry run python main.pyWithin the main.py file, we call the LLMTwin class, which calls the other services as explained during this lesson.Note The complete installation usage instructions are available in the README of the GitHub repository. Check out the code on GitHub 1 and support us with a ConclusionCongratulations! You are close to the end of the LLM twin series.In Lesson 9 of the LLM twin course, you learned to build a scalable inference pipeline for serving LLMs and RAG systems.First, you learned how to architect an inference pipeline by understanding the difference between monolithic and microservice architectures. We also highlighted the difference in designing the training and inference pipelines.Secondly, we walked you through implementing the RAG business module and LLM twin microservice. Also, we showed you how to log all the prompts, answers, and metadata for Comet s prompt monitoring service.Ultimately, we showed you how to deploy and run the LLM twin inference pipeline on the Qwak AI platform.In Lesson 10, we will show you how to evaluate the whole system by building an advanced RAG evaluation pipeline that analyzes the accuracy of the LLMs answers relative to the query and context.See you there! Check out the code on GitHub 1 and support us with a Enjoyed This Article?Join the Decoding ML Newsletter for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For FREE Decoding ML Newsletter Paul Iusztin SubstackJoin for battle tested content on designing, coding, and deploying production grade ML MLOps systems. Every week. For decodingml.substack.comReferencesLiterature 1 Your LLM Twin Course GitHub Repository 2024 , Decoding ML GitHub Organization 2 Add your models to Model Registry 2024 , Comet ML GuidesImagesIf not otherwise stated, all images are created by the author.Sign up to discover human stories that deepen your understanding of the world.FreeDistraction free reading. No ads.Organize your knowledge with lists and highlights.Tell your story. Find your audience.Sign up for freeMembershipRead member only storiesSupport writers you read mostEarn money for your writingListen to audio narrationsRead offline with the Medium appTry for 5 monthMachine LearningProgrammingMl System DesignData ScienceArtificial Intelligence5605601FollowWritten by Paul Iusztin5.1K Followers Editor for Decoding MLSenior ML MLOps Engineer Founder Decoding ML Content about building production grade ML AI systems DML Newsletter https decodingml.substack.comFollowMore from Paul Iusztin and Decoding MLPaul IusztininDecoding MLThe 4 Advanced RAG Algorithms You Must Know to ImplementImplement from scratch 4 advanced RAG methods to optimize your retrieval and post retrieval algorithmMay 41.8K12Paul IusztininDecoding MLThe 6 MLOps foundational principlesThe core MLOps guidelines for production MLSep 21442Vesa AlexandruinDecoding MLThe Importance of Data Pipelines in the Era of Generative AIFrom unstructured data crawling to structured valuable dataMar 236725Paul IusztininDecoding MLAn End to End Framework for Production Ready LLM Systems by Building Your LLM TwinFrom data gathering to productionizing LLMs using LLMOps good practices.Mar 162.1K13See all from Paul IusztinSee all from Decoding MLRecommended from MediumVipra SinghBuilding LLM Applications Serving LLMs Part 9 Learn Large Language Models LLM through the lens of a Retrieval Augmented Generation RAG Application.Apr 188666Vishal RajputinAIGuysWhy GEN AI Boom Is Fading And What s Next?Every technology has its hype and cool down period.Sep 42.3K72ListsPredictive Modeling w Python20 stories 1607 savesNatural Language Processing1766 stories 1367 savesPractical Guides to Machine Learning10 stories 1961 savesChatGPT21 stories 846 savesDerckData architecture for MLOps Metadata storeIntroductionJul 17Alex RazvantinDecoding MLHow to fine tune LLMs on custom datasets at Scale using Qwak and CometMLHow to fine tune a Mistral7b Instruct using PEFT QLoRA, leveraging best MLOps practices deploying on Qwak.ai and tracking with CometML.May 185922MdabdullahalhasibinTowards AIA Complete Guide to Embedding For NLP Generative AI LLMUnderstand the concept of vector embedding, why it is needed, and implementation with LangChain.3d agoNecati DemirAdvanced RAG Implementing Advanced Techniques to Enhance Retrieval Augmented Generation SystemsMay 16481See more recommendationsHelpStatusAboutCareersPressBlogPrivacyTermsText to speechTeams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.", "platform": "medium", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://medium.com/decodingml/architect-scalable-and-cost-effective-llm-rag-inference-pipelines-73b94ef82a99" }, { "id": "d39ca560-21bf-4a6c-a080-064b1ad7996a", "content": "Real time feature pipelines for RAG by Paul Iusztin RAG hybrid search with transformers based sparse vectors. CDC tech stack for event driven architectures. SubscribeSign in Share this post Real time feature pipelines for RAG decodingml.substack.com Copy link Facebook Email Note Other Real time feature pipelines for RAG RAG hybrid search with transformers based sparse vectors. CDC tech stack for event driven architectures. Paul Iusztin Aug 17, 2024 14 Share this post Real time feature pipelines for RAG decodingml.substack.com Copy link Facebook Email Note Other Share This week s topics CDC tech stack for event driven architectures Real time feature pipelines with CDC RAG hybrid search with transformers based sparse vectors CDC tech stack for event driven architectures Here is the \ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddf0\ud835\uddf8 used to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddf2 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddd6\ud835\uddee\ud835\uddfd\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddd6\ud835\uddd7\ud835\uddd6 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01 for implementing an \ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddf1\ud835\uddff\ud835\uddf6\ud835\ude03\ud835\uddf2\ud835\uddfb \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 in our \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddf2 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddd6\ud835\uddee\ud835\uddfd\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddd6\ud835\uddd7\ud835\uddd6 ? The purpose of CDC is to capture insertions, updates, and deletions applied to a database and to make this change data available in a format easily consumable by downstream applications. \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddd6\ud835\uddd7\ud835\uddd6 \ud835\uddfd\ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddfb? Real time Data Syncing Efficient Data Pipelines Minimized System Impact Event Driven Architectures \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\uddd6\ud835\uddd7\ud835\uddd6? We will take the tech stack used in our LLM Twin course as an example, where... ... we built a feature pipeline to gather cleaned data for fine tuning and chunked embedded data for RAG \ud835\uddd8\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddf6\ud835\uddf9\ud835\uddf9 \ud835\uddef\ud835\uddf2 \ud835\uddf1\ud835\uddfc\ud835\uddfb\ud835\uddf2 \ud835\uddfc\ud835\uddfb\ud835\uddf9\ud835\ude06 \ud835\uddf6\ud835\uddfb \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb! \ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude3a \ud835\ude22\ud835\ude33\ud835\ude26 1 . \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\uddef\ud835\uddee\ud835\ude00\ud835\uddf2 MongoDB it also works for most databases such as MySQL, PostgreSQL, Oracle, etc. 2 . \ud835\uddd4 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 \ud835\ude01\ud835\uddfc \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf9\ud835\uddfc\ud835\uddf4 MongoDB Watcher also Debezium is a popular scalable solution 3 . \ud835\uddd4 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddef\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\uddf1 \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude02\ud835\uddf2 RabbitMQ another popular option is to use Kafka, but it was overkill in our use case 4 . \ud835\uddd4 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2 Bytewax great streaming engine for the Python ecosystem 5 . \ud835\uddd4 \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\uddef\ud835\uddee\ud835\ude00\ud835\uddf2 Qdrant this works with any other database, but we needed a vector DB to store our data for fine tuning and RAG \ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26, \ud835\ude29\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude22 \ud835\ude1e\ud835\ude19\ud835\ude10\ud835\ude1b\ud835\ude0c \ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude23\ud835\ude26 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude24\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude26\ud835\ude25 1 . Write a post to the MongoDB warehouse 2 . A \ud835\ude24\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude35\ud835\ude26 operation is logged in the transaction log of Mongo 3 . The MongoDB watcher captures this and emits it to the RabbitMQ queue 4 . The Bytewax streaming pipelines read the event from the queue 5 . It cleans, chunks, and embeds it right away in real time! 6 . The cleaned embedded version of the post is written to Qdrant Real time feature pipelines with CDC \ud835\udddb\ud835\uddfc\ud835\ude04 to \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddd6\ud835\uddd7\ud835\uddd6 to \ud835\ude00\ud835\ude06\ud835\uddfb\ud835\uddf0 your \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\ude04\ud835\uddee\ud835\uddff\ud835\uddf2\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\ude00\ud835\uddf2 and \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 using a RabbitMQ \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude02\ud835\uddf2 and a Bytewax \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddd9\ud835\uddf6\ud835\uddff\ud835\ude00\ud835\ude01, \ud835\uddf9\ud835\uddf2\ud835\ude01 \ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude04\ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddf2 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddd6\ud835\uddee\ud835\uddfd\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddd6\ud835\uddd7\ud835\uddd6 \ud835\uddfd\ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddfb \ud835\ude0a\ud835\ude0b\ud835\ude0a \ud835\ude2a\ud835\ude34 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude38\ud835\ude29\ud835\ude26\ud835\ude2f \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude22\ud835\ude2f\ud835\ude35 \ud835\ude35\ud835\ude30 \ud835\ude34\ud835\ude3a\ud835\ude2f\ud835\ude24 2 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude34. The destination can be a complete replica of the source database e.g., one for transactional and the other for analytical applications ...or you can process the data from the source database before loading it to the destination DB e.g., retrieve various documents and chunk embed them for RAG . \ud835\ude1b\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude34 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude10 \ud835\ude22\ud835\ude2e \ud835\ude28\ud835\ude30\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude30 \ud835\ude34\ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude3a\ud835\ude30\ud835\ude36 How to use CDC to sync a MongoDB Qdrant vector DB to streamline real time documents that must be ready for fine tuning LLMs and RAG. MongoDB is our data warehouse. Qdrant is our logical feature store. . \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddd6\ud835\uddd7\ud835\uddd6 \ud835\uddfd\ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddfb 1 . Use Mongo s \ud835\ude38\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29 method to listen for CRUD transactions 2 . For example, on a CREATE operation, along with saving it to Mongo, the \ud835\ude38\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29 method will trigger a change and return a JSON with all the information. 3 . We standardize the JSON in our desired structure. 4 . We stringify the JSON and publish it to the RabbitMQ queue \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\ude00\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf2? You can use Debezium instead of Mongo s \ud835\ude38\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29 method for scaling up the system, but the idea remains the same. You can swap RabbitMQ with Kafka, but RabbitMQ can get you far. \ud835\udde1\ud835\uddfc\ud835\ude04, \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\uddfd\ud835\uddfd\ud835\uddf2\ud835\uddfb\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfc\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\ude00\ud835\uddf6\ud835\uddf1\ud835\uddf2 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude02\ud835\uddf2? You have a Bytewax streaming pipeline 100 written in Python that 5 . Listens in real time to new messages from the RabbitMQ queue 6 . It cleans, chunks, and embeds the events on the fly 7 . It loads the data to Qdrant for LLM fine tuning RAG MongoDB CDC example Do you \ud835\ude04\ud835\uddee\ud835\uddfb\ud835\ude01 to check out the \ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\uddf9 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2? ...or even an \ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 about \ud835\uddd6\ud835\uddd7\ud835\uddd6? The CDC component is part of the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb FREE \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2, made by Decoding ML. \ud835\ude13\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f 3 \ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude2f\ud835\ude28\ud835\ude26 \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude0a\ud835\ude22\ud835\ude31\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26 \ud835\ude0c\ud835\ude2f\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude0c\ud835\ude37\ud835\ude26\ud835\ude2f\ud835\ude35 \ud835\ude0b\ud835\ude33\ud835\ude2a\ud835\ude37\ud835\ude26\ud835\ude2f \ud835\ude08\ud835\ude33\ud835\ude24\ud835\ude29\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26\ud835\ude34 \ud835\ude0e\ud835\ude2a\ud835\ude35\ud835\ude0f\ud835\ude36\ud835\ude23 RAG hybrid search with transformers based sparse vectors \ud835\udddb\ud835\ude06\ud835\uddef\ud835\uddff\ud835\uddf6\ud835\uddf1 \ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 is standard in \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1 \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00. The \ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf8 is to \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 the suitable \ud835\ude00\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude00\ud835\uddf2 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude00 for it. Here is an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 that shows \ud835\uddf5\ud835\uddfc\ud835\ude04 to use \ud835\udde6\ud835\udde3\ud835\udddf\ud835\uddd4\ud835\uddd7\ud835\uddd8 to \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 \ud835\ude00\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude00\ud835\uddf2 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude00 using \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddf2\ud835\uddff\ud835\ude00 and integrate them into a \ud835\uddf5\ud835\ude06\ud835\uddef\ud835\uddff\ud835\uddf6\ud835\uddf1 \ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\uddee\ud835\uddf9\ud835\uddf4\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddfa using Qdrant. \ud835\ude52\ud835\ude5d\ud835\ude6e \ud835\ude57\ud835\ude64\ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude67 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude68\ud835\ude65\ud835\ude56\ud835\ude67\ud835\ude68\ud835\ude5a \ud835\ude6b\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude64\ud835\ude67\ud835\ude68 \ud835\ude6c\ud835\ude5d\ud835\ude5a\ud835\ude63 \ud835\ude6c\ud835\ude5a \ud835\ude5d\ud835\ude56\ud835\ude6b\ud835\ude5a \ud835\ude59\ud835\ude5a\ud835\ude63\ud835\ude68\ud835\ude5a \ud835\ude6b\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude64\ud835\ude67\ud835\ude68 \ud835\ude5a\ud835\ude62\ud835\ude57\ud835\ude5a\ud835\ude59\ud835\ude59\ud835\ude5e\ud835\ude63\ud835\ude5c\ud835\ude68 ? Sparse vectors represent data by highlighting only the most relevant features like keywords , significantly reducing memory usage compared to dense vectors. Also, sparse vectors work great in finding specific keywords, which is why they work fantastic in combination with dense vectors used for finding similarities in semantics but not particular words. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 \ud835\uddf5\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\uddf9\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01\ud835\ude00 \ud835\ude1a\ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude37\ud835\ude34. \ud835\ude25\ud835\ude26\ud835\ude2f\ud835\ude34\ud835\ude26 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude34 \ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude1a\ud835\ude17\ud835\ude13\ud835\ude08\ud835\ude0b\ud835\ude0c \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34 The SPLADE model leverages sparse vectors to perform better than traditional methods like BM25 by computing it using transformer architectures. \ud835\ude1e\ud835\ude29\ud835\ude3a \ud835\ude1a\ud835\ude17\ud835\ude13\ud835\ude08\ud835\ude0b\ud835\ude0c \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34 It expands terms based on context rather than just frequency, offering a nuanced understanding of content relevancy. \ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude30 \ud835\ude2a\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 \ud835\ude29\ud835\ude3a\ud835\ude23\ud835\ude33\ud835\ude2a\ud835\ude25 \ud835\ude34\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29 \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude1a\ud835\ude17\ud835\ude13\ud835\ude08\ud835\ude0b\ud835\ude0c with Qdrant step by step code Sparse vectors using transformers \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 \ud835\ude1a\ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude1d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude34 \ud835\ude2a\ud835\ude2f \ud835\ude18\ud835\ude25\ud835\ude33\ud835\ude22\ud835\ude2f\ud835\ude35 \ud835\ude17\ud835\ude36\ud835\ude33\ud835\ude26 \ud835\ude1d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude0f\ud835\ude3a\ud835\ude23\ud835\ude33\ud835\ude2a\ud835\ude25 \ud835\ude1a\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29 Images If not otherwise stated, all images are created by the author. 14 Share this post Real time feature pipelines for RAG decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/real-time-feature-pipelines-with?r=1ttoeh" }, { "id": "4271a54f-6239-4f50-97e6-b3fa3a9a2fbd", "content": "Building ML System Using the FTI Architecture Introduction to the feature training inference FTI design pattern to build scalable and modular ML systems using MLOps best practices. SubscribeSign in Share this post Building ML systems the right way using the FTI architecture decodingml.substack.com Copy link Facebook Email Note Other Building ML systems the right way using the FTI architecture The fundamentals of the FTI architecture that will help you build modular and scalable ML systems using MLOps best practices. Paul Iusztin Aug 10, 2024 12 Share this post Building ML systems the right way using the FTI architecture decodingml.substack.com Copy link Facebook Email Note Other Share The feature training inference FTI architecture builds scalable and modular ML systems using MLOps best practices. We will start by discussing the problems of naively building ML systems. Then, we will examine other potential solutions and their problems. Ultimately, we will present the feature training inference FTI design pattern and its benefits. We will also understand the benefits of using a feature store and model registry when architecting your ML system. The problem with building ML systems Building production ready ML systems is much more than just training a model. From an engineering point of view, training the model is the most straightforward step in most use cases. However, training a model becomes complex when deciding on the correct architecture and hyperparameters. That s not an engineering problem but a research problem. At this point, we want to focus on how to design a production ready architecture. Training a model with high accuracy is extremely valuable, but just by training it on a static dataset, you are far from deploying it robustly. We have to consider how to ingest, clean and validate fresh data training vs. inference setups compute and serve features in the right environment serve the model in a cost effective way version, track and share the datasets and models monitor your infrastructure and models deploy the model on a scalable infrastructure automate the deployments and training These are the types of problems an ML or MLOps engineer must consider, while the research or data science team is often responsible for training the model. Figure 1 Components of an ML system. Photo from the Google Cloud Architecture documents Figure 1 shows all the components the Google Cloud team suggests that a mature ML and MLOps system requires. Along with the ML code, there are many moving pieces. The rest of the system comprises configuration, automation, data collection, data verification, testing and debugging, resource management, model analysis, process and metadata management, serving infrastructure, and monitoring. The point is that there are many components we must consider when productionizing an ML model. _Thus, the critical question is How do we connect all these components into a single homogenous system ?_ We must create a boilerplate for clearly designing ML systems to answer that question. Similar solutions exist for classic software. For example, if you zoom out, most software applications can be split between a database, business logic and UI layer. Every layer can be as complex as needed, but at a high level overview, the architecture of standard software can be boiled down to these three components. Do we have something similar for ML applications? The first step is to examine previous solutions and why they are unsuitable for building scalable ML systems. The issue with previous solutions In Figure 2, you can observe the typical architecture present in most ML applications. It is based on a monolithic batch architecture that couples the feature creation, model training, and inference into the same component. By taking this approach, you quickly solve one critical problem in the ML world the training serving skew. The training serving skew happens when the features passed to the model are computed differently at training and inference time. In this architecture, the features are created using the same code. Hence, the training serving skew issue is solved by default. This pattern works fine when working with small data. The pipeline runs on a schedule in batch mode, and the predictions are consumed by a third party application such as a dashboard. Figure 2 Monolithic batch pipeline architecture Unfortunately, building a monolithic batch system raises many other issues, such as features are not reusable by your system or others if the data increases, you have to refactor the whole code to support PySpark or Ray hard to rewrite the prediction module in a more efficient language such as C , Java or Rust hard to share the work between multiple teams between the features, training, and prediction modules impossible to switch to a streaming technology for real time training In Figure 3, we can see a similar scenario for a real time system. This use case introduces another issue in addition to what we listed before. To make the predictions, we have to transfer the whole state through the client request so the features can be computed and passed to the model. Consider the scenario of computing movie recommendations for a user. Instead of simply passing the user ID, we must transmit the entire user state, including their name, age, gender, movie history, and more. This approach is fraught with potential errors, as the client must understand how to access this state, and it s tightly coupled with the model service. Another example would be when implementing an LLM with RAG support. The documents we add as context along the query represent our external state. If we didn t store the records in a vector DB, we would have to pass them with the user query. To do so, the client must know how to query and retrieve the documents, which is not feasible. It is an antipattern for the client application to know how to access or compute the features. If you don t understand how RAG works, we will explain it in future chapters. Figure 3 Stateless real time architecture In conclusion, our problem is accessing the features to make predictions without passing them at the client s request. For example, based on our first user movie recommendation example, how can we predict the recommendations solely based on the user s ID? Remember these questions, as we will answer them shortly. The solution the FTI architecture The solution is based on creating a clear and straightforward mind map that any team or person can follow to compute the features, train the model, and make predictions. Based on these three critical steps that any ML system requires, the pattern is known as the FTI feature, training, inference pipelines. So, how does this differ from what we presented before? The pattern suggests that any ML system can be boiled down to these three pipelines feature, training, and inference similar to the database, business logic and UI layers from classic software . This is powerful, as we can clearly define the scope and interface of each pipeline. Also, it s easier to understand how the three components interact. As shown in Figure 4, we have the feature, training and inference pipelines. We will zoom in on each of them and understand their scope and interface. Before going into the details, it is essential to understand that each pipeline is a different component that can run on a different process or hardware. Thus, each pipeline can be written using a different technology, by a different team, or scaled differently. The key idea is that the design is very flexible to the needs of your team. It acts as a mind map for structuring your architecture. Figure 4 Feature Training Inference FTI pipelines architecture The feature pipeline The feature pipelines take as input data and output features labels used to train the model. Instead of directly passing them to the model, the features and labels are stored inside a feature store. Its responsibility is to store, version, track, and share the features. By saving the features into a feature store, we always have a state of our features. Thus, we can easily send the features to the training and inference pipeline s . As the data is versioned, we can always ensure that the training and inference time features match. Thus, we avoid the training serving skew problem. The training pipeline The training pipeline takes the features and labels from the features store as input and outputs a train model or models. The models are stored in a model registry. Its role is similar to that of feature stores, but this time, the model is the first class citizen. Thus, the model registry will store, version, track, and share the model with the inference pipeline. Also, most modern model registries support a metadata store that allows you to specify essential aspects of how the model was trained. The most important are the features, labels and their version used to train the model. Thus, we will always know what data the model was trained on. The inference pipeline The inference pipeline takes as input the features labels from the feature store and the trained model from the model registry. With these two, predictions can be easily made in either batch or real time mode. As this is a versatile pattern, it is up to you to decide what you do with your predictions. If it s a batch system, they will probably be stored in a database. If it s a real time system, the predictions will be served to the client who requested them. As the features, labels, and model are versioned. We can easily upgrade or roll back the deployment of the model. For example, we will always know that model v1 uses features F1, F2, and F3, and model v2 uses F2, F3, and F4. Thus, we can quickly change the connections between the model and features. Benefits of the FTI architecture To conclude, the most important thing you must remember about the FTI pipelines is their interface The feature pipeline takes in data and outputs features labels saved to the feature store. The training pipelines query the features store for features labels and output a model to the model registry. The inference pipeline uses the features from the feature store and the model from the model registry to make predictions. It doesn t matter how complex your ML system gets. These interfaces will remain the same. Now that we better understand how the pattern works, we want to highlight the main benefits of using this pattern as you have just three components, it is intuitive to use and easy to understand each component can be written into its tech stack, so we can quickly adapt them to specific needs, such as big or streaming data. Also, it allows us to pick the best tools for the job as there is a transparent interface between the three components, each one can be developed by a different team if necessary , making the development more manageable and scalable every component can be deployed, scaled, and monitored independently. The final thing you must understand about the FTI pattern is that the system doesn t have to contain only three pipelines. In most cases, it will include more. For example, the feature pipeline can be composed of a service that computes the features and one that validates the data. Also, the training pipeline can be composed of the training and evaluation components. The FTI pipelines act as logical layers. Thus, it is perfectly fine for each to be complex and contain multiple services. However, what is essential is to stick to the same interface on how the FTI pipelines interact with each other through the feature store and model registries. By doing so, each FTI component can evolve differently, without knowing the details of each other and without breaking the system on new changes. Conclusion In this article, we understood the fundamental problems when naively building ML systems. We also looked at potential solutions and their downsides. Ultimately, we presented the FTI architecture, its benefits, and how to apply it to modern ML systems. My _ latest book , LLM Engineer s Handbook, _inspired me to write this article. If you liked this article, consider supporting me by buying my book and enjoy a lot more similar content compressed into a single book LLM Engineer s Handbook LLM Engineer s Handbook Cover References Literature 1 Jim Dowling, From MLOps to ML Systems with Feature Training Inference Pipelines 2023 , Hopsworks blog Images If not otherwise stated, all images are created by the author. 12 Share this post Building ML systems the right way using the FTI architecture decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/building-ml-systems-the-right-way?r=1ttoeh" }, { "id": "2ce3c5d1-730b-4258-88ab-07009eddaf33", "content": "Reduce your PyTorch code latency by 82 by Paul Iusztin How not to optimize the inference of your DL models. Computer science is dead. SubscribeSign in Share this post Reduce your PyTorch code latency by 82 decodingml.substack.com Copy link Facebook Email Note Other Reduce your PyTorch code latency by 82 How not to optimize the inference of your DL models. Computer science is dead. Paul Iusztin Aug 03, 2024 9 Share this post Reduce your PyTorch code latency by 82 decodingml.substack.com Copy link Facebook Email Note Other 2 Share _Decoding ML Notes_ This week s topics Reduce the latency of your PyTorch code by 82 How I failed to optimize the inference of my DL models Computer science is dead \ud835\udde1\ud835\uddf2\ud835\ude04 \ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8 on engineering end to end LLM systems, from data collection and fine tuning to LLMOps deployment, monitoring . I kept this one a secret, but in the past months, in collaboration with Packt , Alex Vesa and Maxime Labonne , we started working on the \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude0c\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33 \ud835\ude34 \ud835\ude0f\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude23\ud835\ude30\ud835\ude30\ud835\ude2c. \ud835\uddd4 \ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8 that will walk you through everything you know to build a production ready LLM project. I am a big advocate of learning with hands on examples while being anchored in real world use cases. That is why this is not the standard theoretical book. While reading the book, you will learn to build a complex LLM project an LLM Twin. In contrast, theoretical aspects will back everything to understand why we make certain decisions. However, our ultimate goal is to present a framework that can be applied to most LLM projects. . \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddee \ud835\ude00\ud835\uddfb\ud835\uddf2\ud835\uddee\ud835\uddf8 \ud835\uddfd\ud835\uddf2\ud835\uddf2\ud835\uddf8 \ud835\uddfc\ud835\uddf3 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude04\ud835\uddf6\ud835\uddf9\ud835\uddf9 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddd8\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff \ud835\ude00 \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8 collect unstructured data create instruction datasets from raw data to fine tune LLMs SFT techniques such as LoRA and QLoRA LLM evaluation techniques Preference alignment using DPO inference optimization methods key optimization, model parallelism, quantization, attention mechanisms advanced RAG algorithms using LangChain as our LLM framework and Qdrant as our vector DB design LLM systems using the FTI architecture use AWS SageMaker to fine tune and deploy open source LLMs use ZenML to orchestrate all the pipelines and track the data as artifacts LLMOps patterns such as CT CI CD pipelines, model registries and using Comet for experiment tracking and prompt monitoring . The book is still a work in progress, but we are very excited about it! Thank you, Packt, for making this possible and Maxime and Alex for this remarkable collaboration. If you are curious, you can currently pre order it from Amazon. The whole book should be released by the end of September 2024. \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude0c\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33 \ud835\ude34 \ud835\ude0f\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude23\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude14\ud835\ude22\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude33 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude35 \ud835\ude30\ud835\ude27 \ud835\ude26\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude13\ud835\ude22\ud835\ude33\ud835\ude28\ud835\ude26 \ud835\ude13\ud835\ude22\ud835\ude2f\ud835\ude28\ud835\ude36\ud835\ude22\ud835\ude28\ud835\ude26 \ud835\ude14\ud835\ude30\ud835\ude25\ud835\ude26\ud835\ude2d\ud835\ude34 \ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude24\ud835\ude26\ud835\ude31\ud835\ude35 \ud835\ude35\ud835\ude30 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f Reduce the latency of your PyTorch code by 82 This is how I \ud835\uddff\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf2\ud835\uddf1 the \ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06 of my \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 by \ud835\udff4\ud835\udfee \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\uddfb\ud835\uddf9\ud835\ude06 \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5. \ud835\udde1\ud835\udde2 \ud835\uddf3\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\ude06 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9\ud835\ude00 \ud835\uddf6\ud835\uddfb\ud835\ude03\ud835\uddfc\ud835\uddf9\ud835\ude03\ud835\uddf2\ud835\uddf1! \ud835\ude4f\ud835\ude5d\ud835\ude5a \ud835\ude65\ud835\ude67\ud835\ude64\ud835\ude57\ud835\ude61\ud835\ude5a\ud835\ude62? During inference, I am using 5 DL at 25k images at once. The script took around 4 hours to run. The problem is that this isn t a batch job that runs over the night... Various people across the company required it to run in real time multiple times a day. \ud835\ude4f\ud835\ude5d\ud835\ude5a \ud835\ude68\ud835\ude64\ud835\ude61\ud835\ude6a\ud835\ude69\ud835\ude5e\ud835\ude64\ud835\ude63? The first thing that might come to your mind is to start using some fancy optimizer e.g., TensorRT . Even though that should be done at some point... First, you should \ud835\uddee\ud835\ude00\ud835\uddf8 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2\ud835\uddf9\ud835\uddf3 I O bottlenecks reading writing images preprocessing postprocessing can it be parallelized? are the CUDA cores used at their maximum potential? is the bandwidth between the CPU GPU throttled? can we move more computation to the GPU? That being said... \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 is what I did I \ud835\uddf1\ud835\uddf2\ud835\uddf0\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\uddf1 the \ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06 of the script by \ud835\udff4\ud835\udfee \ud835\udfed . \ud835\uddd5\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\ude00\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 Batching is not only valuable for training but also mighty in speeding up your inference time. Otherwise, you waste your GPU CUDA cores. Instead of passing through the models one sample at a time, I now process 64. \ud835\udfee . \ud835\udddf\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\uddf1 \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\ude00 \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf2\ud835\uddff This has 2 main advantages parallel data loading preprocessing on multiple processes NOT threads copying your input images directly into the pinned memory avoid a CPU CPU copy operation \ud835\udfef . \ud835\udde0\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\ude00 \ud835\uddfa\ud835\ude02\ud835\uddf0\ud835\uddf5 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddfc\ud835\ude00\ud835\ude01\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddda\ud835\udde3\ud835\udde8 I saw that the tensor was moved too early on the CPU and mapped to a NumPy array. I refactored the code to keep it on the GPU as much as possible, which had 2 main advantages tensors are processed faster on the GPU at the end of the logic, I had smaller tensors, resulting in smaller transfers between the CPU GPU \ud835\udff0 . \ud835\udde0\ud835\ude02\ud835\uddf9\ud835\ude01\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf9\ud835\uddf9 \ud835\uddfa\ud835\ude06 \ud835\udddc \ud835\udde2 \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 For I O bottlenecks, using Python threads is extremely powerful. I moved all my writes under a \ud835\ude1b\ud835\ude29\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude17\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude0c\ud835\ude39\ud835\ude26\ud835\ude24\ud835\ude36\ud835\ude35\ud835\ude30\ud835\ude33, batching my write operations. . Note that I used only good old Python PyTorch code. When the code is poorly written, no tool can save you Only now is the time to add fancy tooling, such as TensorRT. . So remember... To optimize the PyTorch code by 82 1 . Batched the inference samples 2 . Leveraged PyTorch s DataLoader 3 . Moved as much of the postprocessing on the GPU 4 . Multithreading for all my I O write operations What other methods do you have in mind? Leave them in the comments How I failed to optimize the inference of my DL models This is how I FAILED to \ud835\uddfc\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\ude07\ud835\uddf2 the \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 of my \ud835\uddd7\ud835\udddf \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00 when \ud835\uddff\ud835\ude02\ud835\uddfb\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddfa on a \ud835\udde1\ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf6\ud835\uddee \ud835\uddda\ud835\udde3\ud835\udde8. Let me tell you \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddee\ud835\ude03\ud835\uddfc\ud835\uddf6\ud835\uddf1 I had a simple task. To reduce the latency of the DL models used in production. We had 4 DL models that were running on Nvidia GPUs. After a first look at the inference code, I saw that the inputs to the models weren t batched. We were processing one sample at a time. I said to myself Ahaa! That s it. I cracked it. We just have to batch as many samples as possible, and we are done. So, I did just that... After 2 3 days of work adding the extra batch dimension to the PyTorch preprocessing postprocessing code, \ud835\udddc \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddf1 \ud835\udddc \ud835\uddea\ud835\uddd4\ud835\udde6 \ud835\uddea\ud835\udde5\ud835\udde2\ud835\udde1\ud835\uddda. \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\ude06 We were using Nvidia GPUs from the A family A6000, A5000, etc. . As these GPUs have a lot of memory 40GB , I managed to max out the VRAM and squash a batch of 256 images on the GPU. Relative to using a \ud835\ude23\ud835\ude22\ud835\ude35\ud835\ude24\ud835\ude29 1 it was faster, but not A LOT FASTER, as I expected. Then I tried batches of 128, 64, 32, 16, and 8. ...and realized that everything batch 16 was running slower than using a batch of 16. \ud835\uddd4 \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\uddfc\ud835\uddf3 \ud835\udfed\ud835\udff2 \ud835\ude04\ud835\uddee\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\ude01 \ud835\ude00\ud835\uddfd\ud835\uddfc\ud835\ude01. But that is not good, as I was using only 10 of the VRAM... \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01? The Nvidia A family of GPUs are known to having a lot of VRAM not being very fast the memory transfer between the CPU GPU the number of CUDA cores isn t that great That being said, my program was throttled. Even if my GPU could handle much more memory wise, the memory transfer processing speeds weren t keeping up. In the end, it was a good optimization 75 faster \ud835\uddd5\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude06 \ud835\uddf6\ud835\ude00 ALWAYS KNOW YOUR HARDWARE Most probably, running a bigger batch on an A100 or V100 wouldn t have the same problem. I plan to try that. But that is why... \ud835\ude6e\ud835\ude64\ud835\ude6a \ud835\ude56\ud835\ude61\ud835\ude6c\ud835\ude56\ud835\ude6e\ud835\ude68 \ud835\ude5d\ud835\ude56\ud835\ude6b\ud835\ude5a \ud835\ude69\ud835\ude64 \ud835\ude64\ud835\ude65\ud835\ude69\ud835\ude5e\ud835\ude62\ud835\ude5e\ud835\ude6f\ud835\ude5a \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude65\ud835\ude56\ud835\ude67\ud835\ude56\ud835\ude62\ud835\ude5a\ud835\ude69\ud835\ude5a\ud835\ude67\ud835\ude68 \ud835\ude64\ud835\ude5b \ud835\ude6e\ud835\ude64\ud835\ude6a\ud835\ude67 \ud835\ude68\ud835\ude6e\ud835\ude68\ud835\ude69\ud835\ude5a\ud835\ude62 \ud835\ude57\ud835\ude56\ud835\ude68\ud835\ude5a\ud835\ude59 \ud835\ude64\ud835\ude63 \ud835\ude6e\ud835\ude64\ud835\ude6a\ud835\ude67 \ud835\ude5d\ud835\ude56\ud835\ude67\ud835\ude59\ud835\ude6c\ud835\ude56\ud835\ude67\ud835\ude5a! In theory, I knew this, but it is completely different when you encounter it in production. Let me know in the comments if you want more similar stories on DO NOTs from my experience. Computer science is dead \ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\ude00\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf1. Do this instead. In a recent talk, Jensen Huang, CEO of Nvidia, said that kids shouldn t learn programming anymore. He said that until now, most of us thought that everyone should learn to program at some point. But the actual opposite is the truth. With the rise of AI, nobody should have or need to learn to program anymore. He highlights that with AI tools, the technology divide between non programmers and engineers is closing. . \ud835\uddd4\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff, \ud835\uddfa\ud835\ude06 \ud835\uddf2\ud835\uddf4\ud835\uddfc \ud835\uddf6\ud835\ude00 \ud835\uddf5\ud835\ude02\ud835\uddff\ud835\ude01 \ud835\uddfa\ud835\ude06 \ud835\uddf3\ud835\uddf6\ud835\uddff\ud835\ude00\ud835\ude01 \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddfc \ud835\ude00\ud835\uddee\ud835\ude06 \ud835\uddf6\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\ude00\ud835\ude01\ud835\ude02\ud835\uddfd\ud835\uddf6\ud835\uddf1. But after thinking about it more thoroughly, I tend to agree with him. After all, even now, almost anybody can work with AI. This probably won t happen in the next 10 years, but at some point, 100 will do. At some point, we will ask our AI companion to write a program that does X for us or whatever. But, I think this is a great thing, as it will give us more time energy to focus on what matters, such as solving real world problems not just tech problems moving to the next level of technology Bioengineering, interplanetary colonization, etc. think about the grand scheme of things be more creative more time to connect with our family more time to take care of our I personally think it is a significant step for humanity. . What do you think? As an engineer, do you see your job still present in the next 10 years? Here is the full talk Images If not otherwise stated, all images are created by the author. 9 Share this post Reduce your PyTorch code latency by 82 decodingml.substack.com Copy link Facebook Email Note Other 2 Share PreviousNext Discussion about this post Comments Restacks SorinAug 3Liked by Paul IusztinExcellent article, except the part CS is dead is invalidExpand full commentReplyShare 1 reply by Paul Iusztin 1 more comment... Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/reduce-your-pytorchs-code-latency?r=1ttoeh" }, { "id": "7a276ac3-5c78-42d3-9ecf-05ff7f76fe31", "content": "LLM Agents Demystified by Li Decoding ML Newsletter Hands on ReAct Agent implementation with AdalFlow library SubscribeSign in Share this post LLM Agents Demystified decodingml.substack.com Copy link Facebook Email Note Other LLM Agents Demystified Hands on ReAct Agent implementation with AdalFlow library Li Jul 27, 2024 14 Share this post LLM Agents Demystified decodingml.substack.com Copy link Facebook Email Note Other Share Hi, all! I m Li Yin, Author of AdalFlow and ex AI researcher MetaAI Find me on LinkedIn Handy links AdalFlow Github Open in Colab _AdalFlow is an LLM library that not only helps developers build but also optimizes LLM task pipelines. Embracing a design pattern similar to PyTorch, AdalFlow is light, modular, and robust, with a 100 readable codebase._ _There are many tutorials that show users how to call high level agent APIs, but none of them explain how it really works in depth. This is where the AdalFlow library aims to make a difference._ _In this blog, you will not only learn how to use the ReAct Agent but more importantly, also understand how it was implemented and how you can customize or build your own agent with AdalFlow._ _Let s get started!_ _Image source , credits to Growtika_ Introduction _ An autonomous agent is a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future. _ _ Franklin and Graesser 1997 _ Alongside the well known RAGs, agents 1 are another popular family of LLM applications. What makes agents stand out is their ability to reason, plan, and act via accessible tools. When it comes to implementation, AdalFlow has simplified it down to a generator that can use tools, taking multiple steps sequential or parallel to complete a user query. Table of Contents 1. What is ReAct Agent 2. Introduction on tools function calls 3. ReAct Agent implementation 4. ReAct Agent in action 1 . What is ReAct Agent ReAct 2 is a general paradigm for building agents that sequentially interleaves thought, action, and observation steps. Thought The reasoning behind taking an action. Action The action to take from a predefined set of actions. In particular, these are the tools functional tools we have introduced in tools. Observation The simplest scenario is the execution result of the action in string format. To be more robust, this can be defined in any way that provides the right amount of execution information for the LLM to plan the next step. Prompt and Data Models _The prompt is the most straightforward way to understand any LLM application. Always read the prompt._ AdalFlow uses jinja2 syntax for the prompt. DEFAULT_REACT_AGENT_SYSTEM_PROMPT is the default prompt for the React agent s LLM planner. We can categorize the prompt template into four parts 1. Task description This part is the overall role setup and task description for the agent. task_desc r You are a helpful assistant.Answer the user s query using the tools provided below with minimal steps and maximum accuracy.Each step you will read the previous Thought, Action, and Observation execution result of the action and then provide the next Thought and Action. 2. Tools, output format, and example This part of the template is exactly the same as how we were calling functions in the tools. The output_format_str is generated by FunctionExpression via JsonOutputParser . It includes the actual output format and examples of a list of FunctionExpression instances. We use thought and action fields of the FunctionExpression as the agent s response. _You will be easily visualize the whole pipeline later by simply_ print react . tools r if tools TOOLS for tool in tools loop.index . tool endfor TOOLS endif output_format_str 3. Task specification to teach the planner how to think . We provide more detailed instruction to ensure the agent will always end with finish action to complete the task. Additionally, we teach it how to handle simple queries and complex queries. For simple queries, we instruct the agent to finish with as few steps as possible. For complex queries, we teach the agent a divide and conquer strategy to solve the query step by step. task_spec r TASK_SPEC For simple queries Directly call the finish action and provide the answer. For complex queries Step 1 Read the user query and potentially divide it into subqueries. And get started with the first subquery. Call one available tool at a time to solve each subquery subquestion. At step finish , join all subqueries answers and finish the task. Remember Action must call one of the above tools with name. It can not be empty. You will always end with finish action to finish the task. The answer can be the final answer or failure message. TASK_SPEC We put all these three parts together to be within the SYS SYS tag. 4. Agent step history. We use StepOutput to record the agent s step history, including action This will be the FunctionExpression instance predicted by the agent. observation The execution result of the action. In particular, we format the steps history after the user query as follows step_history r User query input_str Step History if step_history STEPS for history in step_history Step loop.index . Thought history.action.thought , Action history.action.action , Observation history.observation endfor STEPS endif You 2 . Introduction on tools function calls In addition to the tools provided by users, by default, we add a new tool named finish to allow the agent to stop and return the final answer. def finish answer str str Finish the task with answer. return answer Simply returning a string might not fit all scenarios, and we might consider allowing users to define their own finish function in the future for more complex cases. Additionally, since the provided tools cannot always solve user queries, we allow users to configure if an LLM model should be used to solve a subquery via the add_llm_as_fallback parameter. This LLM will use the same model client and model arguments as the agent s planner. Here is our code to specify the fallback LLM tool _additional_llm_tool Generator model_client model_client, model_kwargs model_kwargs if self.add_llm_as_fallback else None def llm_tool input str str I answer any input query with llm s world knowledge. Use me as a fallback tool or when the query is simple. use the generator to answer the query try output GeneratorOutput _additional_llm_tool prompt_kwargs input_str input response output.data if output else None return response except Exception as e log.error f Error using the generator e print f Error using the generator e return None 3 . ReAct Agent implementation We define the class ReActAgent to put everything together. It will orchestrate two components planner A Generator that works with a JsonOutputParser to parse the output format and examples of the function calls using FunctionExpression . ToolManager Manages a given list of tools, the finish function, and the LLM tool. It is responsible for parsing and executing the functions. Additionally, it manages step_history as a list of StepOutput instances for the agent s internal state. Prompt the agent with an input query and process the steps to generate a response. 4 . ReAct Agent in action We will set up two sets of models, llama3 70b 8192 by Groq and gpt 3.5 turbo by OpenAI, to test two queries. For comparison, we will compare these with a vanilla LLM response without using the agent. Here are the code snippets from lightrag.components.agent import ReActAgent from lightrag.core import Generator, ModelClientType, ModelClient from lightrag.utils import setup_env setup_env Define tools def multiply a int, b int int Multiply two numbers. return a b def add a int, b int int Add two numbers. return a b def divide a float, b float float Divide two numbers. return float a b llama3_model_kwargs model llama3 70b 8192 , llama3 70b works better than 8b here. temperature 0.0, gpt_model_kwargs model gpt 3.5 turbo , temperature 0.0, def test_react_agent model_client ModelClient, model_kwargs dict tools multiply, add, divide queries What is the capital of France? and what is 465 times 321 then add 95297 and then divide by 13.2? , Give me 5 words rhyming with cool, and make a 4 sentence poem using them , define a generator without tools for comparison generator Generator model_client model_client, model_kwargs model_kwargs, react ReActAgent max_steps 6, add_llm_as_fallback True, tools tools, model_client model_client, model_kwargs model_kwargs, print react for query in queries print f Query query agent_response react.call query llm_response generator.call prompt_kwargs input_str query print f Agent response agent_response print f LLM response llm_response print The structure of React using print react , including the initialization arguments and two major components tool_manager and planner . You can visualize the structure from our colab. Now, let s run the test function to see the agent in action. test_react_agent ModelClientType.GROQ , llama3_model_kwargs test_react_agent ModelClientType.OPENAI , gpt_model_kwargs Our agent will show the core steps for developers via colored printout, including input_query, steps, and the final answer. The printout of the first query with llama3 is shown below without the color here 2024 07 10 16 48 47 react.py 287 call input_query What is the capital of France? and what is 465 times 321 then add 95297 and then divide by 13.2 2024 07 10 16 48 48 react.py 266 _run_one_step Step 1 StepOutput step 1, action FunctionExpression thought Let s break down the query into subqueries and start with the first one. , action llm_tool input What is the capital of France? , function Function thought None, name llm_tool , args , kwargs input What is the capital of France? , observation The capital of France is Paris! _______ 2024 07 10 16 48 49 react.py 266 _run_one_step Step 2 StepOutput step 2, action FunctionExpression thought Now, let s move on to the second subquery. , action multiply a 465, b 321 , function Function thought None, name multiply , args , kwargs a 465, b 321 , observation 149265 _______ 2024 07 10 16 48 49 react.py 266 _run_one_step Step 3 StepOutput step 3, action FunctionExpression thought Now, let s add 95297 to the result. , action add a 149265, b 95297 , function Function thought None, name add , args , kwargs a 149265, b 95297 , observation 244562 _______ 2024 07 10 16 48 50 react.py 266 _run_one_step Step 4 StepOutput step 4, action FunctionExpression thought Now, let s divide the result by 13.2. , action divide a 244562, b 13.2 , function Function thought None, name divide , args , kwargs a 244562, b 13.2 , observation 18527.424242424244 _______ 2024 07 10 16 48 50 react.py 266 _run_one_step Step 5 StepOutput step 5, action FunctionExpression thought Now, let s combine the answers of both subqueries. , action finish answer The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244. , function Function thought None, name finish , args , kwargs answer The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244. , observation The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244. _______ 2024 07 10 16 48 50 react.py 301 call answer The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244. The comparison between the agent and the vanilla LLM response is shown below Answer with agent The capital of France is Paris! and the result of the mathematical operation is 18527.424242424244. Answer without agent GeneratorOutput data I d be happy to help you with that! n nThe capital of France is Paris. n nNow, let s tackle the math problem n n1. 465 321 149,485 n2. Add 95,297 to that result 149,485 95,297 244,782 n3. Divide the result by 13.2 244,782 13.2 18,544.09 n nSo, the answer is 18,544.09! , error None, usage None, raw_response I d be happy to help you with that! n nThe capital of France is Paris. n nNow, let s tackle the math problem n n1. 465 321 149,485 n2. Add 95,297 to that result 149,485 95,297 244,782 n3. Divide the result by 13.2 244,782 13.2 18,544.09 n nSo, the answer is 18,544.09! , metadata None The ReAct agent is particularly helpful for answering queries that require capabilities like computation or more complicated reasoning and planning. However, using it on general queries might be an overkill, as it might take more steps than necessary to answer the query. 5 . Optional Customization Please refer to our tutorial for how to customize ReAct to your use case. References 1 A survey on large language model based autonomous agents Paitesanshi LLM Agent Survey 2 ReAct https arxiv.org abs 2210.03629 3 Tool Tutorial https lightrag.sylph.ai tutorials tool_helper.html API References components.agent.react.ReActAgent core.types.StepOutput components.agent.react.DEFAULT_REACT_AGENT_SYSTEM_PROMPT 14 Share this post LLM Agents Demystified decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext A guest post by LiAuthor of AdalFlow, Founder at SylphAI, ex AI researcher at MetaAI. Github liyin2015 Subscribe to Li Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/llm-agents-demystified?r=1ttoeh" }, { "id": "12ad5863-ba57-4f5c-9ab7-4600c7edbf5c", "content": "Scalable RAG pipeline using 74.3 less code Tutorial on building a scalable modular advanced RAG feature pipeline to chunk, embed and ingest multiple data categories to a vector DB using Superlinked SubscribeSign in Share this post Scalable RAG ingestion pipeline using 74.3 less code decodingml.substack.com Copy link Facebook Email Note Other Scalable RAG ingestion pipeline using 74.3 less code End to end implementation for an advanced RAG feature pipeline Paul Iusztin Jul 20, 2024 13 Share this post Scalable RAG ingestion pipeline using 74.3 less code decodingml.substack.com Copy link Facebook Email Note Other Share _ the 1st lesson of the Superlinked bonus series from the LLM Twin free course_ Why is this course different? _By finishing the LLM Twin Building Your Production Ready AI Replica _ _free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices_. _ Why should you care? _ _ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system._ _More details on what you will learn within the LLM Twin course , here _ Latest lessons of the LLM Twin course Lesson 8 Best practices when evaluating fine tuned LLM models Quantitative Qualitative Evaluation Metrics, Human in the Loop, LLM Eval Lesson 9 Architect scalable and cost effective LLM RAG inference pipelines Monolithic vs. microservice, Qwak Deployment, RAG Pipeline Walkthrough Lesson 10 How to evaluate your RAG using RAGAs Framework RAG evaluation best practic, RAGAs framework Lesson 11 Build a scalable RAG ingestion pipeline using 74.3 less code Lessons 11 and 12 are part of a bonus serie s in which we will take the advanced RAG system from the LLM Twin course written in LangChain and refactor it using Superlinked, a framework specialized in vector computing for information retrieval. In Lesson 11 this article , we will learn to build a highly scalable, real time RAG feature pipeline that ingests multi data categories into a Redis vector database. More concretely we will take the ingestion pipeline implemented in Lesson 4 and swap the chunking, embedding, and vector DB logic with Superlinked. _You don t have to readLesson 4 to read this article. We will give enough context to make sense of it._ In the 12th lesson , we will use Superlinked to implement a multi index query strategy and further optimize the advanced RAG retrieval module initially built in Lesson 5 . _The value of this article lies in understanding how easy it is to build complex advanced RAG systems usingSuperlinked._ _ Using Superlinked , we reduced the number of RAG related lines of code by 74.3 . Powerful, right?_ By the end of this article , you will learn to build a production ready feature pipeline built in Superlinked that uses Bytewax as a stream engine to process data in real time ingests multiple data categories from a RabbitMQ queue validates the data with Pydantic chunks, and embeds data using Superlinked for doing RAG loads the embedded vectors along their metadata to a Redis vector DB Ultimately, on the infrastructure side, we will show you how to deploy a Superlinked vector compute server. Quick intro in feature pipelines The feature pipeline is the first pipeline presented in the FTI pipeline architecture feature, training and inference pipelines. A feature pipeline takes raw data as input, processes it into features, and stores it in a feature store, from which the training inference pipelines will use it. The component is completely isolated from the training and inference code. All the communication is done through the feature store. _To avoid repeating myself, if you are unfamiliar with the FTI pipeline architecture , check out Lesson 1 for a refresher._ Table of Contents 1. What is Superlinked? 2. The old architecture of the RAG feature pipeline 3. The new Superlinked architecture of the RAG feature pipeline 4. Understanding the streaming flow for real time processing 5. Loading data to Superlinked 6. Exploring the RAG Superlinked server 7. Using Redis as a vector DB _ Check out the code on GitHub 1 and support us with a _ 1 . What is Superlinked? _Superlinked is a computing framework for turning complex data into vectors._ It lets you quickly build multimodal vectors and define weights at query time, so you don t need a custom reranking algorithm to optimize results. It s focused on turning complex data into vector embeddings within your RAG, Search, RecSys and Analytics stack. I love how Daniel Svonava, the CEO of Superlinked, described the value of vector compute and implicitly Superlinked _Daniel Svonava, CEO at Superlinked _ _ Vectors power most of what you already do online hailing a cab, finding a funny video, getting a date, scrolling through a feed or paying with a tap. And yet, building production systems powered by vectors is still too hard! Our goal is to help enterprises put vectors at the center of their data compute infrastructure, to build smarter and more reliable software. _ To conclude, Superlinked is a framework that puts the vectors in the center of their universe and allows you to chunk and embed embeddings store multi index vectors in a vector DB do complex vector search queries on top of your data. Screenshot from Superlinked s landing page 2 . The old architecture of the RAG feature pipeline Here is a quick recap of the critical aspects of the architecture of the RAG feature pipeline presented in the 4th lesson of the LLM Twin course. _We are working with 3 different data categories _ posts e.g., LinkedIn, Twitter articles e.g., Medium, Substack, or any other blog repositories e.g., GitHub, GitLab Every data category has to be preprocessed differently. For example, you want to chunk the posts into smaller documents while keeping the articles in bigger ones. _The solution is based on CDC , a queue, a streaming engine, and a vector DB _ The raw data is collected from multiple social platforms and is stored in MongoDB. Lesson 2 CDC adds any change made to the MongoDB to a RabbitMQ queue Lesson 3 . the RabbitMQ queue stores all the events until they are processed. The Bytewax streaming engine reads the messages from the RabbitMQ queue and cleans, chunks, and embeds them. The processed data is uploaded to a Qdrant vector DB. The old feature streaming pipeline architecture that was presented in Lesson 4. Why is this design robust? Here are 4 core reasons 1. The data is processed in real time . 2. Out of the box recovery system If the streaming pipeline fails to process a message, it will be added back to the queue 3. Lightweight No need for any diffs between databases or batching too many records 4. No I O bottlenecks on the source database What is the issue with this design? In this architecture, we had to write custom logic to chunk, embed, and load the data to Qdrant. The issue with this approach is that we had to leverage various libraries, such as LangChain and unstructured, to get the job done. Also, because we have 3 data categories, we had to write a dispatcher layer that calls the right function depending on its category, which resulted in tons of boilerplate code. Ultimately, as the chunking and embedding logic is implemented directly in the streaming pipeline, it is harder to scale horizontally. The embedding algorithm needs powerful GPU machines, while the rest of the operations require a strong CPU. This results in more time spent on development more code to maintain the code can quickly become less readable less freedom to scale. Superlinked can speed up this process by providing a very intuitive and powerful Python API that can speed up the development of our ingestion and retrieval logic. Thus, let s see how to redesign the architecture using Superlinked 3 . The new Superlinked architecture of the RAG feature pipeline The core idea of the architecture will be the same. We still want to use a Bytewax streaming engine for real time processing read new events from RabbitMQ clean, chunk, and embed the new incoming raw data load the processed data to a vector DB. The question is , how will we do this with Superlinked? As you can see in the image below, Superlinked will replace the logic for the following operations chunking embedding vector storage queries. Also, we have to swap Qdrant with a Redis vector DB because Superlinked didn t support Qdrant when I wrote this article. But they plan to add it in future months along with many other vector DBs . What will remain unchanged are the following the Bytewax streaming layer the RabbitMQ queue ingestion component the cleaning logic. _By seeing what we must change to the architecture to integrate Superlinked, we can see the framework s core features ._ The components that can be refactored into the Superlinked framework. Now, let s take a deeper look at the new architecture. All the Superlinked logic will sit on its own server, completely decoupling the vector compute component from the rest of the feature pipeline. We can quickly scale the streaming pipeline or the Superlinked server horizontally based on our needs. Also, this makes it easier to run the embedding models from Superlinked on a machine with a powerful GPU while keeping the streaming pipeline on a machine optimized for network I O operations. All the communication to Superlinked ingesting or query data will be done through a REST API, automatically generated based on the schemas and queries you define in your Superlinked application. The Bytewax streaming pipeline will perform the following operations will concurrently read messages from RabbitMQ clean each message based on it s data category send the cleaned document to the Superlinked server through an HTTP request. On the Superlinked server side , we have defined an ingestion endpoint for each data category article, post or code . Each endpoint will know how to chunk embed and store every data point based on its category. Also, we have a query endpoint automatically generated for each data category that will take care of embedding the query and perform a vector semantic search operation to retrieve similar results. The RAG feature pipeline architecture after refactoring. Now, let s finally jump into the code 4 . Understanding the streaming flow for real time processing The Bytewax flow is the central point of the streaming pipeline . It defines all the required steps, following the next simplified pattern _ input processing output ._ Here is the Bytewax flow and its core steps flow Dataflow Streaming RAG feature pipeline stream op.input input , flow, RabbitMQSource stream op.map raw , stream, RawDispatcher.handle_mq_message stream op.map clean , stream, CleaningDispatcher.dispatch_cleaner op.output superlinked_output , stream, SuperlinkedOutputSink client SuperlinkedClient , 5 . Loading data to Superlinked Before we explore the Superlinked application, let s review our Bytewax _SuperlinkedOutputSink _ and _SuperlinkedClient _classes. The purpose of the _SuperlinkedOutputSink _ class is to instantiate a new _SuperlinkedSinkPartition _ instance for each worker within the Bytewax cluster. Thus, we can optimize the system for I O operations by scaling our output workers horizontally. class SuperlinkedOutputSink DynamicSink def __init__ self, client SuperlinkedClient None self._client client def build self, worker_index int, worker_count int StatelessSinkPartition return SuperlinkedSinkPartition client self._client The _SuperlinkedSinkPartition _ class inherits the _StatelessSinkPartition Bytewax base class_ used to create custom stateless partitions. This class takes as input batches of items and sends them to Superlinked through the _SuperlinkedClient _. class SuperlinkedSinkPartition StatelessSinkPartition def __init__ self, client SuperlinkedClient self._client client def write_batch self, items list Document None for item in tqdm items, desc Sending items to Superlinked... match item.type case repositories self._client.ingest_repository item case posts self._client.ingest_post item case articles self._client.ingest_article item case _ logger.error f Unknown item type item.type The _SuperlinkedClient _is a basic wrapper that makes HTTP requests to the Superlinked server that contains all the RAG logic. We use _httpx_ to make __ POST requests for ingesting or searching data. class SuperlinkedClient ... def ingest_repository self, data RepositoryDocument None self.__ingest f self.base_url api v1 ingest repository_schema , data def ingest_post self, data PostDocument None self.__ingest f self.base_url api v1 ingest post_schema , data def ingest_article self, data ArticleDocument None self.__ingest f self.base_url api v1 ingest article_schema , data def __ingest self, url str, data T None ... def search_repository self, search_query str, platform str, author_id str, , limit int 3 list RepositoryDocument return self.__search f self.base_url api v1 search repository_query , RepositoryDocument, search_query, platform, author_id, limit limit, def search_post self, search_query str, platform str, author_id str, , limit int 3 list PostDocument ... URL f self.base_url api v1 search post_query def search_article self, search_query str, platform str, author_id str, , limit int 3 list ArticleDocument ... URL f self.base_url api v1 search article_query def __search self, url str, document_class type T , search_query str, ... list T ... The Superlinked server URLs are automatically generated as follows the ingestion URLs are generated based on the data schemas you defined e.g., repository schema, post schema, etc. the search URLs are created based on the Superlinked queries defined within the application 6 . Exploring the RAG Superlinked server As the RAG Superlinked server is a different component than the Bytewax one, the implementation sits under the server folder at _6 bonus superlinked rag server src app.py._ _Here is a step by step implementation of the Superlinked application _ Settings class Use Pydantic settings to define a global configuration class. class Settings BaseSettings EMBEDDING_MODEL_ID str sentence transformers all mpnet base v2 REDIS_HOSTNAME str redis REDIS_PORT int 6379 settings Settings Schemas Superlinked requires you to define your data structure through a set of schemas, which are very similar to data classes or Pydantic models. Superlinked will use these schemas as ORMs to save your data to a specified vector DB. It will also use them to define ingestion URLs automatically as POST HTTP methods that expect the request body to have the same signature as the schema. Simple and effective. Cool, right? schema class PostSchema id IdField platform String content String author_id String type String schema class ArticleSchema id IdField platform String link String content String author_id String type String schema class RepositorySchema id IdField platform String name String link String content String author_id String type String post PostSchema article ArticleSchema repository RepositorySchema Spaces The spaces are where you define your chunking and embedding logic. A space is scoped at the field of a schema. Thus, if you want to embed multiple attributes of a single schema, you must define multiple spaces and combine them later into a multi index. Let s take the spaces for the article category as an example articles_space_content TextSimilaritySpace text chunk article.content, chunk_size 500, chunk_overlap 50 , model settings.EMBEDDING_MODEL_ID, articles_space_plaform CategoricalSimilaritySpace category_input article.platform, categories medium , superlinked , negative_filter 5.0, Chunking is done simply by calling the _chunk _ function on a given schema field and specifying standard parameters such as _chunk_size _ and _chunk_overlap _. The embedding is done through the _TextSimilaritySpace _ and _CategoricalSimilaritySpace _ classes. As the name suggests, the _ TextSimilaritySpace _embeds text data using the model specified within the _ model _ parameter. It supports any HuggingFace model. We are using _ sentence transformers all mpnet base v2 ._ The _ CategoricalSimilaritySpace _ class uses an _n hot encoded vector_ with the option to apply a negative filter for unmatched categories, enhancing the distinction between matching and non matching category items. You must also specify all the available categories through the _categories_ parameter to encode them in n hot. Indexes The indexes define how a collection can be queried. They take one or multiple spaces from the same schema. Here is what the article index looks like article_index Index articles_space_content, articles_space_plaform , fields article.author_id , As you can see, the vector index combines the article s content and the posted platform. When the article collection is queried, both embeddings will be considered. Also, we index the author_id field to filter articles written by a specific author. It is nothing fancy it is just a classic filter. However, indexing the fields used in filters is often good practice. Queries We will quickly introduce what a query looks like. But in the 14th lesson, we will insist on the advanced retrieval part, hence on queries. Here is what the article query looks like article_query Query article_index, weights articles_space_content Param content_weight , articles_space_plaform Param platform_weight , , .find article .similar articles_space_content.text, Param search_query .similar articles_space_plaform.category, Param platform .filter article.author_id Param author_id .limit Param limit and here is what it does it queries the _article_index_ using a weighted multi index between the content and platform vectors e.g., 0.9 content_embedding 0.1 platform_embedding the search text used to compute query content embedding is specified through the search_query parameter and similar for the platform embedding through the platform parameter we filter the results based on the author_id take only the top results using the limit parameter. These parameters are automatically exposed on the REST API endpoint, as seen in the _SuperlinkedClient _ class. Sources The sources wrap the schemas and allow you to save that schema in the database. In reality, the source maps the schema to an ORM and automatically generates REST API endpoints to ingest data points. article_source RestSource article Executor The last step is to define the executor that wraps all the sources, indices, queries and vector DB into a single entity executor RestExecutor sources article_source, repository_source, post_source , indices article_index, repository_index, post_index , queries RestQuery RestDescriptor article_query , article_query , RestQuery RestDescriptor repository_query , repository_query , RestQuery RestDescriptor post_query , post_query , , vector_database InMemoryVectorDatabase , Now, the last step is to register the executor to the Superlinked engine SuperlinkedRegistry.register executor and that s it! Joking there is something more. We have to use a Redis database instead of the in memory one. 7 . Using Redis as a vector DB First, we have to spin up a Redis vector database that we can work with. We used Docker and attached a Redis image as a service in a _docker compose_ file along with the Superlinked poller and executor which comprise the Superlinked server version 3 services poller ... executor ... redis image redis redis stack latest ports 6379 6379 8001 8001 volumes redis data data volumes redis data Now, Superlinked makes everything easy. The last step is to define a RedisVectorDatabase connector provided by Superlinked vector_database RedisVectorDatabase settings.REDIS_HOSTNAME, settings.REDIS_PORT and swap it in the executor with the _InMemoryVectorDatabase _ one executor RestExecutor ... vector_database vector_database, Now we are done! Conclusion _Congratulations! You learned to write advanced RAG systems usingSuperlinked._ More concretely, in Lesson 11 , you learned what is Superlinked how to design a streaming pipeline using Bytewax how to design a RAG server using Superlinked how to take a standard RAG feature pipeline and refactor it using Superlinked how to split the feature pipeline into 2 services, one that reads in real time messages from RabbitMQ and one that chunks, embeds, and stores the data to a vector DB how to use a Redis vector DB. Lesson 12 will teach you how to implement multi index queries to optimize the RAG retrieval layer further. _ Check out the code on GitHub 1 and support us with a _ Next Steps Step 1 This is just the short version of Lesson 11 on building scalable RAG ingestion pipelines. For The full implementation. Full deep dive into the code. More on the RAG, Bytewax and Superlinked. Check out the full version of Lesson 11 on our Medium publication . It s still FREE Lesson 11 on Medium Step 2 Consider checking out theLLM Twin GitHub repository and try it yourself _Nothing compares with getting your hands dirty and doing it yourself!_ LLM Twin Course GitHub Images If not otherwise stated, all images are created by the author. 13 Share this post Scalable RAG ingestion pipeline using 74.3 less code decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/scalable-rag-ingestion-pipeline-using?r=1ttoeh" }, { "id": "0eae1447-70c8-40b2-a5c4-96f6de69f04b", "content": "The ultimate MLOps tool by Paul Iusztin 6 steps to build your AWS infrastructure that will work for 90 of your projects. How to build a real time news search engine SubscribeSign in Share this post The ultimate MLOps tool decodingml.substack.com Copy link Facebook Email Note Other The ultimate MLOps tool 6 steps to build your AWS infrastructure that will work for 90 of your projects. How to build a real time news search engine Paul Iusztin Jul 13, 2024 18 Share this post The ultimate MLOps tool decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ Based on your feedback from last week s poll, we will post exclusively on Saturdays starting now. Enjoy today s article This week s topics The ultimate MLOps tool 6 steps to build your AWS infrastructure that will work for 90 of your projects How to build a real time news search engine The ultimate MLOps tool I tested this \ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 for my \ud835\udde0\ud835\udddf \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00 and \ud835\uddf9\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddf1 \ud835\uddf6\ud835\ude01! It is the \ud835\ude02\ud835\uddf9\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 to glue everything together for \ud835\uddff\ud835\uddf2\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 and \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4. In the past months, I have tested most of the top orchestrator tools out there Airflow, Prefect, Argo, Kubeflow, Metaflow... You name it! \ud835\uddd5\ud835\ude02\ud835\ude01 \ud835\uddfc\ud835\uddfb\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf1 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddfa\ud835\uddf2. I am talking about ZenML! \ud835\uddea\ud835\uddf5\ud835\ude06? They realized they don t have to compete with tools such as Airflow or AWS in the orchestrators and MLOps race, but join them! Instead of being yet another orchestrator tool, they have built an \ud835\uddee\ud835\uddef\ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude01 \ud835\uddf9\ud835\uddee\ud835\ude06\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddfc\ud835\uddfd \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa experiment trackers model registries e.g., Weights Biases, Comet orchestrators e.g., Apache Airflow, Kubeflow container registries for your Docker images model deployers Hugging Face , BentoML, Seldon They wrote a clever wrapper that integrated the whole MLOps ecosystem! \ud835\ude08\ud835\ude2d\ud835\ude34\ud835\ude30, \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude30 \ud835\ude3a\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f \ud835\ude24\ud835\ude30\ud835\ude25\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude2f\ud835\ude30\ud835\ude35 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude37\ud835\ude26. As long your code is modular which should be anyway , you have to annotate your DAG steps with Stephen S. entry point with james wang \ud835\ude08\ud835\ude34 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude34\ud835\ude26\ud835\ude26 \ud835\ude2a\ud835\ude2f \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude25\ud835\ude26 \ud835\ude34\ud835\ude2f\ud835\ude2a\ud835\ude31\ud835\ude31\ud835\ude26\ud835\ude35\ud835\ude34 \ud835\ude23\ud835\ude26\ud835\ude2d\ud835\ude30\ud835\ude38 ZenML Pipelines . ZenML Steps \ud835\udde7\ud835\uddf5\ud835\uddf2\ud835\ude06 \ud835\uddee\ud835\uddf9\ud835\ude00\ud835\uddfc \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddfd\ud835\ude01 \ud835\uddfc\ud835\uddf3 \ud835\uddee \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddf0\ud835\uddf8 . This allows you to configure multiple tools and infrastructure sets your pipeline can run on. \ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26 \ud835\ude22 \ud835\ude2d\ud835\ude30\ud835\ude24\ud835\ude22\ud835\ude2d \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2c that uses a local orchestrator, artifact store, and compute for quick testing so you don t have to set up other dependencies \ud835\ude22\ud835\ude2f \ud835\ude08\ud835\ude1e\ud835\ude1a \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2c that uses AWS SageMaker Orchestrator, Comet, and Seldon ZenML Stacks As I am still learning ZenML, this was just an intro post to share my excitement. I plan to integrate it into Decoding ML s LLM twin open source project and share the process with you! . \ud835\udde0\ud835\uddf2\ud835\uddee\ud835\uddfb\ud835\ude04\ud835\uddf5\ud835\uddf6\ud835\uddf9\ud835\uddf2, \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddff \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddf6\ud835\uddff \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddf4\ud835\ude02\ud835\uddf6\ud835\uddf1\ud835\uddf2 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25 \ud835\ude28\ud835\ude36\ud835\ude2a\ud835\ude25\ud835\ude26 https lnkd.in dPzXHvjH 6 steps to build your AWS infrastructure that will work for 90 of your projects \ud835\udff2 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 your \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 using \ud835\udddc\ud835\uddee\ud835\uddd6 and a \ud835\uddd6\ud835\udddc \ud835\uddd6\ud835\uddd7 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 that will \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 for \ud835\udff5\ud835\udfec of your \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude00 We will use the data collection pipeline from our free digital twin course as an example, but it can easily be extrapolated to most of your projects. \ud835\ude0d\ud835\ude2a\ud835\ude33\ud835\ude34\ud835\ude35, \ud835\ude2d\ud835\ude26\ud835\ude35 \ud835\ude34 \ud835\ude34\ud835\ude26\ud835\ude26 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude2a\ud835\ude34 \ud835\ude2a\ud835\ude2f \ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude35\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude23\ud835\ude26\ud835\ude2d\ud835\ude35 Docker AWS ECR AWS Lambda MongoDB Pulumni GitHub Actions \ud835\ude1a\ud835\ude26\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude25\ud835\ude2d\ud835\ude3a, \ud835\ude2d\ud835\ude26\ud835\ude35 \ud835\ude34 \ud835\ude32\ud835\ude36\ud835\ude2a\ud835\ude24\ud835\ude2c\ud835\ude2d\ud835\ude3a \ud835\ude36\ud835\ude2f\ud835\ude25\ud835\ude26\ud835\ude33\ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude24\ud835\ude30\ud835\ude2d\ud835\ude2d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude25\ud835\ude30\ud835\ude2a\ud835\ude2f\ud835\ude28 It automates your digital data collection from LinkedIn, Medium, Substack, and GitHub. The normalized data will be loaded into MongoDB. \ud835\ude15\ud835\ude30\ud835\ude38, \ud835\ude2d\ud835\ude26\ud835\ude35 \ud835\ude34 \ud835\ude36\ud835\ude2f\ud835\ude25\ud835\ude26\ud835\ude33\ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude08\ud835\ude1e\ud835\ude1a \ud835\ude2a\ud835\ude2f\ud835\ude27\ud835\ude33\ud835\ude22\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude0a\ud835\ude10 \ud835\ude0a\ud835\ude0b \ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34 1 . We wrap the application s entry point with a \ud835\ude29\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude2d\ud835\ude26 \ud835\ude26\ud835\ude37\ud835\ude26\ud835\ude2f\ud835\ude35, \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35 \ud835\ude13\ud835\ude22\ud835\ude2e\ud835\ude23\ud835\ude25\ud835\ude22\ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35 function. The AWS Lambda serverless computing service will default to the \ud835\ude29\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude2d\ud835\ude26 function. 2 . Build a Docker image of your application inheriting the \ud835\ude31\ud835\ude36\ud835\ude23\ud835\ude2d\ud835\ude2a\ud835\ude24.\ud835\ude26\ud835\ude24\ud835\ude33.\ud835\ude22\ud835\ude38\ud835\ude34 \ud835\ude2d\ud835\ude22\ud835\ude2e\ud835\ude23\ud835\ude25\ud835\ude22 \ud835\ude31\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f 3.11 base Docker image Now, you can quickly check your AWS Lambda function locally by making HTTP requests to your Docker container. 3 . Use Pulumni IaC to create your AWS infrastructure programmatically an ECR as your Docker registry an AWS Lambda service a MongoDB cluster the VPC for the whole infrastructure 4 . Now that we have our Docker image and infrastructure, we can build our CI CD pipeline using GitHub Actions. The first step is to build the Docker image inside the CI and push it to ECR when a new PR is merged into the main branch. 5 . On the CD part, we will take the fresh Docker image from ECR and deploy it to AWS Lambda. 6 . Repeat the same logic with the Pulumni code Add a CD GitHub Action that updates the infrastructure whenever the IaC changes. With \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf3\ud835\uddf9\ud835\uddfc\ud835\ude04, you will do fine for \ud835\udff5\ud835\udfec of your \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude00 . \ud835\ude1b\ud835\ude30 \ud835\ude34\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26, \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude0a\ud835\ude10 \ud835\ude0a\ud835\ude0b \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34 feature PR merged to main build Docker image push to ECR deploy to AWS Lambda LLM Twin AWS architecture \ud835\uddea\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddff\ud835\ude02\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2\ud835\uddf9\ud835\uddf3? Consider checking out \ud835\udddf\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb \ud835\udfee from the FREE \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 hosted by _The Importance of Data Pipelines in the Era of Generative AI_ How to build a real time news search engine Decoding ML \ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\uddf1 an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 on building a \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 \ud835\udde1\ud835\uddf2\ud835\ude04\ud835\ude00 \ud835\udde6\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\uddd8\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2 using \ud835\uddde\ud835\uddee\ud835\uddf3\ud835\uddf8\ud835\uddee, \ud835\udde9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5\ud835\ude00 and \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00. \ud835\ude0c\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude2a\ud835\ude2f \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f! \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf4\ud835\uddfc\ud835\uddee\ud835\uddf9? Learn to build a production ready semantic search engine for news that is synced in real time with multiple news sources using a streaming engine Kafka a vector DB. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa? According to a research study by earthweb.com, the daily influx of news articles, both online and offline, is between 2 and 3 million. How would you constantly sync these data sources with your vector DB to stay in sync with the outside world? \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb! Here is where the streaming pipeline kicks in. As soon as a new data point is available, it is ingested processed loaded to a vector DB ...in real time by the streaming pipeline . \ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude24\ud835\ude2d\ud835\ude26 Set up your own Upstash \ud835\uddde\ud835\uddee\ud835\uddf3\ud835\uddf8\ud835\uddee \ud835\udde9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 \ud835\uddf0\ud835\uddf9\ud835\ude02\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\ude00 \ud835\udde6\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\ude03\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2 your \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee points using Pydantic \ud835\udde6\ud835\uddf6\ud835\uddfa\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2 multiple \ud835\uddde\ud835\uddee\ud835\uddf3\ud835\uddf8\ud835\uddee \ud835\uddd6\ud835\uddf9\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00 using \ud835\ude1b\ud835\ude29\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude17\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude0c\ud835\ude39\ud835\ude26\ud835\ude24\ud835\ude36\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude12\ud835\ude22\ud835\ude27\ud835\ude2c\ud835\ude22\ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude26\ud835\ude33 \ud835\udde6\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 using Bytewax learn to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddee \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 \ud835\udde5\ud835\uddd4\ud835\uddda ingestion \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddd5\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\ude02\ud835\uddfd\ud835\ude00\ud835\uddf2\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\ude00 \ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee to Upstash Vector DB Build a \ud835\udde4 \ud835\uddd4 \ud835\udde8I using Streamlit \ud835\udde8\ud835\uddfb\ud835\uddf6\ud835\ude01 \ud835\udde7\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 Yes, we even added unit testing! \ud835\uddd6\ud835\ude02\ud835\uddff\ud835\uddf6\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\ude01\ud835\uddfc \ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9 \ud835\ude02\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb, \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\uddf4\ud835\uddee\ud835\uddfa\ud835\uddf2 Then, consider checking out \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude24\ud835\ude2d\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude25\ud835\ude26. Everything is free. Article How to build a real time News Search Engine using Vector DBs \ud835\uddda\ud835\uddf6\ud835\ude01\ud835\udddb\ud835\ude02\ud835\uddef \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 Images If not otherwise stated, all images are created by the author. 18 Share this post The ultimate MLOps tool decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-ultimate-mlops-tool?r=1ttoeh" }, { "id": "1436e3e5-eb7c-4632-a538-00fd69c01998", "content": "The new king of Infrastructure as Code IaC Monitoring your DL models while in production. How to build a scalable data collection pipeline SubscribeSign in Share this post The new king of Infrastructure as Code IaC decodingml.substack.com Copy link Facebook Email Note Other The new king of Infrastructure as Code IaC Monitoring your DL models while in production. How to build a scalable data collection pipeline Paul Iusztin Jun 29, 2024 11 Share this post The new king of Infrastructure as Code IaC decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week s topics The new king of Infrastructure as Code IaC How to build a scalable data collection pipeline Monitoring your DL models while in production The new king of Infrastructure as Code IaC This is \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\uddf3 \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\ude00 \ud835\uddd6\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\udddc\ud835\uddee\ud835\uddd6 . Here is \ud835\ude04\ud835\uddf5\ud835\ude06 it is \ud835\uddef\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff than \ud835\udde7\ud835\uddf2\ud835\uddff\ud835\uddff\ud835\uddee\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa or \ud835\uddd6\ud835\uddd7\ud835\uddde I am talking about Pulumi Let s see what is made of \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddf6 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddf5\ud835\uddfc\ud835\ude04 \ud835\uddf6\ud835\ude00 \ud835\uddf6\ud835\ude01 \ud835\uddf1\ud835\uddf6\ud835\uddf3\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\ude01? Unlike other IaC tools that use YAML, JSON, or a Domain Specific Language DSL , Pulumi lets you write code in languages like Python, TypeScript, Node.js, etc. This enables you to leverage existing programming knowledge and tooling for IaC tasks. Pulumi integrates with familiar testing libraries for unit and integration testing of your infrastructure code. It integrates with most cloud providers AWS, GCP, Azure, Oracle, etc. \ud835\uddd5\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\ude01\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddf6 \ud835\uddd9\ud835\uddf9\ud835\uddf2\ud835\ude05\ud835\uddf6\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 Use your preferred programming language for IaC it works for most clouds out there \ud835\uddd8\ud835\uddf3\ud835\uddf3\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06 Leverage existing programming skills and tooling. \ud835\udde7\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 Write unit and integration tests for your infrastructure code. \ud835\uddd6\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\uddef\ud835\uddfc\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb Enables Dev and Ops to work together using the same language. If you disagree, try to apply OOP or logic if, for statements to Terraform HCL s syntax. It works, but it quickly becomes a living hell. \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddf6 \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8\ud835\ude00 Pulumi uses a declarative approach. You define the desired state of your infrastructure. It manages the state of your infrastructure using a state file. When changes are made to the code, Pulumi compares the desired state with the current state and creates a plan to achieve the desired state. The plan shows what resources will be created, updated, or deleted. You can review and confirm the plan before Pulumi executes it. It works similarly to Terraform but with all the benefits your favorite programming language and existing tooling provides It works similar to CDK, but faster and for your favorite cloud infrastructure not only AWS Pulumi code example _What do you think? Have you used Pulumi?_ We started using it for the LLM Twin course, and so far, we love it! I will probably wholly migrate from Terraform to Pulumi in future projects. More on Pulumi How to build a scalable data collection pipeline \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1, \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 to \ud835\uddd4\ud835\uddea\ud835\udde6, \ud835\udddc\ud835\uddee\ud835\uddd6, and \ud835\uddd6\ud835\udddc \ud835\uddd6\ud835\uddd7 for a \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 that \ud835\uddf0\ud835\uddff\ud835\uddee\ud835\ude04\ud835\uddf9\ud835\ude00 your \ud835\uddf1\ud835\uddf6\ud835\uddf4\ud835\uddf6\ud835\ude01\ud835\uddee\ud835\uddf9 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 do you need \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf4\ud835\uddfc\ud835\uddee\ud835\uddf9? \ud835\ude08 \ud835\ude34\ud835\ude24\ud835\ude22\ud835\ude2d\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude24\ud835\ude33\ud835\ude22\ud835\ude38\ud835\ude2d\ud835\ude34, \ud835\ude24\ud835\ude30\ud835\ude2d\ud835\ude2d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude34, \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude34\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude26\ud835\ude34 \ud835\ude22\ud835\ude2d\ud835\ude2d \ud835\ude3a\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude25\ud835\ude2a\ud835\ude28\ud835\ude2a\ud835\ude35\ud835\ude22\ud835\ude2d \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22 \ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e LinkedIn Medium Substack Github \ud835\udde7\ud835\uddfc \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddf6\ud835\ude01 \ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\udfed . \ud835\udde6\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddfb\ud835\uddf6\ud835\ude02\ud835\uddfa a Python tool for automating web browsers. It s used here to interact with web pages programmatically like logging into LinkedIn, navigating through profiles, etc. \ud835\udfee . \ud835\uddd5\ud835\uddf2\ud835\uddee\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\udde6\ud835\uddfc\ud835\ude02\ud835\uddfd a Python library for parsing HTML and XML documents. It creates parse trees that help us extract the data quickly. \ud835\udfef . \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf4\ud835\uddfc\ud835\uddd7\ud835\uddd5 \ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddfb\ud835\ude06 \ud835\uddfc\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udde1\ud835\uddfc\ud835\udde6\ud835\udde4\ud835\udddf \ud835\uddd7\ud835\uddd5 a NoSQL database fits like a glove on our unstructured text data \ud835\udff0 . \ud835\uddd4\ud835\uddfb \ud835\udde2\ud835\uddd7\ud835\udde0 a technique that maps between an object model in an application and a document database \ud835\udff1 . \ud835\uddd7\ud835\uddfc\ud835\uddf0\ud835\uddf8\ud835\uddf2\ud835\uddff \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\uddd8\ud835\uddd6\ud835\udde5 to deploy our code, we have to containerize it, build an image for every change of the main branch, and push it to AWS ECR \ud835\udff2 . \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\udddf\ud835\uddee\ud835\uddfa\ud835\uddef\ud835\uddf1\ud835\uddee we will deploy our Docker image to AWS Lambda a serverless computing service that allows you to run code without provisioning or managing servers. It executes your code only when needed and scales automatically, from a few daily requests to thousands per second \ud835\udff3 . \ud835\udde3\ud835\ude02\ud835\uddf9\ud835\ude02\ud835\uddfa\ud835\uddfb\ud835\uddf6 IaC tool used to programmatically create the AWS infrastructure MongoDB instance, ECR, Lambdas and the VPC \ud835\udff4 . \ud835\uddda\ud835\uddf6\ud835\ude01\ud835\udddb\ud835\ude02\ud835\uddef \ud835\uddd4\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 used to build our CI CD pipeline on any merged PR to the main branch, it will build push a new Docker image and deploy it to the AWS Lambda service ETL architecture to collect digital data from social media platforms \ud835\ude3e\ud835\ude6a\ud835\ude67\ud835\ude5e\ud835\ude64\ud835\ude6a\ud835\ude68 \ud835\ude5d\ud835\ude64\ud835\ude6c \ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude68\ud835\ude5a \ud835\ude69\ud835\ude64\ud835\ude64\ud835\ude61\ud835\ude68 \ud835\ude6c\ud835\ude64\ud835\ude67\ud835\ude60 \ud835\ude69\ud835\ude64\ud835\ude5c\ud835\ude5a\ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude67? Then... Check out \ud835\udddf\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb \ud835\udfee from the FREE \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddd6\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 created by Decoding ML ...where we will walk you \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\uddef\ud835\ude06 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd through the \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 and \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 of the \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 The Importance of Data Pipelines in the Era of Generative AI Monitoring your DL models while in production \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 is \ud835\udde7\ud835\udddb\ud835\uddd8 \ud835\uddf8\ud835\uddf2\ud835\ude06 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 in ensuring your \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00 in \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb are \ud835\uddf3\ud835\uddee\ud835\uddf6\ud835\uddf9 \ud835\ude00\ud835\uddee\ud835\uddf3\ud835\uddf2. Here is an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 on \ud835\udde0\ud835\udddf \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 using Triton, Prometheus and Grafana Razvant Alexandru wrote a fantastic \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\uddef\ud835\ude06 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 in the Decoding ML Newsletter on \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 your \ud835\uddd7\ud835\udddf \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00 while in \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb. Within his article, he started with an example where, in one of his projects, a main processing task was supposed to take 5 \ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34, but while in production, it jumped to 8 \ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34. \ud835\ude1b\ud835\ude29\ud835\ude2a\ud835\ude34 \ud835\ude30\ud835\ude33 \ud835\ude34\ud835\ude30\ud835\ude2e\ud835\ude26\ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude34\ud835\ude2a\ud835\ude2e\ud835\ude2a\ud835\ude2d\ud835\ude22\ud835\ude33 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude29\ud835\ude22\ud835\ude31\ud835\ude31\ud835\ude26\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude22\ud835\ude2d\ud835\ude2d \ud835\ude30\ud835\ude27 \ud835\ude36\ud835\ude34. Even to the greatest. It s impossible always to anticipate everything that will happen in production sometimes it is a waste of time even to try to . That is why you always need eyes and years on your production ML system. Otherwise, imagine how much or users he would have lost if he hadn t detected the 3 4 hours loss in performance as fast as possible. Afterward, he explained step by step how to use \ud835\uddf0\ud835\uddd4\ud835\uddf1\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddfc\ud835\uddff to scrape RAM CPU usage per container \ud835\udde7\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddfb \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde6\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf2\ud835\uddff to serve ML models and yield GPU specific metrics. \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude02\ud835\ude00 to bind between the metrics generators and the consumer. \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf3\ud835\uddee\ud835\uddfb\ud835\uddee to visualize the metrics \ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\uddfc\ud835\uddfb \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf How to ensure your models are fail safe in production? Images If not otherwise stated, all images are created by the author. 11 Share this post The new king of Infrastructure as Code IaC decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-new-king-of-infrastructure-as?r=1ttoeh" }, { "id": "fd48444e-ab32-49b9-afdc-14fe8ecafd41", "content": "Data Ingestion Architecture for ML and Marketing Intelligence Building a highly scalable data collection pipeline for AI, ML and marketing intelligence leveraging the AWS cloud, Python, data crawling, and Docker. SubscribeSign in Share this post Highly Scalable Data Ingestion Architecture for ML and Marketing Intelligence decodingml.substack.com Copy link Facebook Email Note Other Highly Scalable Data Ingestion Architecture for ML and Marketing Intelligence Leveraging AWS Ecosystem and Data Crawling for Scalable and Adaptive Data Pipelines Rares Istoc Jun 27, 2024 13 Share this post Highly Scalable Data Ingestion Architecture for ML and Marketing Intelligence decodingml.substack.com Copy link Facebook Email Note Other Share Today s article is written by our guest , Rares Istoc , a veteran with over 7 years of experience building scalable software and data engineering systems in the industry. Here is his LinkedIn. Machine learning without data is like a chef without ingredients all the skills but nothing to cook. These days, everything circulates around data, from personalized ads to streaming recommendations. Data drives decisions in business, healthcare, and sports. Without it, apps would be clueless, smart devices would be dumb, and predictions would be nothing more than guesses. In this digital age, data is the lifeblood of innovation and efficiency. Ok, but why another article about data ingestion? There are many ways to build data ingestion pipelines, and with all the new tools created over the last decade, selecting the best ones can be challenging. The answer often depends on your project s specific needs. In this article, you ll explore an end to end solution for marketing intelligence. Using AWS s ecosystem, you can create a scalable data ingestion pipeline for data crawling and integrate it into various analytical processes like sales, competitor analysis, market analysis, and customer insights. I ll also present the challenges encountered while building this solution. Finding a complete working solution is tough, with most answers scattered across the Internet. You can access the full solution code on GitHub . _ IMPORTANT NOTE Before diving into this solution, you must be aware of the legal implications of ingesting data from some data sources, like social media pages, so we can make sure nobody goes to jail. Please read the terms and conditions of each major platform these will restrict you from crawling user profiles and private pages._ Table of Contents 1. Architecture Overview 2. Implementation 3. Challenges Pitfalls 4. Local Testings 5. Deployment 1 . Architecture Overview This is what we are about to build Here are some non functional requirements I ve aimed to achieve with this architecture Scalability The solution can process many pages simultaneously and easily add more, handling growth at any time. Maintainability Adaptability Each component is designed for easy modification and expansion without significant development time. Components Overview Scheduler Triggers crawler lambdas for each page link. Crawler Extracts various posts and information from the page link. If unfamiliar with crawling, look it up before proceeding. Details will follow in the implementation part. Database MongoDB is used for our data lake storage, housing posts for later use. It excels at handling semi structured data. The complete flow the scheduler triggers a crawler lambda for each page, sending the page name and link. The crawler extracts posts from the past week, storing the raw content, creation date, link, and name. The scheduler waits for all lambdas to finish, aggregates the posts from the database, and sends them to ChatGPT using prompt templates to generate reports. 2 . Implementation In this section, I ll provide a detailed overview of the main components, breaking them down with code samples and explanations. 2.1. Scheduler I ll not focus much on the reporting part, though you can find it here along with all the code shared in this article. The main focus is the scheduling part, the entry point of the system where the flow starts and is orchestrated import json import os import time from datetime import datetime, timedelta import boto3 from aws_lambda_powertools import Logger from aws_lambda_powertools.utilities.typing import LambdaContext from src.constants import PAGE_LINK from src.db import database from src.utils import monitor logger Logger service decodingml scheduler _client boto3.client lambda def lambda_handler event, context LambdaContext correlation_ids for link in PAGE_LINK response _client.invoke FunctionName lambda , InvocationType Event , Payload json.dumps link link , logger.info f Triggered crawler for link correlation_ids.append response ResponseMetadata RequestId logger.info f Monitoring len correlation_ids crawler processes while True time.sleep 15 completed monitor correlation_ids correlation_ids c for c in correlation_ids if c not in completed if not correlation_ids break logger.info f Still waiting for len correlation_ids crawlers to complete now datetime.now posts list database.profiles.find date gte now timedelta days 7 , lte now , logger.info f Gathered len posts posts if not posts logger.info Cannot generate report, no new posts available return reports generate_profiles_report posts logger.info Generated new report! The scheduler acts as a scatterer, iterating over a list of page links and invoking a crawler asynchronously with the InvocationType parameter set to Event, ensuring the scheduler won t block for a single page. It stores each lambda s correlation ID in a list and waits for all lambdas to finish, with a 15 second wait time, adjustable based on your crawler s average completion time. Finally, it finds all crawled posts and sends them to the report generation phase. 2.2. Crawler Here I ll break down the actual crawling process import abc import os from datetime import datetime, timedelta from itertools import takewhile, dropwhile from typing import List, Dict, Any import instaloader from src.crawlers.base import BaseAbstractCrawler class BaseAbstractCrawler abc.ABC abc.abstractmethod def extract self, link str, kwargs None ... class InstagramCrawler BaseAbstractCrawler def __init__ self, link str, proxy None self.link link self.loader instaloader.Instaloader self._until datetime.now self._since self._until timedelta days 7 self._proxy proxy def extract self, kwargs List Dict str, str Any parsed_url urlparse self.link if self._proxy os.environ https_proxy self._proxy.__dict__ .get http profile instaloader.Profile.from_username self.loader.context, parsed_url.path.strip .split 0 posts takewhile lambda p p.date self._since, dropwhile lambda p p.date self._until, profile.get_posts return content post.caption, date post.date, link self.link for post in posts I ve defined a main abstraction point for all crawlers, establishing a common interface that all derived crawlers must implement. Each subclass must provide its implementation for the extract method, ensuring reusability and uniformity. import re from src.crawlers.base import BaseAbstractCrawler from src.crawlers.instagram import InstagramCrawler class CrawlerDispatcher def __init__ self None self._crawlers def register self, domain str, crawler type BaseAbstractCrawler None self._crawlers r https www . ? .com .format re.escape domain crawler def get_crawler self, url str BaseAbstractCrawler for pattern, crawler in self._crawlers.items if re.match pattern, url return crawler else raise ValueError No crawler found for the provided link dispatcher CrawlerDispatcher dispatcher.register instagram , InstagramCrawler To promote and call each crawler automatically, I ve built a dispatcher that selects and instantiates the correct crawler class based on the provided link. This acts as a registry and factory for the crawlers, managed under a unified interface and structure. Advantages Flexibility Scalability Allows easy addition of new domains and specialized crawlers without modifying the existing codebase. Encapsulation Modularity The dispatcher encapsulates the logic for determining which crawler to use, making the system modular and allowing each crawler to focus on its core business logic. from datetime import datetime, timedelta from aws_lambda_powertools import Logger from aws_lambda_powertools.utilities.typing import LambdaContext from src.crawlers import dispatcher from src.db import database logger Logger service decodingml crawler def lambda_handler event, context LambdaContext link event.get link logger.info f Start extracting posts for link crawler dispatcher.get_crawler event.get link posts page, correlation_id context.aws_request_id for page in crawler.extract now datetime.now existing_posts database.profiles.find date gte now timedelta days 7 , lte now , name link , projection date 1 existing_posts post.get date for post in list existing_posts posts post for post in posts if post.get date not in existing_posts if not posts logger.info No new posts on page return logger.info f Successfully extracted len posts posts database.profiles.insert_many posts logger.info f Successfully inserted data in db The main entry point assembles the link from the event body, selects the correct crawler, and starts extraction jobs. After extraction, it checks for existing posts to avoid duplicates and adds new posts to the database. 3 . Challenges Pitfalls 3.1. Running headless browser instance with selenium in lambda runtime environment This caused the most headaches. The Lambda execution environment is read only, so writing to disk requires using a temporary file, complicating automatic binary driver installation. Therefore, you need to install the driver directly in the Docker image and reference it manually in Selenium s driver options. The only usable driver for this setup was the Google binary driver in my case. FROM public.ecr.aws lambda python 3.11 as build Download chrome driver and browser and manually unpack them in their folders RUN yum install y unzip curl Lo tmp chromedriver linux64.zip https edgedl.me.gvt1.com edgedl chrome chrome for testing 119.0.6045.105 linux64 chromedriver linux64.zip curl Lo tmp chrome linux64.zip https edgedl.me.gvt1.com edgedl chrome chrome for testing 119.0.6045.105 linux64 chrome linux64.zip unzip tmp chromedriver linux64.zip d opt unzip tmp chrome linux64.zip d opt FROM public.ecr.aws lambda python 3.11 Install the function s OS dependencies using yum RUN yum install y atk cups libs gtk3 libXcomposite alsa lib libXcursor libXdamage libXext libXi libXrandr libXScrnSaver libXtst pango at spi2 atk libXt xorg x11 server Xvfb xorg x11 xauth dbus glib dbus glib devel nss mesa libgbm ffmpeg libxext6 libssl dev libcurl4 openssl dev libpq dev COPY from build opt chrome linux64 opt chrome COPY from build opt chromedriver linux64 opt COPY . pyproject.toml . poetry.lock . Install Poetry, export dependencies to requirements.txt, and install dependencies in the Lambda task directory, finally cleanup manifest files. RUN python3 m pip install upgrade pip pip install poetry RUN poetry export f requirements.txt requirements.txt pip3 install no cache dir r requirements.txt target LAMBDA_TASK_ROOT rm requirements.txt pyproject.toml poetry.lock Copy function code COPY . src LAMBDA_TASK_ROOT src The main idea in this Dockerfile is that I manually downloaded the Chrome driver and browser and unpacked them in a location where they can be accessed by Selenium, which usually would ve done this directly. This is a mandatory step for the Lambda environment. Since everything is read only, in the next code sample I ll show you how point Selenium to the correct driver and browser locations from tempfile import mkdtemp def init_driver self options Options Setup drover binary location manually options.binary_location opt chrome chrome Run browser in headless mode options.add_argument headless new options.add_argument no sandbox options.add_argument single process options.add_argument window size 1420,1080 options.add_argument disable dev shm usage options.add_argument disable gpu options.add_argument disable popup blocking options.add_argument disable notifications options.add_argument disable dev tools options.add_argument log level 3 options.add_argument ignore certificate errors options.add_argument no zygote options.add_argument f user data dir mkdtemp options.add_argument f data path mkdtemp options.add_argument f disk cache dir mkdtemp options.add_argument remote debugging port 9222 self._driver webdriver.Chrome service Service opt chromedriver , options options, I hardcoded the driver and browser locations in the Dockerfile. Additionally, I pointed several folders e.g., user data dir, disk cache dir to temporary directories to prevent Selenium from creating them automatically, which would cause errors due to Lambda s disk limitations. 3.2. Aggregate Empty Pages My initial monitoring algorithm was basic, looping over lambda invocation correlation IDs and checking the database for generated posts. However, it encountered an infinite loop when no new posts were created for some pages. import datetime import re from typing import List import boto3 _client boto3.client logs def monitor correlation_ids List str finished now int datetime.datetime.now datetime.timedelta days 1 .timestamp 1000 response _client.filter_log_events logGroupName aws lambda crawler , startTime now, filterPattern REPORT RequestId for event in response events match re.search r REPORT RequestId s , event.get message if match correlation_id match.group 1 if correlation_id in correlation_ids finished.append correlation_id return finished Here, I search through all log streams for each lambda generated in that current day and look for the message, which usually has this format _ REPORT RequestId _ correlation_id . This indicates that the lambda has reached the end of its execution, and I can mark which correlation IDs have finished. 3.3. Avoid being blocked by social media platforms This was a pity error the kind you would ve spent days on and the solution was to watch it from a different perspective. Popular social media platforms implement many anti bot protection mechanisms to prevent crawling, from request header analysis to rate limiting to IP blocking. And because we run our browser in headless mode to mimic realistic user browser interaction, and all our crawlers send requests under the same IP address to multiple pages at the same time repeatedly, this screams, please block me. To address this, I ve used a proxy to mask my IP address and location import os class ProxyConnection def __init__ self, host str None, port str None, username str None, password str None, verify_ssl bool False self.host host or os.getenv PROXY_HOST self.port port or os.getenv PROXY_PORT self.username username or os.getenv PROXY_USERNAME self.password password or os.getenv PROXY_PASSWORD self.verify_ssl verify_ssl self._url f self.username self.password self.host self.port def __dict__ self return https https .format self._url.replace , , http http .format self._url.replace , , no_proxy localhost, 127.0.0.1 , verify_ssl self.verify_ssl To address this, I used a proxy to mask my IP and location. Paid proxies like SmartProxy offer a pool of rotating IPs, assigning a different IP to each crawler, mimicking regular user behavior. Additionally, using a proxy allows finding a country without access restrictions to public pages, ensuring smooth crawling. 4 . Local Testings To prove this works, I wrote a makefile containing some simple commands for crawler and lambda. The problem is that I ve only managed to test the crawler locally. Since the scheduler spins up crawlers, they should be already deployed on AWS. local test crawler Send test command on local to test the lambda curl X POST http localhost 9000 2015 03 31 functions function invocations d link https www.instagram.com mcdonalds local test scheduler Send test command on local to test the lambda curl X POST http localhost 9000 2015 03 31 functions function invocations d Now, most people, when testing lambda functions on a local environment, use AWS Lambda RIE Runtime Interface Emulator , which allows you to test your lambda function packages in a container. Basically, this emulates a lambda execution environment on your local machine. As you can see, I ve managed to do this without using the emulator, which slightly simplified my environment. You can use these commands to test each component. For example, if you would like to test the crawler, go into your terminal and use this command make local test crawler As you can see, the crawling process has started, and for this page, we ve found three new posts in the last seven days 5 . Deployment The deployment process is defined in our GitHub repository under the ops folder, where you can explore the whole solution written in Pulumi. You can play with the Makefile. It contains all the necessary commands to make your infrastructure up and running. Conclusion In this article, we ve explored a complete end to end robust solution for building a Highly Scalable Data Ingestion pipeline that can leverage existing data from multiple crawlable sources for various processes like ML training, data analysis, etc. We ve gone through specific challenges you might face and how to overcome them in this process. _ Check out the code on GitHub 1 and support us with a _ Within our newsletter, we keep things short and sweet. If you enjoyed reading this article, consider checking out the full version on Medium. It s still free Full article on Medium Images If not otherwise stated, all images are created by the author. 13 Share this post Highly Scalable Data Ingestion Architecture for ML and Marketing Intelligence decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/highly-scalable-data-ingestion-architecture?r=1ttoeh" }, { "id": "9c6f5239-fc76-4fe9-a8e2-77f662d0c69f", "content": "2 Key LLMOps Concepts by Alex Razvant How to monitor LLM RAG applications. Evaluate your RAG like a pro. Learn about memory compute requirements on LLMs. SubscribeSign in Share this post 2 Key LLMOps Concepts decodingml.substack.com Copy link Facebook Email Note Other 2 Key LLMOps Concepts How to monitor LLM RAG applications. Evaluate your RAG like a pro. Learn about memory compute requirements on LLMs. Alex Razvant Jun 22, 2024 10 Share this post 2 Key LLMOps Concepts decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week s topics A powerful framework to evaluate RAG pipelines Why do LLMs require so much VRAM? LLMOps Chain Monitoring \ud835\udde2\ud835\uddfb\ud835\uddf2 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9\ud835\ude02\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\udde5\ud835\uddd4\ud835\uddda\ud835\uddd4\ud835\ude00 Building an RAG pipeline is fairly simple. You just need a Vector DB knowledge base, an LLM to process your prompts, plus additional logic for interactions between these modules. Lesson 10 Evaluating the RAG pipeline. Image by Author However, reaching a satisfying performance level imposes its challenges due to the separate components Decoding ML Newsletter is a reader supported publication. If you enjoy our content, please consider becoming a paid subscriber. Subscribe 1. Retriever which takes care of querying the Knowledge DB and retrieves additional context that matches the user s query. 2. Generator which encompasses the LLM module, generating an answer based on the context augmented prompt. When evaluating a RAG pipeline, we must evaluate both components separately and together. What is RAGAs? A framework that helps you evaluate your Retrieval Augmented Generation RAG pipelines. One of the core concepts of RAGAs is Metric Driven Development MDD which is a product development approach that relies on data to make well informed decisions. What metrics do RAGAs expose? For \ud835\udde5\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 Stage \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\udde3\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb Evaluates the precision of the context used to generate an answer, ensuring relevant information is selected from the context \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\ude06 Measures how relevant the selected context is to the question. \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf9 Measures if all the relevant information required to answer the question was retrieved. \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 \ud835\uddd8\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\ude01\ud835\uddf6\ud835\uddf2\ud835\ude00 \ud835\udde5\ud835\uddf2\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf9 Evaluates the recall of entities within the context, ensuring that no important entities are overlooked. For \ud835\uddda\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb Stage \ud835\uddd9\ud835\uddee\ud835\uddf6\ud835\ude01\ud835\uddf5\ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\uddfb\ud835\uddf2\ud835\ude00\ud835\ude00 Measures how accurately the generated answer reflects the source content, ensuring the generated content is truthful and reliable. \ud835\uddd4\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\udde5\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2 It is validating that the response directly addresses the user s query. \ud835\uddd4\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\udde6\ud835\uddf2\ud835\uddfa\ud835\uddee\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddf0 \ud835\udde6\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\uddf9\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\ude06 Shows that the generated content is semantically aligned with expected responses. \ud835\uddd4\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\uddd6\ud835\uddfc\ud835\uddff\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfb\ud835\uddf2\ud835\ude00\ud835\ude00 Focuses on fact checking, assessing the factual accuracy of the generated answer. How to evaluate using RAGAs? 1 . Prepare your \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34,\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33\ud835\ude34,\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude34 and \ud835\ude28\ud835\ude33\ud835\ude30\ud835\ude36\ud835\ude2f\ud835\ude25_\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude35\ud835\ude29\ud835\ude34 2 . Compose a Dataset object 3 . Select metrics 4 . Evaluate 5 . Monitor scores or log the entire evaluation chain to a platform like CometML. For a full end to end workflow of RAGAs evaluation in practice, I ve described it in this LLM Twin Course Article How to Evaluate RAGs Medium Article Why are LLMs so Memory hungry? LLMs require lots of GPU memory, but let s see why that s the case. What is an LLM parameter? LLMs, like Mistral 7B or LLama3 8B, have billions of parameters. \ud835\uddd8\ud835\uddee\ud835\uddf0\ud835\uddf5 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddf6\ud835\ude00 \ud835\uddee \ud835\ude04\ud835\uddf2\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01 stored and accessed during computation. How much GPU VRAM is required? There are three popular precision formats that LLMs are trained in FP32 32bits floating point FP16 BFP16 16 bits floating point Most use mixed precision, e.g., matmul in BFP16 and accumulations in FP32. For this example, we ll use half precision BFP16. Here s a deeper dive on this topic Google BFloat16 LLMs Precision Benchmark Let s calculate the VRAM required begin align text VRAM text Size text params text Size text activations text Size text params text Params times text Precision text bytes end align As 1byte 8bits, we ve got FP32 32 bits 4 bytes FP16 BFP16 16bits 2 bytes Now, for a 7B model, we would require VRAM 7 10 9 billion 2 bytes 14 10 9 bytes Knowing that 1GB 10 9 bytes we have \ud835\udfed\ud835\udff0\ud835\uddda\ud835\uddd5 as the required VRAM to load a \ud835\udff3\ud835\uddd5 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 in half BF16 precision. \ud835\udde7\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf6\ud835\ude00 \ud835\uddfd\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\ude06 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\ude00. Ever encountered the \ud835\uddd6\ud835\udde8\ud835\uddd7\ud835\uddd4 \ud835\udde2\ud835\udde2\ud835\udde0 Error e.g \ud835\ude1b\ud835\ude33\ud835\ude2a\ud835\ude26\ud835\ude25 \ud835\ude35\ud835\ude30 \ud835\ude22\ud835\ude2d\ud835\ude2d\ud835\ude30\ud835\ude24\ud835\ude22\ud835\ude35\ud835\ude26 56\ud835\ude14\ud835\ude09 ... when inferencing? here s the most plausible cause for that No GPU VRAM left for the activations. Let s figure out the activation size required by using \ud835\udddf\ud835\udddf\ud835\uddee\ud835\uddfa\ud835\uddee\ud835\udfee \ud835\udff3\ud835\uddd5 as an example. Activations are a combination of the following model parameters Context Length N Hidden Size H Precision P After a quick look at the LLama2 7b model configuration, we get these values Context Length N 4096 tokens Hidden Size H 4096 dims Precision P BF16 2bytes \ud835\udddf\ud835\udddf\ud835\uddee\ud835\uddfa\ud835\uddee\ud835\udfee \ud835\udff3\ud835\uddef \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\udde3\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\ude00 shorturl.at CWOJ9 Consult this interactive LLM VRAM calculator to check on the different memory segments reserved when inferencing training LLMs. Inference Training VRAM Calculator For training, things stay a little different, as more factors come into play, as memory is allocated for Full Activations considering N Heads and N Layers Optimizer States which differ based on the optimizer type Gradients Here s a tutorial on PEFT, QLoRA fine tuning in action LLM Fine Tuning Medium Article Other Resources Model Anatomy shorturl.at nJeu0 VRAM for Serving shorturl.at 9UPBE LLM VRAM Explorer shorturl.at yAcTU One key LLMOps concept Chain Monitoring In traditional ML systems, it is easier to backtrack to a problem compared to Generative AI ones based on LLMs. When working with LLMs, their generative nature can lead to complex and sometimes unpredictable behavior. \ud835\uddd4 \ud835\ude00\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01? Log prompts or entire chains with representative metadata when testing evaluating your LLM. \ud835\ude16\ud835\ude2f\ud835\ude26 \ud835\ude31\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude10 \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude10 \ud835\ude37\ud835\ude26 \ud835\ude23\ud835\ude26\ud835\ude26\ud835\ude2f \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34 \ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude2c \ud835\ude2a\ud835\ude34 \ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\udde0\ud835\udddf \ud835\udddf\ud835\udddf\ud835\udde0. \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\ude04 \ud835\uddf0\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude01 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\ude00 \ud835\uddef\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde6\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde7\ud835\uddee\ud835\ude00\ud835\uddf8\ud835\ude00 Here you might have a query that represents the larger text, the LLMs response which is the summary, and you could calculate the ROUGE score inline between query response and add it to the metadata field. Then you can compose a JSON with query, response, and rouge_score and log it to comet. \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde4 \ud835\uddd4 \ud835\udde7\ud835\uddee\ud835\ude00\ud835\uddf8\ud835\ude00 Here, you could log the Q A pairs separately, or even add an evaluation step using a larger model to evaluate the response. Each pair would be composed of Q, A, GT, and True False to mark the evaluation. \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\uddda\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde7\ud835\uddee\ud835\ude00\ud835\uddf8\ud835\ude00 You could log the query and response, and append in the metadata a few qualitative metrics e.g. relevance, cohesiveness . \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde5\ud835\uddd4\ud835\uddda If you have complex chains within your RAG application, you could log prompt structures sys_prompt, query , and LLM responses and track the chain execution step by step. \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde1\ud835\uddd8\ud835\udde5 You could define the entity fields and log the query, response, entities_list, and extracted_entities in the same prompt payload. \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddf2\ud835\uddff\ud835\ude00 CometML LLM also allows you to log images associated with a prompt or a chain. If you re working with GPT4 Vision for example, you could log the query and the generated image in the same payload. Also, besides the actual prompt payload, you could inspect the processing time per each step of a chain. For example, a 3 step chain in an RAG application might query the Vector DB, compose the prompt, and pass it to the LLM, and when logging the chain to CometML, you could see the processing time chain step. \ud835\udde7\ud835\uddfc \ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\uddf6\ud835\ude01 \ud835\ude02\ud835\uddfd, \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf9\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 CometML pip package CometML API key Workspace name and Project Name I ve used this approach when evaluating a fine tuned LLM on a custom instruction dataset. For a detailed walkthrough Evaluating LLMs Medium Article Images If not otherwise stated, all images are created by the author. 10 Share this post 2 Key LLMOps Concepts decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/2-key-llmops-concepts?r=1ttoeh" }, { "id": "87f34471-9a5b-4641-8272-15b6a18a9be7", "content": "The LLM Twin Free Course on Production Ready RAG applications. Learn how to build a full end to end LLM RAG production ready system, follow and code along each component by yourself. SubscribeSign in Share this post The LLM Twin Free Course on Production Ready RAG applications. decodingml.substack.com Copy link Facebook Email Note Other The LLM Twin Free Course on Production Ready RAG applications. Learn how to build a full end to end LLM RAG production ready system, follow and code along each component by yourself. Alex Razvant Jun 20, 2024 13 Share this post The LLM Twin Free Course on Production Ready RAG applications. decodingml.substack.com Copy link Facebook Email Note Other Share the last lesson of the LLM Twin free course What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality, and voice into an LLM. Decoding ML Newsletter is a reader supported publication. If you enjoy our work, please consider becoming a paid subscriber. Subscribe Image by DALL E Why is this course different? _By finishing the LLM Twin Building Your Production Ready AI Replica _ _free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices_. _ Why should you care? _ _ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system._ _More details on what you will learn within the LLM Twin course , here _ The LLM Twin Free Course This course teaches you how to design, build, and deploy a production ready LLM RAG system. It covers all the components, system design, data ingestion, streaming pipeline, fine tuning pipeline, inference pipeline alongside production monitoring, and more. What is the course about? We re building a production ready RAG system, able to write content based on your unique style, by scrapping previous posts articles and code snippets written by you to construct a fresh and continuously updated knowledge base, generate a dataset to fine tune a capable and efficient open source LLM, and then interconnect all components for a full end to end deployment while integrating evaluation and post deployment monitoring. This course follows best MLOps LLMOps practices, focusing on the 3 pipeline design pattern for building ML centered applications. Lesson 1 Presenting the Architecture Presenting and describing each component, the tooling used, and the intended workflow of implementation. The first lesson will prepare the ground by offering a wide overview of each component and consideration. We recommend you start here. Lesson 1 An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin LLM twin system architecture Image by the Author Lesson 2 Data Pipelines In this lesson, we ll start by explaining what a data pipeline is, and the key concepts of data processing and streaming, and then dive into the data scrapping and processing logic. Lesson 2 The Importance of Data Pipelines in the Era of Generative AI Lesson 2 The Data Collection Pipeline Image by author Lesson 3 Change Data Capture and Data Processing In this lesson, we re showcasing the CDC Change Data Capture integration within the LLM Twin data pipeline. We re showing how to set up MongoDB, the CDC approach for event driven processing, RabbitMQ for message queuing, and efficient low latency database querying using the MongoDB Oplog. Lesson 3 CDC Enabling Event Driven Architectures Lesson 3 Event Driven Processing using RabbitMQ, CDC, and MongoDB Image by Author Lesson 4 Efficient Data Streaming Pipelines In this lesson, we ll focus on the feature pipeline. Here, we re showcasing how we ingest data that we ve gathered in the previous lesson, and how we ve built a stream processing workflow with Bytewax that fetches raw samples, structures them using Pydantic Models, cleans, chunks, encodes, and stores them in our Qdrant Vector Database. Lesson 4 SOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! Lesson 4 Efficient Data Streaming Pipelines using Bytewax and Qdrant Vector DB. Image by Author Lesson 5 Advanced RAG Optimization Techniques In this lesson, we ll showcase a few advanced techniques to increase the similarity and accuracy of the embedded data samples from our Qdrant Vector Database. The contents of this lesson could make a significant difference between a naive RAG application and a production ready one. Lesson 5 The 4 Advanced RAG Algorithms You Must Know to Implement Lesson 5 Advanced RAG Optimization Techniques. Image by Author Lesson 6 Dataset preparation for LLM fine tuning In this lesson, we ll discuss the core concepts to consider when creating task specific custom datasets to fine tune LLMs. We ll use our cleaned data from our Vector Database, and engineer specific Prompt Templates alongside using GPT3.5 Turbo API to generate our custom dataset and version it on Comet ML . Lesson 6 The Role of Feature Stores in Fine Tuning LLMs Lesson 6 Generate custom datasets using Knowledge Distillation. Lesson 7 Fine tuning LLMs on custom datasets We ll show how to implement a fine tuning workflow for a Mistral7B Instruct model while using the custom dataset we ve versioned previously. We ll present in depth the key concepts including LoRA Adapters, PEFT, Quantisation, and how to deploy on Qwak. Lesson 7 How to fine tune LLMs on custom datasets at Scale using Qwak and CometML Lesson 7 Fine tuning LLMs on custom datasets using Qwak and CometML. Image by Author Lesson 8 Evaluating the fine tuned LLM In this lesson, we re discussing one core concept of ML Evaluation . We ll present the evaluation workflow we ll showcase the full process of assessing the model s performance using the GPT3.5 Turbo model and custom engineered evaluation templates. Lesson 8 Best Practices When Evaluating Fine Tuned LLMs Lesson 8 Evaluating the quality of our custom fine tuned LLM. Image by Author Lesson 9 Deploying the Inference Pipeline Stack In this lesson, we ll showcase how to design and implement the LLM RAG inference pipeline based on a set of detached Python microservices. We ll split the ML and business logic into two components, describe each one in part, and show how to wrap up and deploy the inference pipeline on Qwak as a scalable and reproducible system. Lesson 9 Architect scalable and cost effective LLM RAG inference pipelines Lesson 9 Architecturing LLM RAG inference pipeline. Image by Author Lesson 10 RAG Pipeline Evaluation In this lesson, we re covering RAG evaluation which is one of great importance. If no proper evaluation metrics are monitored or techniques are used, the RAG systems might underperform and hallucinate badly. Here, we ll describe the workflow of evaluating RAG pipelines using the powerful RAGAs framework, compose the expected RAGAs evaluation format, and capture eval scores which will be included in full LLM execution chains and logged on Comet ML LLM . Lesson 10 Evaluating RAG Systems using the RAGAs Framework Lesson 10 Evaluating the RAG pipeline. Image by Author Next Steps Step 1 Check out the full versions of all Lessons 1 11 on our Medium publication , under the LLM Twin Course group tag. _It s still FREE _ The LLM Twin Course Step 2 Check out theLLM Twin GitHub repository and try it yourself _Nothing compares with getting your hands dirty and building it yourself!_ LLM Twin Course GitHub Images If not otherwise stated, all images are created by the author. 13 Share this post The LLM Twin Free Course on Production Ready RAG applications. decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-llm-twin-free-course-on-production?r=1ttoeh" }, { "id": "d3cb26a9-45fe-42e0-9a79-7a2f358fc875", "content": "A blueprint for designing production LLM systems From Notebooks to production How to get a GitHub Copilot subscription for FREE to 5x writing code . Learn to build production ML systems by building an LLM application. SubscribeSign in Share this post A blueprint for designing production LLM systems From Notebooks to production decodingml.substack.com Copy link Facebook Email Note Other A blueprint for designing production LLM systems From Notebooks to production How to get a GitHub Copilot subscription for FREE to 5x writing code . Learn to build production ML systems by building an LLM application. Paul Iusztin Jun 15, 2024 13 Share this post A blueprint for designing production LLM systems From Notebooks to production decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week s topics How to get a GitHub Copilot subscription for FREE to 5x writing code A blueprint for designing production LLM systems From Notebooks to production Learn to build production ML systems by building an LLM application How to get a GitHub Copilot subscription for FREE to 5x writing code \ud835\udddb\ud835\uddfc\ud835\ude04 to get a \ud835\uddda\ud835\uddf6\ud835\ude01\ud835\udddb\ud835\ude02\ud835\uddef \ud835\uddd6\ud835\uddfc\ud835\uddfd\ud835\uddf6\ud835\uddf9\ud835\uddfc\ud835\ude01 \ud835\ude00\ud835\ude02\ud835\uddef\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb for \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 to 5x writing code There are other alternatives, but GitHub Copilot is still the leading solution due to 2 factors performance convenience. If you can get it for free, there are 0 reasons not to use it sneaky move Microsoft \ud835\udde6\ud835\uddfc \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb? There is no secret. As stated in their docs Verified students, teachers, and maintainers of popular open source projects on GitHub are eligible to use Copilot Individual for free. Docs To become a student or teacher when you are not is not a solution. But... To become a maintainer of a popular open source project is! \ud835\udde6\ud835\uddfc \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddee \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddef\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee \ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddf3 \ud835\uddee \ud835\uddfd\ud835\uddfc\ud835\uddfd\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\uddff \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01 ? I don t know the exact formula, but here are some examples. I am eligible for it because I am the owner of a GitHub repository with 2.2k stars 350 forks Hands on LLMs Course After digging into some Reddit threads, a dude said that for a repo with 520 stars 299 forks, you got the free subscription. The idea is that you don t have to be a maintainer of Pandas or PyTorch to become eligible. . \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf0\ud835\uddf9\ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddfc... start contributing to open source or creating your cool project, which will complete the job! . \ud835\ude10\ud835\ude27 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude23\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude26\ud835\ude33 \ud835\ude2c\ud835\ude2f\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude34\ud835\ude26\ud835\ude24\ud835\ude33\ud835\ude26\ud835\ude35 \ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude36\ud835\ude2d\ud835\ude22 \ud835\ude24\ud835\ude33\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude33\ud835\ude2a\ud835\ude22, \ud835\ude31\ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude34\ud835\ude26 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude37\ud835\ude26 \ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude2e\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude34 \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude30\ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude33\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude2c\ud835\ude2f\ud835\ude30\ud835\ude38. Also, let me know if you know that when contributing to open source, you must contribute by how much until you become eligible. A blueprint for designing production LLM systems From Notebooks to production I am \ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\ude01\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf0\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01... \ud835\udddd\ud835\uddfc\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4, but here is \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 your \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb for \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 posts or articles \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\ude03\ud835\uddfc\ud835\uddf6\ud835\uddf0\ud835\uddf2 \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb? It s an AI character who writes like you, using your writing style and personality. \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddfb\ud835\uddfc\ud835\ude01 \ud835\uddf1\ud835\uddf6\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\ude01\ud835\uddda\ud835\udde3\ud835\udde7? \ud835\uddec\ud835\uddfc\ud835\ude02 \ud835\uddfa\ud835\uddee\ud835\ude06 \ud835\uddee\ud835\ude00\ud835\uddf8... When generating content using an LLM, the results tend to be very generic and unarticulated, contain misinformation due to hallucination , require tedious prompting to achieve the desired result. \ud835\udde7\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf5\ud835\ude06, \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01, \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddee \ud835\ude00\ud835\uddfd\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01 is fine tuned on your digital content to replicate your persona has access to a vector DB with relevant data to avoid hallucinating and write only about concrete facts \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 \ud835\uddff\ud835\uddf2\ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb 1 . A data collection pipeline will gather your digital data from Medium, Substack, LinkedIn and GitHub. It will be normalized and saved to a Mongo DB. 2 . Using CDC, you listen to any changes made to the Mongo DB and add them as events to a RabbitMQ queue. 3 . A Bytewax streaming ingestion pipeline will listen to the queue to clean, chunk, and embed the data in real time. 4 . The cleaned and embedded data is loaded to a Qdrant vector DB. 5 . On the training pipeline side, you use a vector DB retrieval client to build your training dataset, which consists of the cleaned data augmented using RAG . 6 . You fine tune an open source Mistral LLM using QLoRA and push all the experiment artifacts to a Comet experiment tracker. 7 . Based on the best experiment, you push the LLM candidate to Comet s model registry. You carefully evaluate the LLM candidate using Comet s prompt monitoring dashboard. If the evaluation passes, you tag it as accepted. 8 . On the inference pipeline side, you deploy the new LLM model by pulling it from the model registry, loading it, and quantizing it. 9 . The inference pipeline is wrapped by a REST API, which allows users to make ChatGPT like requests. Learn to build production ML systems by building an LLM application Taking in mind the _blueprint for designing production LLM systems presented above_ , we want to let you know that _ We are close to wrapping our LLM twin course lessons and code._ To give more context for newcomers, in the past weeks we started \ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 an \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 by teaching you how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 an \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\ude20\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude3a \ud835\ude08\ud835\ude10 \ud835\ude19\ud835\ude26\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude24\ud835\ude22 So If you are looking for an \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00, consider checking the course s first FREE lesson . \ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude38\ud835\ude22\ud835\ude2d\ud835\ude2c \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude35\ud835\ude29\ud835\ude33\ud835\ude30\ud835\ude36\ud835\ude28\ud835\ude29 \ud835\ude22 \ud835\ude27\ud835\ude36\ud835\ude2d\ud835\ude2d \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2c \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude24\ud835\ude26\ud835\ude34\ud835\ude34 from data gathering... ...until deploying and monitoring your LLM twin using LLMOps . With that in mind... The \ud835\udfed\ud835\ude00\ud835\ude01 \ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb will walk you through the issues of generating content using ChatGPT or other similar solutions the 3 pipeline design the system design and architecture of the LLM twin . Within the \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\ude00\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb, we will present all the \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddee\ud835\uddf9 \ud835\uddf1\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 on \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a data collection pipeline a real time feature pipeline using a streaming engine hook the data and feature pipelines using the CDC pattern a continuous fine tuning pipeline an inference pipeline deployed as a REST API A \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\uddff \ud835\uddf3\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\ude00 will be on \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddf4\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf4\ud835\uddfc\ud835\uddfc\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf2\ud835\ude00 prompt versioning model registries experiment tracker prompt monitoring CI CD IaC Docker . \ud835\ude52\ud835\ude56\ud835\ude63\ud835\ude69 \ud835\ude69\ud835\ude64 \ud835\ude59\ud835\ude5e\ud835\ude5c \ud835\ude5e\ud835\ude63\ud835\ude69\ud835\ude64 \ud835\ude69\ud835\ude5d\ud835\ude5a 1\ud835\ude68\ud835\ude69 \ud835\ude61\ud835\ude5a\ud835\ude68\ud835\ude68\ud835\ude64\ud835\ude63? \ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01. It s FREE, and no registration is required \ud835\ude13\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f 1 \ud835\ude08\ud835\ude2f \ud835\ude0c\ud835\ude2f\ud835\ude25 \ud835\ude35\ud835\ude30 \ud835\ude0c\ud835\ude2f\ud835\ude25 \ud835\ude0d\ud835\ude33\ud835\ude22\ud835\ude2e\ud835\ude26\ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude3a \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1a\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e\ud835\ude34 \ud835\ude23\ud835\ude3a \ud835\ude09\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude25\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude20\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1b\ud835\ude38\ud835\ude2a\ud835\ude2f Images If not otherwise stated, all images are created by the author. 13 Share this post A blueprint for designing production LLM systems From Notebooks to production decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/a-blueprint-for-designing-production?r=1ttoeh" }, { "id": "9d858911-52d4-4240-8d6e-91f6b426baa0", "content": "The difference between development and continuous training ML environments Looking to become a PRO in LangChain? How to write a streaming retrieval system for RAG on social media data. SubscribeSign in Share this post The difference between development and continuous training ML environments decodingml.substack.com Copy link Facebook Email Note Other The difference between development and continuous training ML environments Looking to become a PRO in LangChain? How to write a streaming retrieval system for RAG on social media data. Paul Iusztin Jun 08, 2024 7 Share this post The difference between development and continuous training ML environments decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week s topics Looking to become a PRO in LangChain? The difference between development and continuous training ML environments How to write a streaming retrieval system for RAG on social media data _ First , I want to thank everyone who supported our Hands on LLMs course repo_ The \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 FREE \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 passed 2.1k on GitHub the place to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb the \ud835\uddf3\ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf9\ud835\ude00 of \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude28\ud835\ude30 \ud835\ude35\ud835\ude30 \ud835\ude29\ud835\ude36\ud835\ude23 \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude27\ud835\ude36\ud835\ude2f\ud835\ude25\ud835\ude22\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude22\ud835\ude2d\ud835\ude34 \ud835\ude30\ud835\ude27 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude3a \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude16\ud835\ude31\ud835\ude34 It will walk you through an \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00... ...from data preparation to deployment monitoring the 3 pipeline design building your custom financial dataset using GPT 4 a streaming pipeline to ingest financial news in real time fine tuning an LLM using QLoRA building a custom RAG pipeline deploying the streaming pipeline to AWS deploying the training inference pipelines to Beam using MLOps components model registries, experiment trackers, prompt monitoring \ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\ude0f\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude34 \ud835\ude30\ud835\ude2f \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 \ud835\ude0a\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude13\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude0b\ud835\ude26\ud835\ude31\ud835\ude2d\ud835\ude30\ud835\ude3a \ud835\ude22 \ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude2d \ud835\ude1b\ud835\ude2a\ud835\ude2e\ud835\ude26 \ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d \ud835\ude08\ud835\ude25\ud835\ude37\ud835\ude2a\ud835\ude34\ud835\ude30\ud835\ude33 Looking to become a PRO in LangChain? Then \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddfc\ud835\ude02\ud835\ude01 this \ud835\uddef\ud835\uddfc\ud835\uddfc\ud835\uddf8 on \ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb from \ud835\uddef\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddfb\ud835\uddf2\ud835\uddff to \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1 It s called \ud835\ude0e\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude37\ud835\ude26 \ud835\ude08\ud835\ude10 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude13\ud835\ude22\ud835\ude2f\ud835\ude28\ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude09\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude25 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude22\ud835\ude31\ud835\ude31\ud835\ude34 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f, \ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude35\ud835\ude0e\ud835\ude17\ud835\ude1b, \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude30\ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 by Ben Auffarth , published by Packt \ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude22 \ud835\ude34\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35 \ud835\ude23\ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude2c\ud835\ude25\ud835\ude30\ud835\ude38\ud835\ude2f It begins with some theoretical chapters on LLMs LangChain It explores the critical components of LangChain chains, agents, memory, tools \ud835\udde7\ud835\uddf5\ud835\uddf2\ud835\uddfb, \ud835\uddfa\ud835\ude06 \ud835\uddf3\ud835\uddee\ud835\ude03\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude01... \ud835\udddc\ud835\ude01 \ud835\uddf7\ud835\ude02\ud835\uddfa\ud835\uddfd\ud835\ude00 \ud835\uddf1\ud835\uddf6\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddfc \ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \ud835\uddea\ud835\udddc\ud835\udde7\ud835\udddb \ud835\udde3\ud835\uddec\ud835\udde7\ud835\udddb\ud835\udde2\ud835\udde1 \ud835\uddd6\ud835\udde2\ud835\uddd7\ud835\uddd8 takes off with beginner friendly examples of using LangChain with agents, HuggingFace, GCP VertexAI, Azure, Anthropic, etc. shows an end to end example of building a customer services application with LangChain VertexAI how to mitigate hallucinations using the \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude0a\ud835\ude29\ud835\ude26\ud835\ude24\ud835\ude2c\ud835\ude26\ud835\ude33\ud835\ude0a\ud835\ude29\ud835\ude22\ud835\ude2a\ud835\ude2f class how to implement map reduce pipelines how to monitor token usage costs how to extract information from documents such as PDFs building a Streamlit interface how reasoning works in agent building a chatbot like ChatGPT from SCRATCH . I haven t finished it yet, but I love it so far I plan to finish it soon. . \ud835\uddea\ud835\uddf5\ud835\uddfc \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff? If you are \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\ude02\ud835\ude01 in the LLM world, this is a great book to \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1 \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1. Even if you are \ud835\uddf2\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1, I think it is \ud835\uddf2\ud835\ude05\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddf9\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2\ud835\uddf3\ud835\ude02\ud835\uddf9 to \ud835\ude00\ud835\uddf8\ud835\uddf6\ud835\uddfa \ud835\uddf6\ud835\ude01 to refresh the fundamentals, learn new details, and see how everything is implemented in LangChain. Generative AI with LangChain By Ben Auffarth \ud835\udddc\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\ude06\ud835\uddfc\ud835\ude02? \ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddf6\ud835\ude01 \ud835\uddfc\ud835\ude02\ud835\ude01 Generative AI with LangChain By Ben Auffarth The difference between development and continuous training ML environments They might do the same thing, but their design is entirely different \ud835\udde0\ud835\udddf \ud835\uddd7\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9\ud835\uddfc\ud835\uddfd\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 At this point, your main goal is to ingest the raw and preprocessed data through versioned artifacts or a feature store , analyze it generate as many experiments as possible to find the best model hyperparameters augmentations Based on your business requirements, you must maximize some specific metrics, find the best latency accuracy trade offs, etc. You will use an experiment tracker to compare all these experiments. After you settle on the best one, the output of your ML development environment will be a new version of the code a new version of the configuration artifact Here is where the research happens. Thus, you need flexibility. That is why we decouple it from the rest of the ML systems through artifacts data, config, code artifacts . The difference between ML development continuous training environments \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 Here is where you want to take the data, code, and config artifacts and train the model on all the required data output a staging versioned model artifact test the staging model artifact if the test passes, label it as the new production model artifact deploy it to the inference services A common strategy is to build a CI CD pipeline that e.g., using GitHub Actions builds a docker image from the code artifact e.g., triggered manually or when a new artifact version is created start the training pipeline inside the docker container that pulls the feature and config artifacts and outputs the staging model artifact manually look over the training report If everything went fine, manually trigger the testing pipeline manually look over the testing report if everything worked fine e.g., the model is better than the previous one , manually trigger the CD pipeline that deploys the new model to your inference services Note how the model registry quickly helps you to decouple all the components. Also, because training and testing metrics are not always black and white, it is challenging to automate the CI CD pipeline 100 . Thus, you need a human in the loop when deploying ML models. To conclude... The ML development environment is where you do your research to find better models. The continuous training environment is used to train test the production model at scale. How to write a streaming retrieval system for RAG on social media data \ud835\uddd5\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 are the \ud835\uddfd\ud835\uddee\ud835\ude00\ud835\ude01. Here is how to \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa for \ud835\udde5\ud835\uddd4\ud835\uddda on \ud835\ude00\ud835\uddfc\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfa\ud835\uddf2\ud835\uddf1\ud835\uddf6\ud835\uddee \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5? In environments where data evolves quickly e.g., social media platforms , the system s response time is critical for your application s user experience. That is why TikTok is so addicting. Its recommender system adapts in real time based on your interaction with the app. How would it be if the recommendations were updated daily or hourly? Well, it would work, but you would probably get bored of the app much faster. The same applies to RAG for highly intensive data sources... where you must sync your source and vector DB in real time for up to date retrievals. \ud835\ude13\ud835\ude26\ud835\ude35 \ud835\ude34 \ud835\ude34\ud835\ude26\ud835\ude26 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude2a\ud835\ude35 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c\ud835\ude34. I wrote an \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 on how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa for \ud835\udde5\ud835\uddd4\ud835\uddda on \ud835\udddf\ud835\uddf6\ud835\uddfb\ud835\uddf8\ud835\uddf2\ud835\uddf1\ud835\udddc\ud835\uddfb \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee in collaboration with Superlinked . The \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa is based on \ud835\udfee \ud835\uddf1\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf1 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00 the streaming ingestion pipeline the retrieval client The \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 runs 24 7 to keep the vector DB synced with the current raw LinkedIn posts data source. The \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\uddf0\ud835\uddf9\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 is used in RAG applications to query the vector DB. These 2 components are completely decoupled and communicate with each other through the vector DB. \ud835\udfed. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Implemented in Bytewax a streaming engine built in Rust speed reliability that exposes a Python interface \ud835\ude14\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude27\ud835\ude2d\ud835\ude30\ud835\ude38 uses CDC to add changes from the source DB to a queue listens to the queue for new events cleans, chunks, and embeds the LI posts loads them to a Qdrant vector DB and... everything in real time! Advanced RAG architecture source from Superlinked Vectorhub \ud835\udfee. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\uddf0\ud835\uddf9\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 A standard Python module. The goal is to retrieve similar posts using various query types, such as posts, questions, and sentences. \ud835\ude14\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude27\ud835\ude2d\ud835\ude30\ud835\ude38 preprocess user queries the same way as they were ingested search the Qdrant vector DB for the most similar results use rerank to improve the retrieval system s accuracy visualize the results on a 2D plot using UMAP . You don t believe me? \ud835\uddd6\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 \ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\ude02\ud835\uddf9\ud835\uddf9 \ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddf9\ud835\uddf2 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\uddfc\ud835\uddfb \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde0\ud835\udddf \ud835\ude08 \ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude2d \ud835\ude35\ud835\ude2a\ud835\ude2e\ud835\ude26 \ud835\ude19\ud835\ude26\ud835\ude35\ud835\ude33\ud835\ude2a\ud835\ude26\ud835\ude37\ud835\ude22\ud835\ude2d \ud835\ude1a\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e \ud835\ude27\ud835\ude30\ud835\ude33 \ud835\ude19\ud835\ude08\ud835\ude0e \ud835\ude30\ud835\ude2f \ud835\ude1a\ud835\ude30\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d \ud835\ude14\ud835\ude26\ud835\ude25\ud835\ude2a\ud835\ude22 \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22 Images If not otherwise stated, all images are created by the author. 7 Share this post The difference between development and continuous training ML environments decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-difference-between-development?r=1ttoeh" }, { "id": "20beb560-6063-4158-b7b5-c2083b299ec5", "content": "Architect LLM RAG inference pipelines by Paul Iusztin Design, build, deploy and monitor LLM and RAG inference pipelines using LLMOps best practices. Integrate it with a model registry and vector DB. SubscribeSign in Share this post Architect scalable and cost effective LLM RAG inference pipelines decodingml.substack.com Copy link Facebook Email Note Other Architect scalable and cost effective LLM RAG inference pipelines Design, build and deploy RAG inference pipeline using LLMOps best practices. Paul Iusztin Jun 06, 2024 13 Share this post Architect scalable and cost effective LLM RAG inference pipelines decodingml.substack.com Copy link Facebook Email Note Other Share the 9th out of 11 lessons of the LLM Twin free course Why is this course different? _By finishing the LLM Twin Building Your Production Ready AI Replica _ _free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices_. _ Why should you care? _ _ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system._ _More details on what you will learn within the LLM Twin course , here _ Latest Lessons of the LLM Twin Course Lesson 6 The Role of Feature Stores in Fine Tuning LLMs Custom Dataset Generation, Artifact Versioning, GPT3.5 Turbo Distillation, Qdrant Lesson 7 How to fine tune LLMs on custom datasets at Scale using Qwak and CometML QLoRA, PEFT, Fine tuning Mistral 7b Instruct on custom dataset, Qwak, Comet ML Lesson 8 Best practices when evaluating fine tuned LLM models LLM Evaluation techniques Does and don ts, Quantitive and manual LLM evaluation techniques Lesson 9 Architect scalable and cost effective LLM RAG inference pipelines In Lesson 9, we will focus on implementing and deploying the inference pipeline of the LLM twin system. First , we will design and implement a scalable LLM RAG inference pipeline based on microservices, separating the ML and business logic into two layers. Secondly , we will use Comet ML to integrate a prompt monitoring service to capture all input prompts and LLM answers for further debugging and analysis. Ultimately , we will deploy the inference pipeline to Qwak and make the LLM twin service available worldwide. Context from previous lessons. What you must know. This lesson is part of a more extensive series in which we learn to build an end to end LLM system using LLMOps best practices. _If you haven t read the whole series, for this one to make sense, you have to know that we have a _ Qdrant vector DB populated with digital data posts, articles, and code snippets vector DB retrieval module to do advanced RAG fine tuned open source LLM available in a model registry from Comet ML _ In this lesson, we will focus on gluing everything together into a scalable inference pipeline and deploying it to the cloud._ Table of Contents 1. The architecture of the inference pipeline 2. The training vs. the inference pipeline 3. The RAG business module 4. The LLM microservice 5. Prompt monitoring 6. Deploying and running the inference pipeline 7. Conclusion 1 . The architecture of the inference pipeline Our inference pipeline contains the following core elements a fine tuned LLM a RAG module a monitoring service Let s see how to hook these into a scalable and modular system. The interface of the inference pipeline As we follow the feature training inference FTI pipeline architecture, the communication between the 3 core components is clear. Our LLM inference pipeline needs 2 things a fine tuned LLM pulled from the model registry features for RAG pulled from a vector DB which we modeled as a logical feature store This perfectly aligns with the FTI architecture. _ If you are unfamiliar with the FTI pipeline architecture, we recommend you reviewLesson 1 s section on the 3 pipeline architecture._ Monolithic vs. microservice inference pipelines Usually, the inference steps can be split into 2 big layers t he LLM service where the actual inference is being done the business service domain specific logic We can design our inference pipeline in 2 ways. Option 1 Monolithic LLM business service In a monolithic scenario, we implement everything into a single service. _Pros _ easy to implement easy to maintain _Cons _ harder to scale horizontally based on the specific requirements of each component harder to split the work between multiple teams not being able to use different tech stacks for the two services Monolithic vs. microservice inference pipelines Option 2 Different LLM business microservices The LLM and business services are implemented as two different components that communicate with each other through the network, using protocols such as REST or gRPC. _Pros _ each component can scale horizontally individually each component can use the best tech stack at hand _Cons _ harder to deploy harder to maintain Let s focus on the each component can scale individually part, as this is the most significant benefit of the pattern. Usually, LLM and business services require different types of computing. For example, an LLM service depends heavily on GPUs, while the business layer can do the job only with a CPU. Microservice architecture of the LLM twin inference pipeline Let s understand how we applied the microservice pattern to our concrete LLM twin inference pipeline. As explained in the sections above, we have the following components 1. A business microservice 2. An LLM microservice 3. A prompt monitoring microservice The business microservice is implemented as a Python module that contains the advanced RAG logic, which calls the vector DB and GPT 4 API for advanced RAG operations calls the LLM microservice through a REST API using the prompt computed utilizing the user s query and retrieved context sends the prompt and the answer generated by the LLM to the prompt monitoring microservice. As you can see, the business microservice is light. It glues all the domain steps together and delegates the computation to other services. The end goal of the business layer is to act as an interface for the end client. In our case, as we will ship the business layer as a Python module, the client will be a Streamlit application. However, you can quickly wrap the Python module with FastAPI and expose it as a REST API to make it accessible from the cloud. Microservice architecture of the LLM twin inference pipeline The LLM microservice is deployed on Qwak. This component is wholly niched on hosting and calling the LLM. It runs on powerful GPU enabled machines. How does the LLM microservice work? It loads the fine tuned LLM twin model from Comet s model registry 2 . It exposes a REST API that takes in prompts and outputs the generated answer. When the REST API endpoint is called, it tokenizes the prompt, passes it to the LLM, decodes the generated tokens to a string and returns the answer. That s it! The prompt monitoring microservice is based on Comet ML s LLM dashboard. Here, we log all the prompts and generated answers into a centralized dashboard that allows us to evaluate, debug, and analyze the accuracy of the LLM. 2 . The training vs. the inference pipeline Along with the obvious reason that the training pipeline takes care of training while the inference pipeline takes care of inference Duh! , there are some critical differences you have to understand. The input of the pipeline How the data is accessed Do you remember our logical feature store based on the Qdrant vector DB and Comet ML artifacts? If not, consider checking out Lesson 6 for a refresher. The core idea is that during training , the data is accessed from an offline data storage in batch mode, optimized for throughput and data lineage. Our LLM twin architecture uses Comet ML artifacts to access, version, and track all our data. The data is accessed in batches and fed to the training loop. During inference , you need an online database optimized for low latency. As we directly query the Qdrant vector DB for RAG, that fits like a glove. During inference, you don t care about data versioning and lineage. You just want to access your features quickly for a good user experience. The data comes directly from the user and is sent to the inference logic. The training vs. the inference pipeline The output of the pipeline The training pipeline s final output is the trained weights stored in Comet s model registry. The inference pipeline s final output is the predictions served directly to the user. The infrastructure The training pipeline requires more powerful machines with as many GPUs as possible. _Why?_ During training, you batch your data and have to hold in memory all the gradients required for the optimization steps. Because of the optimization algorithm, the training is more compute hungry than the inference. Thus, more computing and VRAM result in bigger batches, which means less training time and more experiments. If you run a batch pipeline, you will still pass batches to the model but don t perform any optimization steps. If you run a real time pipeline, as we do in the LLM twin architecture, you pass a single sample to the model or do some dynamic batching to optimize your inference step. Are there any overlaps? Yes! This is where the training serving skew comes in. To avoid the training serving skew, you must carefully apply the same preprocessing and postprocessing steps during training and inference. 3 . The RAG business module We will define the RAG business module under the _LLMTwin_ class. The LLM twin logic is directly correlated with our business logic. We don t have to introduce the word business in the naming convention of the classes. Let s dig into the _generate _ method of the _LLMTwin_ class, where we call the RAG module create the prompt using the prompt template, query and context call the LLM microservice log the prompt, prompt template, and answer to Comet ML s prompt monitoring service. Inference pipeline business module generate method GitHub Let s look at how our LLM microservice is implemented using Qwak. 4 . The LLM microservice As the LLM microservice is deployed on Qwak, we must first inherit from the _QwakModel_ class and implement some specific functions. _initialize_model _ where we load the fine tuned model from the model registry at serving time _schema _ where we define the input and output schema _predict _ where we implement the actual inference logic Note The _build _ function contains all the training logic, such as loading the dataset, training the LLM, and pushing it to a Comet experiment. To see the full implementation, consider checking out Lesson 7, where we detailed the training pipeline. LLM microservice GitHub Let s zoom into the implementation and the life cycle of the Qwak model. The _schema _ method is used to define how the input and output of the _predict _ method look like. This will automatically validate the structure and type of the _predict _ method. For example, the LLM microservice will throw an error if the variable instruction is a JSON instead of a string. The other Qwak specific methods are called in the following order 1. ___init__ _ when deploying the model 2. _initialize_model _ when deploying the model 3. _predict _ on every request to the LLM microservice Note that these methods are called only during serving time and not during training . Qwak exposes your model as a RESTful API, where the _predict _ method is called on each request. Inside the prediction method, we perform the following steps map the input text to token IDs using the LLM specific tokenizer move the token IDs to the provided device GPU or CPU pass the token IDs to the LLM and generate the answer extract only the generated tokens from the _generated_ids_ variable by slicing it using the shape of the _input_ids_ decode the _generated_ids_ back to text return the generated text The final step is to look at Comet s prompt monitoring service. 5 . Prompt monitoring Comet makes prompt monitoring straightforward. There is just one API call where you connect to your project and workspace and send the following to a single function the prompt and LLM output the prompt template and variables that created the final output your custom metadata specific to your use case here, you add information about the model, prompt token count, token generation costs, latency, etc. class PromptMonitoringManager classmethod def log cls, prompt str, output str, prompt_template str None None, prompt_template_variables dict None None, metadata dict None None, None metadata model settings.MODEL_TYPE, metadata, or model settings.MODEL_TYPE comet_llm.log_prompt workspace settings.COMET_WORKSPACE, project f settings.COMET_PROJECT monitoring , api_key settings.COMET_API_KEY, prompt prompt, prompt_template prompt_template, prompt_template_variables prompt_template_variables, output output, metadata metadata, This is how Comet ML s prompt monitoring dashboard looks. Here, you can scroll through all the prompts that were ever sent to the LLM. You can click on any prompt and see everything we logged programmatically using the _PromptMonitoringManager_ class. Screenshot from Comet ML s dashboard Besides what we logged, adding various tags and the inference duration can be valuable. 6 . Deploying and running the inference pipeline We can deploy the LLM microservice using the following Qwak command qwak models deploy realtime model id llm_twin instance gpu.a10.2xl timeout 50000 replicas 2 server workers 2 We deployed two replicas of the LLM twin. Each replica has access to a machine with x1 A10 GPU. Also, each replica has two workers running on it. More on Qwak instance types Two replicas and two workers result in 4 microservices that run in parallel and can serve our users. You can scale the deployment to more replicas if you need to serve more clients. Qwak provides autoscaling mechanisms triggered by listening to the consumption of GPU, CPU or RAM. To conclude, you build the Qwak model once, and based on it, you can make multiple deployments with various strategies. Conclusion _Congratulations! You are close to the end of the LLM twin series._ In Lesson 9 of the LLM twin course, you learned to build a scalable inference pipeline for serving LLMs and RAG systems. First , you learned how to architect an inference pipeline by understanding the difference between monolithic and microservice architectures. We also highlighted the difference in designing the training and inference pipelines. Secondly , we walked you through implementing the RAG business module and LLM twin microservice. Also, we showed you how to log all the prompts, answers, and metadata for Comet s prompt monitoring service. Ultimately , we showed you how to deploy and run the LLM twin inference pipeline on the Qwak AI platform. In Lesson 10 , we will show you how to evaluate the whole system by building an advanced RAG evaluation pipeline that analyzes the accuracy of the LLMs answers relative to the query and context. See you there! _ Check out the code on GitHub 1 and support us with a _ Next Steps Step 1 This is just the short version of Lesson 9 on architecting scalable and cost effective LLM RAG inference pipelines. For The full implementation. Full deep dive into the code. More on the RAG, LLM and monitoring services. Check out the full version of Lesson 9 on our Medium publication . It s still FREE Lesson 9 on Medium Step 2 Consider checking out theLLM Twin GitHub repository and try it yourself _Nothing compares with getting your hands dirty and doing it yourself!_ LLM Twin Course GitHub Images If not otherwise stated, all images are created by the author. 13 Share this post Architect scalable and cost effective LLM RAG inference pipelines decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/architect-scalable-and-cost-effective?r=1ttoeh" }, { "id": "95d64d1d-83f2-47e9-8eda-9a687b98e6eb", "content": "7 tips to reduce your VRAM when training LLMs 3 techniques you must know to evaluate your LLMs. Introduction to deploying private LLMs with AWS SageMaker. SubscribeSign in Share this post 7 tips to reduce your VRAM when training LLMs decodingml.substack.com Copy link Facebook Email Note Other 7 tips to reduce your VRAM when training LLMs 3 techniques you must know to evaluate your LLMs. Introduction to deploying private LLMs with AWS SageMaker. Paul Iusztin May 18, 2024 4 Share this post 7 tips to reduce your VRAM when training LLMs decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week s topics 3 techniques you must know to evaluate your LLMs 7 tips you must know to reduce your VRAM consumption of your LLMs during training Introduction to deploying private LLMs with AWS SageMaker On the 3rd of May, I \ud835\uddf5\ud835\uddfc\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddf1 a \ud835\uddf3\ud835\uddff\ud835\uddf2\ud835\uddf2 \ud835\ude00\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb on Maven for \ud835\udff5\ud835\udff0 \ud835\uddfd\ud835\uddf2\ud835\uddfc\ud835\uddfd\ud835\uddf9\ud835\uddf2 on how to \ud835\uddd4\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb. If you missed it, here is \ud835\uddf5\ud835\uddfc\ud835\ude04 you can \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\uddf6\ud835\ude01 for \ud835\uddf3\ud835\uddff\ud835\uddf2\ud835\uddf2 . \ud835\ude12\ud835\ude26\ud835\ude3a \ud835\ude35\ud835\ude22\ud835\ude2c\ud835\ude26\ud835\ude22\ud835\ude38\ud835\ude22\ud835\ude3a\ud835\ude34 \ud835\ude38\ud835\ude26\ud835\ude33\ud835\ude26 Why I started building my LLM Twin The 3 pipeline design The FTI pipeline architecture System design of the LLM Twin Architecture Break down the RAG system of the LLM Twin Architecture Live Demo . If you want the recording, you can watch it for free here https bit.ly 3PZGV0S \ud835\ude08\ud835\ude2d\ud835\ude34\ud835\ude30, \ud835\ude29\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude30\ud835\ude35\ud835\ude29\ud835\ude26\ud835\ude33 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude27\ud835\ude36\ud835\ude2d \ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude2c\ud835\ude34 \ud835\ude34\ud835\ude2d\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude34 https lnkd.in d_MdqGwS \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1b\ud835\ude38\ud835\ude2a\ud835\ude2f \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude0e\ud835\ude2a\ud835\ude35\ud835\ude0f\ud835\ude36\ud835\ude23 https lnkd.in dzat6PB6 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude1b\ud835\ude38\ud835\ude2a\ud835\ude2f \ud835\ude0d\ud835\ude19\ud835\ude0c\ud835\ude0c \ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f\ud835\ude34 https lnkd.in dX__4mhX 3 techniques you must know to evaluate your LLMs Here are 3 techniques you must know to evaluate your LLMs quickly. Manually testing the output of your LLMs is a tedious and painful process you need to automate it. In generative AI, most of the time, you cannot leverage standard metrics. Thus, the real question is, how do you evaluate the outputs of an LLM? \ud835\udfed. \ud835\udde6\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\ude00 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf8\ud835\uddfb\ud835\uddfc\ud835\ude04 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude04\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\ude01 Even if you use an LLM to generate text, you can ask it to generate a response in a structured format e.g., JSON that can be parsed. You know exactly what you want e.g., a list of products extracted from the user s question . Thus, you can easily compare the generated and ideal answers using classic approaches. For example, when extracting the list of products from the user s input, you can do the following check if the LLM outputs a valid JSON structure use a classic method to compare the generated and real answers \ud835\udfee. \ud835\udde1\ud835\uddfc \ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\uddf2.\ud835\uddf4., \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude00, \ud835\uddf2\ud835\ude01\ud835\uddf0. When generating sentences, the LLM can use different styles, words, etc. Thus, traditional metrics e.g., BLUE score are too rigid to be useful. You can leverage another LLM to test the output of our initial LLM. The trick is in what questions to ask. Here, we have another 2 sub scenarios \ud835\udfee.\ud835\udfed \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb \ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb \ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5 You don t have access to an expert to write an ideal answer for a given question to compare it to. Based on the initial prompt and generated answer, you can compile a set of questions and pass them to an LLM. Usually, these are Y N questions that you can easily quantify and check the validity of the generated answer. This is known as Rubric Evaluation For example Is there any disagreement between the response and the context? Y or N Count how many questions the user asked. output a number ... This strategy is intuitive, as you can ask the LLM any question you are interested in as long it can output a quantifiable answer Y N or a number . \ud835\udfee.\ud835\udfee. \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5 When you have access to an answer manually created by a group of experts, things are easier. You will use an LLM to compare the generated and ideal answers based on semantics, not structure. For example A The submitted answer is a subset of the expert answer and entirely consistent. ... E The answers differ, but these differences don t matter. 7 tips you must know to reduce your VRAM consumption of your LLMs during training Here are \ud835\udff3 \ud835\ude01\ud835\uddf6\ud835\uddfd\ud835\ude00 you must know to \ud835\uddff\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf2 your \ud835\udde9\ud835\udde5\ud835\uddd4\ud835\udde0 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb of your \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 during \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 so you can \ud835\uddf3\ud835\uddf6\ud835\ude01 it on \ud835\ude05\ud835\udfed \ud835\uddda\ud835\udde3\ud835\udde8. \ud835\udfed . \ud835\udde0\ud835\uddf6\ud835\ude05\ud835\uddf2\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb During training you use both FP32 and FP16 in the following way FP32 weights FP16 weights FP16 gradients FP32 gradients Update weights FP32 weights and repeat . As you can see, the forward backward passes are done in FP16, and only the optimization step is done in FP32, which reduces both the VRAM and runtime. \ud835\udfee . \ud835\udddf\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb All your computations are done in FP16 instead of FP32. But the key is using bfloat16 Brain Floating Point , a numerical representation Google developed for deep learning. It allows you to represent very large and small numbers, avoiding overflowing or underflowing scenarios. \ud835\udfef . \ud835\udde5\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\uddf6\ud835\ude07\ud835\uddf2 This one is straightforward. Fewer samples per training iteration result in smaller VRAM requirements. The downside of this method is that you can t go too low with your batch size without impacting your model s performance. \ud835\udff0 . \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb It is a simple powerful trick to increase your batch size virtually. You compute the gradients for micro batches forward backward passes . Once the accumulated gradients reach the given virtual target, the model weights are updated with the accumulated gradients. For example, you have a batch size of 4 and a micro batch size of 1. Then, the forward backward passes will be done using only x1 sample, and the optimization step will be done using the aggregated gradient of the 4 samples. \ud835\udff1 . \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\uddee \ud835\ude00\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddff Adam is the most popular optimizer. It is one of the most stable optimizers, but the downside is that it has 2 additional parameters a mean variance for every model parameter. If you use a stateless optimizer, such as SGD, you can reduce the number of parameters by 2 3, which is significant for LLMs. \ud835\udff2 . \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\ude03\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8\ud835\uddfd\ud835\uddfc\ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 It drops specific activations during the forward pass and recomputes them during the backward pass. Thus, it eliminates the need to hold all activations simultaneously in VRAM. This technique reduces VRAM consumption but makes the training slower. \ud835\udff3 . \ud835\uddd6\ud835\udde3\ud835\udde8 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddf3\ud835\uddf3\ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 The parameters that do not fit on your GPU s VRAM are loaded on the CPU. Intuitively, you can see it as a model parallelism between your GPU CPU. Image by DALL E Most of these methods are orthogonal, so you can combine them and drastically reduce your VRAM requirements during training. Introduction to deploying private LLMs with AWS SageMaker Ever wondered \ud835\uddf5\ud835\uddfc\ud835\ude04 to \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 in \ud835\udfef\ud835\udfec \ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00, such as \ud835\udddf\ud835\uddf9\ud835\uddee\ud835\uddfa\ud835\uddee\ud835\udfee, on \ud835\uddd4\ud835\uddea\ud835\udde6 \ud835\udde6\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\udde0\ud835\uddee\ud835\uddf8\ud835\uddf2\ud835\uddff? Then wonder no more Step 1 Deploy the LLM to AWS SageMaker The sweet thing about SageMaker is that it accelerates the development process, enabling a more efficient and rapid transition to the production stage. Vesa Alexandru smashed with his first article on DML about showing step by step how to deploy an LLM from HuggingFace to AWS SageMaker using good practices, such as designing a config class for the deployment of the LLM set up AWS and deploy the LLM to SageMaker implement an inference class to call the deployed LLM in real time through a web endpoint define a prompt template function to ensure reproducibility consistency ...and, ultimately, how to play yourself with your freshly deployed LLM. _Here is the full article explaining how to deploy the LLM to AWS SageMaker_ DML Introduction to Deploying Private LLMs with AWS SageMaker Focus on Llama2 7b chat Vesa Alexandru Jan 18 Read full story Step 2 Call the SageMaker inference endpoint You ve just deployed your Mistral LLM to SageMaker. \ud835\ude15\ud835\ude30\ud835\ude38 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35? Unfortunately, you are not done. That was just the beginning of the journey. Now, you have to write a Python client that calls the LLM. \ud835\udddf\ud835\uddf2\ud835\ude01 \ud835\ude00 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddee \ud835\uddf1\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\ude06 \ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf8 \ud835\uddee\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2. \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed Define a Settings object using \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24. \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee Create an inference interface that inherits from \ud835\ude08\ud835\ude09\ud835\ude0a \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef Implement an \ud835\ude08\ud835\ude1e\ud835\ude1a \ud835\ude1a\ud835\ude22\ud835\ude28\ud835\ude26\ud835\ude14\ud835\ude22\ud835\ude2c\ud835\ude26\ud835\ude33 version of the inference interface by specifying how to construct the HTTP payload and call the SageMaker endpoint. We want to keep this class independent from the summarization prompt! \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff0 Create the summarization prompt. \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff1 Encapsulate the summarization prompt and Python SageMaker client into a \ud835\ude1a\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26\ud835\ude1a\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35\ud835\ude0b\ud835\ude30\ud835\ude24\ud835\ude36\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 task. \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff2 Wrap the \ud835\ude1a\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26\ud835\ude1a\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35\ud835\ude0b\ud835\ude30\ud835\ude24\ud835\ude36\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 task with a FastAPI endpoint. ...and bam! You have an LLM for summarizing any document. . \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude00\ud835\uddfc\ud835\uddfa\ud835\uddf2 \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddef\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\uddef\ud835\uddfc\ud835\ude03\ud835\uddf2 by using an inference interface, you can quickly swap the LLM implementation by decoupling the prompt construction logic from the inference class, you can reuse the inference client with any prompt by wrapping everything with a \ud835\ude1a\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude26\ud835\ude1a\ud835\ude29\ud835\ude30\ud835\ude33\ud835\ude35\ud835\ude0b\ud835\ude30\ud835\ude24\ud835\ude36\ud835\ude2e\ud835\ude26\ud835\ude2f\ud835\ude35 task you can quickly define configure multiple types of tasks and leverage polymorphism to run them _Here is the full article explaining how to design the inference module_ Steal my code to solve real world problems Vesa Alexandru Feb 29 Read full story Images If not otherwise stated, all images are created by the author. 4 Share this post 7 tips to reduce your VRAM when training LLMs decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/7-tips-to-reduce-your-vram-when-training?r=1ttoeh" }, { "id": "d0c592eb-82bc-46c4-9632-388f9dd144ce", "content": "Using this Python package, you can x10 your text preprocessing pipelines End to end framework for production ready LLMs. Top 6 ML platform features you must know and use in your ML system. SubscribeSign in Share this post Using this Python package, you can x10 your text preprocessing pipelines decodingml.substack.com Copy link Facebook Email Note Other Using this Python package, you can x10 your text preprocessing pipelines End to end framework for production ready LLMs. Top 6 ML platform features you must know and use in your ML system. Paul Iusztin May 11, 2024 9 Share this post Using this Python package, you can x10 your text preprocessing pipelines decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week s topics Top 6 ML platform features you must know and use in your ML system. Using this Python package, you can x10 your text preprocessing pipelines End to end framework for production ready LLMs Top 6 ML platform features you must know and use in your ML system Here they are \ud835\udfed. \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 In your ML development phase, you generate lots of experiments. Tracking and comparing the metrics between them is crucial in finding the optimal model. \ud835\udfee. \ud835\udde0\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\udde6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 Its primary purpose is reproducibility. To know how a model was generated, you need to know the version of the code the version of the packages hyperparameters config total compute version of the dataset ... and more \ud835\udfef. \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 Most of the time, along with the metrics, you must log a set of visualizations for your experiment. Such as images videos prompts t SNE graphs 3D point clouds ... and more \ud835\udff0. \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddfc\ud835\uddff\ud835\ude01\ud835\ude00 You don t work in a vacuum. You have to present your work to other colleges or clients. A report lets you take the metadata and visualizations from your experiment... ...and create, deliver and share a targeted presentation for your clients or peers. \ud835\udff1. \ud835\uddd4\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf3\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\ude00 The most powerful feature out of them all. An artifact is a versioned object that is an input or output for your task. Everything can be an artifact, but the most common cases are data model code Wrapping your assets around an artifact ensures reproducibility. For example, you wrap your features into an artifact e.g., features 3.1.2 , which you can consume into your ML development step. The ML development step will generate config e.g., config 1.2.4 and code e.g., code 1.0.2 artifacts used in the continuous training pipeline. Doing so lets you quickly respond to questions such as What I used to generate the model? and What Version? \ud835\udff2. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\udde5\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06 The model registry is the ultimate way to make your model accessible to your production ecosystem. For example, in your continuous training pipeline, after the model is trained, you load the weights as an artifact into the model registry e.g., model 1.2.4 . You label this model as staging under a new version and prepare it for testing. If the tests pass, mark it as production under a new version and prepare it for deployment e.g., model 2.1.5 . All of these features are used in a mature ML system. What is your favorite one? Using this Python package, you can x10 your text preprocessing pipelines Any text preprocessing pipeline has to clean, partition, extract, or chunk text data to feed it into your LLMs. \ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 offers a \ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf5 and \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb \ud835\uddd4\ud835\udde3\ud835\udddc that allows you to quickly \ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f your data into smaller segments from various data sources e.g., HTML, CSV, PDFs, even images, etc. \ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 the text of anomalies e.g., wrong ASCII characters , any irrelevant information e.g., white spaces, bullets, etc. , and filling missing values \ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28 information from pieces of text e.g., datetimes, addresses, IP addresses, etc. \ud835\ude24\ud835\ude29\ud835\ude36\ud835\ude2f\ud835\ude2c\ud835\ude2a\ud835\ude2f\ud835\ude28 your text segments into pieces of text that can be inserted into your embedding model \ud835\ude26\ud835\ude2e\ud835\ude23\ud835\ude26\ud835\ude25\ud835\ude25\ud835\ude2a\ud835\ude2f\ud835\ude28 data e.g., wrapper over OpenAIEmbeddingEncoder, HuggingFaceEmbeddingEncoders, etc. \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude28\ud835\ude26 your data to be fed into various tools e.g., Label Studio, Label Box, etc. \ud835\uddd4\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff feeding your data into your LLMs embedding the data and ingesting it into a vector DB doing RAG labeling recommender systems ... basically for any LLM or multimodal applications . Implementing all these steps from scratch will take a lot of time. I know some Python packages already do this, but the functionality is scattered across multiple packages. \ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 packages everything together under a nice, clean API. End to end framework for production ready LLMs Want to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 in a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\ude04\ud835\uddee\ud835\ude06? For \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8? Then \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude00\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf9\ud835\uddf1 \ud835\ude01\ud835\uddee\ud835\uddf8\ud835\uddf2 our \ud835\udde1\ud835\uddd8\ud835\uddea \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on how to \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 an \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 for \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 Decoding ML and I are \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 a \ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 on \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 how to \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01 and \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa by \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 an \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb from start to finish from from data collection to deployment production ready from NO MLOps to experiment trackers, model registries, prompt monitoring, and versioning The course is called \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee ...and here is what you will learn to build 4 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34 \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. Deployed on AWS. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded and loaded into a Qdrant vector DB in real time. Deployed on AWS. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Create a custom dataset based on your digital data. Fine tune an LLM using QLoRA. Use Comet ML s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet s model registry. Deployed on Qwak. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Load and quantize the fine tuned LLM from Comet s model registry. Deploy it as a REST API Enhance the prompts using RAG Generate content using your LLM twin Monitor the LLM using Comet s prompt monitoring dashboard Deployed on Qwak. . \ud835\ude08\ud835\ude2d\ud835\ude30\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 4 \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34, \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 3 \ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34 \ud835\ude35\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude34 Comet as your ML Platform Qdrant as your vector DB Qwak as your ML infrastructure . To stay updated on \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee course... \ud835\ude3e\ud835\ude5d\ud835\ude5a\ud835\ude58\ud835\ude60 \ud835\ude5e\ud835\ude69 \ud835\ude64\ud835\ude6a\ud835\ude69 \ud835\ude42\ud835\ude5e\ud835\ude69\ud835\ude43\ud835\ude6a\ud835\ude57 \ud835\ude56\ud835\ude63\ud835\ude59 \ud835\ude68\ud835\ude6a\ud835\ude65\ud835\ude65\ud835\ude64\ud835\ude67\ud835\ude69 \ud835\ude6a\ud835\ude68 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude56 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee Images If not otherwise stated, all images are created by the author. 9 Share this post Using this Python package, you can x10 your text preprocessing pipelines decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/using-this-python-package-you-can?r=1ttoeh" }, { "id": "46f9a4cc-cf3b-43c6-9026-6c9cddf8674a", "content": "4 Advanced RAG Algorithms You Must Know by Paul Iusztin Implement 4 advanced RAG retrieval techniques to optimize your vector DB searches. Integrate the RAG retrieval module into a production LLM system. SubscribeSign in Share this post The 4 Advanced RAG Algorithms You Must Know to Implement decodingml.substack.com Copy link Facebook Email Note Other The 4 Advanced RAG Algorithms You Must Know to Implement Implement from scratch 4 advanced RAG methods to optimize your retrieval and post retrieval algorithm Paul Iusztin May 09, 2024 17 Share this post The 4 Advanced RAG Algorithms You Must Know to Implement decodingml.substack.com Copy link Facebook Email Note Other 1 Share _ the 5th out of 11 lessons of the LLM Twin free course_ Why is this course different? _By finishing the LLM Twin Building Your Production Ready AI Replica _ _free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices_. _ Why should you care? _ _ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system._ More details on what you will learn within the LLM Twin course , here Latest Lessons of the LLM Twin Course Lesson 2 The importance of Data Pipeline in the era of Generative AI Data crawling, ETL pipelines, ODM, NoSQL Database Lesson 3 CDC Enabling Event Driven Architectures Change Data Capture CDC , MongoDB Watcher, RabbitMQ queue Lesson 4 Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! Feature pipeline, Bytewax streaming engine, Pydantic models, The dispatcher layer Lesson 5 The 4 Advanced RAG Algorithms You Must Know to Implement In Lesson 5 , we will focus on building an advanced retrieval module used for RAG. We will show you how to implement 4 retrieval and post retrieval advanced optimization techniques to improve the accuracy of your RAG retrieval step . In this lesson, we will focus only on the retrieval part of the RAG system. In Lesson 4 , we showed you how to clean, chunk, embed, and load social media data to a Qdrant vector DB the ingestion part of RAG . In future lessons, we will integrate this retrieval module into the inference pipeline for a full fledged RAG system. Retrieval Python Module Architecture 1 . Overview of advanced RAG optimization techniques A production RAG system is split into 3 main components ingestion clean, chunk, embed, and load your data to a vector DB retrieval query your vector DB for context generation attach the retrieved context to your prompt and pass it to an LLM The ingestion component sits in the _feature pipeline_ , while the retrieval and generation components are implemented inside the _inference pipeline_. You can also use the retrieval and generation components in your _training pipeline_ to fine tune your LLM further on domain specific prompts. You can apply advanced techniques to optimize your RAG system for ingestion, retrieval and generation. _That being said, there are 3 main types of advanced RAG techniques _ Pre retrieval optimization ingestion tweak how you create the chunks Retrieval optimization retrieval improve the queries to your vector DB Post retrieval optimization retrieval process the retrieved chunks to filter out the noise The generation step can be improved through fine tuning or prompt engineering, which will be explained in future lessons. The pre retrieval optimization techniques are explained in Lesson 4. In this lesson, we will show you some popular retrieval and post retrieval optimization techniques . 2 . Advanced RAG techniques applied to the LLM twin Retrieval optimization _We will combine 3 techniques _ Query Expansion Self Query Filtered vector search Post retrieval optimization We will use the rerank pattern using GPT 4 and prompt engineering instead of Cohere or an open source re ranker cross encoder 4 . I don t want to spend too much time on the theoretical aspects. There are plenty of articles on that. _So, we will jump straight to implementing and integrating these techniques in our LLM twin system._ But first, let s clarify why we picked Qdrant as our vector DB 2.1. Why Qdrant? There are many vector DBs out there, too many But since we discovered Qdrant, we loved it. Why? It is built in Rust. Apache 2.0 license open source It has a great and intuitive Python SDK. It has a freemium self hosted version to build PoCs for free. It supports unlimited document sizes, and vector dims of up to 645536. It is production ready. Companies such as Disney, Mozilla, and Microsoft already use it. It is one of the most popular vector DBs out there. _ To put that in perspective, _ Pinecone, one of its biggest competitors, supports only documents with up to 40k tokens and vectors with up to 20k dimensions . and a proprietary license. I could go on and on but if you are curious to find out more , _check out Qdrant _ 3 . Retrieval optimization 1 Query expansion Query expansion is quite intuitive. You use an LLM to generate multiple queries based on your initial query. These queries should contain multiple perspectives of the initial query. Thus, when embedded, they hit different areas of your embedding space that are still relevant to our initial question. You can do query expansion with a detailed zero shot prompt. Query expansion template GitHub Code 4 . Retrieval optimization 2 Self query What if you could extract the tags within the query and use them along the embedded query? That is what self query is all about! You use an LLM to extract various metadata fields that are critical for your business use case e.g., tags, author ID, number of comments, likes, shares, etc. In our custom solution, we are extracting just the author ID. Thus, a zero shot prompt engineering technique will do the job. _Self queries work hand in hand with vector filter searches, which we will explain in the next section._ To define the _ SelfQueryTemplate _ , we have to Subclass the base abstract class Define the self query prompt Create the LangChain PromptTemplate wrapper class SelfQueryTemplate BasePromptTemplate prompt str You are an AI language model assistant. Your task is to extract information from a user question. The required information that needs to be extracted is the user id. Your response should consists of only the extracted id e.g. 1345256 , nothing else. User question question def create_template self PromptTemplate return PromptTemplate template self.prompt, input_variables question , verbose True 5 . Retrieval optimization 3 Hybrid filtered vector search Combine the vector search technique with one or more complementary search strategy, which works great for finding exact words. It is not defined which algorithms are combined, but the most standard strategy for hybrid search is to combine the traditional keyword based search and modern vector search. _How are these combined?_ _The first method is to merge the similarity scores of the 2 techniques as follows _ hybrid_score 1 alpha sparse_score alpha dense_score Where alpha takes a value between 0, 1 , with alpha 1 Vector Search alpha 0 Keyword search Also, the similarity scores are defined as follows sparse_score is the result of the _keyword search_ that, behind the scenes, uses a BM25 algorithm 7 that sits on top of TF IDF. dense_score is the result of the _vector search_ that most commonly uses a similarity metric such as cosine distance _The second method uses the vector search technique as usual and applies a filter based on your keywords on top of the metadata of retrieved results._ This is also known as filtered vector search . In this use case, the similar score is not changed based on the provided keywords . It is just a fancy word for a simple filter applied to the metadata of your vectors. But it is essential to understand the difference between the first and second methods the first method combines the similarity score between the keywords and vectors using the alpha parameter the second method is a simple filter on top of your vector search. How does this fit into our architecture? Remember that during the self query step, we extracted the author_id as an exact field that we have to match. Thus, we will search for the author_id using the keyword search algorithm and attach it to the 5 queries generated by the query expansion step. _As we want the most relevant chunks from a given author, it makes the most sense to use a filter using the author_id as follows filtered vector search _ self._qdrant_client.search collection_name vector_posts , query_filter models.Filter must models.FieldCondition key author_id , match models.MatchValue value metadata_filter_value, , , query_vector self._embedder.encode generated_query .tolist , limit k, Note that we can easily extend this with multiple keywords e.g., tags , making the combination of self query and hybrid search a powerful retrieval duo. The only question you have to ask yourself is whether we want to use a simple vector search filter or the more complex hybrid search strategy. 6 . Implement the advanced retrieval Python class _Now that you ve understood the advanced retrieval optimization techniques we re using, let s combine them into a Python retrieval class ._ Query expansion chains wrapper GitHub Now the final step is to call Qdrant for each query generated by the query expansion step VectorRetriever main search function GitHub _Note that we have 3 types of data posts, articles, and code repositories._ Thus, we have to make a query for each collection and combine the results in the end. We gathered data from each collection individually and kept the best retrieved results using rerank. Which is the final step of the article. 7 . Post retrieval optimization Rerank using GPT 4 We made a different search in the Qdrant vector DB for N prompts generated by the query expansion step . Each search returns K results . Thus, we end up with N x K chunks . In our particular case, N 5 K 3. Thus, we end up with 15 chunks. Post retrieval optimization rerank We will use rerank to order all the N x K chunks based on their relevance relative to the initial question, where the first one will be the most relevant and the last chunk the least. Ultimately, we will pick the TOP K most relevant chunks. Rerank works really well when combined with query expansion. _A natural flow when using rerank is as follows _ Search for K chunks Reorder using rerank Take top K Thus, when combined with query expansion, we gather potential useful context from multiple points in space rather than just looking for more than K samples in a single location. _Now the flow looks like _ Search for N x K chunks Reoder using rerank Take top K A typical solution for reranking is to use open source Bi Encoders from sentence transformers 4 . These solutions take both the question and context as input and return a score from 0 to 1. In this article, we want to take a different approach and use GPT 4 prompt engineering as our reranker. If you want to see how to apply rerank using open source algorithms, check out this hands on article from Decoding ML A Real time Retrieval System for RAG on Social Media Data Paul Iusztin Mar 7 Read full story Now let s see our implementation using GPT 4 prompt engineering. Similar to what we did for the expansion and self query chains, we define a template and a chain builder class RerankingTemplate BasePromptTemplate prompt str You are an AI language model assistant. Your task is to rerank passages related to a query based on their relevance. The most relevant passages should be put at the beginning. You should only pick at max k passages. The following are passages related to this query question . Passages passages def create_template self PromptTemplate return PromptTemplate template self.prompt, input_variables question , passages and that s it! Conclusion _Congratulations!_ In Lesson 5 , you learned to build an advanced RAG retrieval module optimized for searching posts, articles, and code repositories from a Qdrant vector DB. First , you learned about where the RAG pipeline can be optimized pre retrieval retrieval post retrieval After you learn how to build from scratch without using LangChain s utilities the following advanced RAG retrieval post retrieval optimization techniques query expansion self query hybrid search rerank Ultimately , you understood where the retrieval component sits in an RAG production LLM system, where the code is shared between multiple microservices and doesn t sit in a single Notebook. _ Next week , in Lesson 6 , we will move to the training pipeline and show you how to automatically transform the data crawled from LinkedIn, Substack, Medium, and GitHub into an instruction dataset using GPT 4 to fine tune your LLM Twin._ See you there! Next Steps Step 1 This is just the short version of Lesson 5 on the advanced RAG retrieval module . For The full implementation. Discussion on our custom implementation vs. LangChain. More on the problems these 4 advanced RAG techniques solve. How to use the retrieval module. Check out the full version of Lesson 5 on our Medium publication . It s still FREE Lesson 5 FREE Medium Article Step 2 Check out theLLM Twin GitHub repository and try it yourself _Nothing compares with getting your hands dirty and building it yourself!_ LLM Twin Course GitHub Images If not otherwise stated, all images are created by the author. 17 Share this post The 4 Advanced RAG Algorithms You Must Know to Implement decodingml.substack.com Copy link Facebook Email Note Other 1 Share PreviousNext Discussion about this post Comments Restacks Meng LiAI Disruption May 17Great, thanks for sharing!Expand full commentReplyShare Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/the-4-advanced-rag-algorithms-you?r=1ttoeh" }, { "id": "037e6362-8be7-4860-992f-1f075921a669", "content": "Problems deploying your ML models? Here is your solution! PyTorch CUDA ultimate guide. Synthetic data generation. Serverless infrastructure. SubscribeSign in Share this post Problems deploying your ML models? Here is your solution! decodingml.substack.com Copy link Facebook Email Note Other Problems deploying your ML models? Here is your solution! PyTorch CUDA ultimate guide. Synthetic data generation. Serverless infrastructure. Paul Iusztin Apr 27, 2024 10 Share this post Problems deploying your ML models? Here is your solution! decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week s topics The ultimate guide on installing PyTorch with CUDA support in all possible ways Generate a synthetic domain specific Q A dataset in 30 minutes The power of serverless in the world of ML Exciting news I was invited by Maven to speak in their Lighting Lesson series about how to \ud835\uddd4\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb. Register here it s free This 30 min session is for ML MLOps engineers who want to learn \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde6\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb Using the 3 pipeline architecture MLOps good practices \ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 data crawling, ETLs, CDC, AWS \ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 streaming engine in Python, data ingestion for fine tuning RAG, vector DBs \ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 create a custom dataset, fine tuning, model registries, experiment trackers, LLM evaluation \ud835\uddd7\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 real time deployment, REST API, RAG, LLM monitoring Join LIVE on \ud835\ude0d\ud835\ude33\ud835\ude2a, \ud835\ude14\ud835\ude22\ud835\ude3a 3! Register here it s free The ultimate guide on installing PyTorch with CUDA support in all possible ways Ever wanted to quit ML while wrestling with \ud835\uddd6\ud835\udde8\ud835\uddd7\ud835\uddd4 \ud835\uddf2\ud835\uddff\ud835\uddff\ud835\uddfc\ud835\uddff\ud835\ude00? I know I did. Discover \ud835\uddf5\ud835\uddfc\ud835\ude04 to install \ud835\uddd6\ud835\udde8\ud835\uddd7\ud835\uddd4 \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\uddfd\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf9\ud835\ude06 in all possible ways. Here is the story of most ML people 1 . You just got excited about a new model that came out. 2 . You want to try it out. 3 . You install everything. 4 . You run the model. 5 . Bam... CUDA error. 6 . You fix the error. 7 . Bam... Another CUDA error 7 . You fix the error. 8 . ...Yet another CUDA error. You get the idea. Now it is 3 00 am, and you finally solved all your CUDA errors and ran your model. Now, it s time to do your actual work. Do you relate? If so... I started a Medium article where I documented good practices and step by step instructions on how to install CUDA PyTorch with Pip Conda or Mamba Poetry Docker Docker entry point bash template Check it out _ The ultimate guide on installing PyTorch with CUDA support in all possible ways _ \ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2 Feel free to comment with any improvements on how to install CUDA PyTorch. Let s make the ultimate tutorial on installing these 2 beasts Generate a synthetic domain specific Q A dataset in 30 minutes How do you \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 a \ud835\ude00\ud835\ude06\ud835\uddfb\ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude01\ud835\uddf6\ud835\uddf0 \ud835\uddf1\ud835\uddfc\ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\ude00\ud835\uddfd\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\uddf3\ud835\uddf6\ud835\uddf0 \ud835\udde4 \ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 in \ud835\udfef\ud835\udfec \ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\ude01\ud835\uddf2\ud835\ude00 to \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 your \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0? This method is also known as \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb. Here are its 3 \ud835\ude2e\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31\ud835\ude34 \ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26, \ud835\ude2d\ud835\ude26\ud835\ude35 \ud835\ude34 \ud835\ude28\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude22 \ud835\ude18 \ud835\ude08 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude35 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude35\ud835\ude30 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26 \ud835\ude22 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d \ud835\ude22\ud835\ude25\ud835\ude37\ud835\ude2a\ud835\ude34\ud835\ude30\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14. \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed \ud835\udde0\ud835\uddee\ud835\uddfb\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\ude04 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 Generate a few input samples 3 that have the following structure \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude33_\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35 describe the type of investor e.g., I am a 28 year old marketing professional \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f describe the user s intention e.g., Is Bitcoin a good investment option? \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf5\ud835\uddf2\ud835\uddf9\ud835\uddfd \ud835\uddfc\ud835\uddf3 \ud835\uddee \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 Use a powerful LLM as a teacher e.g., GPT4, Falcon 180B, etc. to generate up to N similar input examples. We generated 100 input examples in our use case, but you can generate more. You will use the manually filled input examples to do few shot prompting. This will guide the LLM to give you domain specific samples. \ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34 ... Generate 100 more examples with the following pattern USER CONTEXT 1 ... QUESTION 1 ... USER CONTEXT 2 ... \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddfc\ud835\ude02\ud835\ude01\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 Now, you will have the same powerful LLM as a teacher, but this time, it will answer all your N input examples. But first, to introduce more variance, we will use RAG to enrich the input examples with news context. Afterward, we will use the teacher LLM to answer all N input examples. ...and bam! You generated a domain specific Q A dataset with almost 0 manual work. . Now, you will use this data to train a smaller LLM e.g., Falcon 7B on a niched task, such as financial advising. This technique is known as finetuning with distillation because you use a powerful LLM as the teacher e.g., GPT4, Falcon 180B to generate the data, which will be used to fine tune a smaller LLM e.g., Falcon 7B , which acts as the student. Generate a Q A dataset in 30 minutes \ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26 To ensure that the generated data is of high quality, you can hire a domain expert to check refine it. The power of serverless in the world of ML \ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfa\ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf4 ML models is \ud835\uddf5\ud835\uddee\ud835\uddff\ud835\uddf1, especially when running your models on GPUs. But \ud835\ude00\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 makes things \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06. Using Beam as your serverless provider, deploying managing ML models can be as easy as \ud835\uddd7\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf2\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\ude00 In a few lines of code, you define the application that contains the requirements of your infrastructure, such as the CPU, RAM, and GPU the dependencies of your application the volumes from where you can load your data and store your artifacts \ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf7\ud835\uddfc\ud835\uddef\ud835\ude00 Using the Beam application, you can quickly decorate your Python functions to run them once on the given serverless application put your task job in a queue to be processed or even schedule it using a CRON based syntax even deploy it as a RESTful API endpoint . As you can see in the image below, you can have one central function for training or inference, and with minimal effort, you can switch from all these deployment methods. Also, you don t have to bother at all with managing the infrastructure on which your jobs run. You specify what you need, and Beam takes care of the rest. By doing so, you can directly start to focus on your application and stop carrying about the infrastructure. This is the power of serverless! Beam example \ud835\ude0a\ud835\ude29\ud835\ude26\ud835\ude24\ud835\ude2c \ud835\ude30\ud835\ude36\ud835\ude35 \ud835\ude09\ud835\ude26\ud835\ude22\ud835\ude2e \ud835\ude35\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude2e\ud835\ude30\ud835\ude33\ud835\ude26 Images If not otherwise stated, all images are created by the author. 10 Share this post Problems deploying your ML models? Here is your solution! decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/problems-deploying-your-ml-models?r=1ttoeh" }, { "id": "c91e76e3-774c-43e7-91db-01c0c6bff57a", "content": "Streaming Pipelines for LLMs and RAG by Paul Iusztin SOTA streaming pipeline in Python to clean, chunk, embed and load data to a vector DB feature store in real time for fine tuning LLMs and RAG on AWS . SubscribeSign in Share this post SOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! decodingml.substack.com Copy link Facebook Email Note Other SOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! Use a Python streaming engine to populate a feature store from 4 data sources Paul Iusztin Apr 25, 2024 11 Share this post SOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! decodingml.substack.com Copy link Facebook Email Note Other Share the 4th out of 11 lessons of the LLM Twin free course What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality, and voice into an LLM. Image by DALL E Why is this course different? _By finishing the LLM Twin Building Your Production Ready AI Replica _ _free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices_. _ Why should you care? _ _ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system._ More details on what you will learn within the LLM Twin course , here Latest Lessons of the LLM Twin Course Lesson 1 An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin LLM Twin Concept, 3 Pipeline Architecture, System Design for LLM Twin Lesson 2 The importance of Data Pipeline in the era of Generative AI Data crawling, ETL pipelines, ODM, NoSQL Database Lesson 3 CDC Enabling Event Driven Architectures Change Data Capture CDC , MongoDB Watcher, RabbitMQ queue Lesson 4 Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! In the 4th lesson , we will focus on the feature pipeline. The feature pipeline is the first pipeline presented in the 3 pipeline architecture feature, training and inference pipelines. A feature pipeline takes raw data as input, processes it into features, and stores it in a feature store, from which the training inference pipelines will use it. The component is completely isolated from the training and inference code. All the communication is done through the feature store. By the end of this article , you will learn to design and build a production ready feature pipeline that uses Bytewax as a stream engine to process data in real time ingests data from a RabbitMQ queue uses SWE practices to process multiple data types posts, articles, code cleans, chunks, and embeds data for LLM fine tuning and RAG loads the features to a Qdrant vector DB. Note that we will only cover the vector DB retrieval client and advanced retrieval techniques in the 5th lesson ! _Excited? Let s get started!_ Table of Contents 1. Why are we doing this? 2. System design of the feature pipeline 3. The Bytewax streaming flow 4. Pydantic data models 5. Load data to Qdrant our feature store 6. The dispatcher layer Check out the code on GitHub 1 and support us with a 1 . Why are we doing this? A quick reminder from previous lessons To give you some context, in Lesson 2, we crawl data from LinkedIn, Medium, and GitHub, normalize it, and load it to MongoDB. In Lesson 3, we are using CDC to listen to changes to the MongoDB database and emit events in a RabbitMQ queue based on any CRUD operation done on MongoDB. The problem we are solving In our LLM Twin use case, the feature pipeline constantly syncs the MongoDB warehouse with the Qdrant vector DB our feature store while processing the raw data into features. Why we are solving it The feature store will be the central point of access for all the features used within the training and inference pipelines. The training pipeline will use the feature store to create fine tunin g datasets for your LLM twin . The inference pipeline will use the feature store for RAG . 2 . System design of the feature pipeline our solution _Our solution is based on CDC , a queue, a streaming engine, and a vector DB _ CDC adds any change made to the Mongo DB to the queue read more in Lesson 3 . the RabbitMQ queue stores all the events until they are processed. The Bytewax streaming engine cleans, chunks, and embeds the data. A streaming engine works naturally with a queue based system. The data is uploaded to a Qdrant vector DB on the fly Why is this powerful? Here are 4 core reasons 1. The data is processed in real time . 2. Out of the box recovery system If the streaming pipeline fails to process a message will be added back to the queue 3. Lightweight No need for any diffs between databases or batching too many records 4. No I O bottlenecks on the source database It solves all our problems! Streaming ingestion pipeline architecture and integration with the rest of the components How do we process multiple data types? How do you process multiple types of data in a single streaming pipeline without writing spaghetti code ? Yes, that is for you, data scientists! Joking am I ? We have 3 data types posts, articles, and code. Each data type and its state will be modeled using Pydantic models . To process them, we will write a dispatcher layer , which will use a creational factory pattern to instantiate a handler implemented for that specific data type post, article, code and operation cleaning, chunking, embedding . The handler follows the strategy behavioral pattern. Streaming over batch Nowadays, using tools such as Bytewax makes implementing streaming pipelines a lot more frictionless than using their JVM alternatives. The key aspect of choosing a streaming vs. a batch design is real time synchronization between your source and destination DBs. In our particular case, we will process social media data, which changes fast and irregularly. Also, for our digital twin, it is important to do RAG on up to date data. We don t want to have any delay between what happens in the real world and what your LLM twin sees. That being said, choosing a streaming architecture seemed natural in our use case. 3 . The Bytewax streaming flow The Bytewax flow is the central point of the streaming pipeline . It defines all the required steps, following the next simplified pattern _ input processing output ._ As I come from the AI world, I like to see it as the graph of the streaming pipeline , where you use the _input _ , _map _ , and _output _ Bytewax functions to define your graph, which in the Bytewax world is called a _ flow _. As you can see in the code snippet below, we ingest posts, articles or code messages from a RabbitMQ queue. After we clean, chunk and embed them. Ultimately, we load the cleaned and embedded data to a Qdrant vector DB, which in our LLM twin use case will represent the feature store of our system. To structure and validate the data, between each Bytewax step, we map and pass a different Pydantic model based on its current state raw, cleaned, chunked, or embedded. Bytewax flow GitHub Code We have a single streaming pipeline that processes everything. As we ingest multiple data types posts, articles, or code snapshots , we have to process them differently. To do this the right way, we implemented a dispatcher layer that knows how to apply data specific operations based on the type of message. More on this in the next sections Why Bytewax? _Bytewax is an open source streaming processing framework that _ is built in Rust for performance has Python bindings for leveraging its powerful ML ecosystem so, for all the Python fanatics out there, no more JVM headaches for you. Jokes aside, here is why Bytewax is so powerful Bytewax local setup is plug and play can quickly be integrated into any Python project you can go wild even use it in Notebooks can easily be integrated with other Python packages NumPy, PyTorch, HuggingFace, OpenCV, SkLearn, you name it out of the box connectors for Kafka and local files, or you can quickly implement your own We used Bytewax to build the streaming pipeline for the LLM Twin course and loved it. To learn more about Bytewax , check out their Substack , where you have the chance to dive deeper into streaming engines . In Python. For FREE Bytewax Newsletter 4 . Pydantic data models Let s take a look at what our Pydantic models look like. We defined a hierarchy of Pydantic models for all our data types posts, articles, or code all our states raw, cleaned, chunked, and embedded This is how the set of classes for the posts will look like Pydantic posts model structure GitHub Code We repeated the s ame process for the articles and code model hierarchy . 5 . Load data to Qdrant our feature store The first step is to implement our custom Bytewax _DynamicSink_ class Qdrant DynamicSink GitHub Code Next, for every type of operation we need output cleaned or embedded data , we have to subclass the _StatelessSinkPartition_ Bytewax class they also provide a stateful option more in their docs An instance of the class will run on every partition defined within the Bytewax deployment. In the course, we are using a single partition per worker. But, by adding more partitions and workers , you can quickly scale your Bytewax pipeline horizontally. Remember why we upload the data to Qdrant in two stages , as the Qdrant vector DB will act as our feature store 1. The _cleaned data_ will be used for _LLM fine tuning_ used by the training pipeline 2. The _chunked embedded_ data will be used for _RAG used by the inference pipeline _ Qdrant worker partitions GitHub Code Note that we used Qdrant s Batch method to upload all the available points simultaneously. By doing so, we reduce the latency on the network I O side more on that here 6 . The dispatcher layer Now that we have the Bytewax flow and all our data models. How do we map a raw data model to a cleaned data model? All our domain logic is modeled by a set of _Handler _ classes CleaningDataHandler ChunkingDataHandler EmbeddingDataHandler Now, to build our dispatcher, we need 2 last components a factory class instantiates the right handler based on the type of the event a dispatcher class the glue code that calls the factory class and handler Here is what the cleaning dispatcher and factory look like The dispatcher and factory classes GitHub Code Note that we will have a different Handler for every data_type, state pair resulting in 3 x 3 9 different handlers. For Example, we will have 3 handlers based on their data type for the cleaned post state PostCleaningHandler, ArticleCleaningHandler, and RepositoryCleaningHandler. By repeating the same logic, we will end up with the following set of dispatchers _RawDispatcher_ no factory class required as the data is not processed _CleaningDispatcher_ with a _ChunkingHandlerFactory_ class _ChunkingDispatcher_ with a _ChunkingHandlerFactory_ class _EmbeddingDispatcher_ with an _EmbeddingHandlerFactory_ class To Summarize In Lesson 4 of the LLM Twin course, we learned how to Design a streaming pipeline in Python using Bytewax Load data to a Qdrant vector DB Use Pydantic models to add types and validation to the data points Implement a dispatcher layer to process multiple data types in a modular way _ In Lesson 5, which will be held in two weeks, we will focus on the vector DB retrieval client and advanced retrieval techniques._ Next Steps To dig into the details of the streaming pipeline and how to implement cleaning , chunking , and embedding strategies for digital data design the AWS infrastructure for the streaming pipeline understand how to run the component Check out the full fledged version of the article on our Medium publication . Lesson 4 FREE Medium Article Images If not otherwise stated, all images are created by the author. 11 Share this post SOTA Python Streaming Pipelines for Fine tuning LLMs and RAG in Real Time! decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/sota-python-streaming-pipelines-for?r=1ttoeh" }, { "id": "53bc94d1-8cfd-4e65-b55c-9b3582f6ed64", "content": "Ready for production ML? Here are the 4 pillars to build production ML systems ML Platforms MLOps Components. RAG RAG What problems does it solve, and how is it integrated into LLM powered applications SubscribeSign in Share this post Ready for production ML? Here are the 4 pillars to build production ML systems decodingml.substack.com Copy link Facebook Email Note Other Ready for production ML? Here are the 4 pillars to build production ML systems ML Platforms MLOps Components. RAG RAG What problems does it solve, and how is it integrated into LLM powered applications Paul Iusztin Apr 13, 2024 8 Share this post Ready for production ML? Here are the 4 pillars to build production ML systems decodingml.substack.com Copy link Facebook Email Note Other 2 Share _Decoding ML Notes_ This week s topics Using an ML Platform is critical to integrating MLOps into your project The 4 pillars to build production ML systems RAG What problems does it solve, and how is it integrated into LLM powered applications? Using an ML Platform is critical to integrating MLOps into your project Here are 6 ML platform features you must know use ...and let s use Comet ML as a concrete example. \ud835\udfed. \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 In your ML development phase, you generate lots of experiments. Tracking and comparing the metrics between them is crucial in finding the optimal model hyperparameters. \ud835\udfee. \ud835\udde0\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\udde6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 Its primary purpose is reproducibility. To know how a model from a specific experiment was generated, you must know the version of the code version of the dataset hyperparameters config total compute ... and more \ud835\udfef. \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 Most of the time, along with the scalar metrics, you must log visual results, such as images videos prompts t SNE graphs 3D point clouds ... and more 4. \ud835\udc00\ud835\udc2b\ud835\udc2d\ud835\udc22\ud835\udc1f\ud835\udc1a\ud835\udc1c\ud835\udc2d\ud835\udc2c The most powerful feature out of them all. An artifact is a versioned object that acts as an input or output for your job. Everything can be an artifact data, model, code , but the most common case is for your data. Wrapping your assets around an artifact ensures reproducibility and shareability. For example, you wrap your features into an artifact e.g., features 3.1.2 , which you can consume and share across multiple ML environments development or continuous training . Using an artifact to wrap your data allows you to quickly respond to questions such as What data have I used to generate the model? and What Version? 5. \ud835\udc0c\ud835\udc28\ud835\udc1d\ud835\udc1e\ud835\udc25 \ud835\udc11\ud835\udc1e\ud835\udc20\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc2b\ud835\udc32 The model registry is the ultimate way to version your models and make them accessible to all your services. For example, your continuous training pipeline will log the weights as an artifact into the model registry after it trains the model. You label this model as v 1.1.5 staging and prepare it for testing. If the tests pass, mark it as v 1.1.0 production and trigger the CI CD pipeline to deploy it to production. 6. \ud835\udc16\ud835\udc1e\ud835\udc1b\ud835\udc21\ud835\udc28\ud835\udc28\ud835\udc24\ud835\udc2c Webhooks lets you integrate the Comet model registry with your CI CD pipeline. For example, when the model status changes from Staging to Production, a POST request triggers a GitHub Actions workflow to deploy your new model. Image by the Author Check out Comet to learn more The 4 pillars to build production ML systems Before building a production ready system, it is critical to consider a set of questions that will later determine the nature of your ML system architecture. \ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26 4 \ud835\ude31\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude33\ud835\ude34 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude22\ud835\ude2d\ud835\ude38\ud835\ude22\ud835\ude3a\ud835\ude34 \ud835\ude29\ud835\ude22\ud835\ude37\ud835\ude26 \ud835\ude35\ud835\ude30 \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude34\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude33 \ud835\ude23\ud835\ude26\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude26 \ud835\ude25\ud835\ude26\ud835\ude34\ud835\ude2a\ud835\ude28\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude22\ud835\ude2f\ud835\ude3a \ud835\ude34\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee What data types do you have? e.g., tabular data, images, text, etc. What does the data look like? e.g., for text data, is it in a single language or multiple? How do you collect the data? At what frequency do you have to collect the data? How do you collect labels for the data? crucial for how you plan to evaluate and monitor the model in production \ud835\udde7\ud835\uddf5\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5\ud835\uddfd\ud835\ude02\ud835\ude01 What are the throughput requirements? You must know at least the throughput s minimum, average, and maximum statistics. How many requests the system must handle simultaneously? 1, 10, 1k, 1 million, etc. \ud835\udddf\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\ude06 What are the latency requirements? 1 millisecond, 10 milliseconds, 1 second, etc. Throughput vs. latency trade off Accuracy vs. speed trade off \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 Batch vs. real time architecture closely related to the throughput vs. latency trade off How should the system scale? e.g., based on CPU workload, of requests, queue size, data size, etc. Cost requirements . Do you see how we shifted the focus from model performance towards how it is integrated into a more extensive system? When building production ready ML, the model s accuracy is no longer the holy grail but a bullet point in a grander scheme. . \ud835\udde7\ud835\uddfc \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude07\ud835\uddf2, the 4 pillars to keep in mind before designing an ML architecture are Data Throughput Latency Infrastructure Image by the Author RAG What problems does it solve, and how is it integrated into LLM powered applications? Let s find out RAG is a popular strategy when building LLMs to add external data to your prompt. \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa Working with LLMs has 3 main issues 1 . The world moves fast LLMs learn an internal knowledge base. However, the issue is that its knowledge is limited to its training dataset. The world moves fast. New data flows on the internet every second. Thus, the model s knowledge base can quickly become obsolete. One solution is to fine tune the model every minute or day... If you have some billions to spend around, go for it. 2 . Hallucinations An LLM is full of testosterone and likes to be blindly confident. Even if the answer looks 100 legit, you can never fully trust it. 3 . Lack of reference links It is hard to trust the response of the LLM if we can t see the source of its decisions. Especially for important decisions e.g., health, financials \ud835\udde6\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb Surprize! It is RAG. 1 . Avoid fine tuning Using RAG, you use the LLM as a reasoning engine and the external knowledge base as the main memory e.g., vector DB . The memory is volatile, so you can quickly introduce or remove data. 2 . Avoid hallucinations By forcing the LLM to answer solely based on the given context, the LLM will provide an answer as follows use the external data to respond to the user s question if it contains the necessary insights I don t know if not 3 . Add reference links Using RAG, you can easily track the source of the data and highlight it to the user. \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc\ud835\uddf2\ud835\ude00 \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8? Let s say we want to use RAG to build a financial assistant. \ud835\ude1e\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude25\ud835\ude30 \ud835\ude38\ud835\ude26 \ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude25? a data source with historical and real time financial news e.g. Alpaca a stream processing engine eg. Bytewax an encoder only model for embedding the docs e.g., pick one from sentence transformers a vector DB e.g., Qdrant \ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude25\ud835\ude30\ud835\ude26\ud835\ude34 \ud835\ude2a\ud835\ude35 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c? On the feature pipeline side 1 . using Bytewax, you ingest the financial news and clean them 2 . you chunk the news documents and embed them 3 . you insert the embedding of the docs along with their metadata e.g., the initial text, source_url, etc. to Qdrant On the inference pipeline side 4 . the user question is embedded using the same embedding model 5 . using this embedding, you extract the top K most similar news documents from Qdrant 6 . along with the user question, you inject the necessary metadata from the extracted top K documents into the prompt template e.g., the text of documents its source_url 7 . you pass the whole prompt to the LLM for the final answer Image by the Author 8 Share this post Ready for production ML? Here are the 4 pillars to build production ML systems decodingml.substack.com Copy link Facebook Email Note Other 2 Share PreviousNext Discussion about this post Comments Restacks Dr. Jody Ann S. JonesThe Data Sensei Apr 13Liked by Paul IusztinExcellent article Paul! Thank you so much for sharing Expand full commentReplyShare 1 reply by Paul Iusztin 1 more comment... Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/ready-for-production-ml-here-are?r=1ttoeh" }, { "id": "20a85606-a880-4894-bfb7-6b0cad8b3f1f", "content": "My monthly recommendations for leveling up in ML In Vector DBs, RAG, MLOps, and LLMs SubscribeSign in Share this post My monthly recommendations for leveling up in ML decodingml.substack.com Copy link Facebook Email Note Other My monthly recommendations for leveling up in ML In Vector DBs, RAG, MLOps, and LLMs Paul Iusztin Apr 06, 2024 12 Share this post My monthly recommendations for leveling up in ML decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ Today is about learning. Here is a list of learning resources I used and filtered in the past months. It is one of the most helpful content on Vector DBs, RAG, MLOps and LLMs out there. This week s topics Pick the right vector DB for your exact use case 4 video lectures on hands on LLMs 7 steps you have to achieve 100 MLOps maturity Advanced RAG Pick the right vector DB for your exact use case This is the \ud835\uddfc\ud835\uddfb\ud835\uddf9\ud835\ude06 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 to \ud835\uddfd\ud835\uddf6\ud835\uddf0\ud835\uddf8 the \ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 for your exact \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddf0\ud835\uddee\ud835\ude00\ud835\uddf2. Since ChatGPT made AI cool, besides the millions of ChatGPT posts you got tired of and blocked, you realized that a new type of tool started to hit the scene Vector DBs. As vector DBs play a crucial role in most LLM applications, they popped out everywhere. On this day, there are 37 vector DB solutions that are constantly changing and adding features. \ud835\ude15\ud835\ude30\ud835\ude38, \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude29 \ud835\ude2d \ud835\ude34\ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude2d\ud835\ude25 \ud835\ude10 \ud835\ude31\ud835\ude2a\ud835\ude24\ud835\ude2c \ud835\ude30\ud835\ude2f\ud835\ude26? SS from Superlinked \ud835\ude43\ud835\ude5a\ud835\ude67\ud835\ude5a \ud835\ude5e\ud835\ude68 \ud835\ude6c\ud835\ude5d\ud835\ude5a\ud835\ude67\ud835\ude5a \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude51\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude64\ud835\ude67 \ud835\ude3f\ud835\ude3d \ud835\ude3e\ud835\ude64\ud835\ude62\ud835\ude65\ud835\ude56\ud835\ude67\ud835\ude5e\ud835\ude68\ud835\ude64\ud835\ude63 \ud835\ude60\ud835\ude5e\ud835\ude58\ud835\ude60\ud835\ude68 \ud835\ude5e\ud835\ude63. It is an effort managed by Superlinked, where they carefully compared all these 37 vector DBs across 29 features, such as License GitHub support for text, image or struct models RAG, RecSys, LangChain or LllamaIndex APIs pricing sharding document size vector dims ...and more! I won t list all 29 features. You have to check it out to see them for yourself Vector DB Comparison \ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2 To keep the table updated or add more features, you can contribute to it yourself. 4 video lectures on hands on LLMs Want to build your first \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01 but don t know where to start? Here are \ud835\udff0 \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\ude00, made by Pau Labarta Bajo from Real World Machine Learning , to put you on the right track 1. \ud835\udc05\ud835\udc22\ud835\udc27\ud835\udc1e \ud835\udc2d\ud835\udc2e\ud835\udc27\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e \ud835\udc1f\ud835\udc28\ud835\udc2b \ud835\udc28\ud835\udc29\ud835\udc1e\ud835\udc27 \ud835\udc2c\ud835\udc28\ud835\udc2e\ud835\udc2b\ud835\udc1c\ud835\udc1e \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc2c You will learn What is model fine tuning? Why is it useful? When to use it? Why to fine tune an LLM using QLoRA How to architect a fine tuning pipeline in a real world project 2. \ud835\udc07\ud835\udc1a\ud835\udc27\ud835\udc1d\ud835\udc2c \ud835\udc28\ud835\udc27 \ud835\udc1f\ud835\udc22\ud835\udc27\ud835\udc1e \ud835\udc2d\ud835\udc2e\ud835\udc27\ud835\udc22\ud835\udc27\ud835\udc20 Let s apply what we learned in lesson 1 to build our first fine tuning pipeline. 3. \ud835\udc01\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d \ud835\udc1d\ud835\udc1e\ud835\udc29\ud835\udc25\ud835\udc28\ud835\udc32 \ud835\udc1a \ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc25 \ud835\udc2d\ud835\udc22\ud835\udc26\ud835\udc1e \ud835\udc2c\ud835\udc2d\ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc26\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e You will learn How to transform HTML docs into vector embeddings. How to process data in real time How to store retrieve embeddings from a vector DB How to deploy it to AWS. 4. \ud835\udc08\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e Finally, you will learn how to use LangChain to glue together your fine tuned LLM and your financial news stored as embeddings in a vector DB to serve predictions behind a RESTful API. 7 steps you have to achieve 100 MLOps maturity One of the most \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf3\ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf1\ud835\ude00 in the \ud835\udde0\ud835\udddf \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 is \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 , a new interdisciplinary process that isn t fully defined yet. The good news is that there is a strong movement in \ud835\uddf1\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 a \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff \ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 in \ud835\ude00\ud835\uddf0\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 the \ud835\uddf9\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9 of \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddfa\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\ude06 within your \ud835\uddfc\ud835\uddff\ud835\uddf4\ud835\uddee\ud835\uddfb\ud835\uddf6\ud835\ude07\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb or \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01. Here are \ud835\udff3 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 you have to \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8 to \ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddf2 \ud835\udfed\ud835\udfec\ud835\udfec \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddfa\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\ude06 No one other than Maria Vechtomova from MarvelousMLOps has proposed it. \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude06 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude14\ud835\ude36\ud835\ude34\ud835\ude35 \ud835\ude29\ud835\ude22\ud835\ude37\ud835\ude26\ud835\ude34 \ud835\udfed . \ud835\uddd7\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb project, ML model, and technical documentation \ud835\udfee . \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf2\ud835\uddee\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddff\ud835\uddf2\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 Infrastructure traceability and reproducibility versioned IaC under CI CD and ML code traceability and reproducibility versioned code, data, and models along with metadata lineage attached to the data model \ud835\udfef . \ud835\uddd6\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\uddfe\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 infrastructure code ML model code quality requirements tests ran on PRs under the CI pipeline, PR reviews, formatting checks \ud835\udff0 . \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude00\ud835\ude02\ud835\uddfd\ud835\uddfd\ud835\uddfc\ud835\uddff\ud835\ude01 infrastructure, application, model performance, business KPIs, data drift and outliers monitoring \ud835\ude09\ud835\ude26\ud835\ude3a\ud835\ude30\ud835\ude2f\ud835\ude25 \ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude2a\ud835\ude24 \ud835\ude14\ud835\ude13\ud835\ude16\ud835\ude31\ud835\ude34 \ud835\udff1 . \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00 \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 all the features are shared versioned from a central feature store \ud835\udff2 . \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf9\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddef\ud835\uddf6\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06 a human can understand the reasoning of the model and not treat it as a black box \ud835\udff3 . \ud835\uddd4 \ud835\uddd5 \ud835\ude01\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\uddef\ud835\uddee\ud835\uddf0\ud835\uddf8 \ud835\uddf9\ud835\uddfc\ud835\uddfc\ud835\uddfd inputs outputs of the model are stored automatically and A B testing is performed regularly . Check out the entire questionnaire on the MarvelousMLOps blog MLOps maturity assessment MLOps Maturity Assessment by Marvelous MLOps What level of MLOps maturity is your organization at? For now, you will rarely see 100 . Advanced RAG RAG systems are far from perfect This free course teaches you how to improve your RAG system. I recently finished the \ud835\uddd4\ud835\uddf1\ud835\ude03\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf2\ud835\uddf1 \ud835\udde5\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude03\ud835\uddee\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddd4\ud835\udddc \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddd6\ud835\uddf5\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddee free course from DeepLearning.AI SS from the Advanced Retrieval for AI with Chroma course If you are into RAG, I find it among the most valuable learning sources. The course already assumes you know what RAG is. Its primary focus is to show you all the current issues of RAG and why it is far from perfect. Afterward, it shows you the latest SoTA techniques to improve your RAG system, such as query expansion cross encoder re ranking embedding adaptors I am not affiliated with DeepLearning.AI I wouldn t mind though . This is a great course you should take if you are into RAG systems. The good news is that it is free and takes only 1 hour. Check it out Advanced Retrieval for AI with Chroma 12 Share this post My monthly recommendations for leveling up in ML decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/my-ml-monthly-learning-resource-recommendations?r=1ttoeh" }, { "id": "ab66f3dc-2957-4ab9-9ed7-ece653d3f725", "content": "End to End Framework for Production Ready LLMs FREE course on designing, training, deploying, and monitoring a production ready LLM system powered by LLMs, vector DBs LLMOps by building your LLM twin. SubscribeSign in Share this post An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin decodingml.substack.com Copy link Facebook Email Note Other An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin From data gathering to productionizing LLMs using LLMOps good practices. Paul Iusztin Mar 28, 2024 35 Share this post An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin decodingml.substack.com Copy link Facebook Email Note Other Share _ the 1st out of 11 lessons of the LLM Twin free course_ What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM. Image by DALL E Why is this course different? _By finishing the LLM Twin Building Your Production Ready AI Replica _ _free course, you will learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices_. _ Why should you care? _ _ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end to end production grade LLM system._ More details on what you will learn within the LLM Twin course , here Are you ready to build your AI replica? Let s start with Lesson 1 Lesson 1 End to end framework for production ready LLM systems In the first lesson , we will present the project you will build during the course _your production ready LLM Twin AI replica._ Afterward , we will dig into the LLM project system design . We will present all our architectural decisions regarding the design of the _data collection pipeline_ for social media data and how we applied _the 3 pipeline architecture_ to our LLM microservices. In the following lessons , we will examine each component s code and learn how to implement and deploy it to AWS and Qwak. LLM twin system architecture Image by the Author What you will learn to build during this course. Table of Contents 1. What are you going to build? The LLM twin concept 2. LLM twin system design 1 . What are you going to build? The LLM twin concept The outcome of this course is to learn to build your own AI replica . We will use an LLM to do that, hence the name of the course _ LLM Twin Building Your Production Ready AI Replica. _ But what is an LLM twin? Shortly, your LLM twin will be an AI character who writes like you, using your writing style and personality. It will not be you. It will be your writing copycat. More concretely, you will build an AI replica that writes social media posts or technical articles like this one using your own voice. Why not directly use ChatGPT? You may ask When trying to generate an article or post using an LLM, the results tend to be very generic and unarticulated, contain misinformation due to hallucination , require tedious prompting to achieve the desired result. _ But here is what we are going to do to fix that _ First , we will fine tune an LLM on your digital data gathered from LinkedIn, Medium, Substack and GitHub. By doing so, the LLM will align with your writing style and online personality. It will teach the LLM to talk like the online version of yourself. Our use case will focus on an LLM twin who writes social media posts or articles that reflect and articulate your voice. Secondly , we will give the LLM access to a vector DB to access external information to avoid hallucinating. Ultimately , in addition to accessing the vector DB for information, you can provide external links that will act as the building block of the generation process. Excited? Let s get started 2 . LLM Twin System design Let s understand how to apply the 3 pipeline architecture to our LLM system . The architecture of the LLM twin is split into 4 Python microservices 1. The data collection pipeline 2. The feature pipeline 3. The training pipeline 4. The inference pipeline LLM twin system architecture Image by the Author _Now, let s zoom in on each component to understand how they work individually and interact with each other. _ 2.1. The data collection pipeline Its scope is to crawl data for a given user from Medium articles Substack articles LinkedIn posts GitHub code As every platform is unique, we implemented a different Extract Transform Load ETL pipeline for each website. However, the baseline steps are the same for each platform . _Thus, for each ETL pipeline, we can abstract away the following baseline steps _ log in using your credentials use _selenium_ to crawl your profile use _BeatifulSoup_ to parse the HTML clean normalize the extracted HTML save the normalized but still raw data to Mongo DB Important note We are crawling only our data, as most platforms do not allow us to access other people s data due to privacy issues. But this is perfect for us, as to build our LLM twin, we need only our own digital data. Why Mongo DB? We wanted a NoSQL database that quickly allows us to store unstructured data aka text . How will the data pipeline communicate with the feature pipeline? We will use the Change Data Capture CDC pattern to inform the feature pipeline of any change on our Mongo DB. To explain the CDC briefly, a watcher listens 24 7 for any CRUD operation that happens to the Mongo DB. The watcher will issue an event informing us what has been modified. We will add that event to a RabbitMQ queue. The feature pipeline will constantly listen to the queue, process the messages, and add them to the Qdrant vector DB. For example, when we write a new document to the Mongo DB, the watcher creates a new event. The event is added to the RabbitMQ queue ultimately, the feature pipeline consumes and processes it. Where will the data pipeline be deployed? The data collection pipeline and RabbitMQ service will be deployed to AWS. We will also use the freemium serverless version of Mongo DB. 2.2. The feature pipeline The feature pipeline is implemented usingBytewax a Rust streaming engine with a Python interface . Thus, in our specific use case , we will also refer to it as a streaming ingestion pipeline . It is an entirely different service than the data collection pipeline. How does it communicate with the data pipeline? As explained above, the feature pipeline communicates with the data pipeline through a RabbitMQ queue . Currently, the streaming pipeline doesn t care how the data is generated or where it comes from. It knows it has to listen to a given queue, consume messages from there and process them. By doing so, we decouple the two components entirely. What is the scope of the feature pipeline? It represents the ingestion component of the RAG system . It will take the raw data passed through the queue and clean the data chunk it embed it using the embedding models from Superlinked load it to the Qdrant vector DB. What data will be stored? The training pipeline will have access only to the feature store , which, in our case, is represented by the Qdrant vector DB. _With this in mind, we will store in Qdrant 2 snapshots of our data _ 1 . The cleaned data without using vectors as indexes store them in a NoSQL fashion . 2 . The cleaned, chunked, and embedded data leveraging the vector indexes of Qdrant The training pipeline needs access to the data in both formats as we want to fine tune the LLM on standard and augmented prompts. Why implement a streaming pipeline instead of a batch pipeline? There are 2 main reasons. The first one is that, coupled with the CDC pattern , it is the most efficient way to sync two DBs between each other. Using CDC a streaming pipeline, you process only the changes to the source DB without any overhead. The second reason is that by doing so, your source and vector DB will always be in sync . Thus, you will always have access to the latest data when doing RAG. Why Bytewax? Bytewax is a streaming engine built in Rust that exposes a Python interface. We use Bytewax because it combines Rust s impressive speed and reliability with the ease of use and ecosystem of Python. It is incredibly light, powerful, and easy for a Python developer. Where will the feature pipeline be deployed? The feature pipeline will be deployed to AWS. We will also use the freemium serverless version of Qdrant. 2.3. The training pipeline How do we have access to the training features? As section 2.2 highlights, all the training data will be accessed from the feature store . In our case, the feature store is the Qdrant vector DB that contains the cleaned digital data from which we will create prompts answers we will use the chunked embedded data for RAG to augment the cleaned data. _We will implement a different vector DB retrieval client for each of our main types of data posts, articles, code ._ What will the training pipeline do? The training pipeline contains a data to prompt layer that will preprocess the data retrieved from the vector DB into prompts. It will also contain an LLM fine tuning module that inputs a HuggingFace dataset and uses QLoRA to fine tune a given LLM e.g., Mistral . All the experiments will be logged into Comet ML s experiment tracker . We will use a bigger LLM e.g., GPT4 to evaluate the results of our fine tuned LLM. These results will be logged into Comet s experiment tracker. Where will the production candidate LLM be stored? We will compare multiple experiments, pick the best one, and issue an LLM production candidate for the model registry. After, we will inspect the LLM production candidate manually using Comet s prompt monitoring dashboard. Where will the training pipeline be deployed? The training pipeline will be deployed to Qwak. Qwak is a serverless solution for training and deploying ML models. It makes scaling your operation easy while you can focus on building. Also, we will use the freemium version of Comet ML for the following experiment tracker model registry prompt monitoring. 2.4. The inference pipeline The inference pipeline is the final component of the LLM system . It is the one the clients will interact with . It will be wrapped under a REST API . The clients can call it through HTTP requests, similar to your experience with ChatGPT or similar tools. How do we access the features? We will grab the features solely from the feature store. We will use the same Qdrant vector DB retrieval clients as in the training pipeline to use the features we need for RAG. How do we access the fine tuned LLM? The fine tuned LLM will always be downloaded from the model registry based on its tag e.g., accepted and version e.g., v1.0.2, latest, etc. . What are the components of the inference pipeline? The first one is the retrieval client used to access the vector DB to do RAG. After we have a query to prompt the layer, that will map the prompt and retrieved documents from Qdrant into a prompt. After the LLM generates its answer, we will log it to Comet s prompt monitoring dashboard and return it to the clients. For example, the client will request the inference pipeline to Write a 1000 word LinkedIn post about LLMs, and the inference pipeline will go through all the steps above to return the generated post. Where will the inference pipeline be deployed? The inference pipeline will be deployed to Qwak. As for the training pipeline, we will use a serverless freemium version of Comet for its prompt monitoring dashboard. Conclusion This is the 1st article of the _ LLM Twin Building Your Production Ready AI Replica _ free course. In this lesson, we presented what you will build during the course. Ultimately, we went through the system design of the course and presented the architecture of each microservice and how they interact with each other 1. The data collection pipeline 2. The feature pipeline 3. The training pipeline 4. The inference pipeline In Lesson 2 , we will dive deeper into the data collection pipeline , learn how to implement crawlers for various social media platforms, clean the gathered data, store it in a Mongo DB, and finally, show you how to deploy it to AWS. _ Check out the code on GitHub 1 and support us with a _ This is how we can further help you In the Decoding ML newsletter , we want to keep things short sweet . To dive deeper into all the concepts presented in this article Check out the full fledged version of the article on our Medium publication . It s FREE Detailed Lesson 1 on Medium 35 Share this post An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/an-end-to-end-framework-for-production?r=1ttoeh" }, { "id": "c4ad61cb-4875-41f6-a9d9-f0da74303586", "content": "Upskill your LLM knowledge base with these tools. Speed up your LLM inference and dissect the Attention Mechanism with step by step animation. SubscribeSign in Share this post Upskill your LLM knowledge base with these tools. decodingml.substack.com Copy link Facebook Email Note Other Upskill your LLM knowledge base with these tools. Speed up your LLM inference and dissect the Attention Mechanism with step by step animation. Alex Razvant Mar 23, 2024 10 Share this post Upskill your LLM knowledge base with these tools. decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ The LLM Twin Course development has taken off! Join aboard and learn how to design, build, and implement an end to end LLM replica, by following along in a step by step hands on manner with the development of data pipelines, ingestion, LLM fine tuning, serving, monitoring, and more. Decoding ML Newsletter is a reader supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Subscribe The first 2 11 lessons are out, make sure to check them out here Lesson 1 An End to End Framework for Production Ready LLM Systems by Building Your LLM Twin Lesson 2 The Importance of Data Pipelines in the Era of Generative AI This week s topics Fast inference on LLMs Visualize attention mechanism A commonly misunderstood CUDA issue! Fast inference LLMs For the last few years, LLMs have been a hot topic new models, RAGs, new papers, the rise of OpenSource models, etc. The attention mechanism is easy to understand, but hungry to compute thus multiple methods aim to fill the performance gap in model serving. Here are the top 4 LLM inference solutions 1. \ud835\ude03\ud835\udddf\ud835\udddf\ud835\udde0 A fast and easy to use library for LLM inference and serving. \ud835\ude46\ud835\ude5a\ud835\ude6e \ud835\ude56\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude68 \ud835\ude56\ud835\ude67\ud835\ude5a is open source state of the art serving throughput fast model execution with optimized CUDA kernels graph. efficient memory management using PagedAttention support for AMD GPUs ROCm deploy support with NVIDIA Triton, KServe, Docker \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25 shorturl.at nAFPW 2. \ud835\udde7\ud835\uddf2\ud835\uddfb\ud835\ude00\ud835\uddfc\ud835\uddff\ud835\udde5\ud835\udde7 \ud835\udddf\ud835\udddf\ud835\udde0 A library that accelerates and optimizes inference performance of the latest LLMs. \ud835\ude46\ud835\ude5a\ud835\ude6e \ud835\ude56\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude68 \ud835\ude56\ud835\ude67\ud835\ude5a is open source built on a strong TensorRT foundation leverages custom optimized CUDA kernels for transformers enhances customization supports various optimization quant, tensor parallelism takes advantage of the NVIDIA Toolkit perf analyzer, Triton \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25 shorturl.at dluMX 3. \ud835\udde2\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\uddfa\ud835\uddee A tool that allows you to run open source language models locally. \ud835\uddde\ud835\uddf2\ud835\ude06 \ud835\uddee\ud835\ude00\ud835\uddfd\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude00 \ud835\uddee\ud835\uddff\ud835\uddf2 multi modal model support optimizes setup and configuration details, including GPU usage bundles weights, configuration, and data into a single Modelfile package \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25 shorturl.at dGZ46 4. \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\udde5\ud835\udde7\ud835\uddeb A solution from NVIDIA that allows users to build their own personalized chatbot experience. \ud835\ude46\ud835\ude5a\ud835\ude6e \ud835\ude56\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude68 \ud835\ude56\ud835\ude67\ud835\ude5a emphasizes no code, ChatGPT like interface one can connect custom documents, videos, notes, and PDFs easy to set up RAG Retrieval Augmented Generation support for the latest LLMs leverages TensorRT LLM and RTX acceleration downloadable installer 35GB , out of the box Mistral LLaMA 7b versions \ud835\ude0e\ud835\ude26\ud835\ude35 \ud835\ude1a\ud835\ude35\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude26\ud835\ude25 shorturl.at ekuK6 Visualize attention mechanism \ud835\udddf\ud835\udddf\ud835\udde0 models are complex the key to understanding the process is the \ud835\uddee\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfa\ud835\uddf2\ud835\uddf0\ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf6\ud835\ude00\ud835\uddfa. Here are \ud835\udfef \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9\ud835\ude00 to help you interactively visualize attention 1. \ud835\uddd4\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\udde9\ud835\uddf6\ud835\ude07 shorturl.at DSY58 1. \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28\ud835\ude36\ud835\ude33\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude2f\ud835\ude36\ud835\ude2e \ud835\ude29\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude34. 2. \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28\ud835\ude36\ud835\ude33\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude2f\ud835\ude36\ud835\ude2e \ud835\ude2d\ud835\ude22\ud835\ude3a\ud835\ude26\ud835\ude33\ud835\ude34. 3. \ud835\ude29\ud835\ude22\ud835\ude34 \ud835\ude1d\ud835\ude2a\ud835\ude1b, \ud835\ude09\ud835\ude0c\ud835\ude19\ud835\ude1b, \ud835\ude0e\ud835\ude17\ud835\ude1b2 \ud835\ude2a\ud835\ude2f\ud835\ude24\ud835\ude2d\ud835\ude36\ud835\ude25\ud835\ude26\ud835\ude25. 4. \ud835\udfee\ud835\uddd7 visualization \ud835\udfef\ud835\uddd7 \ud835\ude3b\ud835\ude30\ud835\ude30\ud835\ude2e \ud835\ude2a\ud835\ude2f\ud835\ude34 \ud835\ude30\ud835\ude2f \ud835\ude34\ud835\ude26\ud835\ude2d\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude26\ud835\ude25 \ud835\ude2d\ud835\ude22\ud835\ude3a\ud835\ude26\ud835\ude33\ud835\ude34. 2. \ud835\udde3\ud835\ude06\ud835\udde7\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf5 \ud835\udde0\ud835\udde0 shorturl.at lqJQY \ud835\ude24\ud835\ude36\ud835\ude34\ud835\ude35\ud835\ude30\ud835\ude2e \ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34. \ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude26\ud835\ude2f\ud835\ude34\ud835\ude2a\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude2a\ud835\ude2f \ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude31\ud835\ude29 \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude27\ud835\ude22\ud835\ude34\ud835\ude29\ud835\ude2a\ud835\ude30\ud835\ude2f. \ud835\ude29\ud835\ude22\ud835\ude34 \ud835\ude0e\ud835\ude17\ud835\ude1b2 \ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude30, \ud835\ude13\ud835\ude30\ud835\ude19\ud835\ude08 \ud835\ude1b\ud835\ude26\ud835\ude24\ud835\ude29\ud835\ude2f\ud835\ude2a\ud835\ude32\ud835\ude36\ud835\ude26 \ud835\ude2a\ud835\ude2f\ud835\ude24\ud835\ude2d\ud835\ude36\ud835\ude25\ud835\ude26\ud835\ude25. 3D 3. \ud835\uddd5\ud835\uddd5\ud835\ude06\ud835\uddd6\ud835\uddff\ud835\uddfc\ud835\uddf3\ud835\ude01 shorturl.at ivCR1 \ud835\ude2a\ud835\ude2f\ud835\ude34\ud835\ude31\ud835\ude26\ud835\ude24\ud835\ude35 \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31 \ud835\ude23\ud835\ude3a \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31 1 \ud835\ude35\ud835\ude30\ud835\ude2c\ud835\ude26\ud835\ude2f \ud835\ude31\ud835\ude33\ud835\ude26\ud835\ude25\ud835\ude2a\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f. \ud835\ude29\ud835\ude22\ud835\ude34 \ud835\ude0e\ud835\ude17\ud835\ude1b2 \ud835\ude34\ud835\ude2e\ud835\ude22\ud835\ude2d\ud835\ude2d, \ud835\ude0e\ud835\ude17\ud835\ude1b3, \ud835\ude0e\ud835\ude17\ud835\ude1b \ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude30, \ud835\ude0e\ud835\ude17\ud835\ude1b2 \ud835\ude1f\ud835\ude13 \ud835\ude2a\ud835\ude2f\ud835\ude24\ud835\ude2d\ud835\ude36\ud835\ude25\ud835\ude26\ud835\ude25. straight forward A commonly misunderstood CUDA issue! The problem was that \ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf6\ud835\uddee \ud835\ude00\ud835\uddfa\ud835\uddf6 was showing a \ud835\uddf1\ud835\uddf6\ud835\uddf3\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddda\ud835\udde3\ud835\udde8 \ud835\uddf1\ud835\uddf2\ud835\ude03\ud835\uddf6\ud835\uddf0\ud835\uddf2 \ud835\uddfc\ud835\uddff\ud835\uddf1\ud835\uddf2\ud835\uddff compared to docker or Python. Thus, errors regarding the disjoint memory regions appeared. \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf8 \ud835\udde6\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa \ud835\udddf\ud835\uddee\ud835\ude06\ud835\uddf2\ud835\uddff \ud835\ude63\ud835\ude6b\ud835\ude5e\ud835\ude59\ud835\ude5e\ud835\ude56 \ud835\ude68\ud835\ude62\ud835\ude5e works at the system level and orders GPU \ud835\ude67\ud835\ude5a\ud835\ude68\ud835\ude65\ud835\ude5a\ud835\ude58\ud835\ude69\ud835\ude5e\ud835\ude63\ud835\ude5c \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude69\ud835\ude64\ud835\ude65 \ud835\ude59\ud835\ude64\ud835\ude6c\ud835\ude63 \ud835\ude64\ud835\ude67\ud835\ude59\ud835\ude5a\ud835\ude67 \ud835\ude64\ud835\ude5b \ud835\ude5d\ud835\ude64\ud835\ude6c \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude65\ud835\ude5d\ud835\ude6e\ud835\ude68\ud835\ude5e\ud835\ude58\ud835\ude56\ud835\ude61 \ud835\ude6b\ud835\ude5e\ud835\ude59\ud835\ude5a\ud835\ude64 \ud835\ude58\ud835\ude56\ud835\ude67\ud835\ude59 \ud835\ude5e\ud835\ude68 \ud835\ude5e\ud835\ude63\ud835\ude68\ud835\ude5a\ud835\ude67\ud835\ude69\ud835\ude5a\ud835\ude59 \ud835\ude5e\ud835\ude63\ud835\ude69\ud835\ude64 \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude4b\ud835\ude3e\ud835\ude44_\ud835\ude40\ud835\ude53\ud835\ude4b\ud835\ude4d\ud835\ude40\ud835\ude4e\ud835\ude4e \ud835\ude68\ud835\ude61\ud835\ude64\ud835\ude69\ud835\ude68 \ud835\ude64\ud835\ude63 \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude62\ud835\ude64\ud835\ude69\ud835\ude5d\ud835\ude5a\ud835\ude67\ud835\ude57\ud835\ude64\ud835\ude56\ud835\ude67\ud835\ude59. \ud835\udde6\ud835\uddfc\ud835\uddf3\ud835\ude01\ud835\ude04\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\udddf\ud835\uddee\ud835\ude06\ud835\uddf2\ud835\uddff At this layer, python docker or any other program, by default is seeing the \ud835\ude42\ud835\ude4b\ud835\ude50\ud835\ude68 \ud835\ude5e\ud835\ude63 \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude41\ud835\ude3c\ud835\ude4e\ud835\ude4f\ud835\ude40\ud835\ude4e\ud835\ude4f_\ud835\ude41\ud835\ude44\ud835\ude4d\ud835\ude4e\ud835\ude4f \ud835\ude64\ud835\ude67\ud835\ude59\ud835\ude5a\ud835\ude67, meaning it will take the \ud835\ude42\ud835\ude4b\ud835\ude50 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude5d\ud835\ude5e\ud835\ude5c\ud835\ude5d\ud835\ude5a\ud835\ude68\ud835\ude69 \ud835\ude3e\ud835\ude3e \ud835\ude58\ud835\ude6a\ud835\ude59\ud835\ude56 \ud835\ude58\ud835\ude56\ud835\ude65\ud835\ude56\ud835\ude57\ud835\ude5e\ud835\ude61\ud835\ude5e\ud835\ude69\ud835\ude6e \ud835\ude64\ud835\ude63 \ud835\ude69\ud835\ude5d\ud835\ude5a \ud835\ude5b\ud835\ude5e\ud835\ude67\ud835\ude68\ud835\ude69 \ud835\ude5e\ud835\ude63\ud835\ude59\ud835\ude5a\ud835\ude6d. The solution here is to condition the applications at the Software Layer to respect the System Layer ordering by setting the env variable \ud835\ude3e\ud835\ude50\ud835\ude3f\ud835\ude3c_\ud835\ude3f\ud835\ude40\ud835\ude51\ud835\ude44\ud835\ude3e\ud835\ude40\ud835\ude4e_\ud835\ude4a\ud835\ude4d\ud835\ude3f\ud835\ude40\ud835\ude4d \ud835\ude4b\ud835\ude3e\ud835\ude44_\ud835\ude3d\ud835\ude50\ud835\ude4e_\ud835\ude44\ud835\ude3f Decoding ML Newsletter is a reader supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Subscribe 10 Share this post Upskill your LLM knowledge base with these tools. decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/upskill-your-llm-knowledge-base-with?r=1ttoeh" }, { "id": "4d1d7d1c-ebd2-445e-a8d7-bdfc1c90cfc6", "content": "An end to end framework for production ready LLM systems Learn how to design, train, and deploy a production ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices. SubscribeSign in Share this post Learn an end to end framework for production ready LLM systems by building your LLM twin decodingml.substack.com Copy link Facebook Email Note Other Learn an end to end framework for production ready LLM systems by building your LLM twin Why you should take our new production ready LLMs course Paul Iusztin Mar 16, 2024 18 Share this post Learn an end to end framework for production ready LLM systems by building your LLM twin decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ Want to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb an \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 for \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 by \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 your \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb? Then you are in luck. The Decoding ML team and I will \ud835\uddff\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddf2 in a few days a \ud835\uddd9\ud835\udde5\ud835\uddd8\ud835\uddd8 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 called the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee. \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb? It is an AI character that learns to write like somebody by incorporating its style and personality into an LLM. Within the course, you will learn how to architect train deploy ...a \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\ude04\ud835\uddf6\ud835\uddfb of yourself powered by LLMs, vector DBs, and LLMOps good practices, such as experiment trackers model registries prompt monitoring versioning deploying LLMs ...and more! It is an \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 where you will \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 a \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa from start to finish from data collection to deployment production ready from NO MLOps to experiment trackers, model registries, prompt monitoring, and versioning Image by DALL E Who is this for? Audience MLE, DE, DS, or SWE who want to learn to engineer production ready LLM systems using LLMOps good principles. Level intermediate Prerequisites basic knowledge of Python, ML, and the cloud How will you learn? The course contains 11 hands on written lessons and the open source code you can access on GitHub WIP . You can read everything at your own pace. Costs? The articles and code are completely free . They will always remain free. This time, the Medium articles won t be under any paid wall. I want to make them entirely available to everyone. Meet your teachers! The course is created under the Decoding ML umbrella by Paul Iusztin Senior ML MLOps Engineer Alex Vesa Senior AI Engineer Alex Razvant Senior ML MLOps Engineer What will you learn to build? LM twin system architecture Image by the Author \ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude13\ud835\ude13\ud835\ude14 \ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26 \ud835\ude30\ud835\ude27 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude24\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude34\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude30 4 \ud835\ude17\ud835\ude3a\ud835\ude35\ud835\ude29\ud835\ude30\ud835\ude2f \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34 \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf0\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Crawl your digital data from various social media platforms. Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines. Send database changes to a queue using the CDC pattern. Deployed on AWS. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Consume messages from a queue through a Bytewax streaming pipeline. Every message will be cleaned, chunked, embedded using Superlinked , and loaded into a Qdrant vector DB in real time. Deployed on AWS. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Create a custom dataset based on your digital data. Fine tune an LLM using QLoRA. Use Comet ML s experiment tracker to monitor the experiments. Evaluate and save the best model to Comet s model registry. Deployed on Qwak. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 Load and quantize the fine tuned LLM from Comet s model registry. Deploy it as a REST API. Enhance the prompts using RAG. Generate content using your LLM twin. Monitor the LLM using Comet s prompt monitoring dashboard . Deployed on Qwak. . \ud835\ude08\ud835\ude2d\ud835\ude30\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 4 \ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude33\ud835\ude30\ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude2a\ud835\ude24\ud835\ude26\ud835\ude34, \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude35\ud835\ude30 \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 3 \ud835\ude34\ud835\ude26\ud835\ude33\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34 \ud835\ude35\ud835\ude30\ud835\ude30\ud835\ude2d\ud835\ude34 Comet ML as your ML Platform Qdrant as your vector DB Qwak as your ML infrastructure Soon, we will release the first lesson from the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\udde7\ud835\ude04\ud835\uddf6\ud835\uddfb \ud835\uddd5\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddec\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\uddf1\ud835\ude06 \ud835\uddd4\ud835\udddc \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee To stay updated... \ud835\ude3e\ud835\ude5d\ud835\ude5a\ud835\ude58\ud835\ude60 \ud835\ude5e\ud835\ude69 \ud835\ude64\ud835\ude6a\ud835\ude69 \ud835\ude42\ud835\ude5e\ud835\ude69\ud835\ude43\ud835\ude6a\ud835\ude57 \ud835\ude56\ud835\ude63\ud835\ude59 \ud835\ude68\ud835\ude6a\ud835\ude65\ud835\ude65\ud835\ude64\ud835\ude67\ud835\ude69 \ud835\ude6a\ud835\ude68 \ud835\ude6c\ud835\ude5e\ud835\ude69\ud835\ude5d \ud835\ude56 _ LLM Twin Building Your Production Ready AI Replica Course GitHub Repository_ 18 Share this post Learn an end to end framework for production ready LLM systems by building your LLM twin decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/want-to-learn-an-end-to-end-framework?r=1ttoeh" }, { "id": "1dbefe69-acbf-4b86-8b52-0670b28dbab4", "content": "Fix your messy ML configs in your Python projects 2024 MLOps learning roadmap. Python syntax sugar that will help you write cleaner code. SubscribeSign in Share this post Fix your messy ML configs in your Python projects decodingml.substack.com Copy link Facebook Email Note Other Fix your messy ML configs in your Python projects 2024 MLOps learning roadmap. Python syntax sugar that will help you write cleaner code. Paul Iusztin Mar 09, 2024 13 Share this post Fix your messy ML configs in your Python projects decodingml.substack.com Copy link Facebook Email Note Other Share _Decoding ML Notes_ This week our main focus will be a classic. We will discuss Python. More concretely how to write cleaner code and applications in Python. Is that even possible? This week s topics My favorite way to implement a configuration layer in Python Some Python syntax sugar that will help you write cleaner code 2024 MLOps learning roadmap Since creating content, I learned one crucial thing \ud835\ude0c\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude3a\ud835\ude23\ud835\ude30\ud835\ude25\ud835\ude3a \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude25 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude25\ud835\ude2a\ud835\ude27\ud835\ude27\ud835\ude26\ud835\ude33\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude2d\ud835\ude3a. Do you prefer to read content on Medium? Then, you are in luck. Decoding ML is also on Medium. Substack vs. Medium? On Medium, we plan to post more extended and detailed content, while on Substack, we will write on the same topics but in a shorter and more concentrated manner. If you want more code and less talking _Check out our Medium publication_ Decoding ML Medium publication Decoding ML Medium publication My favorite way to implement a configuration layer in Python This is my favorite way to \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 a \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\uddf3\ud835\uddf6\ud835\uddf4\ud835\ude02\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\ude00\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\ude00 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa in \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb for all my apps The core is based on \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24, a data validation library for Python. More precisely, on their \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude1a\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude34 class. \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\ude06\ud835\uddf1\ud835\uddee\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddf0 \ud835\uddd5\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\udde6\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\ude00 \ud835\uddf0\ud835\uddf9\ud835\uddee\ud835\ude00\ud835\ude00? you can quickly load values from .\ud835\ude26\ud835\ude2f\ud835\ude37 files or even \ud835\ude11\ud835\ude1a\ud835\ude16\ud835\ude15 or \ud835\ude20\ud835\ude08\ud835\ude14\ud835\ude13 add default values for the configuration of your application the MOST IMPORTANT one It validates the type of the loaded variables. Thus, you will always be ensured you use the correct variables to configure your system. \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddf6\ud835\ude01? It is pretty straightforward. You subclass the \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude1a\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude34 class and define all your settings at the class level. It is similar to a Python \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2d\ud835\ude22\ud835\ude34\ud835\ude34 but with an extra layer of data validation and factory methods. If you assign a value to the variable, it makes it optional. If you leave it empty, providing it in your .\ud835\ude5a\ud835\ude63\ud835\ude6b file is mandatory. \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddf4\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddf6\ud835\ude01 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udde0\ud835\udddf \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2? You often have a training configuration file or inference into a JSON or YAML file I prefer YAML files as they are easier to read . You shouldn t pollute your \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24 settings class with all the hyperparameters related to the module as they are a lot, A LOT . Also, to isolate the application ML settings, the easiest way is to add the \ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28_\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28_\ud835\ude31\ud835\ude22\ud835\ude35\ud835\ude29 in your settings and use a \ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude27\ud835\ude2a\ud835\ude28 class to load it independently. Doing so lets you leverage your favorite way probably the one you already have in your ML code of loading a config file for the ML configuration plain YAML or JSON files, hydra, or other fancier methods. Another plus is that you can t hardcode the path anywhere on your system. That is a nightmare when you start using git with multiple people. pydantic BaseSettings example Image by the Author What do you say? Would you start using the \ud835\ude31\ud835\ude3a\ud835\ude25\ud835\ude22\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude24 \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude1a\ud835\ude26\ud835\ude35\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude34 class in your ML applications? Some Python syntax sugar that will help you write cleaner code Here is some \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb \ud835\ude00\ud835\ude06\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\ude05 \ud835\ude00\ud835\ude02\ud835\uddf4\ud835\uddee\ud835\uddff that will help you \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb\ud835\uddf2\ud835\uddff \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 I am talking about the \ud835\ude38\ud835\ude22\ud835\ude2d\ud835\ude33\ud835\ude36\ud835\ude34 \ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude30\ud835\ude33 denoted by the symbol. It was introduced in Python 3.8, but I rarely see it used. Thus, as a clean code freak, I wanted to dedicate a post to it. \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf1\ud835\uddfc\ud835\uddf2\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude04\ud835\uddee\ud835\uddf9\ud835\uddff\ud835\ude02\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddf1\ud835\uddfc? It s an assignment expression that allows you to assign and return a value in the same expression. \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\ude00\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf9\ud835\uddf1 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddf6\ud835\ude01? \ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude34\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude34\ud835\ude34 It reduces the number of lines needed for variable assignment and checking, making code more concise. \ud835\ude19\ud835\ude26\ud835\ude22\ud835\ude25\ud835\ude22\ud835\ude23\ud835\ude2a\ud835\ude2d\ud835\ude2a\ud835\ude35\ud835\ude3a It can enhance readability by keeping related logic close, although this depends on the context and the reader s familiarity with exotic Python syntax. \ud835\ude43\ud835\ude5a\ud835\ude67\ud835\ude5a \ud835\ude56\ud835\ude67\ud835\ude5a \ud835\ude68\ud835\ude64\ud835\ude62\ud835\ude5a \ud835\ude5a\ud835\ude6d\ud835\ude56\ud835\ude62\ud835\ude65\ud835\ude61\ud835\ude5a\ud835\ude68 1 . Using the walrus operator, you can directly assign the result of the \ud835\ude2d\ud835\ude26\ud835\ude2f function inside an if statement. 2 . Avoid calling the same function twice in a while loop. The benefit is less code and makes everything more readable. 3 . Another use case arises in list comprehensions where a value computed in a filtering condition is also needed in the expression body. Before the \ud835\ude38\ud835\ude22\ud835\ude2d\ud835\ude33\ud835\ude36\ud835\ude34 \ud835\ude30\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude30\ud835\ude33, if you had to apply a function to an item from a list and filter it based on some criteria, you had to refactor it to a standard for loop. . When writing clean code, the detail matters. The details make the difference between a codebase that can be read like a book or one with 10 WTFs seconds. The walrus operator examples Image by the Author What do you think? Does the walrus operator make the Python code more readable and concise? 2024 MLOps learning roadmap \ud835\uddea\ud835\uddee\ud835\uddfb\ud835\ude01 to \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 but got stuck at the 100th tool you think you must know? Here is the \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddff\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddfa\ud835\uddee\ud835\uddfd \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\udfee\ud835\udfec\ud835\udfee\ud835\udff0 \ud835\ude14\ud835\ude13\ud835\ude16\ud835\ude31\ud835\ude34 \ud835\ude37\ud835\ude34. \ud835\ude14\ud835\ude13 \ud835\ude26\ud835\ude2f\ud835\ude28\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude33 In theory, MLEs focus on deploying models to production while MLOps engineers build the platform used by MLEs. I think this is heavily dependent on the scale of the company. As the company gets smaller, these 2 roles start to overlap more. This roadmap will teach you how to build such a platform, from programming skills to MLOps components and infrastructure as code. . Here is the MLOps roadmap for 2024 suggested by Maria Vechtomova from MarvelousMLOps \ud835\udfed . \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddf4\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 Python IDEs Bash basics command line editors \ud835\udfee . \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\ude07\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddde\ud835\ude02\ud835\uddef\ud835\uddf2\ud835\uddff\ud835\uddfb\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\ude00 Docker Kubernetes \ud835\udfef . \ud835\udde0\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf9\ud835\ude00 ...until now we laid down the fundamentals. Now let s get into MLOps \ud835\udff0 . \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddfd\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 reproducible, testable, and evolvable ML powered software \ud835\udff1 . \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00 Version control CI CD pipelines Orchestration Experiment tracking and model registries Data lineage and feature stores Model training serving Monitoring observability \ud835\udff2 . \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 Terraform 2024 MLOps Learning Roadmap Image by the Author As a self learner, I wish I had access to this step by step plan when I started learning MLOps. Remember, you should pick up and tailor this roadmap at the level you are currently at. Find more details about the roadmap in Maria Vechtomova article MLOps roadmap 2024 13 Share this post Fix your messy ML configs in your Python projects decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/my-favorite-way-to-implement-a-configuration?r=1ttoeh" }, { "id": "ba6ba94f-b2d0-4ad8-9dbc-638f5eb1a081", "content": "A Real time Retrieval System for RAG on Social Media Data Use a Bytewax streaming engine to build a real time ingestion pipeline to populate a Qdrant vector DB. Implement a RAG retrieval client using rerank. SubscribeSign in Share this post A Real time Retrieval System for RAG on Social Media Data decodingml.substack.com Copy link Facebook Email Note Other A Real time Retrieval System for RAG on Social Media Data Use a streaming engine to populate a vector DB in real time. Use rerank UMAP to improve the accuracy of your retrieved documents. Paul Iusztin Mar 07, 2024 31 Share this post A Real time Retrieval System for RAG on Social Media Data decodingml.substack.com Copy link Facebook Email Note Other 4 Share We are putting in a lot of time to create high quality content. Thus, we want to make it as convenient as possible for you to read our content. That is why we will experiment with the posting time and move it to Thursday at 3 00 PM CET . In this article, you will learn how to build a real time retrieval system for social media data. In our example, we will use only my LinkedIn posts, but our implementation can easily be extended to other platforms supporting written content, such as X, Instagram, or Medium. In this article, you will learn how to build a streaming pipeline that ingests LinkedIn posts into a vector DB in real time clean, chunk, and embed LinkedIn posts build a retrieval client to query LinkedIn posts use a rerank pattern to improve retrieval accuracy visualize content retrieved for a given query in a 2D plot using UMAP Our implementation focuses on just the retrieval part of an RAG system. But you can quickly hook the retrieved LinkedIn posts to an LLM for post analysis or personalized content generation. Table of Contents 1. System Design 2. Data 3. Streaming ingestion pipeline 4. Retrieval client 5. Conclusion 1 . System Design The architecture of the retrieval system Image by the Author in collaboration with VectorHub . The retrieval system is based on 2 detached components 1. the streaming ingestion pipeline 2. the retrieval client The streaming ingestion pipeline runs 24 7 to keep the vector DB synced up with current raw LinkedIn posts data source, while the retrieval client is used in RAG applications to query the vector DB. These 2 components communicate with each other only through the vector DB . 1.1. The streaming ingestion pipeline The streaming ingestion pipeline implements the Change Data Capture CDC pattern between a data source containing the raw LinkedIn posts and the vector DB used for retrieval. In a real world scenario, the streaming pipeline listens to a queue populated by all the changes made to the source database. But because we are focusing primarily on the retrieval system, we simulate the data within the queue with a couple of JSON files. The streaming pipeline is built in Python using Bytewax, and cleans, chunks, and embeds the LinkedIn posts before loading them into a Qdrant vector DB. Why do we need a stream engine? Because LinkedIn posts or any other social media data evolve frequently, your vector DB can quickly get out of sync. To handle this, you can build a batch pipeline that runs every minute. But to really minimize data lag, to make sure your vector DB stays current with new social media posts , you need to use a streaming pipeline that immediately takes every new item the moment it s posted, preprocesses it, and loads it into the vector DB. Why Bytewax? Bytewax is a streaming engine built in Rust that exposes a Python interface. We use Bytewax because it combines the impressive speed and reliability of Rust with the ease of use and ecosystem of Python. 1.2. The retrieval client Our retrieval client is a standard Python module that preprocesses user queries and searches the vector DB for most similar results. Qdrant vector DB lets us decouple the retrieval client from the streaming ingestion pipeline. Using a semantic based retrieval system lets us query our LinkedIn post collection very flexibly. For example, we can retrieve similar posts using a variety of query types e.g., posts, questions, sentences. Also, to improve the retrieval system s accuracy, we use a rerank pattern. Lastly, to better understand and explain the retrieval process for particular queries, we visualize our results on a 2D plot using UMAP. 2 . Data We will ingest 215 LinkedIn posts from my Linked profile Paul Iusztin. Though we simulate the post ingestion step using JSON files, the posts themselves are authentic. Before diving into the code, let s take a look at an example LinkedIn post to familiarize ourselves with the challenges it will introduce text \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 do you need to \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 an open source \ud835\udddf\ud835\udddf\ud835\udde0 to create your own \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddf1\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddfc\ud835\uddff? nThis is the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf8\ud835\uddf6\ud835\ude01 you must know n\ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 nThe key component of any successful ML project is the data. nYou need a 100 1000 sample Q A questions answers dataset with financial scenarios. nThe best approach is to hire a bunch of experts to create it manually. nBut, for a PoC, that might get expensive slow. nThe good news is that a method called \ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude25\ud835\ude2a\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f exists. n ... Along with ease of deployment, you can easily add your training code to your CI CD to add the final piece of the MLOps puzzle, called CT continuous training . n Beam nhttps lnkd.in dedCaMDh n. n To see all these components in action, check out my FREE \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 give it a nhttps lnkd.in dZgqtf8f nhashtag n nmachinelearning nhashtag n nmlops nhashtag n ndatascience , image https media.licdn.com dms image D4D10AQHWQzZcToQQ1Q image shrink_800 0 1698388219549?e 1705082400 v beta t 9mrDC_NooJgD7u7Qk0PmrTGGaZtuwDIFKh3bEqeBsm0 The following features of the above post are not compatible with embedding models. We ll need to find some way of handling them in our preprocessing step emojis bold, italic text other non ASCII characters URLs content that exceeds the context window limit of the embedding model Emojis and bolded and italic text are represented by Unicode characters that are not available in the vocabulary of the embedding model. Thus, these items cannot be tokenized and passed to the model we have to remove them or normalize them to something that can be parsed by the tokenizer. The same holds true for all other non ASCII characters. URLs take up space in the context window without providing much semantic value. Still, knowing that there s a URL in the sentence may add context. For this reason, we replace all URLs with a URL token. This lets us ingest whatever value the URL s presence conveys without it taking up valuable space. 3 . Streaming ingestion pipeline Let s dive into the streaming pipeline, starting from the top and working our way to the bottom 3.1. The Bytewax flow The Bytewax flow transparently conveys all the steps of the streaming pipeline. The first step is ingesting every LinkedIn post from our JSON files. In the next steps, every map operation has a single responsibility validate the ingested data using a _RawPost pydantic model_ clean the posts chunk the posts because chunking will output a list of ChunkedPost objects, we use a flat_map operation to flatten them out embed the posts load the posts to a Qdrant vector DB def build_flow embedding_model EmbeddingModelSingleton flow Dataflow flow stream op.input input , flow, JSONSource data paul.json stream op.map raw_post , stream, RawPost.from_source stream op.map cleaned_post , stream, CleanedPost.from_raw_post stream op.flat_map chunked_post , stream, lambda cleaned_post ChunkedPost.from_cleaned_post cleaned_post, embedding_model embedding_model , stream op.map embedded_chunked_post , stream, lambda chunked_post EmbeddedChunkedPost.from_chunked_post chunked_post, embedding_model embedding_model , op.inspect inspect , stream, print op.output output , stream, QdrantVectorOutput vector_size model.embedding_size return flow 3.2. The processing steps Every processing step is incorporated into a _pydantic model_. This way, we can easily validate the data at each step and reuse the code in the retrieval module. We isolate every step of an ingestion pipeline into its own class cleaning chunking embedding Doing so, we follow the separation of concerns good SWE practice. Thus, every class has its own responsibility. Now the code is easy to read and understand. Also, it s future proof, as it s extremely easy to change or extend either of the 3 steps cleaning, chunking and embedding. Here is the interface of the _pydantic models_ class RawPost BaseModel post_id str text str image Optional str classmethod def from_source cls, k_v Tuple str, dict RawPost ... Mapping a dictionary to a RawPost validated pydantic model. return cls ... class CleanedPost BaseModel post_id str raw_text str text str image Optional str classmethod def from_raw_post cls, raw_post RawPost CleanedPost ... Cleaning the raw post return cls ... class ChunkedPost BaseModel post_id str chunk_id str full_raw_text str text str image Optional str classmethod def from_cleaned_post cls, cleaned_post CleanedPost, embedding_model EmbeddingModelSingleton list ChunkedPost chunks ... Compute chunks return cls ... for chunk in chunks class EmbeddedChunkedPost BaseModel post_id str chunk_id str full_raw_text str text str text_embedding list image Optional str None score Optional float None rerank_score Optional float None classmethod def from_chunked_post cls, chunked_post ChunkedPost, embedding_model EmbeddingModelSingleton EmbeddedChunkedPost ... Compute embedding. return cls ... Now, the data at each step is validated and has a clear structure. Note Providing different types when instantiating a _pydantic_ model will throw a validation error. For example, if the _post_id_ is defined as a _string_ , and we try to instantiate an _EmbeddedChunkedPost_ with a _None_ or _int_ _post_id_ , it will throw an error. Check out the full implementation on our GitHub Articles Hub repository. 3.3. Load to Qdrant To load the LinkedIn posts to Qdrant, you have to override Bytewax s _StatelessSinkPartition_ class which acts as an output in a Bytewax flow class QdrantVectorSink StatelessSinkPartition def __init__ self, client QdrantClient, collection_name str self._client client self._collection_name collection_name def write_batch self, chunks list EmbeddedChunkedPost ... Map chunks to ids, embeddings, and metadata. self._client.upsert collection_name self._collection_name, points Batch ids ids, vectors embeddings, payloads metadata, , Within this class, you must overwrite the _write_batch _ method, where we will serialize every _EmbeddedChunkedPost_ to a format expected by Qdrant and load it to the vector DB. 4 . Retrieval client Here, we focus on preprocessing a user s query, searching the vector DB, and postprocessing the retrieved posts for maximum results. To design the retrieval step, we implement a _QdrantVectorDBRetriever_ class to expose all the necessary features for our retrieval client. class QdrantVectorDBRetriever def __init__ self, embedding_model EmbeddingModelSingleton, vector_db_client QdrantClient, cross_encoder_model CrossEncoderModelSingleton vector_db_collection str self._embedding_model embedding_model self._vector_db_client vector_db_client self._cross_encoder_model cross_encoder_model self._vector_db_collection vector_db_collection def search self, query str, limit int 3, return_all bool False Union list EmbeddedChunkedPost , dict str, list ... Search the Qdrant vector DB based on the given query. def embed_query self, query str list list float ... Embed the given query. def rerank self, query str, posts list EmbeddedChunkedPost list EmbeddedChunkedPost ... Rerank the posts relative to the given query. def render_as_html self, post EmbeddedChunkedPost None ... Map the embedded post to HTML to display it. 4.1. Embed query We must embed the query in precisely the same way we ingested our posts into the vector DB. Because the streaming pipeline is written in Python thanks to Bytewax , and every preprocessing operation is modular, we can quickly replicate all the steps necessary to embed the query. class QdrantVectorDBRetriever ... def embed_query self, query str list list float cleaned_query CleanedPost.clean query chunks ChunkedPost.chunk cleaned_query, self._embedding_model embdedded_queries self._embedding_model chunk, to_list True for chunk in chunks return embdedded_queries Check out the full implementation on our GitHub repository. 4.2. Plain retrieval Let s try to retrieve a set of posts without using the rerank algorithm. vector_db_retriever QdrantVectorDBRetriever embedding_model EmbeddingModelSingleton , vector_db_client build_qdrant_client query Posts about Qdrant retrieved_results vector_db_retriever.search query query for post in retrieved_results posts vector_db_retriever.render_as_html post Here are the top 2 retrieved results sorted using the cosine similarity score Result 1 Result 1 for the Posts about Qdrant query without using reranking Image by the Author in collaboration with VectorHub Result 2 Result 2 for the Posts about Qdrant query without using reranking Image by the Author in collaboration with VectorHub You can see from the results above, that starting from the second post the results are irrelevant. Even though it has a cosine similarly score of 0.69 the posts doesn t contain any information about Qdrant or vector DBs. Note We looked over the top 5 retrieved results. Nothing after the first post was relevant. We haven t added them here as the article is already too long. 4.3. Visualize retrieval To visualize our retrieval, we implement a dedicated class that uses the UMAP dimensionality reduction algorithm. We have picked UMAP as it preserves the geometric properties between points e.g., the distance in higher dimensions when they are projected onto lower dimensions better than its peers e.g., PCA, t SNE . The _RetrievalVisualizer_ computes the projected embeddings for the entire vector space once. Afterwards, it uses the render method to project only the given query and retrieved posts, and plot them to a 2D graph. class RetrievalVisualizer def __init__ self, posts list EmbeddedChunkedPost self._posts posts self._umap_transform self._fit_model self._posts self._projected_post_embeddings self.project_posts self._posts def _fit_model self, posts list EmbeddedChunkedPost umap.UMAP umap_transform ... Fit a UMAP model on the given posts. return umap_transform def project_posts self, posts list EmbeddedChunkedPost np.ndarray embeddings np.array post.text_embedding for post in posts return self._project embeddings embeddings def _project self, embeddings np.ndarray np.ndarray ... Project the embeddings to 2D using UMAP. return umap_embeddings def render self, embedded_queries list list float , retrieved_posts list EmbeddedChunkedPost , None ... Render the given queries retrieved posts using matplotlib. Let s take a look at the result to see how the _ Posts about Qdrant _ query looks Visualization of the Posts about Qdrant query using UMAP without reranking Image by the Author in collaboration with VectorHub . Our results are not great. You can see how far the retrieved posts are from our query in the vector space. Can we improve the quality of our retrieval system using the rerank algorithm? 4.4. Rerank We use the _reranking_ algorithm to refine our retrieval for the initial query. Our initial retrieval step because it used cosine similarity or similar distance metrics to compute the distance between a query and post embeddings may have missed more complex but essential relationships between the query and the documents in the vector space. Reranking leverages the power of transformer models that are capable of understanding more nuanced semantic relationships. We use a cross encoder model to implement the reranking step, so we can score the query relative to all retrieved posts individually. These scores take into consideration more complex relationships than cosine similarity can. Under the hood is a BERT classifier that outputs a number between 0 and 1 according to how similar the 2 given sentences are. The BERT classifier outputs 0 if they are entirely different and 1 if they are a perfect match. Bi Encoder vs. Cross Encoder Image by the Author in collaboration with VectorHub But, you might ask, _Why not use the cross encoder model from the start if it is that much better? _ The answer, in a word, is speed. Using a cross encoder model to search your whole collection is much slower than using cosine similarity. To optimize your retrieval, therefore, your reranking process should involve 2 steps 1. an initial rough retrieval step using cosine similarity, which retrieves the top N items as potential candidates 2. filtering the rough search using the rerank strategy, which retrieves the top K items as your final results The implementation is relatively straightforward. For each retrieved post, we create a pair consisting of the cleaned query and the text of the post. We do this for all retrieved posts, resulting in a list of pairs. Next, we call a _cross encoder ms marco MiniLM L 6 v2_ model from sentence transformers to give the retrieved posts their rerank score. We then sort the posts in descending order based on their rerank score. Check out the rerank algorithm implementation on our GitHub repository. 4.5. Visualize retrieval with rerank Now that we ve added the rerank pattern to our retrieval system, let s see if it improves the results of our _ Posts about Qdrant _ query Result 1 Result 1 for the Posts about Qdrant query using reranking Image by the Author in collaboration with VectorHub Result 2 Result 2 for the Posts about Qdrant query using reranking Image by the Author in collaboration with VectorHub The improvement is remarkable! All our results are about Qdrant and vector DBs. Note We looked over the top 5 retrieved results. The top 4 out of 5 posts are relevant to our query, which is incredible. Now, let s look at the UMAP visualization Visualization of the Posts about Qdrant query using UMAP with reranking Image by the Author in collaboration with VectorHub . While the returned posts aren t very close to the query, they are a lot closer to the query compared to when we weren t reranking the retrieved posts . 5 . Conclusion In this article, we learned how to adapt a RAG retrieval pattern to improve LinkedIn post retrieval. To keep our database up to date with rapidly changing social media data, we implemented a real time streaming pipeline that uses CDC to sync the raw LinkedIn posts data source with a vector DB. You also saw how to use Bytewax to write using only Python a streaming pipeline that cleans, chunks, and embeds LinkedIn posts. Finally, you learned how to implement a standard retrieval client for RAG and saw how to improve it using the rerank pattern. As retrieval is complex to evaluate, you saw how to visualize the retrieval for a given query by rendering all the posts, the query, and the retrieved posts in a 2D space using UMAP. This article is a summary __ of my contribution from VectorHub . Check out the full article here to dig into the details, the code and more experiments . 31 Share this post A Real time Retrieval System for RAG on Social Media Data decodingml.substack.com Copy link Facebook Email Note Other 4 Share PreviousNext Discussion about this post Comments Restacks OlaMar 8Liked by Paul IusztinNice read, full of insights.Expand full commentReplyShare 1 reply by Paul Iusztin VenkataMar 23Liked by Paul IusztinExcellent article. Thanks a lot for posting this.Expand full commentReplyShare 1 reply by Paul Iusztin 2 more comments... Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/a-real-time-retrieval-system-for?r=1ttoeh" }, { "id": "cb6e689e-e718-42c8-80b1-44db7d568c3b", "content": "4 key decoding strategies for LLMs that you must know The only 6 prompt engineering techniques you need to know. One thing that I do that sets me apart from the crowd. SubscribeSign in Share this post 4 key decoding strategies for LLMs that you must know decodingml.substack.com Copy link Facebook Email Note Other 4 key decoding strategies for LLMs that you must know The only 6 prompt engineering techniques you need to know. One thing that I do that sets me apart from the crowd. Paul Iusztin Feb 15, 2024 9 Share this post 4 key decoding strategies for LLMs that you must know decodingml.substack.com Copy link Facebook Email Note Other Share Hello everyone, I hope you enjoyed what Alex R. Alex V. have prepared for you in their previous articles. I promised that the 3 of us would dig deeper into more exciting topics about production ready LLM and CV models. _ But this is just the beginning. Stay tuned for more production ML_ This week s topics 4 key decoding strategies for LLMs that you must know The only 6 prompt engineering techniques you need to know One thing that I do that sets me apart from the crowd Want to build your first \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf7\ud835\uddf2\ud835\uddf0\ud835\ude01 but don t know where to start? If you want to learn in a structured way to build hands on LLM systems using good LLMOps principles We want to announce that we just released 8 Medium lessons for the Hands on LLMs course that will put you on the right track Within the 8 Medium lessons , you will go step by step through the theory , system design , and code to learn how to build a real time streaming pipeline deployed on AWS that uses Bytewax as the stream engine to listen to financial news, cleans embeds the documents, and loads them to a vector DB fine tuning pipeline deployed as a serverless continuous training that fine tunes an LLM on financial data using QLoRA, monitors the experiments using an experiment tracker and saves the best model to a model registry inference pipeline built in LangChain deployed as a serverless RESTful API that loads the fine tuned LLM from the model registry and answers financial questions using RAG leveraging the vector DB populated with financial news We will also show you how to integrate various serverless tools , such as Comet ML as your ML Platform Qdrant as your vector DB Beam as your infrastructure. The architecture of the system you will learn to build during the Hands on LLMs course Image by the Author . Who is this for? The series targets MLE, DE, DS, or SWE who want to learn to engineer LLM systems using LLMOps good principles. How will you learn? The series contains 4 hands on video lessons and the open source code you can access on GitHub. Curious? Check out the 8 Medium lessons of the Hands on LLMs course and start building your own LLMs system The Hands on LLMs Medium Series 4 key decoding strategies for LLMs that you must know You see, LLMs don t just spit out text. They calculate logits , which are mapped to probabilities for every possible token in their vocabulary. It uses previous token IDs to predict the next most likely token the auto regressive nature of decoder models . The real magic happens in the decoding strategy you pick Greedy Search Beam Search Top K Sampling Nucleus Sampling . \ud835\uddda\ud835\uddff\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\ude06 \ud835\udde6\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 It only holds onto the most likely token at each stage. It s fast and efficient, but it is short sighted. \ud835\uddd5\ud835\uddf2\ud835\uddee\ud835\uddfa \ud835\udde6\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5 This time, you are not looking at just the token with the highest probability. But you are considering the N most likely tokens. This will create a tree like structure, where each node will have N children. The procedure repeats until you hit a maximum length or an end of sequence token. Ultimately, you pick the leaf with the biggest score and recursively pick its parent until you hit the root node. For example, in the graph below, we have \ud835\ude23\ud835\ude26\ud835\ude22\ud835\ude2e\ud835\ude34 2 and \ud835\ude2d\ud835\ude26\ud835\ude2f\ud835\ude28\ud835\ude35\ud835\ude29 3 . \ud835\udde7\ud835\uddfc\ud835\uddfd \ud835\uddde \ud835\udde6\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf4 This technique extends the Beam search strategy and adds a dash of randomness to the generation process. Instead of just picking the most likely tokens, it s selecting a token randomly from the top k most likely choices. Thus, the tokens with the highest probability will appear more often, but other tokens will be generated occasionally to add some randomness creativity . \ud835\udde1\ud835\ude02\ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\ude02\ud835\ude00 \ud835\udde6\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf4 In this case, you re not just picking the top k most probable tokens here. You re picking a cutoff value _p_ and forming a nucleus of tokens. In other words, rather than selecting the top k most probable tokens, nucleus sampling chooses a cutoff value p such that the sum of the probabilities of the selected tokens exceeds p. Thus, at every step, you will have a various number of possible tokens included in the nucleus from which you sample. This introduces even more diversity and creativity into your output. . \ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2 For \ud835\ude35\ud835\ude30\ud835\ude31 \ud835\ude2c and \ud835\ude2f\ud835\ude36\ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude36\ud835\ude34 \ud835\ude34\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude28, you can also use the \ud835\ude35\ud835\ude26\ud835\ude2e\ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 hyperparameter to tweak the output probabilities. It is a parameter that ranges from 0 to 1. A low temperature e.g., 0.1 will decrease the entropy randomness , making the generation more stable. 4 key decoding strategies for LLMs that you must know Image by the Author . To summarize... There are 2 main decoding strategies for LLMs greedy search beam search To add more variability and creativity to beam search, you can use top k sampling nucleus sampling The only 6 prompt engineering techniques you need to know The whole field of prompt engineering can be reduced to these 6 techniques I use almost daily when using ChatGPT or other LLMs . Here they are 1. \ud835\udc05\ud835\udc1e\ud835\udc30 \ud835\udc2c\ud835\udc21\ud835\udc28\ud835\udc2d \ud835\udc29\ud835\udc2b\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc2d\ud835\udc22\ud835\udc27\ud835\udc20 Add in your prompt 2 or 3 high quality demonstrations, each consisting of both input and desired output, on the target task. The LLM will better understand your intention and what kind of answers you expect based on concrete examples. 2. \ud835\udc12\ud835\udc1e\ud835\udc25\ud835\udc1f \ud835\udc1c\ud835\udc28\ud835\udc27\ud835\udc2c\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc32 \ud835\udc2c\ud835\udc1a\ud835\udc26\ud835\udc29\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc20 Sample multiple outputs with temperature 0 and select the best one out of these candidates. How to pick the best candidate? It will vary from task to task, but here are 2 primary scenarios 1 . Some tasks are easy to validate, such as programming questions. In this case, you can write unit tests to verify the correctness of the generated code. 2 . For more complicated tasks, you can manually inspect them or use another LLM or another specialized model to rank them. 3. \ud835\udc02\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27 \ud835\udc28\ud835\udc1f \ud835\udc13\ud835\udc21\ud835\udc28\ud835\udc2e\ud835\udc20\ud835\udc21\ud835\udc2d \ud835\udc02\ud835\udc28\ud835\udc13 You want to force the LLM to explain its thought process, which eventually leads to the final answer, step by step. This will help the LLM to reason complex tasks better. You want to use CoT for complicated reasoning tasks large models e.g., with more than 50B parameters . Simple tasks only benefit slightly from CoT prompting. Here are a few methods to achieve CoT provide a list of bullet points with all the steps you expect the LLM to take use Few shot prompt to teach the LLM to think in steps ... or my favorite use sentences such as Let s think step by step. 4. \ud835\udc00\ud835\udc2e\ud835\udc20\ud835\udc26\ud835\udc1e\ud835\udc27\ud835\udc2d\ud835\udc1e\ud835\udc1d \ud835\udc0f\ud835\udc2b\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc2d\ud835\udc2c The LLM s internal knowledge is limited to the data it was trained on. Also, often, it forgets specific details of older training datasets. The most common use case is Retrieval Augmented Generation RAG . That is why using the LLM as a reasoning engine is beneficial to parse and extract information from a reliable source of information given as context in the prompt. \ud835\ude1e\ud835\ude29\ud835\ude3a? avoid retraining the model on new data avoid hallucinating access to references on the source 5. \ud835\udc00 \ud835\udc2c\ud835\udc22\ud835\udc27\ud835\udc20\ud835\udc25\ud835\udc1e \ud835\udc2b\ud835\udc1e\ud835\udc2c\ud835\udc29\ud835\udc28\ud835\udc27\ud835\udc2c\ud835\udc22\ud835\udc1b\ud835\udc22\ud835\udc25\ud835\udc22\ud835\udc2d\ud835\udc32 \ud835\udc29\ud835\udc1e\ud835\udc2b \ud835\udc29\ud835\udc2b\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc2d Quite self explanatory. It is similar to the DRY principle in SWE. Having only x1 task prompt is good practice to avoid confusing the LLM. If you have more complex tasks, split them into granular ones and merge the results later in a different prompt. 6. \ud835\udc01\ud835\udc1e \ud835\udc1a\ud835\udc2c \ud835\udc1e\ud835\udc31\ud835\udc29\ud835\udc25\ud835\udc22\ud835\udc1c\ud835\udc22\ud835\udc2d \ud835\udc1a\ud835\udc2c \ud835\udc29\ud835\udc28\ud835\udc2c\ud835\udc2c\ud835\udc22\ud835\udc1b\ud835\udc25\ud835\udc1e The LLM cannot read your mind. To maximize the probability of getting precisely what you want, you can imagine the LLM as a 7 year old to whom you must explain everything step by step to be sure he understood. \ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26 The level of detail in the prompt is inversely proportional to the size complexity of the model. Image generated by DALL E The truth is that prompt engineering is quite intuitive, and we don t have to overthink it too much. What would you add to this list? One thing that I do that sets me apart from the crowd Here is one thing that I do that sets me apart from the crowd \ud835\ude10 \ud835\ude22\ud835\ude2e \ud835\ude30\ud835\ude2c\ud835\ude22\ud835\ude3a \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude23\ud835\ude26\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude25\ud835\ude36\ud835\ude2e\ud835\ude31 \ud835\ude30\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude22\ud835\ude34\ud835\ude2c\ud835\ude34 \ud835\ude2e\ud835\ude22\ud835\ude2f\ud835\ude3a \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34. \ud835\udc07\ud835\udc26\ud835\udc26... \ud835\udc16\ud835\udc21\ud835\udc32? The reality is that even the brightest minds cannot understand everything from the first shot. It is not necessarily that you cannot understand the concepts. There are other factors, such as you are tired you haven t paid enough attention the concept wasn t explained at your level the presenter wasn t clear enough, etc. Also, the truth is that many of us don t understand everything from the first shot when presented with a new concept. But because of our ego, we are afraid to come out and ask something because we are worried that we will sound stupid. The jokes are on you. Most people will be grateful you broke the ice and asked to explain the concept again. \ud835\udc16\ud835\udc21\ud835\udc32? It will help the team to learn the new concepts better. It will start a discussion to dig deeper into the subject. It will piss off or annoy the people you don t like. It will help other people ask questions next time. It will open up new perspectives on the problem. To conclude... Ignore your ego and what people think of you. Own your curiosity and ask questions when you feel like it. It is ok not to know everything. It is better to be stupid for 5 minutes than your entire life. Congrats on learning something new today! Don t hesitate to share your thoughts we would love to hear them. _ Remember, when ML looks encoded we ll help you decode it. _ See you next Thursday at 9 00 am CET. Have a fantastic weekend! 9 Share this post 4 key decoding strategies for LLMs that you must know decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/4-key-decoding-strategies-for-llms?r=1ttoeh" }, { "id": "50a5a621-5799-4214-990d-3387ecc704e1", "content": "DML New year, the new improved Decoding ML What to expect? How we plan to grow, provide more qualitative hands on content, and real world ML projects to expand your professional skills SubscribeSign in Share this post DML New year, the new improved Decoding ML What to expect? decodingml.substack.com Copy link Facebook Email Note Other DML New year, the new improved Decoding ML What to expect? How we plan to grow, provide more qualitative hands on content, and real world ML projects to expand your professional skills Paul Iusztin , Alex Razvant , and Vesa Alexandru Jan 11, 2024 10 Share this post DML New year, the new improved Decoding ML What to expect? decodingml.substack.com Copy link Facebook Email Note Other 2 Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This newsletter will differ from the others as I want to share my plans for the Decoding ML newsletter with you. From now on, it will cost 1000 month. Joking. It will still be free. It s not about the money but about growth, better quality added value. To be 100 transparent with you, I started this newsletter as an experiment, but when I saw people who actually read it, the perfectionist in me screamed that I should improve it and move to the next step. This is the next step. And I m taking you with me. The big news is that I will go all in, pouring more time and resources into growing the Decoding ML newsletter. My main goals are to push better quality content every week bring more real world projects to increase your hands on skills increases the number of articles with code examples to make it practical so you can benefit from it even more at your job As the world constantly changes, especially AI, MLE MLOps, you cannot stagnate. Decoding ML s growth is about providing you with all the MLE MLOps necessary resources to grow with it and smash it at your projects and job. _So.. How do I plan to grow the Decoding ML newsletter?_ Well, there are 3 main steps 1. Rebranding From now on, my face will no longer be the logo of Decoding ML. This will be the new logo of Decoding ML So you don t have to see my annoying face every Thursday morning in your email 2. Bringing in talent As I wanted to push more content of higher quality, I had to bring in more talented people to write beside me. I was lucky enough to know Alex Razvant and Alex Vesa, who are 2 fantastic MLE MLOps engineers with 10 years of hands on experience in the AI industry. From now on, they will start contributing to the Decoding ML newsletter and team along with me. Maybe you know this famous saying If you want to go fast, go alone if you want to go far, go together . and I want Decoding ML to go far. Our primary goal is to help you level up in MLE MLOps by offering hands on examples that you can use at your job. I plan to improve the quality of the articles by including more code and concrete examples besides the system design talks we have discussed so far. and here enters the scene The Alex s I have worked with them, and I know they are talented experts with fantastic hands on MLE MLOps skills and insights to share with you. Starting from now on, Decoding ML will no longer be a one person brand but a brand by itself, hosted by the new Decoding ML team myself Alex Vesa Alex Razvant 2.1. Now, let the team introduce itself _ Alex Vesa _ _Main niche Deep Learning Computer Vision ML System Infrastructure Startups Business _ LinkedIn Hello everyone, I m very grateful for this opportunity. I consider creativity and inspiration to flourish when there s a merger of minds from various individuals. My professional journey began in 2015, initially focusing on software engineering with a keen interest in Python and AI technologies. I quickly progressed, taking on challenging roles and AI projects. My experience in various startups as a CTO focused on leading teams in developing innovative software solutions. I worked in multiple sectors, notably healthcare and automotive, where I ve implemented AI driven systems to enhance operational efficiency. My technical skills are broad, encompassing Python, Django, and AWS. I m dedicated to leveraging my AI and software development expertise to drive organizational success in this dynamic field. I value knowledge sharing among our community, and my objective is to bring solid expertise in practical, real world AI ML systems to help you in your day to day work and enhance your creativity and vision in product development. Ultimately, I want to share with you the endless capabilities you can possess to evolve. _Alex Razvant_ _Main niche ML CV Systems in Production MLOps_ _Edge ML Deployments _ LinkedIn Hey everyone, I m really happy about this merger, as you ll get 3X more quality content in a concise, valuable, and actionable manner directly to your inbox! Here are a few words about who I am I started my journey as a SWE in 2015, diving into full stack web development. After a few internships, hackathons, and a few failed projects, the ML field caught my eye, and I haven t looked back ever since. My journey includes over 15 successful freelance projects, earning a Top Rated ML Engineer badge on UpWork , collaborating with BMW on AI for self driving cars, authoring a paper for IEEE RAL 2020, and developing scalable Computer Vision systems to analyze 1000 hours of CCTV footage. I aim to bring solid expertise via code tutorials, diagrams, and system designs to help you overcome challenges in building and deploying ML CV systems in cloud or edge environments, following the best practices I ve learned in SWE, ML, and MLOps. _Follow them check them out on LinkedIn to see their incredible experience in AI._ 2.2. Will we start approaching different topics? _TL DR No!_ I was meticulous in bringing in more people with the same vision. Thus, Decoding ML will approach the same niche as it has done _ production ready MLE MLOps topics. _ So you don t have to unsubscribe. We will keep talking about the same topics you chose to follow in our newsletter _ hands on MLE MLOps topics _ However, the advantage of having more people with different backgrounds on the team is that we all come with different perspectives and domain knowledge. For example Alex Razvant worked a lot with Computer Vision, Deep Learning, and MLOps technologies in the world of retail Alex Vesa has a lot of experience with Deep Learning and infrastructure projects in the medical field I am passioned about generative AI, MLOps, and SWE combining our knowledge will result in exciting production ready MLE MLOps articles that will significantly benefit you. 3. Expanding to new distribution channels Every person consumes content differently. So, we d like to give you the best fit to enjoy our content. We already started a Decoding ML Medium publication, where we will start this month to push a deep dive into the code of the Hands on LLMs Course. and slowly, we will expand to video format content on Youtube Instagram TikTok Also, we started planning a set of eBooks about MLE, MLOps and LLMOps and a new course about LLMs and LLMOps. So What happens next? I hope you are excited about the news. For sure, I am _Next Thursday at 9 00 a.m. CET_ , Alex Vesa will make his grand opening by writing a step by step article on how you can deploy an LLaMA2 7b LLM using Amazon SageMaker and HuggingFace . To conclude, you don t have to do anything on your side. _Decoding ML follows its natural course by bringing in more people and expanding to other platforms to give you more value for your time and a more personalized way to enjoy our content._ See you next Thursday! Have a fantastic weekend! Paul 10 Share this post DML New year, the new improved Decoding ML What to expect? decodingml.substack.com Copy link Facebook Email Note Other 2 Share PreviousNext Discussion about this post Comments Restacks Ahmed BesbesThe Tech Buffet Jan 11Liked by Paul IusztinGreat things coming ahead Paul! Looking forward to it!Expand full commentReplyShare 1 reply by Paul Iusztin 1 more comment... Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-new-year-the-new-and-improved?r=1ttoeh" }, { "id": "e85a60a3-6667-45fe-81fd-9384322b7cea", "content": "DML 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer How to successfully present MLOps ideas to upper management. How I generated PyDocs for 100 Python functions in 1 hour SubscribeSign in Share this post DML 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer decodingml.substack.com Copy link Facebook Email Note Other DML 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer How to successfully present MLOps ideas to upper management. How I generated PyDocs for 100 Python functions in 1 hour Paul Iusztin Jan 04, 2024 18 Share this post DML 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ The last Hands on LLM series finished last week. In case you are curious, here are the top 3 out of 9 lessons of the series 1. Lesson 6 What do you need to fine tune an open source LLM to create your financial advisor? 2. Lesson 7 How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? 3. Lesson 4 How to implement a streaming pipeline to populate a vector DB for real time RAG? This week s topics 1. 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer 2. How to successfully present MLOps ideas to upper management 3. How I generated PyDocs for 100 Python functions in 1 hour Before diving into the topics, I have one important thing to share with you. We finally finished the code video lessons for the Hands on LLMs course By finishing the Hands On LLMs free course, you will learn how to use the 3 pipeline architecture LLMOps good practices to design, build, and deploy a real time financial advisor powered by LLMs vector DBs. We will primarily focus on the engineering MLOps aspects. Thus, by the end of this series, you will know how to build deploy a real ML system, not some isolated code in Notebooks. \ud835\udc0c\ud835\udc28\ud835\udc2b\ud835\udc1e \ud835\udc29\ud835\udc2b\ud835\udc1e\ud835\udc1c\ud835\udc22\ud835\udc2c\ud835\udc1e\ud835\udc25\ud835\udc32, \ud835\udc2d\ud835\udc21\ud835\udc1e\ud835\udc2c\ud835\udc1e \ud835\udc1a\ud835\udc2b\ud835\udc1e \ud835\udc2d\ud835\udc21\ud835\udc1e 3 \ud835\udc1c\ud835\udc28\ud835\udc26\ud835\udc29\ud835\udc28\ud835\udc27\ud835\udc1e\ud835\udc27\ud835\udc2d\ud835\udc2c \ud835\udc32\ud835\udc28\ud835\udc2e \ud835\udc30\ud835\udc22\ud835\udc25\ud835\udc25 \ud835\udc25\ud835\udc1e\ud835\udc1a\ud835\udc2b\ud835\udc27 \ud835\udc2d\ud835\udc28 \ud835\udc1b\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d 1 . a \ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc25 \ud835\udc2d\ud835\udc22\ud835\udc26\ud835\udc1e \ud835\udc2c\ud835\udc2d\ud835\udc2b\ud835\udc1e\ud835\udc1a\ud835\udc26\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e deployed on AWS that listens to financial news, cleans embeds the documents, and loads them to a vector DB 2 . a \ud835\udc1f\ud835\udc22\ud835\udc27\ud835\udc1e \ud835\udc2d\ud835\udc2e\ud835\udc27\ud835\udc22\ud835\udc27\ud835\udc20 \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e deployed as a serverless continuous training that fine tunes an LLM on financial data using QLoRA, monitors the experiments using an experiment tracker and saves the best model to a model registry 3 . an \ud835\udc22\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e built in LangChain deployed as a serverless RESTful API that loads the fine tuned LLM from the model registry and answers financial questions using RAG leveraging the vector DB populated with financial news in real time We will also show you how to integrate various serverless tools, such as Comet ML as your ML Platform Qdrant as your vector DB Beam as your infrastructure. \ud835\udc16\ud835\udc21\ud835\udc28 \ud835\udc22\ud835\udc2c \ud835\udc2d\ud835\udc21\ud835\udc22\ud835\udc2c \ud835\udc1f\ud835\udc28\ud835\udc2b? The series targets MLE, DE, DS, or SWE who want to learn to engineer LLM systems using LLMOps good principles. \ud835\udc07\ud835\udc28\ud835\udc30 \ud835\udc30\ud835\udc22\ud835\udc25\ud835\udc25 \ud835\udc32\ud835\udc28\ud835\udc2e \ud835\udc25\ud835\udc1e\ud835\udc1a\ud835\udc2b\ud835\udc27? The series contains 4 hands on video lessons and the open source code you can access on GitHub. \ud835\udc02\ud835\udc2e\ud835\udc2b\ud835\udc22\ud835\udc28\ud835\udc2e\ud835\udc2c? Check it out and support us with a The architecture of a financial bot powered by LLMs, vector DBs and MLOps Image by the Authors 1. 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer These are the \ud835\udff4 \ud835\ude01\ud835\ude06\ud835\uddfd\ud835\uddf2\ud835\ude00 of \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9\ud835\ude00 that must be in your toolbelt to be a \ud835\ude00\ud835\ude02\ud835\uddf0\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff If you are into MLOps, you are aware of the 1000 tools in the space and think you have to know. The reality is that all of these tools can be boiled down to 8 main categories. If you learn the fundamentals and master one tool from each category, you will be fine. . Ba\u015fak Tu\u011f\u00e7e Eskili and Maria Vechtomova from MarvelousMLOps wrote an excellent summary highlighting these 8 categories 1 . \ud835\ude51\ud835\ude5a\ud835\ude67\ud835\ude68\ud835\ude5e\ud835\ude64\ud835\ude63 \ud835\ude58\ud835\ude64\ud835\ude63\ud835\ude69\ud835\ude67\ud835\ude64\ud835\ude61 crucial for the traceability and reproducibility of an ML model deployment or run. Without a version control system, it is difficult to find out what exact code version was responsible for specific runs or errors you might have in production. GitHub, GitLab, etc. 2 . \ud835\ude3e\ud835\ude44 \ud835\ude3e\ud835\ude3f automated tests are triggered upon pull request creation deployment to production should only occur through the CD pipeline GitHub Actions, GitLab CI CD, Jenkins, etc. 3 . \ud835\ude52\ud835\ude64\ud835\ude67\ud835\ude60\ud835\ude5b\ud835\ude61\ud835\ude64\ud835\ude6c \ud835\ude64\ud835\ude67\ud835\ude58\ud835\ude5d\ud835\ude5a\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude56\ud835\ude69\ud835\ude5e\ud835\ude64\ud835\ude63 manage complex dependencies between different tasks, such as data preprocessing, feature engineering, ML model training Airflow, ZenML, AWS Step Functions, etc. 4 . \ud835\ude48\ud835\ude64\ud835\ude59\ud835\ude5a\ud835\ude61 \ud835\ude67\ud835\ude5a\ud835\ude5c\ud835\ude5e\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude6e store, version, and share trained ML model artifacts, together with additional metadata Comet ML, W B, MLFlow, etc. 5 . \ud835\ude3f\ud835\ude64\ud835\ude58\ud835\ude60\ud835\ude5a\ud835\ude67 \ud835\ude67\ud835\ude5a\ud835\ude5c\ud835\ude5e\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude6e store, version, and share Docker images. Basically, all your code will be wrapped up in Docker images and shared through this registry Docker Hub, ECR, etc. 6 7 . \ud835\ude48\ud835\ude64\ud835\ude59\ud835\ude5a\ud835\ude61 \ud835\ude69\ud835\ude67\ud835\ude56\ud835\ude5e\ud835\ude63\ud835\ude5e\ud835\ude63\ud835\ude5c \ud835\ude68\ud835\ude5a\ud835\ude67\ud835\ude6b\ud835\ude5e\ud835\ude63\ud835\ude5c \ud835\ude5e\ud835\ude63\ud835\ude5b\ud835\ude67\ud835\ude56\ud835\ude68\ud835\ude69\ud835\ude67\ud835\ude6a\ud835\ude58\ud835\ude69\ud835\ude6a\ud835\ude67\ud835\ude5a if on premise, you will likely have to go with Kubernetes. There are multiple choices if you are on a cloud provider Azure ML on Azure, Sagemaker on AWS, and Vertex AI on GCP. 8 . \ud835\ude48\ud835\ude64\ud835\ude63\ud835\ude5e\ud835\ude69\ud835\ude64\ud835\ude67\ud835\ude5e\ud835\ude63\ud835\ude5c Monitoring in ML systems goes beyond what is needed for monitoring regular software applications. The distinction lies in that the model predictions can fail even if all typical health metrics appear in good condition. SageMaker, NannyML, Arize, etc. The secret sauce in MLOps is knowing how to glue all these pieces together while keeping things simple. Image from Marvelous MLOps To read more about these components, check out the article on MarvelousMLOps . 2. How to successfully present MLOps ideas to upper management Have you ever presented your MLOps ideas to upper management just to get ghosted? In that case... Rapha\u00ebl Hoogvliets , Ba\u015fak Tu\u011f\u00e7e Eskili , and Maria Vechtomova from MarvelousMLOps presented a great step by step strategy for pitching your MLOps ideas to your upper management and getting attention and resources to implement them. Here are the 6 steps you have to know 1 . \ud835\udc02\ud835\udc28\ud835\udc25\ud835\udc25\ud835\udc1e\ud835\udc1c\ud835\udc2d \ud835\udc1a\ud835\udc25\ud835\udc25 \ud835\udc2d\ud835\udc21\ud835\udc1e \ud835\udc29\ud835\udc1a\ud835\udc22\ud835\udc27 \ud835\udc29\ud835\udc28\ud835\udc22\ud835\udc27\ud835\udc2d\ud835\udc2c Talk to data scientists, product owners, and stakeholders in your organization to gather issues such as time to deployment poor quality deployment non existing monitoring lack of collaboration external parties 2 . \ud835\udc04\ud835\udc1d\ud835\udc2e\ud835\udc1c\ud835\udc1a\ud835\udc2d\ud835\udc1e \ud835\udc29\ud835\udc1e\ud835\udc28\ud835\udc29\ud835\udc25\ud835\udc1e Organize workshops, meetings, etc., to present what MLOps is and how it can help. I think it s critical to present it to your target audience. For example, an engineer looks at the problem differently than the business stakeholders. 3 . \ud835\udc0f\ud835\udc2b\ud835\udc1e\ud835\udc2c\ud835\udc1e\ud835\udc27\ud835\udc2d \ud835\udc1b\ud835\udc1e\ud835\udc1f\ud835\udc28\ud835\udc2b\ud835\udc1e \ud835\udc1a\ud835\udc27\ud835\udc1d \ud835\udc1a\ud835\udc1f\ud835\udc2d\ud835\udc1e\ud835\udc2b \ud835\udc2c\ud835\udc1c\ud835\udc1e\ud835\udc27\ud835\udc1a\ud835\udc2b\ud835\udc22\ud835\udc28\ud835\udc2c Show how MLOps can solve the company s challenges and deliver tangible benefits to the organization, such as less cost fast deployment better collaboration less risk 4 . \ud835\udc0f\ud835\udc2b\ud835\udc28\ud835\udc2f\ud835\udc1e \ud835\udc22\ud835\udc2d Use concrete examples to support your ideas, such as how a competitor or an organization in the same or related field benefited from introducing MLOps build a PoC within your organization 5 . \ud835\udc12\ud835\udc1e\ud835\udc2d \ud835\udc2e\ud835\udc29 \ud835\udc32\ud835\udc28\ud835\udc2e\ud835\udc2b \ud835\udc2d\ud835\udc1e\ud835\udc1a\ud835\udc26 Choose 2 3 experienced individuals not juniors to set up the foundations in your team organization. With an emphasis on starting with experienced engineers and only later bringing more juniors to the party. 6 . \ud835\udc0a\ud835\udc1e\ud835\udc1e\ud835\udc29 \ud835\udc28\ud835\udc27 \ud835\udc24\ud835\udc1e\ud835\udc1e\ud835\udc29\ud835\udc22\ud835\udc27 \ud835\udc28\ud835\udc27 Once you successfully apply MLOps to one use case, you can bring in more responsibility by growing your team and taking on more projects. . All of these are great tips for integrating MLOps in your organization. I love their Present before and after scenarios approach. You can extrapolate this strategy for any other new processes not only MLOps . . To learn the details, check out the full article on MarvelousMLOps . 3. How I generated PyDocs for 100 Python functions in 1 hour The most boring programming part is to write PyDocs, so I usually write clean code and let it speak for itself. But, for open source projects where you have to generate robust documentation, PyDocs are a must. The good news is that now you can automate this process using Copilot. You can see in the video below an example of how easy it is. I tested it on more complex functions classes, and it works well. I chose this example because it fits nicely on one screen. Once I tested Copilot s experience, I will never go back. It is true that, in some cases, you have to make some minor adjustments. But that is still 10000 more efficient than writing it from scratch. If you want more examples, check out our Hands on LLMs course, where all the PyDocs are generated 99 using Copilot in 1 hour. That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 18 Share this post DML 8 types of MLOps tools that must be in your toolbelt to be a successful MLOps engineer decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-8-types-of-mlops-tools-that-must?r=1ttoeh" }, { "id": "8ff6064c-9c09-494f-a42d-a60b0e80387c", "content": "DML This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps Lesson 9 The Hands on LLMs Series SubscribeSign in Share this post DML This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps decodingml.substack.com Copy link Facebook Email Note Other DML This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps Lesson 9 The Hands on LLMs Series Paul Iusztin Dec 28, 2023 15 Share this post DML This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 9 The Hands on LLMs Series This is the last lesson within the Hands on LLMs series... _But certainly not the last MLE MLOps series. We are cooking some exciting stuff._ But I hope you had fun and learned much during this series. Now, let s see how to glue everything we have done so far under the inference pipeline. Enjoy! Table of Contents 1. Inference pipeline video lesson 2. What do you need to build an inference pipeline for a financial assistant powered by LLMs and vector DBs? 3. How can you build deploy an inference pipeline for a real time financial advisor while considering good LLMOps practices? Previous Lessons Lesson 6 What do you need to fine tune an open source LLM to create your financial advisor? Lesson 7 How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? Lesson 8 7 steps on how to fine tune an open source LLM to create your real time financial advisor Check out the Hands on LLMs course and support it with a . 1. Inference pipeline video lesson We \ud835\udc2b\ud835\udc1e\ud835\udc25\ud835\udc1e\ud835\udc1a\ud835\udc2c\ud835\udc1e\ud835\udc1d the \ud835\udc1f\ud835\udc22\ud835\udc27\ud835\udc1a\ud835\udc25 video \ud835\udc25\ud835\udc1e\ud835\udc2c\ud835\udc2c\ud835\udc28\ud835\udc27 of the \ud835\udc07\ud835\udc1a\ud835\udc27\ud835\udc1d\ud835\udc2c \ud835\udc28\ud835\udc27 \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc2c FREE course that will teach you how to \ud835\udc1b\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d \ud835\udc1d\ud835\udc1e\ud835\udc29\ud835\udc25\ud835\udc28\ud835\udc32 an \ud835\udc22\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e for a financial advisor using \ud835\udc0b\ud835\udc1a\ud835\udc27\ud835\udc20\ud835\udc02\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27, \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc0e\ud835\udc29\ud835\udc2c, and \ud835\udc2f\ud835\udc1e\ud835\udc1c\ud835\udc2d\ud835\udc28\ud835\udc2b \ud835\udc03\ud835\udc01\ud835\udc2c. \ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude2c\ud835\ude26\ud835\ude3a \ud835\ude35\ud835\ude30\ud835\ude31\ud835\ude2a\ud835\ude24\ud835\ude34 \ud835\ude24\ud835\ude30\ud835\ude37\ud835\ude26\ud835\ude33\ud835\ude26\ud835\ude25 \ud835\ude2a\ud835\ude2f \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude37\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f made by Pau Labarta \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude10 1 . Overview of the architecture of the inference pipeline and how to apply LLMOps good practices 2 . How to build from scratch a RAG agent using LangChain ContextExtractorChain FinancialBotQAChain 3 . How to attach a callback class to log input prompts and LLM answers to Comet LLMOps 4 . Setting up and running the code locally 5 . Deploying the inference pipeline to Beam as a RESTful API . \ud835\ude0a\ud835\ude36\ud835\ude33\ud835\ude2a\ud835\ude30\ud835\ude36\ud835\ude34? Check out the video lesson Pau Labarta Bajo and I did 2. What do you need to build an inference pipeline for a financial assistant powered by LLMs and vector DBs? Here are its \ud835\udff3 \ud835\uddf8\ud835\uddf2\ud835\ude06 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00 1 . \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 \ud835\uddfd\ud835\uddfc\ud835\uddfd\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\ude04\ud835\ude00 This is the output of the feature pipeline. More concretely, a Qdrant vector DB populated with chunks of financial news from Alpaca. During the inference pipeline, we will use it to query valuable chunks of information and do RAG. 2 . \ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf9\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\ude02\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 To embed the user question and query the vector DB, you need the same embedding model used in the feature pipeline, more concretely \ud835\ude22\ud835\ude2d\ud835\ude2d \ud835\ude14\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude13\ud835\ude14 \ud835\ude136 \ud835\ude372 from \ud835\ude34\ud835\ude26\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude2f\ud835\ude24\ud835\ude26 \ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude26\ud835\ude33\ud835\ude34 . Using the same encoder only model is crucial, as the query vector and vector DB index vectors have to be in the same space. 3 . \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2\ud835\uddf1 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 The output of the training pipeline will be a fine tuned Falcon 7B on financial tasks. 4 . \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddff\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06 The fine tuned model will be shared between the training inference pipeline through Comet s model registry. By doing so, you decouple entirely the 2 components, and the model can easily be shared under specific environments e.g., staging, prod and versions e.g., v1.0.1 . 5 . \ud835\uddee \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddee\ud835\uddfd\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 You need LangChain, as your LLM framework, to glue all the steps together, such as querying the vector DB, storing the history of the conversation, creating the prompt, and calling the LLM. LangChain provides out of the box solutions to chain all these steps together quickly. 6 . \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddee\ud835\uddfd\ud835\uddfd \ud835\uddee\ud835\ude00 \ud835\uddee \ud835\udde5\ud835\uddd8\ud835\udde6\ud835\udde7\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\uddd4\ud835\udde3\ud835\udddc One of the final steps is to deploy your awesome LLM financial assistant under a RESTful API. You can quickly do this using Beam as your serverless infrastructure provider. Beam specializes in DL. Thus, it offers quick ways to load your LLM application on GPU machines and expose it under a RESTful API. 7 . \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 The last step is to add eyes on top of your system. You can do this using Comet s LLMOps features that allow you to track monitor all the prompts responses of the system. Check out how these components are working together in our Hands on LLMs free course. 3. How can you build deploy an inference pipeline for a real time financial advisor while considering good LLMOps practices? \ud835\udc07\ud835\udc28\ud835\udc30 can you \ud835\udc1b\ud835\udc2e\ud835\udc22\ud835\udc25\ud835\udc1d \ud835\udc1d\ud835\udc1e\ud835\udc29\ud835\udc25\ud835\udc28\ud835\udc32 an \ud835\udc22\ud835\udc27\ud835\udc1f\ud835\udc1e\ud835\udc2b\ud835\udc1e\ud835\udc27\ud835\udc1c\ud835\udc1e \ud835\udc29\ud835\udc22\ud835\udc29\ud835\udc1e\ud835\udc25\ud835\udc22\ud835\udc27\ud835\udc1e for a real time financial advisor with \ud835\udc0b\ud835\udc1a\ud835\udc27\ud835\udc20\ud835\udc02\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27 powered by \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc2c \ud835\udc2f\ud835\udc1e\ud835\udc1c\ud835\udc2d\ud835\udc28\ud835\udc2b \ud835\udc03\ud835\udc01\ud835\udc2c while considering \ud835\udc20\ud835\udc28\ud835\udc28\ud835\udc1d \ud835\udc0b\ud835\udc0b\ud835\udc0c\ud835\udc0e\ud835\udc29\ud835\udc2c \ud835\udc29\ud835\udc2b\ud835\udc1a\ud835\udc1c\ud835\udc2d\ud835\udc22\ud835\udc1c\ud835\udc1e\ud835\udc2c? . As a quick reminder from previous posts, here is what we already have a Qdrant vector DB populated with financial news the output of the feature pipeline fine tuned Falcon 7B LoRA weights stored in Comet s model registry the output of the training pipeline The Qdrant vectorDB is accessed through a Python client. A specific version of the Falcon 7B LoRA weights is downloaded from Comet s model registry and loaded in memory using QLoRA. The goal of the inference pipeline is to use LangChain to glue the 2 components into a single FinancialAssistant entity. . The FinancialAssistant entity is deployed in a request response fashion under a RESTful API. We used Beam to deploy it quickly under a serverless web endpoint. To deploy any model using Beam as a RESTful API is as easy as writing the following Python decorator financial_bot. rest_api keep_warm_seconds 300, loader load_bot def run inputs .... \ud835\udc0d\ud835\udc28\ud835\udc30 \ud835\udc25\ud835\udc1e\ud835\udc2d \ud835\udc2c \ud835\udc2e\ud835\udc27\ud835\udc1d\ud835\udc1e\ud835\udc2b\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc1d \ud835\udc2d\ud835\udc21\ud835\udc1e \ud835\udc1f\ud835\udc25\ud835\udc28\ud835\udc30 \ud835\udc28\ud835\udc1f \ud835\udc2d\ud835\udc21\ud835\udc1e \ud835\udc05\ud835\udc22\ud835\udc27\ud835\udc1a\ud835\udc27\ud835\udc1c\ud835\udc22\ud835\udc1a\ud835\udc25\ud835\udc00\ud835\udc2c\ud835\udc2c\ud835\udc22\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc2d \ud835\udc1c\ud835\udc21\ud835\udc1a\ud835\udc22\ud835\udc27 1 . Clean the user s input prompt and use a pre trained all MiniLM L6 v2 encoder only model to embed it the same LM used to populate the vector DB . 2 . Using the embedded user input, query the Qdrant vector DB and extract the top 3 most similar financial news based on the cosine similarly distance These 2 steps were necessary to do RAG. If you don t know how RAG works, check out Lesson 3. 3 . Build the final prompt using a PromptTemplate class the same one used for training that formats the following components a system prompt the user s input prompt the financial news context the chat history 4 . Now that our prompt contains all the necessary data, we pass it to the fine tuned Falcon 7B LLM for the final answer. The input prompt and LLM answer will be logged and monitored by Comet LLMOps. 5 . You can get the answer in one shot or use the TextIteratorStreamer class from HuggingFace to stream it token by token. 6 . Store the user s input prompt and LLM answer in the chat history. 7 . Pass the final answer to the client. Note You can use the TextIteratorStreamer class wrap the FinancialAssistant under a WebSocket instead of the RESTful API to stream the answer of the bot token by token. Similar to what you see in the interface of ChatGPT. How Inference pipeline Build deploy an inference pipeline using LangChain powered by LLMs vector DBs Image by the Author . Check out the Hands on LLMs course and support it with a . That s it for today With this, we concluded the Hands On LLMs series. I hope you enjoyed it See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 15 Share this post DML This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-this-is-what-you-need-to-build?r=1ttoeh" }, { "id": "ceacd8d8-91dc-42a7-ad33-97964bf91387", "content": "DML 7 steps on how to fine tune an open source LLM to create your real time financial advisor Lesson 8 The Hands on LLMs Series SubscribeSign in Share this post DML 7 steps on how to fine tune an open source LLM to create your real time financial advisor decodingml.substack.com Copy link Facebook Email Note Other DML 7 steps on how to fine tune an open source LLM to create your real time financial advisor Lesson 8 The Hands on LLMs Series Paul Iusztin Dec 21, 2023 6 Share this post DML 7 steps on how to fine tune an open source LLM to create your real time financial advisor decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 8 The Hands on LLMs Series Table of Contents 1. What is Beam? How does serverless make deploying ML models easy? 2. 7 tips you must know to reduce your VRAM consumption of your LLMs during training 3. 7 steps on how to fine tune an open source LLM to create your real time financial advisor Previous Lessons Lesson 5 Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? Lesson 6 What do you need to fine tune an open source LLM to create your financial advisor? Lesson 7 How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? Check out the Hands on LLMs course and support it with a . 1. What is Beam? How does serverless make deploying ML models easy? \ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfa\ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf4 ML models is \ud835\uddf5\ud835\uddee\ud835\uddff\ud835\uddf1, especially when running your models on GPUs. But \ud835\ude00\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 makes things \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06. Using Beam as your serverless provider, deploying managing ML models can be as easy as \ud835\uddd7\ud835\uddf2\ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddff\ud835\uddee\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf2\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\ude00 In a few lines of code, you define the application that contains the requirements of your infrastructure, such as the CPU, RAM, and GPU the dependencies of your application the volumes from where you can load your data and store your artifacts \ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddf7\ud835\uddfc\ud835\uddef\ud835\ude00 Using the Beam application, you can quickly decore your Python functions to run them once on the given serverless application put your task job in a queue to be processed or even schedule it using a CRON based syntax even deploy it as a RESTful API endpoint How do you use Beam as your serverless provider? Image by the Author As you can see in the image below, you can have one central function for training or inference, and with minimal effort, you can switch from all these deployment methods. Also, you don t have to bother at all with managing the infrastructure on which your jobs run. You specify what you need, and Beam takes care of the rest. By doing so, you can directly start to focus on your application and stop carrying about the infrastructure. This is the power of serverless! Check out Beam to learn more 2. 7 tips you must know to reduce your VRAM consumption of your LLMs during training Here are \ud835\udff3 \ud835\ude01\ud835\uddf6\ud835\uddfd\ud835\ude00 you must know to \ud835\uddff\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf2 your \ud835\udde9\ud835\udde5\ud835\uddd4\ud835\udde0 \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb of your \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 during \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 so you can \ud835\uddf3\ud835\uddf6\ud835\ude01 it on \ud835\ude05\ud835\udfed \ud835\uddda\ud835\udde3\ud835\udde8. When training LLMs, one of the pain points is to have enough VRAM on your system. The good news is that the gods of DL are with us, and there are methods to lower your VRAM consumption without a significant impact on your performance \ud835\udfed . \ud835\udde0\ud835\uddf6\ud835\ude05\ud835\uddf2\ud835\uddf1 \ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb During training you use both FP32 and FP16 in the following way FP32 weights FP16 weights FP16 gradients FP32 gradients Update weights FP32 weights and repeat . As you can see, the forward backward passes are done in FP16, and only the optimization step is done in FP32, which reduces both the VRAM and runtime. \ud835\udfee . \ud835\udddf\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\uddfd\ud835\uddff\ud835\uddf2\ud835\uddf0\ud835\uddf6\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb All your computations are done in FP16 instead of FP32. But the key is using bfloat16 Brain Floating Point , a numerical representation Google developed for deep learning. It allows you to represent very large and small numbers, avoiding overflowing or underflowing scenarios. \ud835\udfef . \ud835\udde5\ud835\uddf2\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddef\ud835\uddee\ud835\ude01\ud835\uddf0\ud835\uddf5 \ud835\ude00\ud835\uddf6\ud835\ude07\ud835\uddf2 This one is straightforward. Fewer samples per training iteration result in smaller VRAM requirements. The downside of this method is that you can t go too low with your batch size without impacting your model s performance. \ud835\udff0 . \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\ude02\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb It is a simple powerful trick to increase your batch size virtually. You compute the gradients for micro batches forward backward passes . Once the accumulated gradients reach the given virtual target, the model weights are updated with the accumulated gradients. For example, you have a batch size of 4 and a micro batch size of 1. Then, the forward backward passes will be done using only x1 sample, and the optimization step will be done using the aggregated gradient of the 4 samples. \ud835\udff1 . \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\uddee \ud835\ude00\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\uddfc\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf6\ud835\ude07\ud835\uddf2\ud835\uddff Adam is the most popular optimizer. It is one of the most stable optimizers, but the downside is that it has 2 additional parameters a mean variance for every model parameter. If you use a stateless optimizer, such as SGD, you can reduce the number of parameters by 2 3, which is significant for LLMs. \ud835\udff2 . \ud835\uddda\ud835\uddff\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\ude03\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddf0\ud835\uddf8\ud835\uddfd\ud835\uddfc\ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 It drops specific activations during the forward pass and recomputes them during the backward pass. Thus, it eliminates the need to hold all activations simultaneously in VRAM. This technique reduces VRAM consumption but makes the training slower. \ud835\udff3 . \ud835\uddd6\ud835\udde3\ud835\udde8 \ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddfc\ud835\uddf3\ud835\uddf3\ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 As the name suggests, the parameters that do not fit on your GPU s VRAM are loaded on the CPU. Intuitively, you can see it as a model parallelism between your GPU CPU. A happy dude going for a walk with his GPU Image by DALL E Most of these methods are orthogonal, so you can combine them and drastically reduce your VRAM requirements during training. 3. 7 steps on how to fine tune an open source LLM to create your real time financial advisor In the past weeks, we covered \ud835\ude04\ud835\uddf5\ud835\ude06 you have to fine tune an LLM and \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 resources tools you need Q A dataset pre trained LLM Falcon 7B QLoRA MLOps experiment tracker, model registry, prompt monitoring Comet ML compute platform Beam . Now, let s see how you can hook all of these pieces together into a single fine tuning module \ud835\udfed . \ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde4 \ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 Our Q A samples have the following structure keys about_me, user_context, question, and answer. For task specific fine tuning, you need only 100 1000 samples. Thus, you can directly load the whole JSON in memory. After you map every sample to a list of Python \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2d\ud835\ude22\ud835\ude34\ud835\ude34\ud835\ude26\ud835\ude34 to validate the structure type of the ingested instances. \ud835\udfee . \ud835\udde3\ud835\uddff\ud835\uddf2\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde4 \ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\uddf6\ud835\uddfb\ud835\ude01\ud835\uddfc \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00 The first step is to use \ud835\ude36\ud835\ude2f\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26\ud835\ude25 to clean every sample by removing redundant characters. After, as every sample consists of multiple fields, you must map it to a single piece of text, also known as the prompt. To do so, you define a \ud835\ude17\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35\ud835\ude1b\ud835\ude26\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude26 class to manage all your prompts. You will use it to map all the sample keys to a prompt using a Python f string. The last step is to map the list of Python \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude24\ud835\ude2d\ud835\ude22\ud835\ude34\ud835\ude34\ud835\ude26\ud835\ude34 to a HuggingFace dataset and map every sample to a prompt, as discussed above. \ud835\udfef . \ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde4\ud835\udddf\ud835\uddfc\ud835\udde5\ud835\uddd4 Load a pretrained Falcon 7B LLM by passing a \ud835\ude23\ud835\ude2a\ud835\ude35\ud835\ude34\ud835\ude22\ud835\ude2f\ud835\ude25\ud835\ude23\ud835\ude3a\ud835\ude35\ud835\ude26\ud835\ude34 quantization configuration that loads all the weights on 4 bits. After using LoRA, you freeze the weights of the original Falcon LLM and attach to it a set of trainable adapters. \ud835\udff0 . \ud835\uddd9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 The \ud835\ude35\ud835\ude33\ud835\ude2d Python package makes this step extremely simple. You pass to the \ud835\ude1a\ud835\ude0d\ud835\ude1b\ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude33 class the training arguments, the dataset and the model and call the \ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f method. One crucial aspect is configuring an experiment tracker, such as Comet ML, to log the loss and other vital metrics artifacts. \ud835\udff1 . \ud835\udde3\ud835\ude02\ud835\ude00\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddef\ud835\uddf2\ud835\ude00\ud835\ude01 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\ude01\ud835\uddfc \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddff\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06 One of the final steps is to attach a callback to the \ud835\ude1a\ud835\ude0d\ud835\ude1b\ud835\ude1b\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude33 class that runs when the training ends to push the model with the lowest loss to the model registry as the new production candidate. \ud835\udff2 . \ud835\uddd8\ud835\ude03\ud835\uddee\ud835\uddf9\ud835\ude02\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\uddf0\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\uddf6\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2 Evaluating generative AI models can be pretty tricky. You can run the LLM on the test set and log the prompts answers to Comet ML s monitoring system to check them manually. If the provided answers are valid, using the model registry dashboard, you will manually release it to replace the old LLM. \ud835\udff3 . \ud835\uddd7\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\ude01\ud835\uddfc \ud835\uddd5\ud835\uddf2\ud835\uddee\ud835\uddfa It is as easy as wrapping the training inference functions or classes with a Python \ud835\ude22\ud835\ude31\ud835\ude31.\ud835\ude33\ud835\ude36\ud835\ude2f decorator. A step by step guide on fine tuning an LLM to create a real time financial advisor Image by the Author . Check out the Hands on LLMs course and support it with a . That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 9 , the last lesson of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 6 Share this post DML 7 steps on how to fine tune an open source LLM to create your real time financial advisor decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-7-steps-on-how-to-fine-tune-an?r=1ttoeh" }, { "id": "dffed5e0-c824-40db-9388-a26fa09f7b49", "content": "DML How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? Lesson 7 The Hands on LLMs Series SubscribeSign in Share this post DML How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? decodingml.substack.com Copy link Facebook Email Note Other DML How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? Lesson 7 The Hands on LLMs Series Paul Iusztin Dec 14, 2023 5 Share this post DML How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 7 The Hands on LLMs Series Table of Contents 1. Real time feature pipeline video lesson 2. How do you generate a synthetic domain specific Q A dataset in 30 minutes to fine tune your open source LLM? 3. My personal list of filtered resources about LLMs vector DBs Previous Lessons Lesson 4 How to implement a streaming pipeline to populate a vector DB for real time RAG? Lesson 5 Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? Lesson 6 What do you need to fine tune an open source LLM to create your financial advisor? Check out the Hands on LLMs course and support it with a . 1. Real time feature pipeline video lesson I know we are currently talking about the training pipeline and Q A dataset generation, but sometimes, mixing the information to remember and make new connections is healthy. or maybe that is only an excuse to share the video lesson about the feature pipeline that wasn t ready when I started this series. It will teach you how to \ud835\uddf6\ud835\uddfb\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\ude04\ud835\ude00 in \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 from Alpaca, \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1 the \ud835\uddf1\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\ude00, and \ud835\uddf9\ud835\uddfc\ud835\uddee\ud835\uddf1 them in a \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5. \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf6\ud835\uddf2\ud835\ude04 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude03\ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddfc 1 . Step by step instructions on how to set up the streaming pipeline code a Qdrant vector DB serverless cluster 2 . Why we used Bytewax to build the streaming pipeline 3 . How we used Bytewax to ingest financial news in real time leveraging a WebSocket, clean the documents, chunk them, embed them and ingest them in the Qdrant vector DB 4 . How we adapted the Bytewax streaming pipeline to also work in batch mode to populate the vector DB with historical data 5 . How to run the code 6 . How to deploy the code to AWS Here it is Enjoy 2. How do you generate a synthetic domain specific Q A dataset in 30 minutes to fine tune your open source LLM? This method is also known as \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddf9\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb. Here are its 3 \ud835\ude2e\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31\ud835\ude34 \ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26, \ud835\ude2d\ud835\ude26\ud835\ude35 \ud835\ude34 \ud835\ude28\ud835\ude26\ud835\ude2f\ud835\ude26\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude22 \ud835\ude18 \ud835\ude08 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude35 \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude35\ud835\ude30 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26 \ud835\ude22 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude2a\ud835\ude22\ud835\ude2d \ud835\ude22\ud835\ude25\ud835\ude37\ud835\ude2a\ud835\ude34\ud835\ude30\ud835\ude33 \ud835\ude13\ud835\ude13\ud835\ude14. \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed \ud835\udde0\ud835\uddee\ud835\uddfb\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddee \ud835\uddf3\ud835\uddf2\ud835\ude04 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 Generate a few input samples 3 that have the following structure \ud835\ude36\ud835\ude34\ud835\ude26\ud835\ude33_\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude39\ud835\ude35 describe the type of investor e.g., I am a 28 year old marketing professional \ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f describe the user s intention e.g., Is Bitcoin a good investment option? \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf5\ud835\uddf2\ud835\uddf9\ud835\uddfd \ud835\uddfc\ud835\uddf3 \ud835\uddee \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 Use a powerful LLM as a teacher e.g., GPT4, Falcon 180B, etc. to generate up to N similar input examples. We generated 100 input examples in our use case, but you can generate more. You will use the manually filled input examples to do few shot prompting. This will guide the LLM to give you domain specific samples. \ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35 \ud835\ude38\ud835\ude2a\ud835\ude2d\ud835\ude2d \ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude2d\ud835\ude2a\ud835\ude2c\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34 ... Generate 100 more examples with the following pattern USER CONTEXT 1 ... QUESTION 1 ... USER CONTEXT 2 ... \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef \ud835\udde8\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\uddee\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddfc\ud835\ude02\ud835\ude01\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\ude00 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddee\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf6\ud835\uddfb\ud835\uddfd\ud835\ude02\ud835\ude01 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00 Now, you will have the same powerful LLM as a teacher, but this time, it will answer all your N input examples. But first, to introduce more variance, we will use RAG to enrich the input examples with news context. Afterward, we will use the teacher LLM to answer all N input examples. ...and bam! You generated a domain specific Q A dataset with almost 0 manual work. . Now, you will use this data to train a smaller LLM e.g., Falcon 7B on a niched task, such as financial advising. This technique is known as finetuning with distillation because you use a powerful LLM as the teacher e.g., GPT4, Falcon 180B to generate the data, which will be used to fine tune a smaller LLM e.g., Falcon 7B , which acts as the student. \ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26 To ensure that the generated data is of high quality, you can hire a domain expert to check refine it. How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? Image by the Author . To learn more about this technique, check out How to generate a Q A dataset in less than 30 minutes Pau Labarta s article from Real World Machine Learning . 3. My personal list of filtered resources about LLMs vector DBs The internet is full of \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\ude00 about \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5\ud835\ude00. But \ud835\uddfa\ud835\uddfc\ud835\ude00\ud835\ude01 \ud835\uddfc\ud835\uddf3 \ud835\uddf6\ud835\ude01 is \ud835\ude01\ud835\uddff\ud835\uddee\ud835\ude00\ud835\uddf5. After \ud835\udff2 \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf5\ud835\ude00 of \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5\ud835\ude00, here is a \ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\ude01 \ud835\uddfc\ud835\uddf3 \ud835\uddf3\ud835\uddf6\ud835\uddf9\ud835\ude01\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\ude00 that I \ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\uddfc\ud835\uddfb\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\ude09\ud835\ude2d\ud835\ude30\ud835\ude28\ud835\ude34 philschmid Chip Huyen eugeneyan LLM Learning Lab Lil Log VectorHub by SuperLinked Qdrant Blog \ud835\ude08\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude34 Patterns for Building LLM based Systems Products RLHF Reinforcement Learning from Human Feedback Illustrating Reinforcement Learning from Human Feedback RLHF Understanding Encoder And Decoder LLMs Building LLM applications for production Prompt Engineering Transformers Bidirectional Encoder Representations from Transformers BERT Multimodality and Large Multimodal Models LMMs by Chip Huyen \ud835\ude1d\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude30\ud835\ude34 Word Embedding and Word2Vec, Clearly Explained!!! Let s build GPT from scratch, in code, spelled out Transformer Neural Networks, ChatGPT s foundation, Clearly Explained!!! Large Language Models with Semantic Search Decoder Only Transformers, ChatGPTs specific Transformer, Clearly Explained!!! \ud835\ude0a\ud835\ude30\ud835\ude25\ud835\ude26 \ud835\ude19\ud835\ude26\ud835\ude31\ud835\ude30\ud835\ude34\ud835\ude2a\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude2a\ud835\ude26\ud835\ude34 OpenAI Cookbook generative ai for beginners \ud835\ude0a\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34\ud835\ude26\ud835\ude34 LangChain for LLM Application Development Building Systems with the ChatGPT API ChatGPT Prompt Engineering for Developers . ...and hopefully, my Hands on LLMs course will soon appear along them. Image by DALL E Let me know what you think of this list and have fun learning That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 8 of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 5 Share this post DML How do you generate a Q A dataset in 30 minutes to fine tune your LLMs? decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-do-you-generate-a-q-and-a?r=1ttoeh" }, { "id": "15c3831b-67fd-4279-970a-a720aafefa67", "content": "DML What do you need to fine tune an open source LLM to create your financial advisor? Lesson 6 The Hands on LLMs Series SubscribeSign in Share this post DML What do you need to fine tune an open source LLM to create your financial advisor? decodingml.substack.com Copy link Facebook Email Note Other DML What do you need to fine tune an open source LLM to create your financial advisor? Lesson 6 The Hands on LLMs Series Paul Iusztin Dec 07, 2023 4 Share this post DML What do you need to fine tune an open source LLM to create your financial advisor? decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 6 The Hands on LLMs Series Table of Contents 1. The difference between encoders, decoders, and encoder decoder LLMs. 2. You must know these 3 main stages of training an LLM to train your own LLM on your proprietary data. 3. What do you need to fine tune an open source LLM to create your own financial advisor? Previous Lessons Lesson 3 Why what do you need a streaming pipeline when implementing RAG in your LLM applications? Lesson 4 How to implement a streaming pipeline to populate a vector DB for real time RAG? Lesson 5 Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? Check out the Hands on LLMs course and support it with a . 1. The difference between encoders, decoders, and encoder decoder LLMs Let s see when to use each architecture As embeddings are everywhere, both encoders and decoders use self attention layers to encode word tokens into embeddings. The devil is in the details. Let s clarify it \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\udde2\ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddf9 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa\ud835\uddf2\ud835\uddff It is an encoder decoder setup. The encoder processes the input text and hands off its understanding as embeddings to the decoder, which will generate the final output. The key difference between an encoder decoder is in how it processes its inputs outputs. \ud835\uddd8\ud835\uddfb\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00 The role of an encoder is to extract relevant information from the whole input and encode it into an embedding e.g., BERT, RoBERTa . Within the Multi head attention of the transformer, all the tokens are allowed to speak to each other. A token at position t can talk to all other previous tokens 0, t 1 and future tokens t 1, T . This means that the attention mask is computed along the whole vector. Thus, because the encoder processes the whole input, it is helpful for classification tasks e.g., sentiment analysis and creates embeddings for clustering, recommender systems, vector DB indexes, etc. \ud835\uddd7\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00 On the flip side, if you want to generate text, use decoder only models e.g., GPT family . Only the current and previous tokens not the whole input are used to predict the next token. Within the Masked Multi head attention, the future positions are masked to maintain the autoregressive property of the decoding process. For example, within the Masked Multi head attention, instead of all the tokens talking to each other, a token at position t will have access only to previous tokens at positions t 1, t 2, t 3, ..., 0. \ud835\uddd8\ud835\uddfb\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff \ud835\uddf1\ud835\uddf2\ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddff This technique is used when you have to understand the entire input sequence encoder and the previously generated sequence decoder autoregressive . Typical use cases are text translation summarization the original transformer was built for text translation , where the output heavily relies on the input. Why? Because the decoding step always has to be conditioned by the encoded information. Also known as cross attention, the decoder queries the encoded information for information to guide the decoding process. For example, when translating English to Spanish, every Spanish token predicted is conditioned by the previously predicted Spanish tokens the entire English sentence. Encoder vs. Decoder vs. Encoder Decoder LLMs Image by the Author . To conclude... a decoder takes as input previous tokens and predicts the next one in an autoregressive way by dropping the Masked logic from the Masked Multi head attention, you process the whole input, transforming the decoder into an encoder if you hook the encoder to the decoder through a cross attention layer, you have an encoder decoder architecture 2. You must know these 3 main stages of training an LLM to train your own LLM on your proprietary data You must know these \ud835\udfef \ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2\ud835\ude00 of \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0 to train your own \ud835\udddf\ud835\udddf\ud835\udde0 on your \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfd\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddff\ud835\ude06 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee. \ud835\udde6\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\udfed \ud835\udde3\ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb You start with a bear foot randomly initialized LLM. This stage aims to teach the model to spit out tokens. More concretely, based on previous tokens, the model learns to predict the next token with the highest probability. For example, your input to the model is The best programming language is ___ , and it will answer, The best programming language is Rust. Intuitively, at this stage, the LLM learns to speak. \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22 1 trillion token 15 million books . The data quality doesn t have to be great. Hence, you can scrape data from the internet. \ud835\udde6\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\udfee \ud835\udde6\ud835\ude02\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\ude03\ud835\uddf6\ud835\ude00\ud835\uddf2\ud835\uddf1 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde6\ud835\uddd9\ud835\udde7 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddf1\ud835\uddf6\ud835\uddee\ud835\uddf9\ud835\uddfc\ud835\uddf4\ud835\ude02\ud835\uddf2 You start with the pretrained model from stage 1. This stage aims to teach the model to respond to the user s questions. For example, without this step, when prompting What is the best programming language? , it has a high probability of creating a series of questions such as What is MLOps? What is MLE? etc. As the model mimics the training data, you must fine tune it on Q A questions answers data to align the model to respond to questions instead of predicting the following tokens. After the fine tuning step, when prompted, What is the best programming language? , it will respond, Rust . \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22 10K 100K Q A example \ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26 After aligning the model to respond to questions, you can further single task fine tune the model, on Q A data, on a specific use case to specialize the LLM. \ud835\udde6\ud835\ude01\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\udfef \ud835\udde5\ud835\uddf2\ud835\uddf6\ud835\uddfb\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddff\ud835\uddfc\ud835\uddfa \ud835\uddf5\ud835\ude02\ud835\uddfa\ud835\uddee\ud835\uddfb \ud835\uddf3\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\uddef\ud835\uddee\ud835\uddf0\ud835\uddf8 \ud835\udde5\ud835\udddf\ud835\udddb\ud835\uddd9 Demonstration data tells the model what kind of responses to give but doesn t tell the model how good or bad a response is. The goal is to align your model with user feedback what users liked or didn t like to increase the probability of generating answers that users find helpful. \ud835\ude19\ud835\ude13\ud835\ude0f\ud835\ude0d \ud835\ude2a\ud835\ude34 \ud835\ude34\ud835\ude31\ud835\ude2d\ud835\ude2a\ud835\ude35 \ud835\ude2a\ud835\ude2f 2 1 . Using the LLM from stage 2, train a reward model to act as a scoring function using prompt, winning_response, losing_response samples comparison data . The model will learn to maximize the difference between these 2. After training, this model outputs rewards for prompt, response tuples. \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22 100K 1M comparisons 2 . Use an RL algorithm e.g., PPO to fine tune the LLM from stage 2. Here, you will use the reward model trained above to give a score for every prompt, response . The RL algorithm will align the LLM to generate prompts with higher rewards, increasing the probability of generating responses that users liked. \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22 10K 100K prompts The 3 main stages of training an LLM that you must know Image by the Author . Note Post inspired by Chip Huyen s RLHF Reinforcement Learning from Human Feedback article. 3. What do you need to fine tune an open source LLM to create your own financial advisor? This is the \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf8\ud835\uddf6\ud835\ude01 you must know \ud835\uddd7\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 The key component of any successful ML project is the data. You need a 100 1000 sample Q A questions answers dataset with financial scenarios. The best approach is to hire a bunch of experts to create it manually. But, for a PoC, that might get expensive slow. The good news is that a method called \ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26\ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude38\ud835\ude2a\ud835\ude35\ud835\ude29 \ud835\ude25\ud835\ude2a\ud835\ude34\ud835\ude35\ud835\ude2a\ud835\ude2d\ud835\ude2d\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f exists. In a nutshell, this is how it works Use a big powerful LLM e.g., GPT4 to generate your fine tuning data. After, use this data to fine tune a smaller model e.g., Falcon 7B . For specializing smaller LLMs on specific use cases e.g., financial advisors , this is an excellent method to kick off your project. \ud835\udde3\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf1 \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 You never want to start training your LLM from scratch or rarely . Why? Because you need trillions of tokens millions of in compute power. You want to fine tune your LLM on your specific task. The good news is that you can find a plethora of open source LLMs on HuggingFace e.g., Falcon, LLaMa, etc. \ud835\udde3\ud835\uddee\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\uddf2\ud835\uddf3\ud835\uddf3\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 As LLMs are big... duh... ... they don t fit on a single GPU. As you want only to fine tune the LLM, the community invented clever techniques that quantize the LLM to fit on a single GPU and fine tune only a set of smaller adapters. One popular approach is QLoRA, which can be implemented using HF s \ud835\ude31\ud835\ude26\ud835\ude27\ud835\ude35 Python package. \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 As you want your project to get to production, you have to integrate the following MLOps components experiment tracker to monitor compare your experiments model registry to version share your models between the FTI pipelines prompts monitoring to debug track complex chains All of them are available on ML platforms, such as Comet ML \ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 \ud835\uddfd\ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf3\ud835\uddfc\ud835\uddff\ud835\uddfa The most common approach is to train your LLM on your on prem Nivida GPUs cluster or rent them on cloud providers such as AWS, Paperspace, etc. But what if I told you that there is an easier way? There is! It is called serverless. For example, Beam is a GPU serverless provider that makes deploying your training pipeline as easy as decorating your Python function with \ud835\ude22\ud835\ude31\ud835\ude31.\ud835\ude33\ud835\ude36\ud835\ude2f . Along with ease of deployment, you can easily add your training code to your CI CD to add the final piece of the MLOps puzzle, called CT continuous training . Beam What Training Pipeline Image by the Author . To see all these components in action, check out our FREE \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 \ud835\uddf0\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\ude00\ud835\uddf2 give it a That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 7 of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 4 Share this post DML What do you need to fine tune an open source LLM to create your financial advisor? decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-what-do-you-need-to-fine-tune?r=1ttoeh" }, { "id": "174d6f07-42f4-4190-9150-bb4ad35f8413", "content": "DML Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? Lesson 5 The Hands on LLMs Series SubscribeSign in Share this post DML Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? decodingml.substack.com Copy link Facebook Email Note Other DML Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? Lesson 5 The Hands on LLMs Series Paul Iusztin Nov 30, 2023 6 Share this post DML Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 5 The Hands on LLMs Series Table of Contents 1. Using this Python package, you can x10 your text preprocessing pipeline development. 2. Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? 3. Fine tuning video lessons Previous Lessons Lesson 2 Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps Lesson 3 Why what do you need a streaming pipeline when implementing RAG in your LLM applications? Lesson 4 How to implement a streaming pipeline to populate a vector DB for real time RAG? Check out the Hands on LLMs course and support it with a . 1. Using this Python package, you can x10 your text preprocessing pipeline development Any text preprocessing pipeline has to clean, partition, extract, or chunk text data to feed it into your LLMs. \ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 offers a \ud835\uddff\ud835\uddf6\ud835\uddf0\ud835\uddf5 and \ud835\uddf0\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb \ud835\uddd4\ud835\udde3\ud835\udddc that allows you to quickly \ud835\ude31\ud835\ude22\ud835\ude33\ud835\ude35\ud835\ude2a\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f your data into smaller segments from various data sources e.g., HTML, CSV, PDFs, even images, etc. \ud835\ude24\ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 the text of anomalies e.g., wrong ASCII characters , any irrelevant information e.g., white spaces, bullets, etc. , and filling missing values \ud835\ude26\ud835\ude39\ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude28 information from pieces of text e.g., datetimes, addresses, IP addresses, etc. \ud835\ude24\ud835\ude29\ud835\ude36\ud835\ude2f\ud835\ude2c\ud835\ude2a\ud835\ude2f\ud835\ude28 your text segments into pieces of text that can be inserted into your embedding model \ud835\ude26\ud835\ude2e\ud835\ude23\ud835\ude26\ud835\ude25\ud835\ude25\ud835\ude2a\ud835\ude2f\ud835\ude28 data e.g., wrapper over OpenAIEmbeddingEncoder, HuggingFaceEmbeddingEncoders, etc. \ud835\ude34\ud835\ude35\ud835\ude22\ud835\ude28\ud835\ude26 your data to be fed into various tools e.g., Label Studio, Label Box, etc. Unstructured Image by the Author . \ud835\uddd4\ud835\uddf9\ud835\uddf9 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff feeding your data into your LLMs embedding the data and ingesting it into a vector DB doing RAG labeling recommender systems ... basically for any LLM or multimodal applications . Implementing all these steps from scratch will take a lot of time. I know some Python packages already do this, but the functionality is scattered across multiple packages. \ud835\ude02\ud835\uddfb\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 packages everything together under a nice, clean API. Check it out. 2. Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? Fine tuning is the process of taking a pre trained model and further refining it on a specific task. \ud835\uddd9\ud835\uddf6\ud835\uddff\ud835\ude00\ud835\ude01, \ud835\uddf9\ud835\uddf2\ud835\ude01 \ud835\ude00 \ud835\uddf0\ud835\uddf9\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddf3\ud835\ude06 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddfa\ud835\uddf2\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee\ud835\uddfb \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddf2\ud835\ude05\ud835\uddf6\ud835\ude00t \ud835\ude0a\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude2a\ud835\ude2f\ud835\ude36\ud835\ude26\ud835\ude25 \ud835\ude31\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude33\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 utilize domain specific data to apply the same pre training process next token prediction on the pre trained base model \ud835\ude10\ud835\ude2f\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 the pre trained base model is fine tuned on a Q A dataset to learn to answer questions \ud835\ude1a\ud835\ude2a\ud835\ude2f\ud835\ude28\ud835\ude2d\ud835\ude26 \ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude2c \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 the pre trained model is refined for a specific task, such as toxicity detection, coding, medicine advice, etc. \ud835\ude19\ud835\ude13\ud835\ude0f\ud835\ude0d It requires collecting human preferences e.g., pairwise comparisons , which are then used to train a reward model. The reward model is used to fine tune the LLM via RL techniques such as PPO. Common approaches are to take a pre trained LLM next word prediction and apply instruction single task fine tuning. \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0? You do instruction fine tuning to make the LLM learn to answer your questions. The exciting part is when you want to fine tune your LLM on a single task. Here is why \ud835\ude31\ud835\ude26\ud835\ude33\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude26 it will improve your LLM performance on given use cases e.g., coding, extracting text, etc. . Mainly, the LLM will specialize in a given task a specialist will always beat a generalist in its domain \ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude35\ud835\ude33\ud835\ude30\ud835\ude2d you can refine how your model should behave on specific inputs and outputs, resulting in a more robust product \ud835\ude2e\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude2d\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude3b\ud835\ude22\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f you can create an army of smaller models, where each is specialized on a particular task, increasing the overall system s performance. Usually, when you fine tune one task, it reduces the performance of the other tasks known as the alignment tax . Thus, having an expert system of multiple smaller models can improve the overall performance. \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddee\ud835\uddef\ud835\uddfc\ud835\ude02\ud835\ude01 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude03\ud835\ude00 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4? \ud835\ude25\ud835\ude22\ud835\ude35\ud835\ude22 use prompting when you don t have data available 2 examples are enough . Fine tuning needs at least 100 examples to work. \ud835\ude24\ud835\ude30\ud835\ude34\ud835\ude35 prompting forces you to write long detailed prompts to achieve your level of performance. You pay per token API or compute wise . Thus, when a prompt gets bigger, your costs increase. But, when fine tuning an LLM, you incorporate all that knowledge inside the model. Hence, you can use smaller prompts with similar performance. Fine tuning LLMs Image by the Author . When you start a project, a good strategy is to write a wrapper over an API e.g., OpenAI s GPT 4, Anyscale, etc. that defines a desired interface that can easily be swapped with your open source implementation in future iterations. Check out the Hands on LLMs course to see this in action. 3. Fine tuning video lessons As you might know, Pau Labarta Bajo from Real World Machine Learning and I are also working on a free Hands on LLMs course that contains the open source code a set of video lessons. Here are the 2 video lessons about fine tuning 01 Hands on LLMS Theoretical Part Here is a \ud835\ude34\ud835\ude36\ud835\ude2e\ud835\ude2e\ud835\ude22\ud835\ude33\ud835\ude3a of the 1\ud835\ude34\ud835\ude35 \ud835\ude37\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude34\ud835\ude34\ud835\ude30\ud835\ude2f \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\uddf9\ud835\uddee\ud835\uddff\ud835\uddf4\ud835\uddf2 \ud835\uddf9\ud835\uddee\ud835\uddfb\ud835\uddf4\ud835\ude02\ud835\uddee\ud835\uddf4\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9\ud835\ude00? 1 . \ud835\ude17\ud835\ude26\ud835\ude33\ud835\ude27\ud835\ude30\ud835\ude33\ud835\ude2e\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude26 Fine tuning a large language model LLM can improve performance, especially for specialized tasks. 2 . \ud835\ude0c\ud835\ude24\ud835\ude30\ud835\ude2f\ud835\ude30\ud835\ude2e\ud835\ude2a\ud835\ude24\ud835\ude34 Fine tuned models are smaller and thus cheaper to run. This is crucial, given that LLMs can have billions of parameters. \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf1\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddee \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2? 1 . \ud835\ude0b\ud835\ude22\ud835\ude35\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude35 You need a dataset of input output examples. This dataset can be created manually or semi automatically using existing LLMs like GPT 3.5. 2 . \ud835\ude09\ud835\ude22\ud835\ude34\ud835\ude26 \ud835\ude13\ud835\ude13\ud835\ude14 Choose an open source LLM from repositories like Hugging Face s Model Hub e.g., Falcon 7B 3 . \ud835\ude0d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude34\ud835\ude24\ud835\ude33\ud835\ude2a\ud835\ude31\ud835\ude35 Data loader Trainer 4 . \ud835\ude08\ud835\ude25\ud835\ude37\ud835\ude22\ud835\ude2f\ud835\ude24\ud835\ude26\ud835\ude25 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude29\ud835\ude2f\ud835\ude2a\ud835\ude32\ud835\ude36\ud835\ude26\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude2e\ud835\ude30\ud835\ude25\ud835\ude26\ud835\ude2d \ud835\ude30\ud835\ude2f \ud835\ude24\ud835\ude29\ud835\ude26\ud835\ude22\ud835\ude31 \ud835\ude29\ud835\ude22\ud835\ude33\ud835\ude25\ud835\ude38\ud835\ude22\ud835\ude33\ud835\ude26 QLoRA 5 . \ud835\ude14\ud835\ude13\ud835\ude16\ud835\ude31\ud835\ude34 Experiment Tracker Model Registry 6 . \ud835\ude10\ud835\ude2f\ud835\ude27\ud835\ude33\ud835\ude22\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26 Comet Beam 02 Hands on LLMS Diving into the code \ud835\udddb\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddf6\ud835\ude00 \ud835\uddee \ud835\ude00\ud835\uddf5\ud835\uddfc\ud835\uddff\ud835\ude01 \ud835\ude04\ud835\uddee\ud835\uddf9\ud835\uddf8\ud835\ude01\ud835\uddf5\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5 \ud835\uddfc\ud835\uddf3 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf9\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddfc\ud835\uddfb 1 . How to set up the code and environment using Poetry 2 . How to configure Comet Beam 3 . How to start the training pipeline locally if you have a CUDA enabled GPU or on Beam for running your training pipeline on a serverless infrastructure doesn t matter what hardware you have . 4 . An overview of the code 5 . Clarifying why we integrated Poetry, a model registry and linting within the training pipeline. This video is critical for everyone who wants to replicate the training pipeline of our course on their system. The previous lesson focused on the theoretical parts of the training pipeline. To find out the code all the videos, check out the Hands on LLMs GitHub repository. That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 6 of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 6 Share this post DML Why when do you need to fine tune open source LLMs? What about fine tuning vs. prompt engineering? decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-why-and-when-do-you-need-to-fine?r=1ttoeh" }, { "id": "b6d86294-1bcc-4226-8218-3a63cab813a2", "content": "DML How to implement a streaming pipeline to populate a vector DB for real time RAG? Lesson 4 The Hands on LLMs Series SubscribeSign in Share this post DML How to implement a streaming pipeline to populate a vector DB for real time RAG? decodingml.substack.com Copy link Facebook Email Note Other DML How to implement a streaming pipeline to populate a vector DB for real time RAG? Lesson 4 The Hands on LLMs Series Paul Iusztin Nov 23, 2023 3 Share this post DML How to implement a streaming pipeline to populate a vector DB for real time RAG? decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 4 The Hands on LLMs Series Table of Contents 1. What is Bytewax? 2. Why have vector DBs become so popular? Why are they so crucial for most ML applications? 3. How to implement a streaming pipeline to populate a vector DB for real time RAG? Previous Lessons Lesson 1 How to design an LLM system for a financial assistant using the 3 pipeline design Lesson 2 Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps Lesson 3 Why what do you need a streaming pipeline when implementing RAG in your LLM applications? Check out the Hands on LLMs course and support it with a . 1. What is Bytewax? Are you afraid of writing \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\ude00? Or do you think they are hard to implement? I did until I discovered Bytewax . Let me show you Bytewax is an \ud835\uddfc\ud835\uddfd\ud835\uddf2\ud835\uddfb \ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2 \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf3\ud835\uddff\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 that is built in Rust for performance has Python binding for ease of use ... so for all the Python fanatics out there, no more JVM headaches for you. Jokes aside, here is why Bytewax is so powerful Bytewax local setup is plug and play can quickly be integrated into any Python project you can go wild even use it in Notebooks can easily be integrated with other Python packages NumPy, PyTorch, HuggingFace, OpenCV, SkLearn, you name it out of the box connectors for Kafka, local files, or you can quickly implement your own CLI tool to easily deploy it to K8s, AWS, or GCP. \ud835\ude0d\ud835\ude30\ud835\ude33 \ud835\ude26\ud835\ude39\ud835\ude22\ud835\ude2e\ud835\ude31\ud835\ude2d\ud835\ude26 \ud835\ude2d\ud835\ude30\ud835\ude30\ud835\ude2c \ud835\ude22\ud835\ude35 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude2a\ud835\ude2e\ud835\ude22\ud835\ude28\ud835\ude26 \ud835\ude23\ud835\ude26\ud835\ude2d\ud835\ude30\ud835\ude38 1 . We defined a streaming app in a few lines of code. 2 . We run the streaming app with one command. . The thing is that I worked in Kafka Streams in Kotlin for one year. I loved understood the power of building streaming applications. The only thing that stood in my way was, well... Java. I don t have something with Java it is a powerful language. However, building an ML application in Java Python takes much time due to a more significant resistance to integrating the two. ...and that s where Bytewax kicks in. We used Bytewax for building the streaming pipeline for the \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 course and loved it. What is Bytewax? Iamge by the Author . 2. Why have vector DBs become so popular? Why are they so crucial for most ML applications? In the world of ML, everything can be represented as an embedding. A vector DB is an intelligent way to use your data embeddings as an index and perform fast and scalable searches between unstructured data points. Simply put, a vector DB allows you to find matches between anything and anything e.g., use an image as a query to find similar pieces of text, video, other images, etc. . . \ud835\ude10\ud835\ude2f \ud835\ude22 \ud835\ude2f\ud835\ude36\ud835\ude35\ud835\ude34\ud835\ude29\ud835\ude26\ud835\ude2d\ud835\ude2d, \ud835\ude35\ud835\ude29\ud835\ude2a\ud835\ude34 \ud835\ude2a\ud835\ude34 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude2a\ud835\ude2f\ud835\ude35\ud835\ude26\ud835\ude28\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude22 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude0b\ud835\ude09 \ud835\ude2a\ud835\ude2f \ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude2d \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2d\ud835\ude25 \ud835\ude34\ud835\ude24\ud835\ude26\ud835\ude2f\ud835\ude22\ud835\ude33\ud835\ude2a\ud835\ude30\ud835\ude34 Using various DL techniques, you can project your data points images, videos, text, audio, user interactions into the same vector space aka the embeddings of the data . You will load the embeddings along a payload e.g., a URL to the image, date of creation, image description, properties, etc. into the vector DB, where the data will be indexed along the vector payload text within the payload Now that the embedding indexes your data, you can query the vector DB by embedding any data point. For example, you can query the vector DB with an image of your cat and use a filter to retrieve only black cats. To do so, you must embed the image using the same model you used to embed the data within your vector DB. After you query the database using a given distance e.g., cosine distance between 2 vectors to find similar embeddings. These similar embeddings have attached to them their payload that contains valuable information such as the URL to an image, a URL to a site, an ID of a user, a chapter from a book about the cat of a witch, etc. . Using this technique, I used Qdrant to implement RAG for a financial assistant powered by LLMs. But vector DBs go beyond LLMs RAG. \ud835\ude0f\ud835\ude26\ud835\ude33\ud835\ude26 \ud835\ude2a\ud835\ude34 \ud835\ude22 \ud835\ude2d\ud835\ude2a\ud835\ude34\ud835\ude35 \ud835\ude30\ud835\ude27 \ud835\ude38\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude23\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude25 \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude0b\ud835\ude09\ud835\ude34 e.g., Qdrant similar image search semantic text search instead of plain text search recommender systems RAG for chatbots anomalies detection \ud835\ude0a\ud835\ude29\ud835\ude26\ud835\ude24\ud835\ude2c \ud835\ude30\ud835\ude36\ud835\ude35 \ud835\ude18\ud835\ude25\ud835\ude33\ud835\ude22\ud835\ude2f\ud835\ude35 \ud835\ude34 \ud835\ude28\ud835\ude36\ud835\ude2a\ud835\ude25\ud835\ude26\ud835\ude34 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude35\ud835\ude36\ud835\ude35\ud835\ude30\ud835\ude33\ud835\ude2a\ud835\ude22\ud835\ude2d\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f \ud835\ude2e\ud835\ude30\ud835\ude33\ud835\ude26 \ud835\ude22\ud835\ude23\ud835\ude30\ud835\ude36\ud835\ude35 \ud835\ude37\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude30\ud835\ude33 \ud835\ude0b\ud835\ude09\ud835\ude34. Qdrant s Architecture Image from Qdrant docs . 3. How to implement a streaming pipeline to populate a vector DB for real time RAG? This is \ud835\uddf5\ud835\uddfc\ud835\ude04 you can \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 to populate a \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 to do \ud835\udde5\ud835\uddd4\ud835\uddda for a \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\ude01 powered by \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00. In a previous post, I covered \ud835\ude04\ud835\uddf5\ud835\ude06 you need a streaming pipeline over a batch pipeline when implementing RAG. Now, we will focus on the \ud835\uddf5\ud835\uddfc\ud835\ude04, aka implementation details All the following steps are wrapped in Bytewax functions and connected in a single streaming pipeline. \ud835\uddd8\ud835\ude05\ud835\ude01\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude01 \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddfb\ud835\uddf2\ud835\ude04\ud835\ude00 \ud835\uddf3\ud835\uddff\ud835\uddfc\ud835\uddfa \ud835\uddd4\ud835\uddf9\ud835\uddfd\ud835\uddee\ud835\uddf0\ud835\uddee You need 2 types of inputs 1 . A WebSocket API to listen to financial news in real time. This will be used to listen 24 7 for new data and ingest it as soon as it is available. 2 . A RESTful API to ingest historical data in batch mode. When you deploy a fresh vector DB, you must populate it with data between a given range date_start date_end . You wrap the ingested HTML document and its metadata in a pydantic NewsArticle model to validate its schema. Regardless of the input type, the ingested data is the same. Thus, the following steps are the same for both data inputs \ud835\udde3\ud835\uddee\ud835\uddff\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddb\ud835\udde7\ud835\udde0\ud835\udddf \ud835\uddf0\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf2\ud835\uddfb\ud835\ude01 As the ingested financial news is in HTML, you must extract the text from particular HTML tags. unstructured makes it as easy as calling partition_html document , which will recursively return the text within all essential HTML tags. The parsed NewsArticle model is mapped into another pydantic model to validate its new schema. The elements of the news article are the headline, summary and full content. \ud835\uddd6\ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 Now we have a bunch of text that has to be cleaned. Again, unstructured makes things easy. Calling a few functions we clean the dashes bullets extra whitespace trailing punctuation non ascii chars invalid quotes Finally, we standardize everything to lowercase. \ud835\uddd6\ud835\uddf5\ud835\ude02\ud835\uddfb\ud835\uddf8 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\ude01\ud835\uddf2\ud835\ude05\ud835\ude01 As the text can exceed the context window of the embedding model, we have to chunk it. Yet again, unstructured provides a valuable function that splits the text based on the tokenized text and expected input length of the embedding model. This strategy is naive, as it doesn t consider the text s structure, such as chapters, paragraphs, etc. As the news is short, this is not an issue, but LangChain provides a RecursiveCharacterTextSplitter class that does that if required. \ud835\uddd8\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf0\ud835\uddf5\ud835\ude02\ud835\uddfb\ud835\uddf8\ud835\ude00 You pass all the chunks through an encoder only model. We have used all MiniLM L6 v2 from sentence transformers , a small model that can run on a CPU and outputs a 384 embedding. But based on the size and complexity of your data, you might need more complex and bigger models. \ud835\udddf\ud835\uddfc\ud835\uddee\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\uddf6\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde4\ud835\uddf1\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 Finally, you insert the embedded chunks and their metadata into the Qdrant vector DB. The metadata contains the embedded text, the source_url and the publish date. How to implement a streaming pipeline to populate a vector DB for real time RAG Image by the Author . Check out the Hands on LLMs course to see this in action. That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 5 of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 3 Share this post DML How to implement a streaming pipeline to populate a vector DB for real time RAG? decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-to-implement-a-streaming?r=1ttoeh" }, { "id": "b2296169-eed0-4b28-864a-08b061f5ee45", "content": "DML Why what do you need a streaming pipeline when implementing RAG in your LLM applications? Lesson 3 The Hands on LLMs Series SubscribeSign in Share this post DML Why what do you need a streaming pipeline when implementing RAG in your LLM applications? decodingml.substack.com Copy link Facebook Email Note Other DML Why what do you need a streaming pipeline when implementing RAG in your LLM applications? Lesson 3 The Hands on LLMs Series Paul Iusztin Nov 16, 2023 3 Share this post DML Why what do you need a streaming pipeline when implementing RAG in your LLM applications? decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 3 The Hands on LLMs Series Table of Contents 1. RAG What problems does it solve, and how it s integrated into LLM powered applications? 2. Why do you need a streaming pipeline instead of a batch pipeline when implementing RAG in your LLM applications? 3. What do you need to implement a streaming pipeline for a financial assistant? Previous Lessons Lesson 1 How to design an LLM system for a financial assistant using the 3 pipeline design Lesson 2 Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps Check out the Hands on LLMs course and support it with a . 1. RAG What problems does it solve, and how it s integrated into LLM powered applications? Let s find out RAG is a popular strategy when building LLMs to add external data to your prompt. \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa Working with LLMs has 3 main issues 1 . The world moves fast An LLM learns an internal knowledge base. However, the issue is that its knowledge is limited to its training dataset. The world moves fast. New data flows on the internet every second. Thus, the model s knowledge base can quickly become obsolete. One solution is to fine tune the model every minute or day... If you have some billions to spend around, go for it. 2 . Hallucinations An LLM is full of testosterone and likes to be blindly confident. Even if the answer looks 100 legit, you can never fully trust it. 3 . Lack of reference links It is hard to trust the response of the LLM if we can t see the source of its decisions. Especially for important decisions e.g., health, financials \ud835\udde6\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb Surprize! It is RAG. 1 . Avoid fine tuning Using RAG, you use the LLM as a reasoning engine and the external knowledge base as the main memory e.g., vector DB . The memory is volatile, so you can quickly introduce or remove data. 2 . Avoid hallucinations By forcing the LLM to answer solely based on the given context, the LLM will provide an answer as follows use the external data to respond to the user s question if it contains the necessary insights I don t know if not 3 . Add reference links Using RAG, you can easily track the source of the data and highlight it to the user. \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf1\ud835\uddfc\ud835\uddf2\ud835\ude00 \ud835\udde5\ud835\uddd4\ud835\uddda \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8? Let s say we want to use RAG to build a financial assistant. \ud835\ude1e\ud835\ude29\ud835\ude22\ud835\ude35 \ud835\ude25\ud835\ude30 \ud835\ude38\ud835\ude26 \ud835\ude2f\ud835\ude26\ud835\ude26\ud835\ude25? a data source with historical and real time financial news e.g. Alpaca a stream processing engine e.g., Bytewax an encoder only model for embedding the documents e.g., pick one from sentence transformers a vector DB e.g., Qdrant \ud835\ude0f\ud835\ude30\ud835\ude38 \ud835\ude25\ud835\ude30\ud835\ude26\ud835\ude34 \ud835\ude2a\ud835\ude35 \ud835\ude38\ud835\ude30\ud835\ude33\ud835\ude2c? On the feature pipeline side 1 . using Bytewax, you ingest the financial news and clean them 2 . you chunk the news documents and embed them 3 . you insert the embedding of the docs along with their metadata e.g., the initial text, source_url, etc. to Qdrant On the inference pipeline side 4 . the user question is embedded using the same embedding model 5 . using this embedding, you extract the top K most similar news documents from Qdrant 6 . along with the user question, you inject the necessary metadata from the extracted top K documents into the prompt template e.g., the text of documents its source_url 7 . you pass the whole prompt to the LLM for the final answer What is Retrieval Augmented Generation RAG ? Image by the Author . Check out the Hands on LLMs course to see this in action. 2. Why do you need a streaming pipeline instead of a batch pipeline when implementing RAG in your LLM applications? The quality of your RAG implementation is as good as the quality freshness of your data. Thus, depending on your use case, you have to ask How fresh does my data from the vector DB have to be to provide accurate answers? But for the best user experience, the data has to be as fresh as possible, aka real time data. For example, when implementing a financial assistant, being aware of the latest financial news is critical. A new piece of information can completely change the course of your strategy. Hence, when implementing RAG, one critical aspect is to have your vector DB synced with all your external data sources in real time. A batch pipeline will work if your use case accepts a particular delay e.g., one hour, one day, etc. . But with tools like Bytewax , building streaming applications becomes much more accessible. So why not aim for the best? Streaming vs. batch pipelines when doing RAG Image by the Author 3. What do you need to implement a streaming pipeline for a financial assistant? A financial news data source exposed through a web socket e.g., Alpaca A Python streaming processing framework. For example, Bytewax is built in Rust for efficiency and exposes a Python interface for ease of use you don t need the Java ecosystem to implement real time pipelines anymore. A Python package to process, clean, and chunk documents. unstructured offers a rich set of features that makes parsing HTML documents extremely convenient. An encoder only language model that maps your chunked documents into embeddings. setence transformers is well integrated with HuggingFace and has a huge list of models of various sizes. A vector DB, where to insert your embeddings and their metadata e.g., the embedded text, the source_url, the creation date, etc. . For example, Qdrant provides a rich set of features and a seamless experience. A way to deploy your streaming pipeline. Docker AWS will never disappoint you. A CI CD pipeline for continuous tests deployments. GitHub Actions is a great serverless option with a rich ecosystem. This is what you need to build deploy a streaming pipeline solely in Python Check out the Hands on LLMs course to see this in action. That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 4 of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 3 Share this post DML Why what do you need a streaming pipeline when implementing RAG in your LLM applications? decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-why-and-what-do-you-need-a-streaming?r=1ttoeh" }, { "id": "032f3296-b891-484d-9e00-c2872bbb9bbe", "content": "DML Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps Lesson 2 The Hands on LLMs Series SubscribeSign in Share this post DML Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps decodingml.substack.com Copy link Facebook Email Note Other DML Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps Lesson 2 The Hands on LLMs Series Paul Iusztin Nov 09, 2023 6 Share this post DML Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ Lesson 2 The Hands on LLMs Series Table of Contents 1. Introduction video lessons 2. What is LLMOps? MLOps vs. LLMOps 3. Unwrapping step by step the 3 pipeline design of a financial assistant powered by LLMs Previous Lessons Lesson 1 How to design an LLM system for a financial assistant using the 3 pipeline design Check out the Hands on LLMs course and support it with a . 1. Introduction video lessons We started releasing the first video lessons of the course. This is a recording of me, where I presented at a webinar hosted by Gathers, a 1.5 hour overview of the \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 course. Check it out to get a gut feeling of the LLM system This is the 1st official lesson of the Hands on LLMs course presented by no other but Pau Labarta Bajo from the Real World Machine Learning newsletter if you wonder, the course is the result of our collaboration . Pau is one of the best teachers I know. If you have some spare time, it is worth it Check out the Hands on LLMs course and support it with a . 2. What is LLMOps? MLOps vs. LLMOps LLMOps here, LLMOps there, but did you take the time to see how it differs from MLOps? If not, here is a 2 min LLMOps vs. MLOps summary \ud835\uddea\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\uddf6\ud835\ude00 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00? Well, everything revolves around the idea that Size matters. LLMOps is about best practices for efficient deployment, monitoring and maintenance, but this time for large language models. LLMOps is a subset of MLOps, focusing on training deploying large models trained on big data. Intuitive right? \ud835\uddd5\ud835\ude02\ud835\ude01 \ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\udff1 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfe\ud835\ude02\ud835\uddf2 \ud835\uddf3\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\uddf6\ud835\ude01 \ud835\uddee\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\ude01 \ud835\uddf3\ud835\uddff\ud835\uddfc\ud835\uddfa \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\udfed . \ud835\uddd6\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\uddee\ud835\uddf9 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfc\ud835\ude02\ud835\uddff\ud835\uddf0\ud835\uddf2\ud835\ude00 training your models on CUDA enabled GPUs is more critical than ever, along with knowing how to run your jobs on a cluster of GPUs leveraging data model parallelism using techniques such as ZeRO from DeepSpeed. Also, the high cost of inference makes model compression techniques essential for deployment. \ud835\udfee . \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddfb\ud835\ude00\ud835\uddf3\ud835\uddf2\ud835\uddff \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 training models from scratch is a thing of the past. In most use cases, you will fine tune the model on specific tasks, leveraging techniques such as LLaMA Adapters or QLora. \ud835\udfef . \ud835\udddb\ud835\ude02\ud835\uddfa\ud835\uddee\ud835\uddfb \ud835\uddf3\ud835\uddf2\ud835\uddf2\ud835\uddf1\ud835\uddef\ud835\uddee\ud835\uddf0\ud835\uddf8 reinforcement learning from human feedback RLHF showed much potential in improving the quality of generated outputs. But to do RLHF, you have to introduce a feedback loop within your ML system that lets you evaluate the generated results based on human feedback, which are even further used to fine tune your LLMs. \ud835\udff0 . \ud835\uddda\ud835\ude02\ud835\uddee\ud835\uddff\ud835\uddf1\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddf9\ud835\ude00 to create safe systems, you must protect your systems against harmful or violent inputs and outputs. Also, when designing your prompt templates, you must consider hallucinations and prompt hacking. \ud835\udff1 . \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf9\ud835\ude06\ud835\ude07\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00 most ML platforms e.g., Comet ML introduced specialized logging tools to debug and monitor your LLMs to help you find better prompt templates and protect against hallucination and hacking. What is LLMOps? LLMOps vs. MLOps Image by the Author To conclude... LLMOps isn t anything new for those familiar with MLOps and Deep Learning. For example, training deep learning models on clusters of GPUs or fine tuning them isn t new, but now it is more important than ever to master these skills as models get bigger and bigger. But it indeed introduced novel techniques to fine tune models e.g., QLora , to merge the fields of RL and DL, and a plethora of tools around prompt manipulation storing, such as vector DBs e.g., Qdrant prompt chaining e.g., LangChain prompt logging analytics e.g., Comet LLMOps . But with the new multi modal large models trend, these tips tricks will converge towards all deep learning models e.g., computer vision , and soon, we will change the name of LLMOps to DLOps or LMOps. What do you think? Is the term of LLMOps going to stick around? 3. Unwrapping step by step the 3 pipeline design of a financial assistant powered by LLMs Here is a step by step guide on designing the architecture of a financial assistant powered by LLMs, vector DBs and MLOps. The 3 pipeline design, also known as the FTI architecture, makes things simple \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 We want to build a streaming pipeline that listens to real time financial news, embeds the news, and loads everything in a vector DB. The goal is to add up to date news to the user s questions using RAG to avoid retraining. 1 . We listen 24 7 to financial news from Alpaca through a WebSocket wrapped over a Bytewax connector 2 . Once any financial news is received, these are passed to the Bytewax flow that extracts cleans the necessary information from the news HTML document chunks the text based on the LLM s max context window embeds all the chunks using the all MiniLM L6 v2 encoder only model from sentence transformers inserts all the embeddings along their metadata to Qdrant 3 . The streaming pipeline is deployed to an EC2 machine that runs multiple Bytewax processes. It can be deployed to K8s into a multi node setup to scale up. \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 We want to fine tune a pretrained LLM to specialize the model to answer financial based questions. 1 . Manually fill 100 financial questions. 2 . Use RAG to enrich the questions using the financial news from the Qdrant vector DB. 3 . Use a powerful model such as GPT 4 to answer them, or hire an expert if you have more time and resources. 4 . Load Falcon from HuggingFace using QLoRA to fit on a single GPU. 5 . Preprocess the Q A dataset into prompts. 6 . Fine tune the LLM and log all the artifacts to Comet s experiment tracker loss, model weights, etc. 7 . For every epoch, run the LLM on your test set, log the prompts to Comet s prompt logging feature and compute the metrics. 8 . Send the best LoRA weights to the model registry as the next production candidate. 9 . Deploy steps 4 8 to Beam to run the training on an A10G or A100 Nvidia GPU. \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 We want to hook the financial news stored in the Qdrant Vector DB and the Falcon fine tuned model into a single entity exposed under a RESTful API. Steps 1 7 are all chained together using LangChain. 1 . Use the all MiniLM L6 v2 encoder only model to embed the user s question. 2 . Using the question embedding, query the Qdrant vector DB to find the top 3 related financial news. 3 . Attach the text stored as metadata along the embeddings of the news to the prompt aka RAG . 4 . Download Falcon s pretrained weights from HF LoRA s fine tuned weights from Comet s model registry. 5 . Load the LLM and pass the prompt the user s question, financial news, history to it. 6 . Store the conversation in LangChain s memory. 7 . Deploy steps 1 7 under a RESTful API using Beam. 3 pipeline architecture Image by the Author Check out the Hands on LLMs course to see this in action. That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 3 of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 6 Share this post DML Unwrapping the 3 pipeline design of a financial assistant powered by LLMs LLMOps vs. MLOps decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-unwrapping-the-3-pipeline-design?r=1ttoeh" }, { "id": "21c92489-204c-4791-b4dd-f0c2487f7e82", "content": "DML How to design an LLM system for a financial assistant using the 3 pipeline design Lesson 1 The Hands on LLMs Series SubscribeSign in Share this post DML How to design an LLM system for a financial assistant using the 3 pipeline design decodingml.substack.com Copy link Facebook Email Note Other DML How to design an LLM system for a financial assistant using the 3 pipeline design Lesson 1 The Hands on LLMs Series Paul Iusztin Nov 02, 2023 5 Share this post DML How to design an LLM system for a financial assistant using the 3 pipeline design decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ As promised, starting this week, we will begin the series based on the Hands on LLMs FREE course . Note that this is not the course itself. It is an overview for all the busy people who will focus on the key aspects. The entire course will soon be available on GitHub. Lesson 1 The Hands on LLMs Series Table of Contents 1. What is the 3 pipeline design 2. How to apply the 3 pipeline design in architecting a financial assistant powered by LLMs 3. The tech stack used to build an end to end LLM system for a financial assistant As the Hands on LLMs course is still a \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf8 \ud835\uddf6\ud835\uddfb \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf4\ud835\uddff\ud835\uddf2\ud835\ude00\ud835\ude00, we want to \ud835\uddf8\ud835\uddf2\ud835\uddf2\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude02\ud835\uddfd\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 on our progress Thus, we opened up the \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\uddf0\ud835\ude02\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddee\ud835\uddef under the course s GitHub Repository, where we will \ud835\uddf8\ud835\uddf2\ud835\uddf2\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude02\ud835\uddfd\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 with everything is happening. Also, if you have any \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\ude00, \ud835\ude00\ud835\ude02\ud835\uddf4\ud835\uddf4\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, \ud835\uddfe\ud835\ude02\ud835\uddf2\ud835\ude00\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 or want to \ud835\uddf0\ud835\uddf5\ud835\uddee\ud835\ude01, we encourage you to \ud835\uddf0\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\uddee \ud835\uddfb\ud835\uddf2\ud835\ude04 \ud835\uddf1\ud835\uddf6\ud835\ude00\ud835\uddf0\ud835\ude02\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\uddfc\ud835\uddfb . We want the course to fill your real needs Hence, if your suggestion fits well with our hands on course direction, we will consider implementing it. Hands on LLMs course discussions section Image by the Author . Check it out and leave a if you like what you see Hands on LLMs course 1. What is the 3 pipeline design We all know how \ud835\uddfa\ud835\uddf2\ud835\ude00\ud835\ude00\ud835\ude06 \ud835\udde0\ud835\udddf \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 can get. That is where the \ud835\udfef \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddf8\ud835\uddf6\ud835\uddf0\ud835\uddf8\ud835\ude00 \ud835\uddf6\ud835\uddfb. The 3 pipeline design is a way to bring structure modularity to your ML system and improve your MLOps processes. This is how \ud835\udde3\ud835\uddff\ud835\uddfc\ud835\uddef\ud835\uddf9\ud835\uddf2\ud835\uddfa Despite advances in MLOps tooling, transitioning from prototype to production remains challenging. In 2022, only 54 of the models get into production. Auch. So what happens? Sometimes the model is not mature enough, sometimes there are some security risks, but most of the time... ...the architecture of the ML system is built with research in mind, or the ML system becomes a massive monolith that is extremely hard to refactor from offline to online. So, good processes and a well defined architecture are as crucial as good tools and models. \ud835\udde6\ud835\uddfc\ud835\uddf9\ud835\ude02\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\ude1b\ud835\ude29\ud835\ude26 3 \ud835\ude31\ud835\ude2a\ud835\ude31\ud835\ude26\ud835\ude2d\ud835\ude2a\ud835\ude2f\ud835\ude26 \ud835\ude22\ud835\ude33\ud835\ude24\ud835\ude29\ud835\ude2a\ud835\ude35\ud835\ude26\ud835\ude24\ud835\ude35\ud835\ude36\ud835\ude33\ud835\ude26. First, let s understand what the 3 pipeline design is. It is a mental map that helps you simplify the development process and split your monolithic ML pipeline into 3 components 1 . the feature pipeline 2 . the training pipeline 3 . the inference pipeline ...also known as the Feature Training Inference FTI architecture. . \ud835\udfed. The feature pipeline transforms your data into features labels, which are stored and versioned in a feature store. \ud835\udfee. The training pipeline ingests a specific version of the features labels from the feature store and outputs the trained models, which are stored and versioned inside a model registry. \ud835\udfef. The inference pipeline takes a given version of the features and trained models and outputs the predictions to a client. . This is why the 3 pipeline design is so beautiful it is intuitive it brings structure, as on a higher level, all ML systems can be reduced to these 3 components it defines a transparent interface between the 3 components, making it easier for multiple teams to collaborate the ML system has been built with modularity in mind since the beginning the 3 components can easily be divided between multiple teams if necessary every component can use the best stack of technologies available for the job every component can be deployed, scaled, and monitored independently the feature pipeline can easily be either batch, streaming or both But the most important benefit is that... ...by following this pattern, you know 100 that your ML model will move out of your Notebooks into production. What is the 3 pipeline design Why should you adopt it in your ML systems? Image by the Author . What do you think about the 3 pipeline architecture? Have you used it? If you want to know more about the 3 pipeline design, I recommend this awesome article from Hopsworks From MLOps to ML Systems with Feature Training Inference Pipelines 2. How to apply the 3 pipeline design in architecting a financial assistant powered by LLMs Building ML systems is hard, right? Wrong. Here is how the \ud835\udfef \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf6\ud835\uddf4\ud835\uddfb can make \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 the \ud835\udde0\ud835\udddf \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa for a \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06 . I already covered the concepts of the 3 pipeline design in my previous post, but here is a quick recap It is a mental map that helps you simplify the development process and split your monolithic ML pipeline into 3 components 1 . the feature pipeline 2 . the training pipeline 3 . the inference pipeline ...also known as the Feature Training Inference FTI architecture. . Now, let s see how you can use the FTI architecture to build a financial assistant powered by LLMs \ud835\udfed. \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 The feature pipeline is designed as a streaming pipeline that extracts real time financial news from Alpaca and cleans and chunks the news documents embeds the chunks using an encoder only LM loads the embeddings their metadata in a vector DB deploys it to AWS In this architecture, the vector DB acts as the feature store. The vector DB will stay in sync with the latest news to attach real time context to the LLM using RAG. \ud835\udfee. \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 The training pipeline is split into 2 main steps \ud835\udde4 \ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 \ud835\ude00\ud835\uddf2\ud835\uddfa\ud835\uddf6 \ud835\uddee\ud835\ude02\ud835\ude01\ud835\uddfc\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf1 \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd It takes the vector DB feature store and a set of predefined questions manually written as input. After, you use RAG to inject the context along the predefined questions use a large powerful model, such as GPT 4, to generate the answers save the generated dataset under a new version \ud835\uddd9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd download a pre trained LLM from Huggingface load the LLM using QLoRA preprocesses the generated Q A dataset into a format expected by the LLM fine tune the LLM push the best QLoRA weights model to a model registry deploy it using a serverless solution as a continuous training pipeline \ud835\udfef. \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 The inference pipeline is the financial assistant that the clients actively use. It uses the vector DB feature store and QLoRA weights model from the model registry in the following way download the pre trained LLM from Huggingface load the LLM using the pretrained QLoRA weights connect the LLM and vector DB into a chain use RAG to add relevant financial news from the vector DB deploy it using a serverless solution under a RESTful API The architecture of a financial assistant using the 3 pipeline design Image by the Author . Here are the main benefits of using the FTI architecture it defines a transparent interface between the 3 modules every component can use different technologies to implement and deploy the pipeline the 3 pipelines are loosely coupled through the feature store model registry every component can be scaled independently See this architecture in action in my \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 FREE course. 3. The tech stack used to build an end to end LLM system for a financial assistant The tools are divided based on the \ud835\udfef \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 aka \ud835\uddd9\ud835\udde7\ud835\udddc \ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf6\ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\uddd9\ud835\uddf2\ud835\uddee\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 What do you need to build a streaming pipeline? streaming processing framework Bytewax brings the speed of Rust into our beloved Python ecosystem parse, clean, and chunk documents unstructured validate document structure pydantic encoder only language model HuggingFace sentence transformers, PyTorch vector DB Qdrant deploy Docker, AWS CI CD GitHub Actions \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 What do you need to build a fine tuning pipeline? pretrained LLM HuggingFace Hub parameter efficient tuning method peft LoRA quantization bitsandbytes QLoRA training HuggingFace transformers, PyTorch, trl distributed training accelerate experiment tracking Comet ML model registry Comet ML prompt monitoring Comet ML continuous training serverless deployment Beam \ud835\udddc\ud835\uddfb\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\udde3\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 What do you need to build a financial assistant? framework for developing applications powered by language models LangChain model registry Comet ML inference HuggingFace transformers, PyTorch, peft to load the LoRA weights quantization bitsandbytes distributed inference accelerate encoder only language model HuggingFace sentence transformers vector DB Qdrant prompt monitoring Comet ML RESTful API serverless service Beam . As you can see, some tools overlap between the FTI pipelines, but not all. This is the beauty of the 3 pipeline design, as every component represents a different entity for which you can pick the best stack to build, deploy, and monitor. You can go wild and use Tensorflow in one of the components if you want your colleges to hate you See the tools in action in my \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 FREE course. That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for Lesson 2 of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 5 Share this post DML How to design an LLM system for a financial assistant using the 3 pipeline design decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-to-design-an-llm-system-for?r=1ttoeh" }, { "id": "007833f1-fb36-470f-adad-78143f817fee", "content": "DML Synced Vector DBs A Guide to Streaming Pipelines for Real Time RAG in Your LLM Applications Hello there, I am Paul Iusztin SubscribeSign in Share this post DML Synced Vector DBs A Guide to Streaming Pipelines for Real Time RAG in Your LLM Applications decodingml.substack.com Copy link Facebook Email Note Other DML Synced Vector DBs A Guide to Streaming Pipelines for Real Time RAG in Your LLM Applications Paul Iusztin Oct 26, 2023 4 Share this post DML Synced Vector DBs A Guide to Streaming Pipelines for Real Time RAG in Your LLM Applications decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This week s ML MLOps topics 1. Synced Vector DBs A Guide to Streaming Pipelines for Real Time Rag in Your LLM Applications Story If anyone told you that ML or MLOps is easy, they were right. A simple trick I learned the hard way. This week s newsletter is shorter than usual, but I have some great news Next week, within the Decoding ML newsletter, I will start a step by step series based on the Hands On LLMs course I am developing. By the end of this series, you will know how to design, build, and deploy a financial assistant powered by LLMs. all of this for FREE inside the Decoding ML newsletter Check out the Hands On LLMs course GitHub page and give it a star to stay updated with our progress. 1. Synced Vector DBs A Guide to Streaming Pipelines for Real Time Rag in Your LLM Applications To successfully use \ud835\udde5\ud835\uddd4\ud835\uddda in your \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddee\ud835\uddfd\ud835\uddfd\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, your \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 must constantly be updated with the latest data. Here is how you can implement a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 to keep your vector DB in sync with your datasets . \ud835\udde5\ud835\uddd4\ud835\uddda is a popular strategy when building LLMs to add context to your prompt about your private datasets. Leveraging your domain data using RAG provides 2 significant benefits you don t need to fine tune your model as often or at all avoid hallucinations . On the \ud835\uddef\ud835\uddfc\ud835\ude01 \ud835\ude00\ud835\uddf6\ud835\uddf1\ud835\uddf2, to implement RAG, you have to 3 . Embed the user s question using an embedding model e.g., BERT . Use the embedding to query your vector DB and find the most similar vectors using a distance function e.g., cos similarity . 4 . Get the top N closest vectors and their metadata. 5 . Attach the extracted top N vectors metadata the chat history to the input prompt. 6 . Pass the prompt to the LLM. 7 . Insert the user question assistant answer to the chat history. . But the question is, \ud835\uddf5\ud835\uddfc\ud835\ude04 do you \ud835\uddf8\ud835\uddf2\ud835\uddf2\ud835\uddfd \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5 \ud835\ude02\ud835\uddfd \ud835\ude01\ud835\uddfc \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddf2 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf9\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\ude00\ud835\ude01 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee? You need a real time streaming pipeline. How do you implement it? You need 2 components A streaming processing framework. For example, Bytewax is built in Rust for efficiency and exposes a Python interface for ease of use you don t need Java to implement real time pipelines anymore. Bytewax A vector DB. For example, Qdrant provides a rich set of features and a seamless experience. Qdrant . Here is an example of how to implement a streaming pipeline for financial news \ud835\udfed. Financial news data source e.g., Alpaca To populate your vector DB, you need a historical API e.g., RESTful API to add data to your vector DB in batch mode between a desired start_date, end_date range. You can tweak the number of workers to parallelize this step as much as possible. You run this once in the beginning. You need the data exposed under a web socket to ingest news in real time. So, you ll be able to listen to the news and ingest it in your vector DB as soon as they are available. Listens 24 7 for financial news. \ud835\udfee. Build the streaming pipeline using Bytewax Implement 2 input connectors for the 2 different types of APIs RESTful API web socket. The rest of the steps can be shared between both connectors Clean financial news documents. Chunk the documents. Embed the documents e.g., using Bert . Insert the embedded documents their metadata to the vector DB e.g., Qdrant . \ud835\udfef \ud835\udff3. When the users ask a financial question, you can leverage RAG with an up to date vector DB to search for the latest news in the industry. Synced Vector DBs A Guide to Streaming Pipelines for Real Time Rag in Your LLM Applications Image by the Author Story. If anyone told you that ML or MLOps is easy, they were right. A simple trick I learned the hard way. If anyone told you that \ud835\udde0\ud835\udddf or \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 is \ud835\uddf2\ud835\uddee\ud835\ude00\ud835\ude06, they were \ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01. Here is a simple trick that I learned the hard way If you are in this domain, you already know that everything changes fast a new tool every month a new model every week a new project every day You know what I did? I stopped caring about all these changes and switched my attention to the real gold. Which is \ud835\uddd9\ud835\uddfc\ud835\uddf0\ud835\ude02\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddee\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01\ud835\uddee\ud835\uddf9\ud835\ude00. . Let me explain When you constantly chase the latest models aka FOMO , you will only have a shallow understanding of that new information except if you are a genius or already deep into that niche . But the joke s on you. In reality, most of what you think you need to know, you don t. So you won t use what you learned and forget most of it after 1 2 months. What a waste of time, right? . But... If you master the fundamentals of the topic, you want to learn. For example, for deep learning, you have to know how models are built how they are trained groundbreaking architectures Resnet, UNet, Transformers, etc. parallel training deploying a model, etc. ...when in need e.g., you just moved on to a new project , you can easily pick up the latest research. Thus, after you have laid the foundation, it is straightforward to learn SoTA approaches when needed if needed . Most importantly, what you learn will stick with you, and you will have the flexibility to jump from one project to another quickly. . I am also guilty. I used to FOMO into all kinds of topics until I was honest with myself and admitted I am no Leonardo Da Vinci. But here is what I did and worked well building projects replicating the implementations of famous papers teaching the subject I want to learn ... and most importantly, take my time to relax and internalize the information. To conclude learn ahead only the fundamentals learn the latest trend only when needed Image by the Author That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! and see you next week for the beginning of the Hands On LLMs series Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 4 Share this post DML Synced Vector DBs A Guide to Streaming Pipelines for Real Time RAG in Your LLM Applications decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-synced-vector-dbs-a-guide-to?r=1ttoeh" }, { "id": "e9353901-9ba9-483c-8c59-2de649c9743a", "content": "DML What is the difference between your ML development and continuous training environments? 3 techniques you must know to evaluate your LLMs quickly. Experimentation vs. continuous training environments. SubscribeSign in Share this post DML What is the difference between your ML development and continuous training environments? decodingml.substack.com Copy link Facebook Email Note Other DML What is the difference between your ML development and continuous training environments? 3 techniques you must know to evaluate your LLMs quickly. Experimentation vs. continuous training environments. Paul Iusztin Oct 19, 2023 3 Share this post DML What is the difference between your ML development and continuous training environments? decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This week s ML MLOps topics 1. 3 techniques you must know to evaluate your LLMs quickly 2. What is the difference between your ML development and continuous training environments? Story Job roles tell you there is just one type of MLE, but there are actually 3. But first, I want to let you know that after 1 year of making content, I finally decided to share my content on Twitter X . I took this decision because everybody has a different way of reading and interacting with their socials. ...and I want everyone to enjoy my content on their favorite platform. I even bought that stu blue ticker to see that I am serious about this So... If you like my content and you are a Twitter X person follow at \ud835\udc22\ud835\udc2e\ud835\udc2c\ud835\udc33\ud835\udc2d\ud835\udc22\ud835\udc27\ud835\udc29\ud835\udc1a\ud835\udc2e\ud835\udc25 1. 3 techniques you must know to evaluate your LLMs quickly Manually testing the output of your LLMs is a tedious and painful process you need to automate it. In generative AI, most of the time, you cannot leverage standard metrics. Thus, the real question is, how do you evaluate the outputs of an LLM? Depending on your problem, here is what you can do \ud835\udfed. \ud835\udde6\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\ude00 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf8\ud835\uddfb\ud835\uddfc\ud835\ude04 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\ude04\ud835\uddf5\ud835\uddee\ud835\ude01 \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\ude04\ud835\uddee\ud835\uddfb\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddf4\ud835\uddf2\ud835\ude01 Even if you use an LLM to generate text, you can ask it to generate a response in a structured format e.g., JSON that can be parsed. You know exactly what you want e.g., a list of products extracted from the user s question . Thus, you can easily compare the generated and ideal answers using classic approaches. For example, when extracting the list of products from the user s input, you can do the following check if the LLM outputs a valid JSON structure use a classic method to compare the generated and real answers \ud835\udfee. \ud835\udde1\ud835\uddfc \ud835\uddff\ud835\uddf6\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\uddf2.\ud835\uddf4., \ud835\uddf4\ud835\uddf2\ud835\uddfb\ud835\uddf2\ud835\uddff\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddf1\ud835\uddf2\ud835\ude00\ud835\uddf0\ud835\uddff\ud835\uddf6\ud835\uddfd\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00, \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddf2\ud835\ude00, \ud835\uddf2\ud835\ude01\ud835\uddf0. When generating sentences, the LLM can use different styles, words, etc. Thus, traditional metrics e.g., BLUE score are too rigid to be useful. You can leverage another LLM to test the output of our initial LLM. The trick is in what questions to ask. When testing LLMs, you won t have a big testing split size as you are used to. A set of 10 100 tricky examples usually do the job it won t be costly . Here, we have another 2 sub scenarios \ud835\udfee.\ud835\udfed \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb \ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc\ud835\uddfb \ud835\ude01 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5 You don t have access to an expert to write an ideal answer for a given question to compare it to. Based on the initial prompt and generated answer, you can compile a set of questions and pass them to an LLM. Usually, these are Y N questions that you can easily quantify and check the validity of the generated answer. This is known as Rubric Evaluation For example Is there any disagreement between the response and the context? Y or N Count how many questions the user asked. output a number ... This strategy is intuitive, as you can ask the LLM any question you are interested in as long it can output a quantifiable answer Y N or a number . \ud835\udfee.\ud835\udfee. \ud835\uddea\ud835\uddf5\ud835\uddf2\ud835\uddfb \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf1\ud835\uddfc \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\uddf6\ud835\uddf1\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff \ud835\ude01\ud835\uddfc \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\uddee\ud835\uddff\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddff\ud835\uddf2\ud835\ude00\ud835\uddfd\ud835\uddfc\ud835\uddfb\ud835\ude00\ud835\uddf2 \ud835\ude01\ud835\uddfc \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5 When you can access an answer manually created by a group of experts, things are easier. You will use an LLM to compare the generated and ideal answers based on semantics, not structure. For example A The submitted answer is a subset of the expert answer and entirely consistent. ... E The answers differ, but these differences don t matter. 3 techniques you must know to evaluate your LLMs quickly Image by the Author . 2. What is the difference between your ML development and continuous training environments? They might do the same thing, but their design is entirely different \ud835\udde0\ud835\udddf \ud835\uddd7\ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddf9\ud835\uddfc\ud835\uddfd\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 At this point, your main goal is to ingest the raw and preprocessed data through versioned artifacts or a feature store , analyze it generate as many experiments as possible to find the best model hyperparameters augmentations Based on your business requirements, you must maximize some specific metrics, find the best latency accuracy trade offs, etc. You will use an experiment tracker to compare all these experiments. After you settle on the best one, the output of your ML development environment will be a new version of the code a new version of the configuration artifact Here is where the research happens. Thus, you need flexibility. That is why we decouple it from the rest of the ML systems through artifacts data, config, code artifacts . \ud835\uddd6\ud835\uddfc\ud835\uddfb\ud835\ude01\ud835\uddf6\ud835\uddfb\ud835\ude02\ud835\uddfc\ud835\ude02\ud835\ude00 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddd8\ud835\uddfb\ud835\ude03\ud835\uddf6\ud835\uddff\ud835\uddfc\ud835\uddfb\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 Here is where you want to take the data, code, and config artifacts and train the model on all the required data output a staging versioned model artifact test the staging model artifact if the test passes, label it as the new production model artifact deploy it to the inference services A common strategy is to build a CI CD pipeline that e.g., using GitHub Actions builds a docker image from the code artifact e.g., triggered manually or when a new artifact version is created start the training pipeline inside the docker container that pulls the feature and config artifacts and outputs the staging model artifact manually look over the training report If everything went fine, manually trigger the testing pipeline manually look over the testing report if everything worked fine e.g., the model is better than the previous one , manually trigger the CD pipeline that deploys the new model to your inference services Note how the model registry quickly helps you to decouple all the components. Also, because training and testing metrics are not always black white, it is tough to 100 automate the CI CD pipeline. Thus, you need a human in the loop when deploying ML models. . What is the difference between your ML development and continuous training environments Image by the Author To conclude... The ML development environment is where you do your research to find better models \ud835\ude2a\ud835\ude2f\ud835\ude31\ud835\ude36\ud835\ude35 data artifact \ud835\ude30\ud835\ude36\ud835\ude35\ud835\ude31\ud835\ude36\ud835\ude35 code config artifacts The continuous training environment is used to train test the production model at scale \ud835\ude2a\ud835\ude2f\ud835\ude31\ud835\ude36\ud835\ude35 data, code, config artifacts \ud835\ude30\ud835\ude36\ud835\ude35\ud835\ude31\ud835\ude36\ud835\ude35 model artifact This is not a fixed solution, as ML systems are still an open question. But if you want to see this strategy in action Check out my The Full Stack 7 Steps MLOps Framework FREE Course. Story Job roles tell you there is just one type of MLE, but there are actually 3 Here they are These are the 3 ML engineering personas I found while working with different teams in the industry \ud835\udfed. \ud835\udde5\ud835\uddf2\ud835\ude00\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddf0\ud835\uddf5\ud835\uddf2\ud835\uddff\ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\uddf0\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff They like to stay in touch with the latest papers, understand the architecture of models, optimize them, run experiments, etc. They are great at picking the best models but not that great at writing clean code and scaling the solution. \ud835\udfee. \ud835\udde6\ud835\uddea\ud835\uddd8 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\uddf0\ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff They pretend they read papers but don t maybe only when they have to . They are more concerned with writing modular code and data quality than the latest hot models. Usually, these are the data centric people. They are great at writing clean code processing data at scale but lack deep mathematical skills to develop complex DL solutions. \ud835\udfef. \ud835\udde0\ud835\udddf\ud835\udde2\ud835\uddfd\ud835\ude00 \ud835\uddf3\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf8\ud835\ude00 They ultimately don t care about the latest research hot models. They are more into the latest MLOps tools and building ML systems. They love to automate everything and use as many tools as possible. Great at scaling the solution and building ML pipelines, but not great at running experiments tweaking ML models. They love to treat the ML model as a black box. Image by the Author. I started as 1. , until I realized I hated it now I am a mix of \ud835\udfed. 20 \ud835\udfee. 40 \ud835\udfef. 40 But that doesn t mean one is better these types are complementary. A great ML team should have at least one of each persona. What do you think? Did I get it right? That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 3 Share this post DML What is the difference between your ML development and continuous training environments? decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-what-is-the-difference-between?r=1ttoeh" }, { "id": "aa199018-9dcc-4768-9e99-1b2356af2c21", "content": "DML 7 steps to build a production ready financial assistant using LLMs How to fine tune any LLM at scale in under 5 minutes. 7 steps to build a production ready financial assistant using LLMs. SubscribeSign in Share this post DML 7 steps to build a production ready financial assistant using LLMs decodingml.substack.com Copy link Facebook Email Note Other DML 7 steps to build a production ready financial assistant using LLMs How to fine tune any LLM at scale in under 5 minutes. 7 steps to build a production ready financial assistant using LLMs. Paul Iusztin Oct 12, 2023 5 Share this post DML 7 steps to build a production ready financial assistant using LLMs decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This week s ML MLOps topics 1. Writing your own ML models is history. How to fine tune any LLM at scale in under 5 minutes. 2. 7 steps to chain your prompts to build a production ready financial assistant using LLMs. Extra 3 key resources on how to monitor your ML models 1. Writing your own ML models is history. How to fine tune any LLM at scale in under 5 minutes. Writing your own ML models is history. The true value is in your data, how you prepare it, and your computer power. To demonstrate my statement. Here is how you can write a Python script to train your LLM at scale in under 5 minutes \ud835\udfed. Load your data in JSON format and convert it into a Hugging Dataset \ud835\udfee. Use Huggingface to load the LLM and pass it to the SFTTrainer, along with the tokenizer and training evaluation datasets. \ud835\udfef. Wrap your training script with a serverless solution, such as Beam, which quickly lets you access a cluster of GPUs to train large models. As you can see, the secret ingredients are not the LLM but the amount of data the quality of data how you process the data for compute power the ability to scale the system 3 steps to write a Python script to train your LLMs at scale Image by the Author . My advice If you don t plan to become an ML researcher, shift your focus from the latest models to your data and infrastructure. . \ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2 Integrating serverless services, such as Beam, makes the deployment of your training pipeline fast seamless, leaving you to focus only on the last piece of the puzzle your data. Check out Beam s docs to find out more. 2. 7 steps to chain your prompts to build a production ready financial assistant using LLMs. \ud835\udff3 \ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfd\ud835\ude00 on how to \ud835\uddf0\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb your \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00 to build a production ready \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddee\ud835\ude00\ud835\ude00\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\ude01 using \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 When building LLM applications, you frequently have to divide your application into multiple steps prompts, which are known as chaining prompts . Here are 7 standard steps when building a financial assistant using LLMs or any other assistant \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed Check if the user s question is safe using OpenAI s Moderation API If the user s query is safe, move to \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee Query your proprietary data e.g., financial news to enrich the prompt with fresh data additional context. To do so, you have to use an LM to embed the user s input use the embedding to query your proprietary data stored in a vector DB \ud835\ude15\ud835\ude30\ud835\ude35\ud835\ude26 You must use the same LM model to embed the data that will be stored in the vector DB the user s question used to query the vector DB \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef Build the prompt using a predefined template the user s question extracted financial news as context your conversation history as context \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff0 Call the LLM \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff1 Check if the assistant s answer is safe using the OpenAI s Moderation API. If the assistant s answer is safe, move to \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff1 \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff2 Use an LLM to check if the final answer is satisfactory. To do so, you build a prompt using the following a validation predefined template the user s initial question the assistants answer The LLM has to give a yes or no answer. Thus, if it answers yes, we show the final answer to the user. Otherwise, we will return a predefined response, such as Sorry, we couldn t answer your question because we don t have enough information. \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff3 Add the user s question and assistant s answer to a history cache. Which will be used to enrich the following prompts with the current conversation. Just to remind you, the assistant should support a conversation. Thus, it needs to know what happened in the previous questions. In practice, you usually keep only the latest N question, answer tuples or a conversation summary to keep your context length under control. 7 Steps to Build a Production Ready Financial Assistant Using LLMs Image by the Author If you want to see this strategy in action, check out our new FREE Hands on LLMs course work in progress give it a on GitHub to stay updated with its latest progress. Extra 3 key resources on how to monitor your ML models In the last month, I read 100 ML monitoring articles. I trimmed them for you to 3 key resources 1 . A series of excellent articles made by Arize AI that will make you understand what ML monitoring is all about. Arize Articles 2 . The Evidently AI Blog, where you can find answers to all your questions regarding ML monitoring. Evidently Blog 3 . The monitoring hands on examples hosted by DataTalksClub will teach you how to implement an ML monitoring system. DataTalks Course After wasting a lot of time reading other resources... Using these 3 resources is a solid start for learning about monitoring ML systems. That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 5 Share this post DML 7 steps to build a production ready financial assistant using LLMs decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-7-steps-to-build-a-production?r=1ttoeh" }, { "id": "de3f1dc2-70e9-4621-825b-56dd9a8f99be", "content": "DML Chain of Thought Reasoning Write robust explainable prompts for your LLM Everything you need to know about chaining prompts increase your LLMs accuracy debug and explain your LLM. SubscribeSign in Share this post DML Chain of Thought Reasoning Write robust explainable prompts for your LLM decodingml.substack.com Copy link Facebook Email Note Other DML Chain of Thought Reasoning Write robust explainable prompts for your LLM Everything you need to know about chaining prompts increase your LLMs accuracy debug and explain your LLM. Paul Iusztin Oct 05, 2023 1 Share this post DML Chain of Thought Reasoning Write robust explainable prompts for your LLM decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This week s ML MLOps topics 1. Chaining Prompts to Reduce Costs, Increase Accuracy Easily Debug Your LLMs 2. Chain of Thought Reasoning Write robust explainable prompts for your LLM Extra Why any ML system should use an ML platform as its central nervous system But first, I want to share with you this quick 7 minute guide teaching you how stable diffusion models are trained and generate new images. Diffusion models are the cornerstone of most modern computer vision generative AI applications. Thus, if you are into generative AI, it is essential to have an intuition of how a diffusion model works. Check out my article to quickly understand the general picture of how diffusion models work how diffusion models generate new images how they are trained how they are controlled by a given context e.g., text Busy? This Is Your Quick Guide to Opening the Diffusion Models Black Box 1. Chaining Prompts to Reduce Costs, Increase Accuracy Easily Debug Your LLMs Here it is \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01\ud835\ude00 is an intuitive technique that states that you must split your prompts into multiple calls. \ud835\uddea\ud835\uddf5\ud835\ude06? \ud835\udddf\ud835\uddf2\ud835\ude01 \ud835\ude00 \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\ude00\ud835\uddfc\ud835\uddfa\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\uddee\ud835\uddf9\ud835\uddfc\ud835\uddf4\ud835\uddf6\ud835\uddf2\ud835\ude00. When cooking, you are following a recipe split into multiple steps. You want to move to the next step only when you know what you have done so far is correct. You want every prompt to be simple focused. Another analogy is between reading all the code in one monolith god class and using DRY to separate the logic between multiple modules. You want to understand debug every prompt easily. . Chaining prompts is a \ud835\uddfd\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\ude01\ud835\uddfc\ud835\uddfc\ud835\uddf9 \ud835\uddf3\ud835\uddfc\ud835\uddff \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddee \ud835\ude00\ud835\ude01\ud835\uddee\ud835\ude01\ud835\uddf2\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa where you must take different actions depending on the current state. In other words, you control what happens between 2 chained prompts. \ud835\ude09\ud835\ude3a\ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude25\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude34 \ud835\ude30\ud835\ude27 \ud835\ude24\ud835\ude29\ud835\ude22\ud835\ude2a\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35\ud835\ude34 increase in accuracy reduce the number of tokens lower costs skips steps of the workflow when not needed avoid context limitations easier to include a human in the loop easier to control, moderate, test debug use external tools plugins web search, API, databases, calculator, etc. . \ud835\uddd8\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2 You want to build a virtual assistant to respond to customer service queries. Instead of adding in one single prompt the system message, all the available products, and the user inquiry, you can split it into the following 1 . Use a prompt to extract the products and categories of interest. 2 . Enrich the context only with the products of interest. 3 . Call the LLM for the final answer. You can evolve this example by adding another prompt that classifies the nature of the user inquiry. Based on that, redirect it to billing, technical support, account management, or a general LLM similar to the complex system of GPT 4 . Chaining Prompts to Reduce Costs, Increase Accuracy Easily Debug Your LLMs Image by the Author . \ud835\udde7\ud835\uddfc \ud835\ude00\ud835\ude02\ud835\uddfa\ud835\uddfa\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\ude07\ud835\uddf2 Instead of writing a giant prompt that includes multiple steps Split the god prompt into multiple modular prompts that let you keep track of the state externally and orchestrate the program. In other words, you want modular prompts that you can combine easily same as in writing standard functions classes . To \ud835\uddee\ud835\ude03\ud835\uddfc\ud835\uddf6\ud835\uddf1 \ud835\uddfc\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4, use this technique when your prompt contains instruction. You can leverage the DRY principle from software one prompt one instruction. Tools to chain prompts LangChain Tools to monitor and debug prompts Comet LLMOps Tools 2. Chain of Thought Reasoning Write robust explainable prompts for your LLM \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\udde7\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4 is a \ud835\uddfd\ud835\uddfc\ud835\ude04\ud835\uddf2\ud835\uddff\ud835\uddf3\ud835\ude02\ud835\uddf9 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf2\ud835\uddf0\ud835\uddf5\ud835\uddfb\ud835\uddf6\ud835\uddfe\ud835\ude02\ud835\uddf2 to \ud835\uddf6\ud835\uddfa\ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\ude03\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\ude00 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\ude02\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\ude06 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddf2\ud835\ude05\ud835\uddfd\ud835\uddf9\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\uddf6\ud835\ude01\ud835\ude00 \ud835\uddee\ud835\uddfb\ud835\ude00\ud835\ude04\ud835\uddf2\ud835\uddff. Let me explain It is a method to force the LLM to follow a set of predefined steps. \ud835\uddea\ud835\uddf5\ud835\ude06 \ud835\uddf1\ud835\uddfc \ud835\ude04\ud835\uddf2 \ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddf1 \ud835\uddd6\ud835\uddf5\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\uddfc\ud835\uddf3 \ud835\udde7\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\uddf4\ud835\uddf5\ud835\ude01 \ud835\udde5\ud835\uddf2\ud835\uddee\ud835\ude00\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4? In complex scenarios, the LLM must thoroughly reason about a problem before responding to the question. Otherwise, the LLM might rush to an incorrect conclusion. By forcing the model to follow a set of steps, we can guide the model to think more methodically about the problem. Also, it helps us explain and debug how the model reached a specific answer. . \ud835\udddc\ud835\uddfb\ud835\uddfb\ud835\uddf2\ud835\uddff \ud835\udde0\ud835\uddfc\ud835\uddfb\ud835\uddfc\ud835\uddf9\ud835\uddfc\ud835\uddf4\ud835\ude02\ud835\uddf2 The inner monologue is all the steps needed to reach the final answer. Often, we want to hide all the reasoning steps from the end user. In fancy words, we want to mimic an inner monologue and output only the final answer. Each reasoning step is structured into a parsable format. Thus, we can quickly load it into a data structure and output only the desired steps to the user. . \ud835\udddf\ud835\uddf2\ud835\ude01 \ud835\ude00 \ud835\uddef\ud835\uddf2\ud835\ude01\ud835\ude01\ud835\uddf2\ud835\uddff \ud835\ude02\ud835\uddfb\ud835\uddf1\ud835\uddf2\ud835\uddff\ud835\ude00\ud835\ude01\ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\ude04\ud835\uddf6\ud835\ude01\ud835\uddf5 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2 The input prompt to the LLM consists of a system message the user s question. The secret is in defining the system message as follows You are a virtual assistant helping clients... Follow the next steps to answer the customer queries. Step 1 Decide if it is a question about a product ... Step 2 Retrieve the product ... Step 3 Extract user assumptions ... Step 4 Validate user assumptions ... Step 5 Answer politely ... Make sure to answer in the following format Step 1 \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_1_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33 Step 2 \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_2_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33 Step 3 \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_3_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33 Step 4 \ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude31_4_\ud835\ude22\ud835\ude2f\ud835\ude34\ud835\ude38\ud835\ude26\ud835\ude33 Response to the user \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2d_\ud835\ude33\ud835\ude26\ud835\ude34\ud835\ude31\ud835\ude30\ud835\ude2f\ud835\ude34\ud835\ude26 Enforcing the LLM to follow a set of steps, we ensured it would answer the right questions. Ultimately, we will show the user only the \ud835\ude27\ud835\ude2a\ud835\ude2f\ud835\ude22\ud835\ude2d_\ud835\ude33\ud835\ude26\ud835\ude34\ud835\ude31\ud835\ude30\ud835\ude2f\ud835\ude34\ud835\ude26 subset of the answer. The other steps aka inner monologue help the model to reason the developer to debug Have you used this technique when writing prompts? Chain of Thought Reasoning Write robust explainable prompts for your LLM Image by the Author . Extra Why any ML system should use an ML platform as its central nervous system Any ML system should use an ML platform as its central nervous system. Here is why The primary role of an ML Platform is to bring structure to your experiments visualizations models datasets documentation Also, its role is to decouple your data preprocessing, experiment, training, and inference pipelines. . An ML platform helps you automate everything mentioned above using these 6 features 1 . experiment tracking log compare experiments 2 . metadata store know how a model aka experiment was generated 3 . visualisations a central hub for your visualizations 4 . reports create documents out of your experiments 5 . artifacts version share your datasets 6 . model registry version share your models Why any ML system should use an ML platform as its central nervous system GIF by the Author . I have used many ML Platforms before, but lately, I started using Comet, and I love it. Comet ML What is your favorite ML Platform? That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 1 Share this post DML Chain of Thought Reasoning Write robust explainable prompts for your LLM decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-chain-of-thought-reasoning-write?r=1ttoeh" }, { "id": "3d7e4ad6-60d2-4e20-bf42-e158930d168c", "content": "DML Build Serve a Production Ready Classifier in 1 Hour Using LLMs Stop Manually Creating Your ML AWS Infrastructure use Terraform! Build Serve a Production Ready Classifier in 1 Hour Using LLMs. SubscribeSign in Share this post DML Build Serve a Production Ready Classifier in 1 Hour Using LLMs decodingml.substack.com Copy link Facebook Email Note Other DML Build Serve a Production Ready Classifier in 1 Hour Using LLMs Stop Manually Creating Your ML AWS Infrastructure use Terraform! Build Serve a Production Ready Classifier in 1 Hour Using LLMs. Paul Iusztin Sep 21, 2023 6 Share this post DML Build Serve a Production Ready Classifier in 1 Hour Using LLMs decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This week s ML MLOps topics 1. Stop Manually Creating Your ML AWS Infrastructure. Use Terraform! 2. Build Serve a Production Ready Classifier in 1 Hour Using LLMs. Before going into our subject of the day, I have some news to share with you If you want to \ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\uddf0\ud835\uddf8\ud835\uddf9\ud835\ude06 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb in a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\ude02\ud835\uddff\ud835\uddf2\ud835\uddf1 \ud835\ude04\ud835\uddee\ud835\ude06 how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddfc \ud835\uddf2\ud835\uddfb\ud835\uddf1 \ud835\udde0\ud835\udddf \ud835\ude00\ud835\ude06\ud835\ude00\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\ude00 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00, emphasizing \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9 \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf9\ud835\uddf1 \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2\ud835\ude00? I want to let you know that I am invited on \ud835\udde6\ud835\uddf2\ud835\uddfd\ud835\ude01\ud835\uddf2\ud835\uddfa\ud835\uddef\ud835\uddf2\ud835\uddff \ud835\udfee\ud835\udff4\ud835\ude01\ud835\uddf5 to a \ud835\ude04\ud835\uddf2\ud835\uddef\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddff to present an overview of the \ud835\udddb\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 course I am creating. I will show you a \ud835\uddf5\ud835\uddee\ud835\uddfb\ud835\uddf1\ud835\ude00 \ud835\uddfc\ud835\uddfb \ud835\uddf2\ud835\ude05\ud835\uddee\ud835\uddfa\ud835\uddfd\ud835\uddf9\ud835\uddf2 of how to \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\uddee \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddef\ud835\uddfc\ud835\ude01 \ud835\ude02\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00. Here is what I will cover creating your Q A dataset in a semi automated way OpenAI GPT fine tuning an LLM on your new dataset using QLoRA HuggingFace, Peft, Comet ML, Beam build a streaming pipeline to ingest news in real time into a vector DB Bytewax, Qdrant, AWS build a financial bot based on the fine tuned model and real time financial news LangChain, Comet ML, Beam build a simple UI to interact with the financial bot No Notebooks or fragmented examples. I want to show you how to build a real product. More precisely, I will focus on the engineering and system design, showing you how the components described above work together. . If this is something you want to learn, be sure to register using the link below Engineering an End to End ML System for a Financial Assistant Using LLMs September 28th . See you there Now back to business 1. Stop Manually Creating Your ML AWS Infrastructure. Use Terraform! I was uselessly spending 1000 dollars every month on cloud machines until I started using this tool Terraform! . \ud835\udc05\ud835\udc22\ud835\udc2b\ud835\udc2c\ud835\udc2d, \ud835\udc25\ud835\udc1e\ud835\udc2d \ud835\udc2c \ud835\udc2e\ud835\udc27\ud835\udc1d\ud835\udc1e\ud835\udc2b\ud835\udc2c\ud835\udc2d\ud835\udc1a\ud835\udc27\ud835\udc1d \ud835\udc30\ud835\udc21\ud835\udc32 \ud835\udc30\ud835\udc1e \ud835\udc27\ud835\udc1e\ud835\udc1e\ud835\udc1d \ud835\udc13\ud835\udc1e\ud835\udc2b\ud835\udc2b\ud835\udc1a\ud835\udc1f\ud835\udc28\ud835\udc2b\ud835\udc26. When you want to deploy a software application, there are two main steps 1 . Provisioning infrastructure 2 . Deploying applications A regular workflow would be that before deploying your applications or building your CI CD pipelines, you manually go and spin up your, let s say, AWS machines. Initially, this workflow should be just fine, but there are two scenarios when it could get problematic. 1. Your infrastructure gets too big and complicated. Thus, it is cumbersome and might yield bugs in manually replicating it. 2. In the world of AI, there are many cases when you want to spin up a GPU machine to train your models, and afterward, you don t need it anymore. Thus, if you forget to close it, you will end up uselessly paying a lot of . With Terraform, you can solve both of these issues. . So... \ud835\udc16\ud835\udc21\ud835\udc1a\ud835\udc2d \ud835\udc22\ud835\udc2c \ud835\udc13\ud835\udc1e\ud835\udc2b\ud835\udc2b\ud835\udc1a\ud835\udc1f\ud835\udc28\ud835\udc2b\ud835\udc26? It sits on the provisioning infrastructure layer as a infrastructure as code tool that is declarative you focus on the WHAT, not on the HOW automates and manages your infrastructure is open source Yeah... yeah... that sounds fancy. But \ud835\udc30\ud835\udc21\ud835\udc1a\ud835\udc2d \ud835\udc1c\ud835\udc1a\ud835\udc27 \ud835\udc08 \ud835\udc1d\ud835\udc28 \ud835\udc30\ud835\udc22\ud835\udc2d\ud835\udc21 \ud835\udc22\ud835\udc2d? Let s take AWS as an example, where you have to create a VPC create AWS users and permissions spin up EC2 machines install programs e.g., Docker create a K8s cluster Using Terraform... You can do all that just by providing a configuration file that reflects the state of your infrastructure. Basically, it helps you create all the infrastructure you need programmatically. Isn t that awesome? Terraform Image by the Author . If you want to quickly understand Terraform enough to start using it in your own projects check out my 7 minute read article Stop Manually Creating Your AWS Infrastructure. Use Terraform! 2. Build Serve a Production Ready Classifier in 1 Hour Using LLMs \ud835\ude13\ud835\ude13\ud835\ude14\ud835\ude34 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude22 \ud835\ude2d\ud835\ude30\ud835\ude35 \ud835\ude2e\ud835\ude30\ud835\ude33\ud835\ude26 \ud835\ude35\ud835\ude29\ud835\ude22\ud835\ude2f \ud835\ude24\ud835\ude29\ud835\ude22\ud835\ude35\ud835\ude23\ud835\ude30\ud835\ude35\ud835\ude34. \ud835\ude1b\ud835\ude29\ud835\ude26\ud835\ude34\ud835\ude26 \ud835\ude2e\ud835\ude30\ud835\ude25\ud835\ude26\ud835\ude2d\ud835\ude34 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude33\ud835\ude26\ud835\ude37\ud835\ude30\ud835\ude2d\ud835\ude36\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude2a\ud835\ude3b\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude29\ud835\ude30\ud835\ude38 \ud835\ude14\ud835\ude13 \ud835\ude34\ud835\ude3a\ud835\ude34\ud835\ude35\ud835\ude26\ud835\ude2e\ud835\ude34 \ud835\ude22\ud835\ude33\ud835\ude26 \ud835\ude23\ud835\ude36\ud835\ude2a\ud835\ude2d\ud835\ude35. . Using the standard approach when building an end to end ML application, you had to get labeled data 1 month train the model 2 months serve de model 3 months These 3 steps might take 6 months to implement. So far, it worked great. But here is the catch . \ud835\ude20\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude33\ud835\ude26\ud835\ude22\ud835\ude24\ud835\ude29 \ud835\ude22\ud835\ude2d\ud835\ude2e\ud835\ude30\ud835\ude34\ud835\ude35 \ud835\ude35\ud835\ude29\ud835\ude26 \ud835\ude34\ud835\ude22\ud835\ude2e\ud835\ude26 \ud835\ude33\ud835\ude26\ud835\ude34\ud835\ude36\ud835\ude2d\ud835\ude35 \ud835\ude2a\ud835\ude2f \ud835\ude22 \ud835\ude27\ud835\ude26\ud835\ude38 \ud835\ude29\ud835\ude30\ud835\ude36\ud835\ude33\ud835\ude34 \ud835\ude30\ud835\ude33 \ud835\ude25\ud835\ude22\ud835\ude3a\ud835\ude34 \ud835\ude36\ud835\ude34\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude22 \ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude2e\ud835\ude31\ud835\ude35 \ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26\ud835\ude25 \ud835\ude2d\ud835\ude26\ud835\ude22\ud835\ude33\ud835\ude2f\ud835\ude2a\ud835\ude2f\ud835\ude28 \ud835\ude22\ud835\ude31\ud835\ude31\ud835\ude33\ud835\ude30\ud835\ude22\ud835\ude24\ud835\ude29. Let s take a classification task as an example \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfed You write a system prompt explaining the model and what types of inputs and outputs it will get. You will be provided with customer service queries. Classify each query into the following categories Billing Account Management General Inquiry \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfee You can give the model an example to make sure it understands the task known as one shot learning User I want to know the price of the pro subscription plan. Assistant Billing \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udfef Attach the user prompt and create the input prompt, which now consists of the following system example user ...prompts \ud835\udde6\ud835\ude01\ud835\uddf2\ud835\uddfd \ud835\udff0 Call the LLM s API... and boom, you built a classifier in under one hour. Cool, right? Using this approach, the only time consuming step is to tweak the prompt until it reaches the desired result. How to quickly build a classifier using LLMs GIF by the Author . To conclude... In today s LLMs world, to build a classifier, you have to write a system prompt an example attach the user prompt pass the input prompt to the LLM API That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 6 Share this post DML Build Serve a Production Ready Classifier in 1 Hour Using LLMs decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-build-and-serve-a-production?r=1ttoeh" }, { "id": "49e2912f-313d-439d-8de6-522dc8379cb2", "content": "DML 4 key ideas you must know to train an LLM successfully My time series forecasting Python code was a disaster until I started using this package. 4 key ideas you must know to train an LLM successfully. SubscribeSign in Share this post DML 4 key ideas you must know to train an LLM successfully decodingml.substack.com Copy link Facebook Email Note Other DML 4 key ideas you must know to train an LLM successfully My time series forecasting Python code was a disaster until I started using this package. 4 key ideas you must know to train an LLM successfully. Paul Iusztin Sep 14, 2023 3 Share this post DML 4 key ideas you must know to train an LLM successfully decodingml.substack.com Copy link Facebook Email Note Other 2 Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This week s ML MLOps topics 1. My time series forecasting Python code was a disaster until I started using this package 2. 4 key ideas you must know to train an LLM successfully Extra My favorite ML MLOps newsletter 1. My time series forecasting Python code was a disaster until I started using this package Does building time series models sound more complicated than modeling standard tabular datasets? Well... maybe it is... but that is precisely why you need to learn more about \ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2! When I first built forecasting models, I manually coded the required preprocessing and postprocessing steps. What a newbie I was... How easy would my life have been if I had started from the beginning to use \ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2? . \ud835\udc16\ud835\udc21\ud835\udc1a\ud835\udc2d \ud835\udc22\ud835\udc2c \ud835\udc2c\ud835\udc24\ud835\udc2d\ud835\udc22\ud835\udc26\ud835\udc1e? \ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 is a Python package that adds time series functionality over well known packages such as statsmodels, fbprophet, scikit learn, autoarima, xgboost, etc. Thus, all of a sudden, all your beloved packages will support time series features such as easily swap between different models e.g., xgboost, lightgbm, decision trees, etc. out of the box windowing transformations aggregations functionality for multivariate, panel, and hierarchical learning cross validation adapted to time series cool visualizations and more... Sktime example Image by the Author . If you want to see \ud835\ude00\ud835\uddf8\ud835\ude01\ud835\uddf6\ud835\uddfa\ud835\uddf2 in action, check out my article A Guide to Building Effective Training Pipelines for Maximum Results 2. 4 key ideas you must know to train an LLM successfully These are 4 key ideas you must know to train an LLM successfully \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\uddf6\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfa\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\uddf9\ud835\uddf2\ud835\uddee\ud835\uddff\ud835\uddfb\ud835\uddf6\ud835\uddfb\ud835\uddf4? LLMs still leverage supervised learning. A standard NLP task is to build a classifier. For example, you have a sequence of tokens as inputs and, as output, a set of classes e.g., negative and positive . When training an LLM for text generation, you have as input a sequence of tokens, and its task is to predict the next token Input JavaScript is all you ... Output Need This is known as an autoregressive process. \ud835\ude04\ud835\uddfc\ud835\uddff\ud835\uddf1\ud835\ude00 ! \ud835\ude01\ud835\uddfc\ud835\uddf8\ud835\uddf2\ud835\uddfb\ud835\ude00 Tokens are created based on the frequency of sequences of characters. For example In the sentence Learning new things is fun! every work is a different token as each is frequently used. In the sentence Prompting is a ... the word prompting is divided into 3 tokens prom , pt , and ing This is important because different LLMs have different limits for the input number of tokens. How to train an LLM cheatsheet Image by the Author . \ud835\udde7\ud835\ude06\ud835\uddfd\ud835\uddf2\ud835\ude00 \ud835\uddfc\ud835\uddf3 \ud835\udddf\ud835\udddf\ud835\udde0\ud835\ude00 There are 3 primary types of LLMs base LLM instruction tuned LLM RLHF tuned LLM \ud835\ude1a\ud835\ude35\ud835\ude26\ud835\ude31\ud835\ude34 \ud835\ude35\ud835\ude30 \ud835\ude28\ud835\ude26\ud835\ude35 \ud835\ude27\ud835\ude33\ud835\ude30\ud835\ude2e \ud835\ude22 \ud835\ude23\ud835\ude22\ud835\ude34\ud835\ude26 \ud835\ude35\ud835\ude30 \ud835\ude22\ud835\ude2f \ud835\ude2a\ud835\ude2f\ud835\ude34\ud835\ude35\ud835\ude33\ud835\ude36\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f \ud835\ude35\ud835\ude36\ud835\ude2f\ud835\ude26\ud835\ude25 \ud835\ude13\ud835\ude13\ud835\ude14 1 . Train the Base LLM on a lot of data trillions of tokens trained for months on massive GPU clusters 2 . Fine tune the Base LLM on a Q A dataset millions of tokens trained for hours or days on modest size computational resources 3 . Optional Fine tune the LLM further on human ratings reflecting the quality of different LLM outputs, on criteria such as if the answer is helpful, honest and harmless using RLHF. This will increase the probability of generating a more highly rated output. \ud835\udddb\ud835\uddfc\ud835\ude04 \ud835\ude01\ud835\uddfc \ud835\uddef\ud835\ude02\ud835\uddf6\ud835\uddf9\ud835\uddf1 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude01 \ud835\ude01\ud835\uddfc \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udddf\ud835\udddf\ud835\udde0 \ud835\uddfc\ud835\uddfb \ud835\uddee \ud835\udde4 \ud835\uddd4 \ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude01 The most common approach consists of 4 steps 1 . A system message that sets the general tone behavior. 2 . The context that adds more information to help the model to answer Optional . 3 . The user s question. 4 . The answer to the question. Note that you need to know the answer to the question during training. You can intuitively see it as your label. Extra My favorite ML MLOps newsletter Do you want to learn ML MLOps from real world experience? Then I suggest you join Pau Labarta Bajo s Real World Machine Learning weekly newsletter, along with another 8k ML developers. Pau Labarta Bajo inspired me to start my weekly newsletter and is a great teacher who makes learning seamless Real World Machine Learning Every Saturday Morning That s it for today See you next Thursday at 9 00 a.m. CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where all my work is aggregated in one place courses, articles, webinars, podcasts, etc. . 3 Share this post DML 4 key ideas you must know to train an LLM successfully decodingml.substack.com Copy link Facebook Email Note Other 2 Share PreviousNext Discussion about this post Comments Restacks Pau Labarta BajoReal World Machine Learning Sep 14, 2023Liked by Paul IusztinThanks for the shout out Paul. I love the content you shareExpand full commentReplyShare 1 reply by Paul Iusztin 1 more comment... Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-4-key-ideas-you-must-know-to?r=1ttoeh" }, { "id": "0b152bfd-0a90-4220-a1b8-77709ecb06d0", "content": "DML How to add real time monitoring metrics to your ML System How to easily add retry policies to your Python code. How to add real time monitoring metrics to your ML System. SubscribeSign in Share this post DML How to add real time monitoring metrics to your ML System decodingml.substack.com Copy link Facebook Email Note Other DML How to add real time monitoring metrics to your ML System How to easily add retry policies to your Python code. How to add real time monitoring metrics to your ML System. Paul Iusztin Sep 07, 2023 6 Share this post DML How to add real time monitoring metrics to your ML System decodingml.substack.com Copy link Facebook Email Note Other Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ _This week s ML MLOps topics _ 1. How to add real time monitoring metrics to your ML System 2. How to easily add retry policies to your Python code _Storytime _ How am I writing code in 2023? \ud835\udddc \ud835\uddf1\ud835\uddfc\ud835\uddfb \ud835\ude01. But first, I have some big news to share with you Want to learn how to \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\ude01\ud835\ude02\ud835\uddfb\ud835\uddf2 \ud835\uddee\ud835\uddfb \ud835\udddf\ud835\udddf\ud835\udde0, build a \ud835\ude00\ud835\ude01\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddfa\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2, use a \ud835\ude03\ud835\uddf2\ud835\uddf0\ud835\ude01\ud835\uddfc\ud835\uddff \ud835\uddd7\ud835\uddd5, build a \ud835\uddf3\ud835\uddf6\ud835\uddfb\ud835\uddee\ud835\uddfb\ud835\uddf0\ud835\uddf6\ud835\uddee\ud835\uddf9 \ud835\uddef\ud835\uddfc\ud835\ude01 and \ud835\uddf1\ud835\uddf2\ud835\uddfd\ud835\uddf9\ud835\uddfc\ud835\ude06 \ud835\uddf2\ud835\ude03\ud835\uddf2\ud835\uddff\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\uddfb\ud835\uddf4 using a serverless solution? Then you will enjoy looking at this new free course that me and Pau Labarta Bajo from the RWML newsletter are cooking. The course will teach you how to build an end to end LLM solution. It is structured into 4 modules \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udfed Learn how to generate a financial Q A dataset in a semi automated way using the OpenAI API. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udfee Fine tune the LLM e.g., Falcon, Llama2, etc. using HuggingFace Peft. Also, we will show you how to integrate an experiment tracker, model registry, and monitor the prompts using Comet. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udfef Build a streaming pipeline using Bytewax that listens to financial news through a web socket, cleans it, embeds it, and loads it to a vector database using Qdrant. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf9\ud835\uddf2 \ud835\udff0 Wrap the fine tuned model and vector DB into a financial bot using LangChain and deploy it under a RESTful API. But all of this is useless if it isn t deployed. We will use Beam to deploy everything quickly Beam is a serverless solution that lets you focus on your problem and quickly serve all your ML components. Say bye bye to access policies and network configuration. \ud835\udde1\ud835\uddfc\ud835\ude01\ud835\uddf2 This is still a work in progress, but the first 3 modules are almost done. Architecture built during the Hands On LLMs Course GIF by the Author . Curious? Then, check out the repository and give it a Course GitHub Repository 1. How to add real time monitoring metrics to your ML System Your model is exposed to performance degradation after it is deployed to production. That is why you need to monitor it constantly. The most common way to monitor an ML model is to compute its metrics. But for that, you need the ground truth. \ud835\udddc\ud835\uddfb \ud835\uddfd\ud835\uddff\ud835\uddfc\ud835\uddf1\ud835\ude02\ud835\uddf0\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb, \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf0\ud835\uddee\ud835\uddfb \ud835\uddee\ud835\ude02\ud835\ude01\ud835\uddfc\ud835\uddfa\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddf0\ud835\uddee\ud835\uddf9\ud835\uddf9\ud835\ude06 \ud835\uddee\ud835\uddf0\ud835\uddf0\ud835\uddf2\ud835\ude00\ud835\ude00 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf4\ud835\uddff\ud835\uddfc\ud835\ude02\ud835\uddfb\ud835\uddf1 \ud835\ude01\ud835\uddff\ud835\ude02\ud835\ude01\ud835\uddf5 \ud835\uddf6\ud835\uddfb \ud835\udfef \ud835\uddfa\ud835\uddee\ud835\uddf6\ud835\uddfb \ud835\ude00\ud835\uddf0\ud835\uddf2\ud835\uddfb\ud835\uddee\ud835\uddff\ud835\uddf6\ud835\uddfc\ud835\ude00 1 . near real time you can access it quite quickly 2 . delayed you can access it after a considerable amount of time e.g., one month 3 . never you have to label the data manually . \ud835\uddd9\ud835\uddfc\ud835\uddff \ud835\ude02\ud835\ude00\ud835\uddf2 \ud835\uddf0\ud835\uddee\ud835\ude00\ud835\uddf2\ud835\ude00 \ud835\udfee. \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\udfef. \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddf0\ud835\uddee\ud835\uddfb \ud835\uddfe\ud835\ude02\ud835\uddf6\ud835\uddf0\ud835\uddf8\ud835\uddf9\ud835\ude06 \ud835\uddf0\ud835\uddfc\ud835\uddfa\ud835\uddfd\ud835\ude02\ud835\ude01\ud835\uddf2 \ud835\ude06\ud835\uddfc\ud835\ude02\ud835\uddff \ud835\uddfa\ud835\uddfc\ud835\uddfb\ud835\uddf6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\uddfd\ud835\uddf6\ud835\uddfd\ud835\uddf2\ud835\uddf9\ud835\uddf6\ud835\uddfb\ud835\uddf2 \ud835\uddf6\ud835\uddfb \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\uddf3\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddfc\ud835\ude04\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude04\ud835\uddee\ud835\ude06 store the model predictions and GT as soon as they are available these 2 will be out of sync you can t compute the metrics right away build a DAG e.g., using Airflow that extracts the predictions GT computes the metrics in batch mode and loads them into another storage e.g., GCS use an orchestration tool to run the DAG in the following scenarios 1 . scheduled if the GT is available in near real time e.g., hourly , then it makes sense to run your monitoring pipeline based on the known frequency 2 . triggered if the GT is delayed and you don t know when it may come up, then you can implement a webhook to trigger your monitoring pipeline attach a consumer to your storage to use and display the metrics e.g., trigger alarms and display them in a dashboard How to add real time monitoring metrics to your ML system Image by the Author . If you want to see how to implement a near real time monitoring pipeline using Airflow and GCS, check out my article Ensuring Trustworthy ML Systems With Data Validation and Real Time Monitoring 2. How to easily add retry policies to your Python code One strategy that makes the \ud835\uddf1\ud835\uddf6\ud835\uddf3\ud835\uddf3\ud835\uddf2\ud835\uddff\ud835\uddf2\ud835\uddfb\ud835\uddf0\ud835\uddf2 \ud835\uddef\ud835\uddf2\ud835\ude01\ud835\ude04\ud835\uddf2\ud835\uddf2\ud835\uddfb \ud835\uddf4\ud835\uddfc\ud835\uddfc\ud835\uddf1 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\uddee\ud835\uddfb\ud835\uddf1 \ud835\uddf4\ud835\uddff\ud835\uddf2\ud835\uddee\ud835\ude01 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 is adding \ud835\uddff\ud835\uddf2\ud835\ude01\ud835\uddff\ud835\ude06 \ud835\uddfd\ud835\uddfc\ud835\uddf9\ud835\uddf6\ud835\uddf0\ud835\uddf6\ud835\uddf2\ud835\ude00. To manually implement them can get tedious and complicated. Retry policies are a must when you make calls to an external API read from a queue, etc. . \ud835\udde8\ud835\ude00\ud835\uddf6\ud835\uddfb\ud835\uddf4 \ud835\ude01\ud835\uddf5\ud835\uddf2 \ud835\udde7\ud835\uddf2\ud835\uddfb\ud835\uddee\ud835\uddf0\ud835\uddf6\ud835\ude01\ud835\ude06 \ud835\udde3\ud835\ude06\ud835\ude01\ud835\uddf5\ud835\uddfc\ud835\uddfb \ud835\uddfd\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddee\ud835\uddf4\ud835\uddf2... \ud835\ude20\ud835\ude30\ud835\ude36 \ud835\ude24\ud835\ude22\ud835\ude2f \ud835\ude32\ud835\ude36\ud835\ude2a\ud835\ude24\ud835\ude2c\ud835\ude2d\ud835\ude3a \ud835\ude25\ud835\ude26\ud835\ude24\ud835\ude30\ud835\ude33\ud835\ude22\ud835\ude35\ud835\ude26 \ud835\ude3a\ud835\ude30\ud835\ude36\ud835\ude33 \ud835\ude27\ud835\ude36\ud835\ude2f\ud835\ude24\ud835\ude35\ud835\ude2a\ud835\ude30\ud835\ude2f\ud835\ude34 \ud835\ude22\ud835\ude2f\ud835\ude25 \ud835\ude22\ud835\ude25\ud835\ude25 \ud835\ude24\ud835\ude36\ud835\ude34\ud835\ude35\ud835\ude30\ud835\ude2e\ud835\ude2a\ud835\ude3b\ud835\ude22\ud835\ude23\ud835\ude2d\ud835\ude26 \ud835\ude33\ud835\ude26\ud835\ude35\ud835\ude33\ud835\ude3a \ud835\ude31\ud835\ude30\ud835\ude2d\ud835\ude2a\ud835\ude24\ud835\ude2a\ud835\ude26\ud835\ude34, \ud835\ude34\ud835\ude36\ud835\ude24\ud835\ude29 \ud835\ude22\ud835\ude34 1 . Add fixed and random wait times between multiple retries. 2 . Add a maximum number of attempts or computation time. 3 . Retry only when specific errors are thrown or not thrown . ... as you can see, you easily compose these policies between them. The cherry on top is that you can access the statistics of the retries of a specific function print raise_my_exception.retry.statistics Examples of the retry policies using tenacity Image by the Author . tenacity repository _Storytime _ How am I writing code in 2023? I don t As an engineer, you are paid to think and solve problems. How you do that, it doesn t matter. Let me explain . The truth is that I am lazy. That is why I am a good engineer. With the rise of LLMs, my laziness hit all times highs. . \ud835\udde7\ud835\uddf5\ud835\ude02\ud835\ude00, \ud835\ude01\ud835\uddf5\ud835\uddf6\ud835\ude00 \ud835\uddf6\ud835\ude00 \ud835\uddf5\ud835\uddfc\ud835\ude04 \ud835\udddc \ud835\ude04\ud835\uddff\ud835\uddf6\ud835\ude01\ud835\uddf2 \ud835\uddfa\ud835\ude06 \ud835\uddf0\ud835\uddfc\ud835\uddf1\ud835\uddf2 \ud835\ude01\ud835\uddf5\ud835\uddf2\ud835\ude00\ud835\uddf2 \ud835\uddf1\ud835\uddee\ud835\ude06\ud835\ude00 50 Copilot tab is the new CTRL C CTRL V 30 ChatGPT Bard 10 Stackoverflow call me insane, but I still use StackOverflow from time to time 10 Writing my own code The thing is that I am more productive than ever. ... and that 10 of writing my own code is the final step that connects all the dots and brings real value to the table. . \ud835\udddc\ud835\uddfb \ud835\uddff\ud835\uddf2\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude01\ud835\ude06, \ud835\uddee\ud835\ude00 \ud835\uddee\ud835\uddfb \ud835\uddf2\ud835\uddfb\ud835\uddf4\ud835\uddf6\ud835\uddfb\ud835\uddf2\ud835\uddf2\ud835\uddff, \ud835\ude06\ud835\uddfc\ud835\ude02 \ud835\uddfa\ud835\uddfc\ud835\ude00\ud835\ude01\ud835\uddf9\ud835\ude06 \ud835\uddf5\ud835\uddee\ud835\ude03\ud835\uddf2 \ud835\ude01\ud835\uddfc ask the right questions understand improve the architecture of the system debug code understand business requirements communicate with other teams ...not to write code. Image by the Author Writing code as we know it most probably will disappear with the rise of AI it kind of already did . . What do you think? How do you write code these days? That s it for today See you next Thursday at 9 00 am CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog here, I approach in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where I will constantly aggregate all my work courses, articles, webinars, podcasts, etc. . 6 Share this post DML How to add real time monitoring metrics to your ML System decodingml.substack.com Copy link Facebook Email Note Other Share PreviousNext Discussion about this post Comments Restacks Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-how-to-add-real-time-monitoring?r=1ttoeh" }, { "id": "a520fdac-65b4-4340-9ee2-d16a1390b838", "content": "DML Top 6 ML Platform Features You Must Know to Build an ML System Why serving an ML model using a batch architecture is so powerful? Top 6 ML platform features you must know. SubscribeSign in Share this post DML Top 6 ML Platform Features You Must Know to Build an ML System decodingml.substack.com Copy link Facebook Email Note Other DML Top 6 ML Platform Features You Must Know to Build an ML System Why serving an ML model using a batch architecture is so powerful? Top 6 ML platform features you must know. Paul Iusztin Aug 31, 2023 3 Share this post DML Top 6 ML Platform Features You Must Know to Build an ML System decodingml.substack.com Copy link Facebook Email Note Other 2 Share _Hello there, I am Paul Iusztin _ _Within this newsletter, I will help you decode complex topics about ML MLOps one week at a time _ This week we will cover 1. Top 6 ML platform features you must know to build an ML system 2. Why serving an ML model using a batch architecture is so powerful? _Story _ I never forget anything said no one but your second brain. This week, no shameless promotion 1. Top 6 ML platform features you must know to build an ML system Here they are \ud835\udfed. \ud835\uddd8\ud835\ude05\ud835\uddfd\ud835\uddf2\ud835\uddff\ud835\uddf6\ud835\uddfa\ud835\uddf2\ud835\uddfb\ud835\ude01 \ud835\udde7\ud835\uddff\ud835\uddee\ud835\uddf0\ud835\uddf8\ud835\uddf6\ud835\uddfb\ud835\uddf4 In your ML development phase, you generate lots of experiments. Tracking and comparing the metrics between them is crucial in finding the optimal model. \ud835\udfee. \ud835\udde0\ud835\uddf2\ud835\ude01\ud835\uddee\ud835\uddf1\ud835\uddee\ud835\ude01\ud835\uddee \ud835\udde6\ud835\ude01\ud835\uddfc\ud835\uddff\ud835\uddf2 Its primary purpose is reproducibility. To know how a model was generated, you need to know the version of the code the version of the packages hyperparameters config total compute version of the dataset ... and more \ud835\udfef. \ud835\udde9\ud835\uddf6\ud835\ude00\ud835\ude02\ud835\uddee\ud835\uddf9\ud835\uddf6\ud835\ude00\ud835\uddee\ud835\ude01\ud835\uddf6\ud835\uddfc\ud835\uddfb\ud835\ude00 Most of the time, along with the metrics, you must log a set of visualizations for your experiment. Such as images videos prompts t SNE graphs 3D point clouds ... and more \ud835\udff0. \ud835\udde5\ud835\uddf2\ud835\uddfd\ud835\uddfc\ud835\uddff\ud835\ude01\ud835\ude00 You don t work in a vacuum. You have to present your work to other colleges or clients. A report lets you take the metadata and visualizations from your experiment... ...and create, deliver and share a targeted presentation for your clients or peers. \ud835\udff1. \ud835\uddd4\ud835\uddff\ud835\ude01\ud835\uddf6\ud835\uddf3\ud835\uddee\ud835\uddf0\ud835\ude01\ud835\ude00 The most powerful feature out of them all. An artifact is a versioned object that is an input or output for your task. Everything can be an artifact, but the most common cases are data model code Wrapping your assets around an artifact ensures reproducibility. For example, you wrap your features into an artifact e.g., features 3.1.2 , which you can consume into your ML development step. The ML development step will generate config e.g., config 1.2.4 and code e.g., code 1.0.2 artifacts used in the continuous training pipeline. Doing so lets you quickly respond to questions such as What I used to generate the model? and What Version? \ud835\udff2. \ud835\udde0\ud835\uddfc\ud835\uddf1\ud835\uddf2\ud835\uddf9 \ud835\udde5\ud835\uddf2\ud835\uddf4\ud835\uddf6\ud835\ude00\ud835\ude01\ud835\uddff\ud835\ude06 The model registry is the ultimate way to make your model accessible to your production ecosystem. For example, in your continuous training pipeline, after the model is trained, you load the weights as an artifact into the model registry e.g., model 1.2.4 . You label this model as staging under a new version and prepare it for testing. If the tests pass, mark it as production under a new version and prepare it for deployment e.g., model 2.1.5 . Top 6 ML platform features you must know Image by the Author . . All of these features are used in a mature ML system. What is your favorite one? You can see all these features in action in my The Full Stack 7 Steps MLOps Framework FREE course. 2. Why serving an ML model using a batch architecture is so powerful? When you first start deploying your ML model, you want an initial end to end flow as fast as possible. Doing so lets you quickly provide value, get feedback, and even collect data. . But here is the catch... Successfully serving an ML model is tricky as you need many iterations to optimize your model to work in real time low latency high throughput Initially, serving your model in batch mode is like a hack. By storing the model s predictions in dedicated storage, you automatically move your model from offline mode to a real time online model. Thus, you no longer have to care for your model s latency and throughput. The consumer will directly load the predictions from the given storage. \ud835\udc13\ud835\udc21\ud835\udc1e\ud835\udc2c\ud835\udc1e \ud835\udc1a\ud835\udc2b\ud835\udc1e \ud835\udc2d\ud835\udc21\ud835\udc1e \ud835\udc26\ud835\udc1a\ud835\udc22\ud835\udc27 \ud835\udc2c\ud835\udc2d\ud835\udc1e\ud835\udc29\ud835\udc2c \ud835\udc28\ud835\udc1f \ud835\udc1a \ud835\udc1b\ud835\udc1a\ud835\udc2d\ud835\udc1c\ud835\udc21 \ud835\udc1a\ud835\udc2b\ud835\udc1c\ud835\udc21\ud835\udc22\ud835\udc2d\ud835\udc1e\ud835\udc1c\ud835\udc2d\ud835\udc2e\ud835\udc2b\ud835\udc1e extracts raw data from a real data source clean, validate, and aggregate the raw data within a feature pipeline load the cleaned data into a feature store experiment to find the best model transformations using the data from the feature store upload the best model from the training pipeline into the model registry inside a batch prediction pipeline, use the best model from the model registry to compute the predictions store the predictions in some storage the consumer will download the predictions from the storage repeat the whole process hourly, daily, weekly, etc. it depends on your context . \ud835\ude1b\ud835\ude29\ud835\ude26 \ud835\ude2e\ud835\ude22\ud835\ude2a\ud835\ude2f \ud835\ude25\ud835\ude30\ud835\ude38\ud835\ude2f\ud835\ude34\ud835\ude2a\ud835\ude25\ud835\ude26 of deploying your model in batch mode is that the predictions will have a level of lag. For example, in a recommender system, if you make your predictions daily, it won t capture a user s behavior in real time, and it will update the predictions only at the end of the day. Moving to other architectures, such as request response or streaming, will be natural after your system matures in batch mode. ML Batch Architecture Design Image by the Author . So remember, when you initially deploy your model, using a batch mode architecture will be your best shot for a good user experience. _Story _ I never forget anything said no one but your second brain. After 6 months of refinement, this is my second brain strategy Tiago s Forte book inspired me, but I adapted his system to my needs. . \ud835\udfec. \ud835\uddd6\ud835\uddfc\ud835\uddf9\ud835\uddf9\ud835\uddf2\ud835\uddf0\ud835\ude01 This is where you are bombarded with information from all over the place. \ud835\udfed. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddda\ud835\uddff\ud835\uddee\ud835\ude03\ud835\uddf2\ud835\ude06\ud835\uddee\ud835\uddff\ud835\uddf1 This is where I save everything that looks interesting. I won t use 90 of what is here, but it satisfied my urge to save that cool article I saw on LinkedIn. Tools Mostly Browser Bookmarks, but I rarely use GitHub stars, Medium lists, etc. \ud835\udfee. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddd5\ud835\uddfc\ud835\uddee\ud835\uddff\ud835\uddf1 Here, I start converging the information and planning what to do next. Tools Notion \ud835\udfef. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddd9\ud835\uddf6\ud835\uddf2\ud835\uddf9\ud835\uddf1 Here is where I express myself through learning, coding, writing, etc. Tools whatever you need to express yourself. 2 3 are iterative processes. Thus I often bounce between them until the information is distilled. \ud835\udff0. \ud835\udde7\ud835\uddf5\ud835\uddf2 \ud835\uddea\ud835\uddee\ud835\uddff\ud835\uddf2\ud835\uddf5\ud835\uddfc\ud835\ude02\ud835\ude00\ud835\uddf2 Here is where I take the distilled information and write it down for cold storage. Tools Notion, Google Drive . When I want to search for a piece of information, I start from the Warehouse and go backward until I find what I need. As a minimalist, I kept my tools to a minimum. I primarily use only Brave, Notion, and Google Drive. You don t need 100 tools to be productive. They just want to take your money from you. My second brain strategy Image by the Author . So remember... You have to collect link plan distill store That s it for today See you next Thursday at 9 00 am CET. Have a fantastic weekend! Paul Whenever you re ready, here is how I can help you 1. The Full Stack 7 Steps MLOps Framework a 7 lesson FREE course that will walk you step by step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code 2.5 hours of reading video materials on Medium. 2. Machine Learning MLOps Blog here, I approach in depth topics about designing and productionizing ML systems using MLOps. 3. Machine Learning MLOps Hub a place where I will constantly aggregate all my work courses, articles, webinars, podcasts, etc. , 3 Share this post DML Top 6 ML Platform Features You Must Know to Build an ML System decodingml.substack.com Copy link Facebook Email Note Other 2 Share PreviousNext Discussion about this post Comments Restacks Ahmed BesbesThe Tech Buffet Aug 31, 2023Liked by Paul IusztinHello Paul! Great newsletter. It d be even more useful to suggest tools for each of these features e.g. the model registry, the feature store, etc Expand full commentReplyShare 1 reply by Paul Iusztin 1 more comment... Top Latest Discussions No posts Ready for more? Subscribe 2024 Paul Iusztin Privacy Terms Collection notice Start WritingGet the app Substack is the home for great culture Share Copy link Facebook Email Note Other This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts en", "platform": "decodingml.substack.com", "author_id": "b5fa1f08-75f0-402d-8e88-d1357e346d9e", "author_full_name": "Paul Iusztin", "link": "https://decodingml.substack.com/p/dml-top-6-ml-platform-features-you?r=1ttoeh" } ] }