AIE5-MidTerm / README.md
thomfoolery's picture
adding loom video
dd28ab2
metadata
title: AIE5 MidTerm
emoji: 👁
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
license: mit

AIE 5 Mid Term

My Loom Video

Defining your Problem and Audience

  1. Write a succinct 1-sentence description of the problem

    Shopify Flow is a powerful and complex automation tool that can save Shopify merchants a significant mount time and money by handling repetitive tasks. However, the complexity of the tool can be overwhelming for non-technical merchants, often resulting in needing to hire expensive domain experts to automate and manage their business.

  2. Write 1-2 paragraphs on why this is a problem for your specific user

    Smaller Shopify merchants are often non-technical and may not have the resources to hire domain experts to automate their business processes. This can result in a significant loss of time and money, as they are unable to take full advantage of the automation capabilities of Shopify Flow. By generating Shopify Flow automations from natural language, we can help Shopify merchants save time and money, and focus on growing their business.

Propose a Solution

  1. Write 1-2 paragraphs on your proposed solution. How will it look and feel to the user?

    Shopify Flow automations can be imported and exported in a standardized JSON format that represent a directed acyclic graph (DAG) of triggers, conditions and actions. By fine-tuning a base model to output JSON objects that satisfy Shopify Flow's project format, we can generate these Shopify Flow JSON objects from natural language, eliminating the need to hire expensive domain experts to automate and manage their business.

    Shopify merchants will describe the automation they need to create in natural language. The model will ask additional questions to gather requirements and important considerations. With this information our model will generate the necessary Shopify Flow JSON object to satisfy the merchants automation request. The merchant can import this JSON file into Shopify Flow and simply activate the automation. The can perform this first in a development environment to test the automation before deploying it to their production environment.

  2. Describe the tools you plan to use in each part of your stack. Write one sentence on why you made each tooling choice.

    1. LLM

      We will use a fine-tuned Language Model (LLM) to generate Shopify Flow JSON objects from natural language. We will need to fine-tune a powerful reasoning model (likely Deekseek R1) to understand the complex requirements and translate them into a logical flow diagram.

    2. Embedding Model

      We will use a fine-tuned embedding model to find the most relevant Shopify Flow JSON objects to the help generated final JSON object.

    3. Orchestration

      We will use LangGraph to orchestrate the LLM and Embedding Model to generate the final Shopify Flow JSON object.

    4. Vector Database

      We will use Qdrant to store the Shopify Flow JSON objects and embeddings for quick retrieval and comparison.

    5. Monitoring

      I haven't thought about this yet as it has not been covered in class AFAIK. I will need to do some research and learn more about monitoring.

    6. Evaluation

      We will use RAGAS to evaluate the quality of the generated Shopify Flow JSON objects.

    7. User Interface

      We will use Chainlit to implement a simple chat bot interface for the merchant to describe the automation they need to create, and answer questions form the agent. It will produce a file to download that contains the generated Shopify Flow JSON object.

    8. (Optional) Serving & Inference

      I'm not sure yet as I don't know much about hosting my own fine-tuned models for inference. Need to learn more and do some research.

  3. Where will you use an agent or agents? What will you use “agentic reasoning” for in your app?

Because we are generating Shopify Flow JSON objects from natural language, we will use an agent to interact with the merchant to gather requirements and important considerations. The agent will use agentic reasoning to ask clarifying questions and ensure the generated Shopify Flow JSON object satisfies the merchant's automation request. We will need to fine-tune a powerful reasoning model (likely Deekseek R1) to understand the complex requirements and translate them into a logical flow diagram.

Dealing with the Data

  1. Describe all of your data sources and external APIs, and describe what you’ll use them for.

    I don't believe I will need external APIs because I don't need real-time data. All of my data will come from documents. I will need a library of annotated Shopify Flow JSON objects to fine-tune the LLM and Embedding Model. I will need to use these examples and SDG to synthetically generate additional examples to fine-tune and evaluate the models.

  2. Describe the default chunking strategy that you will use. Why did you make this decision?

    I will start with a default chunking strategy of ~1000 tokens with 500 token overlap. I will tweak these numbers as I test for different scenarios, but it felt like a good place to start.

  3. [Optional] Will you need specific data for any other part of your application? If so, explain.

    Yes I will need to use lots of examples of Shopify Flow JSON objects to fine-tune the LLM and Embedding Model. I will also need to use these examples and SDG to synthetically generate additional examples to fine-tune and evaluate the models.

Building a Quick End-to-End Prototype

  1. Build an end-to-end prototype and deploy it to a Hugging Face Space (or other endpoint)

    The initial prototype just answers general questions about Shopify Flow and is trained on the publicly available documentation at https://shopify.dev/docs/flow. It is available here: https://huggingface.co/spaces/thomfoolery/AIE5-MidTerm. It is meant to satisfy the requirements of the midterm and is not very useful in its current state.

    I plan to work on the getting it to generate Shopify Flow JSON objects from natural language next.

Creating a Golden Test Data Set

  1. Assess your pipeline using the RAGAS framework including key metrics faithfulness, response relevance, context precision, and context recall. Provide a table of your output results.

    Metric Value
    Context Recall 0.8069444444
    Faithfulness 0.9318181818
    Factual Correctness 0.373
    Answer Relevancy 0.9581163434
    Context Entity Recall 0.3598076345
    Noise Sensitivity Relevant 0.3017243635
  2. What conclusions can you draw about the performance and effectiveness of your pipeline with this information?

    I found that my RAG application performs well except for the metrics of Factual Correctness and Context Entity Recall. I'm not sure if 0.3 is bad for Noise Sensitivity Relevant but it seems OK to me. Poor factual correctness is concerning, I wonder if fine-tuned embeddings will improve this metric. I will need to investigate further. I also wonder if a smaller chunk size and overlap could help with context entity recall.

Fine-Tuning Open-Source Embeddings

  1. Swap out your existing embedding model for the new fine-tuned version. Provide a link to your fine-tuned embedding model on the Hugging Face Hub.

    My fine-tuned embedding model is available here: https://huggingface.co/thomfoolery/AIE5-MidTerm-finetuned-embeddings

Assessing Performance

  1. How does the performance compare to your original RAG application? Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements.

    Provide results in a table.

    Metric Original Value Fine-Tuned Value
    Context Recall 0.8069444444 0.7104166667
    Faithfulness 0.9318181818 0.8851135006
    Factual Correctness 0.373 0.4325
    Answer Relevancy 0.9581163434 0.9614896608
    Context Entity Recall 0.3598076345 0.3135290494
    Noise Sensitivity Relevant 0.3017243635 0.2666666667
  2. Articulate the changes that you expect to make to your app in the second half of the course. How will you improve your application?

    Context Entity Recall, Context Recall and Faithfulness dropped significantly. Factual Correctness, Answer Relevancy and Noise Sensitivity improved slightly. Overall I think this is a good start but I need to try a few variation to see if I can improve the metrics across the board. I assume chunk size and overlap is having a negative impact, so I will try a few different values to see if I can improve the metrics.