README.md · thomfoolery/AIE5-MidTerm at main

metadata

title: AIE5 MidTerm
emoji: 👁
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
license: mit

AIE 5 Mid Term

My Loom Video

Defining your Problem and Audience

Write a succinct 1-sentence description of the problem

Shopify Flow is a powerful and complex automation tool that can save Shopify merchants a significant mount time and money by handling repetitive tasks. However, the complexity of the tool can be overwhelming for non-technical merchants, often resulting in needing to hire expensive domain experts to automate and manage their business.
Write 1-2 paragraphs on why this is a problem for your specific user

Smaller Shopify merchants are often non-technical and may not have the resources to hire domain experts to automate their business processes. This can result in a significant loss of time and money, as they are unable to take full advantage of the automation capabilities of Shopify Flow. By generating Shopify Flow automations from natural language, we can help Shopify merchants save time and money, and focus on growing their business.

Propose a Solution

Write 1-2 paragraphs on your proposed solution. How will it look and feel to the user?

Shopify Flow automations can be imported and exported in a standardized JSON format that represent a directed acyclic graph (DAG) of triggers, conditions and actions. By fine-tuning a base model to output JSON objects that satisfy Shopify Flow's project format, we can generate these Shopify Flow JSON objects from natural language, eliminating the need to hire expensive domain experts to automate and manage their business.

Shopify merchants will describe the automation they need to create in natural language. The model will ask additional questions to gather requirements and important considerations. With this information our model will generate the necessary Shopify Flow JSON object to satisfy the merchants automation request. The merchant can import this JSON file into Shopify Flow and simply activate the automation. The can perform this first in a development environment to test the automation before deploying it to their production environment.
Describe the tools you plan to use in each part of your stack. Write one sentence on why you made each tooling choice.
1. LLM
  
  We will use a fine-tuned Language Model (LLM) to generate Shopify Flow JSON objects from natural language. We will need to fine-tune a powerful reasoning model (likely Deekseek R1) to understand the complex requirements and translate them into a logical flow diagram.
2. Embedding Model
  
  We will use a fine-tuned embedding model to find the most relevant Shopify Flow JSON objects to the help generated final JSON object.
3. Orchestration
  
  We will use LangGraph to orchestrate the LLM and Embedding Model to generate the final Shopify Flow JSON object.
4. Vector Database
  
  We will use Qdrant to store the Shopify Flow JSON objects and embeddings for quick retrieval and comparison.
5. Monitoring
  
  I haven't thought about this yet as it has not been covered in class AFAIK. I will need to do some research and learn more about monitoring.
6. Evaluation
  
  We will use RAGAS to evaluate the quality of the generated Shopify Flow JSON objects.
7. User Interface
  
  We will use Chainlit to implement a simple chat bot interface for the merchant to describe the automation they need to create, and answer questions form the agent. It will produce a file to download that contains the generated Shopify Flow JSON object.
8. (Optional) Serving & Inference
  
  I'm not sure yet as I don't know much about hosting my own fine-tuned models for inference. Need to learn more and do some research.
Where will you use an agent or agents? What will you use “agentic reasoning” for in your app?

Because we are generating Shopify Flow JSON objects from natural language, we will use an agent to interact with the merchant to gather requirements and important considerations. The agent will use agentic reasoning to ask clarifying questions and ensure the generated Shopify Flow JSON object satisfies the merchant's automation request. We will need to fine-tune a powerful reasoning model (likely Deekseek R1) to understand the complex requirements and translate them into a logical flow diagram.

Dealing with the Data

Describe all of your data sources and external APIs, and describe what you’ll use them for.

I don't believe I will need external APIs because I don't need real-time data. All of my data will come from documents. I will need a library of annotated Shopify Flow JSON objects to fine-tune the LLM and Embedding Model. I will need to use these examples and SDG to synthetically generate additional examples to fine-tune and evaluate the models.
Describe the default chunking strategy that you will use. Why did you make this decision?

I will start with a default chunking strategy of ~1000 tokens with 500 token overlap. I will tweak these numbers as I test for different scenarios, but it felt like a good place to start.
[Optional] Will you need specific data for any other part of your application? If so, explain.

Yes I will need to use lots of examples of Shopify Flow JSON objects to fine-tune the LLM and Embedding Model. I will also need to use these examples and SDG to synthetically generate additional examples to fine-tune and evaluate the models.

Building a Quick End-to-End Prototype

Build an end-to-end prototype and deploy it to a Hugging Face Space (or other endpoint)

The initial prototype just answers general questions about Shopify Flow and is trained on the publicly available documentation at https://shopify.dev/docs/flow. It is available here: https://huggingface.co/spaces/thomfoolery/AIE5-MidTerm. It is meant to satisfy the requirements of the midterm and is not very useful in its current state.

I plan to work on the getting it to generate Shopify Flow JSON objects from natural language next.

Creating a Golden Test Data Set

Assess your pipeline using the RAGAS framework including key metrics faithfulness, response relevance, context precision, and context recall. Provide a table of your output results.

Metric Value

Context Recall 0.8069444444

Faithfulness 0.9318181818

Factual Correctness 0.373

Answer Relevancy 0.9581163434

Context Entity Recall 0.3598076345

Noise Sensitivity Relevant 0.3017243635

What conclusions can you draw about the performance and effectiveness of your pipeline with this information?

I found that my RAG application performs well except for the metrics of Factual Correctness and Context Entity Recall. I'm not sure if 0.3 is bad for Noise Sensitivity Relevant but it seems OK to me. Poor factual correctness is concerning, I wonder if fine-tuned embeddings will improve this metric. I will need to investigate further. I also wonder if a smaller chunk size and overlap could help with context entity recall.

Metric	Value
Context Recall	0.8069444444
Faithfulness	0.9318181818
Factual Correctness	0.373
Answer Relevancy	0.9581163434
Context Entity Recall	0.3598076345
Noise Sensitivity Relevant	0.3017243635

Fine-Tuning Open-Source Embeddings

Swap out your existing embedding model for the new fine-tuned version. Provide a link to your fine-tuned embedding model on the Hugging Face Hub.

My fine-tuned embedding model is available here: https://huggingface.co/thomfoolery/AIE5-MidTerm-finetuned-embeddings

Assessing Performance

How does the performance compare to your original RAG application? Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements.

Provide results in a table.

Metric Original Value Fine-Tuned Value

Context Recall 0.8069444444 0.7104166667

Faithfulness 0.9318181818 0.8851135006

Factual Correctness 0.373 0.4325

Answer Relevancy 0.9581163434 0.9614896608

Context Entity Recall 0.3598076345 0.3135290494

Noise Sensitivity Relevant 0.3017243635 0.2666666667

Articulate the changes that you expect to make to your app in the second half of the course. How will you improve your application?

Context Entity Recall, Context Recall and Faithfulness dropped significantly. Factual Correctness, Answer Relevancy and Noise Sensitivity improved slightly. Overall I think this is a good start but I need to try a few variation to see if I can improve the metrics across the board. I assume chunk size and overlap is having a negative impact, so I will try a few different values to see if I can improve the metrics.

Metric	Original Value	Fine-Tuned Value
Context Recall	0.8069444444	0.7104166667
Faithfulness	0.9318181818	0.8851135006
Factual Correctness	0.373	0.4325
Answer Relevancy	0.9581163434	0.9614896608
Context Entity Recall	0.3598076345	0.3135290494
Noise Sensitivity Relevant	0.3017243635	0.2666666667