Spaces:

Testys
/

thery.ai

Running

File size: 163,097 Bytes
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RAG in a straightforward way"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "\n",
    "document_url = \"https://arxiv.org/pdf/2312.10997.pdf\"\n",
    "loader = PyPDFLoader(document_url)\n",
    "pages = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "Retrieval-Augmented Generation for Large\n",
      "Language Models: A Survey\n",
      "Yunfan Gaoa, Yun Xiongb, Xinyu \n"
     ]
    }
   ],
   "source": [
    "print(pages[0].page_content[0:100])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='1\\nRetrieval-Augmented Generation for Large\\nLanguage Models: A Survey\\nYunfan Gaoa, Yun Xiongb, Xinyu Gaob, Kangxiang Jiab, Jinliu Panb, Yuxi Bic, Yi Daia, Jiawei Suna, Meng\\nWangc, and Haofen Wanga,c\\naShanghai Research Institute for Intelligent Autonomous Systems, Tongji University\\nbShanghai Key Laboratory of Data Science, School of Computer Science, Fudan University'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='cCollege of Design and Innovation, Tongji University\\nAbstract —Large Language Models (LLMs) showcase impres-\\nsive capabilities but encounter challenges like hallucination,\\noutdated knowledge, and non-transparent, untraceable reasoning\\nprocesses. Retrieval-Augmented Generation (RAG) has emerged\\nas a promising solution by incorporating knowledge from external'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='databases. This enhances the accuracy and credibility of the\\ngeneration, particularly for knowledge-intensive tasks, and allows\\nfor continuous knowledge updates and integration of domain-\\nspecific information. RAG synergistically merges LLMs’ intrin-\\nsic knowledge with the vast, dynamic repositories of external\\ndatabases. This comprehensive review paper offers a detailed'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='examination of the progression of RAG paradigms, encompassing\\nthe Naive RAG, the Advanced RAG, and the Modular RAG.\\nIt meticulously scrutinizes the tripartite foundation of RAG\\nframeworks, which includes the retrieval, the generation and the\\naugmentation techniques. The paper highlights the state-of-the-\\nart technologies embedded in each of these critical components,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='providing a profound understanding of the advancements in RAG\\nsystems. Furthermore, this paper introduces up-to-date evalua-\\ntion framework and benchmark. At the end, this article delineates\\nthe challenges currently faced and points out prospective avenues\\nfor research and development1.\\nIndex Terms —Large language model, retrieval-augmented gen-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='eration, natural language processing, information retrieval\\nI. I NTRODUCTION\\nLARGE language models (LLMs) have achieved remark-\\nable success, though they still face significant limitations,\\nespecially in domain-specific or knowledge-intensive tasks [1],\\nnotably producing “hallucinations” [2] when handling queries\\nbeyond their training data or requiring current information. To'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='overcome challenges, Retrieval-Augmented Generation (RAG)\\nenhances LLMs by retrieving relevant document chunks from\\nexternal knowledge base through semantic similarity calcu-\\nlation. By referencing external knowledge, RAG effectively\\nreduces the problem of generating factually incorrect content.\\nIts integration into LLMs has resulted in widespread adoption,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='establishing RAG as a key technology in advancing chatbots\\nand enhancing the suitability of LLMs for real-world applica-\\ntions.\\nRAG technology has rapidly developed in recent years, and\\nthe technology tree summarizing related research is shown\\nCorresponding Author.Email:[email protected]\\n1Resources are available at https://github.com/Tongji-KGLLM/'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='RAG-Surveyin Figure 1. The development trajectory of RAG in the era\\nof large models exhibits several distinct stage characteristics.\\nInitially, RAG’s inception coincided with the rise of the\\nTransformer architecture, focusing on enhancing language\\nmodels by incorporating additional knowledge through Pre-\\nTraining Models (PTM). This early stage was characterized'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='by foundational work aimed at refining pre-training techniques\\n[3]–[5].The subsequent arrival of ChatGPT [6] marked a\\npivotal moment, with LLM demonstrating powerful in context\\nlearning (ICL) capabilities. RAG research shifted towards\\nproviding better information for LLMs to answer more com-\\nplex and knowledge-intensive tasks during the inference stage,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='leading to rapid development in RAG studies. As research\\nprogressed, the enhancement of RAG was no longer limited\\nto the inference stage but began to incorporate more with LLM\\nfine-tuning techniques.\\nThe burgeoning field of RAG has experienced swift growth,\\nyet it has not been accompanied by a systematic synthesis that\\ncould clarify its broader trajectory. This survey endeavors to'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='fill this gap by mapping out the RAG process and charting\\nits evolution and anticipated future paths, with a focus on the\\nintegration of RAG within LLMs. This paper considers both\\ntechnical paradigms and research methods, summarizing three\\nmain research paradigms from over 100 RAG studies, and\\nanalyzing key technologies in the core stages of “Retrieval,”'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='“Generation,” and “Augmentation.” On the other hand, current\\nresearch tends to focus more on methods, lacking analysis and\\nsummarization of how to evaluate RAG. This paper compre-\\nhensively reviews the downstream tasks, datasets, benchmarks,\\nand evaluation methods applicable to RAG. Overall, this\\npaper sets out to meticulously compile and categorize the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='foundational technical concepts, historical progression, and\\nthe spectrum of RAG methodologies and applications that\\nhave emerged post-LLMs. It is designed to equip readers and\\nprofessionals with a detailed and structured understanding of\\nboth large models and RAG. It aims to illuminate the evolution\\nof retrieval augmentation techniques, assess the strengths and'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 0}, page_content='weaknesses of various approaches in their respective contexts,\\nand speculate on upcoming trends and innovations.\\nOur contributions are as follows:\\n•In this survey, we present a thorough and systematic\\nreview of the state-of-the-art RAG methods, delineating\\nits evolution through paradigms including naive RAG,arXiv:2312.10997v5  [cs.CL]  27 Mar 2024'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='2\\nFig. 1. Technology tree of RAG research. The stages of involving RAG mainly include pre-training, fine-tuning, and inference. With the emergence of LLMs,\\nresearch on RAG initially focused on leveraging the powerful in context learning abilities of LLMs, primarily concentrating on the inference stage. Subsequent'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='research has delved deeper, gradually integrating more with the fine-tuning of LLMs. Researchers have also been exploring ways to enhance language models\\nin the pre-training stage through retrieval-augmented techniques.\\nadvanced RAG, and modular RAG. This review contex-\\ntualizes the broader scope of RAG research within the\\nlandscape of LLMs.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='landscape of LLMs.\\n•We identify and discuss the central technologies integral\\nto the RAG process, specifically focusing on the aspects\\nof “Retrieval”, “Generation” and “Augmentation”, and\\ndelve into their synergies, elucidating how these com-\\nponents intricately collaborate to form a cohesive and\\neffective RAG framework.\\n•We have summarized the current assessment methods of'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='RAG, covering 26 tasks, nearly 50 datasets, outlining\\nthe evaluation objectives and metrics, as well as the\\ncurrent evaluation benchmarks and tools. Additionally,\\nwe anticipate future directions for RAG, emphasizing\\npotential enhancements to tackle current challenges.\\nThe paper unfolds as follows: Section II introduces the\\nmain concept and current paradigms of RAG. The following'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='three sections explore core components—“Retrieval”, “Gen-\\neration” and “Augmentation”, respectively. Section III focuses\\non optimization methods in retrieval,including indexing, query\\nand embedding optimization. Section IV concentrates on post-\\nretrieval process and LLM fine-tuning in generation. Section V\\nanalyzes the three augmentation processes. Section VI focuses'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='on RAG’s downstream tasks and evaluation system. Sec-\\ntion VII mainly discusses the challenges that RAG currentlyfaces and its future development directions. At last, the paper\\nconcludes in Section VIII.\\nII. O VERVIEW OF RAG\\nA typical application of RAG is illustrated in Figure 2.\\nHere, a user poses a question to ChatGPT about a recent,\\nwidely discussed news. Given ChatGPT’s reliance on pre-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='training data, it initially lacks the capacity to provide up-\\ndates on recent developments. RAG bridges this information\\ngap by sourcing and incorporating knowledge from external\\ndatabases. In this case, it gathers relevant news articles related\\nto the user’s query. These articles, combined with the original\\nquestion, form a comprehensive prompt that empowers LLMs'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='to generate a well-informed answer.\\nThe RAG research paradigm is continuously evolving, and\\nwe categorize it into three stages: Naive RAG, Advanced\\nRAG, and Modular RAG, as showed in Figure 3. Despite\\nRAG method are cost-effective and surpass the performance\\nof the native LLM, they also exhibit several limitations.\\nThe development of Advanced RAG and Modular RAG is'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 1}, page_content='a response to these specific shortcomings in Naive RAG.\\nA. Naive RAG\\nThe Naive RAG research paradigm represents the earli-\\nest methodology, which gained prominence shortly after the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='3\\nFig. 2. A representative instance of the RAG process applied to question answering. It mainly consists of 3 steps. 1) Indexing. Documents are split into chunks,\\nencoded into vectors, and stored in a vector database. 2) Retrieval. Retrieve the Top k chunks most relevant to the question based on semantic similarity. 3)'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='Generation. Input the original question and the retrieved chunks together into LLM to generate the final answer.\\nwidespread adoption of ChatGPT. The Naive RAG follows\\na traditional process that includes indexing, retrieval, and\\ngeneration, which is also characterized as a “Retrieve-Read”\\nframework [7].\\nIndexing starts with the cleaning and extraction of raw data'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='in diverse formats like PDF, HTML, Word, and Markdown,\\nwhich is then converted into a uniform plain text format. To\\naccommodate the context limitations of language models, text\\nis segmented into smaller, digestible chunks. Chunks are then\\nencoded into vector representations using an embedding model\\nand stored in vector database. This step is crucial for enabling'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='efficient similarity searches in the subsequent retrieval phase.\\nRetrieval . Upon receipt of a user query, the RAG system\\nemploys the same encoding model utilized during the indexing\\nphase to transform the query into a vector representation.\\nIt then computes the similarity scores between the query\\nvector and the vector of chunks within the indexed corpus.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='The system prioritizes and retrieves the top K chunks that\\ndemonstrate the greatest similarity to the query. These chunks\\nare subsequently used as the expanded context in prompt.\\nGeneration . The posed query and selected documents are\\nsynthesized into a coherent prompt to which a large language\\nmodel is tasked with formulating a response. The model’s'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='approach to answering may vary depending on task-specific\\ncriteria, allowing it to either draw upon its inherent parametric\\nknowledge or restrict its responses to the information con-\\ntained within the provided documents. In cases of ongoing\\ndialogues, any existing conversational history can be integrated\\ninto the prompt, enabling the model to engage in multi-turn'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='dialogue interactions effectively.\\nHowever, Naive RAG encounters notable drawbacks:Retrieval Challenges . The retrieval phase often struggles\\nwith precision and recall, leading to the selection of misaligned\\nor irrelevant chunks, and the missing of crucial information.\\nGeneration Difficulties . In generating responses, the model\\nmay face the issue of hallucination, where it produces con-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='tent not supported by the retrieved context. This phase can\\nalso suffer from irrelevance, toxicity, or bias in the outputs,\\ndetracting from the quality and reliability of the responses.\\nAugmentation Hurdles . Integrating retrieved information\\nwith the different task can be challenging, sometimes resulting\\nin disjointed or incoherent outputs. The process may also'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='encounter redundancy when similar information is retrieved\\nfrom multiple sources, leading to repetitive responses. Deter-\\nmining the significance and relevance of various passages and\\nensuring stylistic and tonal consistency add further complexity.\\nFacing complex issues, a single retrieval based on the original\\nquery may not suffice to acquire adequate context information.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='Moreover, there’s a concern that generation models might\\noverly rely on augmented information, leading to outputs that\\nsimply echo retrieved content without adding insightful or\\nsynthesized information.\\nB. Advanced RAG\\nAdvanced RAG introduces specific improvements to over-\\ncome the limitations of Naive RAG. Focusing on enhancing re-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 2}, page_content='trieval quality, it employs pre-retrieval and post-retrieval strate-\\ngies. To tackle the indexing issues, Advanced RAG refines\\nits indexing techniques through the use of a sliding window\\napproach, fine-grained segmentation, and the incorporation of\\nmetadata. Additionally, it incorporates several optimization\\nmethods to streamline the retrieval process [8].'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='4\\nFig. 3. Comparison between the three paradigms of RAG. (Left) Naive RAG mainly consists of three parts: indexing, retrieval and generation. (Middle)\\nAdvanced RAG proposes multiple optimization strategies around pre-retrieval and post-retrieval, with a process similar to the Naive RAG, still following a'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='chain-like structure. (Right) Modular RAG inherits and develops from the previous paradigm, showcasing greater flexibility overall. This is evident in the\\nintroduction of multiple specific functional modules and the replacement of existing modules. The overall process is not limited to sequential retrieval and\\ngeneration; it includes methods such as iterative and adaptive retrieval.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='Pre-retrieval process . In this stage, the primary focus is\\non optimizing the indexing structure and the original query.\\nThe goal of optimizing indexing is to enhance the quality of\\nthe content being indexed. This involves strategies: enhancing\\ndata granularity, optimizing index structures, adding metadata,\\nalignment optimization, and mixed retrieval. While the goal'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='of query optimization is to make the user’s original question\\nclearer and more suitable for the retrieval task. Common\\nmethods include query rewriting query transformation, query\\nexpansion and other techniques [7], [9]–[11].\\nPost-Retrieval Process . Once relevant context is retrieved,\\nit’s crucial to integrate it effectively with the query. The main'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='methods in post-retrieval process include rerank chunks and\\ncontext compressing. Re-ranking the retrieved information to\\nrelocate the most relevant content to the edges of the prompt is\\na key strategy. This concept has been implemented in frame-\\nworks such as LlamaIndex2, LangChain3, and HayStack [12].\\nFeeding all relevant documents directly into LLMs can lead'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='to information overload, diluting the focus on key details with\\nirrelevant content.To mitigate this, post-retrieval efforts con-\\ncentrate on selecting the essential information, emphasizing\\ncritical sections, and shortening the context to be processed.\\n2https://www.llamaindex.ai\\n3https://www.langchain.com/C. Modular RAG\\nThe modular RAG architecture advances beyond the for-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='mer two RAG paradigms, offering enhanced adaptability and\\nversatility. It incorporates diverse strategies for improving its\\ncomponents, such as adding a search module for similarity\\nsearches and refining the retriever through fine-tuning. Inno-\\nvations like restructured RAG modules [13] and rearranged\\nRAG pipelines [14] have been introduced to tackle specific'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='challenges. The shift towards a modular RAG approach is\\nbecoming prevalent, supporting both sequential processing and\\nintegrated end-to-end training across its components. Despite\\nits distinctiveness, Modular RAG builds upon the foundational\\nprinciples of Advanced and Naive RAG, illustrating a progres-\\nsion and refinement within the RAG family.\\n1) New Modules: The Modular RAG framework introduces'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='additional specialized components to enhance retrieval and\\nprocessing capabilities. The Search module adapts to spe-\\ncific scenarios, enabling direct searches across various data\\nsources like search engines, databases, and knowledge graphs,\\nusing LLM-generated code and query languages [15]. RAG-\\nFusion addresses traditional search limitations by employing'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 3}, page_content='a multi-query strategy that expands user queries into diverse\\nperspectives, utilizing parallel vector searches and intelligent\\nre-ranking to uncover both explicit and transformative knowl-\\nedge [16]. The Memory module leverages the LLM’s memory\\nto guide retrieval, creating an unbounded memory pool that'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='5\\naligns the text more closely with data distribution through iter-\\native self-enhancement [17], [18]. Routing in the RAG system\\nnavigates through diverse data sources, selecting the optimal\\npathway for a query, whether it involves summarization,\\nspecific database searches, or merging different information\\nstreams [19]. The Predict module aims to reduce redundancy'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='and noise by generating context directly through the LLM,\\nensuring relevance and accuracy [13]. Lastly, the Task Adapter\\nmodule tailors RAG to various downstream tasks, automating\\nprompt retrieval for zero-shot inputs and creating task-specific\\nretrievers through few-shot query generation [20], [21] .This\\ncomprehensive approach not only streamlines the retrieval pro-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='cess but also significantly improves the quality and relevance\\nof the information retrieved, catering to a wide array of tasks\\nand queries with enhanced precision and flexibility.\\n2) New Patterns: Modular RAG offers remarkable adapt-\\nability by allowing module substitution or reconfiguration\\nto address specific challenges. This goes beyond the fixed'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='structures of Naive and Advanced RAG, characterized by a\\nsimple “Retrieve” and “Read” mechanism. Moreover, Modular\\nRAG expands this flexibility by integrating new modules or\\nadjusting interaction flow among existing ones, enhancing its\\napplicability across different tasks.\\nInnovations such as the Rewrite-Retrieve-Read [7]model\\nleverage the LLM’s capabilities to refine retrieval queries'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='through a rewriting module and a LM-feedback mechanism\\nto update rewriting model., improving task performance.\\nSimilarly, approaches like Generate-Read [13] replace tradi-\\ntional retrieval with LLM-generated content, while Recite-\\nRead [22] emphasizes retrieval from model weights, enhanc-\\ning the model’s ability to handle knowledge-intensive tasks.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='Hybrid retrieval strategies integrate keyword, semantic, and\\nvector searches to cater to diverse queries. Additionally, em-\\nploying sub-queries and hypothetical document embeddings\\n(HyDE) [11] seeks to improve retrieval relevance by focusing\\non embedding similarities between generated answers and real\\ndocuments.\\nAdjustments in module arrangement and interaction, such'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='as the Demonstrate-Search-Predict (DSP) [23] framework\\nand the iterative Retrieve-Read-Retrieve-Read flow of ITER-\\nRETGEN [14], showcase the dynamic use of module out-\\nputs to bolster another module’s functionality, illustrating a\\nsophisticated understanding of enhancing module synergy.\\nThe flexible orchestration of Modular RAG Flow showcases'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='the benefits of adaptive retrieval through techniques such as\\nFLARE [24] and Self-RAG [25]. This approach transcends\\nthe fixed RAG retrieval process by evaluating the necessity\\nof retrieval based on different scenarios. Another benefit of\\na flexible architecture is that the RAG system can more\\neasily integrate with other technologies (such as fine-tuning'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='or reinforcement learning) [26]. For example, this can involve\\nfine-tuning the retriever for better retrieval results, fine-tuning\\nthe generator for more personalized outputs, or engaging in\\ncollaborative fine-tuning [27].\\nD. RAG vs Fine-tuning\\nThe augmentation of LLMs has attracted considerable atten-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='tion due to their growing prevalence. Among the optimizationmethods for LLMs, RAG is often compared with Fine-tuning\\n(FT) and prompt engineering. Each method has distinct charac-\\nteristics as illustrated in Figure 4. We used a quadrant chart to\\nillustrate the differences among three methods in two dimen-\\nsions: external knowledge requirements and model adaption'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='requirements. Prompt engineering leverages a model’s inherent\\ncapabilities with minimum necessity for external knowledge\\nand model adaption. RAG can be likened to providing a model\\nwith a tailored textbook for information retrieval, ideal for pre-\\ncise information retrieval tasks. In contrast, FT is comparable\\nto a student internalizing knowledge over time, suitable for'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='scenarios requiring replication of specific structures, styles, or\\nformats.\\nRAG excels in dynamic environments by offering real-\\ntime knowledge updates and effective utilization of external\\nknowledge sources with high interpretability. However, it\\ncomes with higher latency and ethical considerations regarding\\ndata retrieval. On the other hand, FT is more static, requiring'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='retraining for updates but enabling deep customization of the\\nmodel’s behavior and style. It demands significant compu-\\ntational resources for dataset preparation and training, and\\nwhile it can reduce hallucinations, it may face challenges with\\nunfamiliar data.\\nIn multiple evaluations of their performance on various\\nknowledge-intensive tasks across different topics, [28] re-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='vealed that while unsupervised fine-tuning shows some im-\\nprovement, RAG consistently outperforms it, for both exist-\\ning knowledge encountered during training and entirely new\\nknowledge. Additionally, it was found that LLMs struggle\\nto learn new factual information through unsupervised fine-\\ntuning. The choice between RAG and FT depends on the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='specific needs for data dynamics, customization, and com-\\nputational capabilities in the application context. RAG and\\nFT are not mutually exclusive and can complement each\\nother, enhancing a model’s capabilities at different levels.\\nIn some instances, their combined use may lead to optimal\\nperformance. The optimization process involving RAG and FT'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='may require multiple iterations to achieve satisfactory results.\\nIII. R ETRIEVAL\\nIn the context of RAG, it is crucial to efficiently retrieve\\nrelevant documents from the data source. There are several\\nkey issues involved, such as the retrieval source, retrieval\\ngranularity, pre-processing of the retrieval, and selection of\\nthe corresponding embedding model.\\nA. Retrieval Source'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='A. Retrieval Source\\nRAG relies on external knowledge to enhance LLMs, while\\nthe type of retrieval source and the granularity of retrieval\\nunits both affect the final generation results.\\n1) Data Structure: Initially, text is s the mainstream source\\nof retrieval. Subsequently, the retrieval source expanded to in-\\nclude semi-structured data (PDF) and structured data (Knowl-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 4}, page_content='edge Graph, KG) for enhancement. In addition to retrieving\\nfrom original external sources, there is also a growing trend in\\nrecent researches towards utilizing content generated by LLMs\\nthemselves for retrieval and enhancement purposes.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='6\\nTABLE I\\nSUMMARY OF RAG METHODS\\nMethod Retrieval SourceRetrieval\\nData TypeRetrieval\\nGranularityAugmentation\\nStageRetrieval\\nprocess\\nCoG [29] Wikipedia Text Phrase Pre-training Iterative\\nDenseX [30] FactoidWiki Text Proposition Inference Once\\nEAR [31] Dataset-base Text Sentence Tuning Once\\nUPRISE [20] Dataset-base Text Sentence Tuning Once\\nRAST [32] Dataset-base Text Sentence Tuning Once'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='Self-Mem [17] Dataset-base Text Sentence Tuning Iterative\\nFLARE [24] Search Engine,Wikipedia Text Sentence Tuning Adaptive\\nPGRA [33] Wikipedia Text Sentence Inference Once\\nFILCO [34] Wikipedia Text Sentence Inference Once\\nRADA [35] Dataset-base Text Sentence Inference Once\\nFilter-rerank [36] Synthesized dataset Text Sentence Inference Once\\nR-GQA [37] Dataset-base Text Sentence Pair Tuning Once'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='LLM-R [38] Dataset-base Text Sentence Pair Inference Iterative\\nTIGER [39] Dataset-base Text Item-base Pre-training Once\\nLM-Indexer [40] Dataset-base Text Item-base Tuning Once\\nBEQUE [9] Dataset-base Text Item-base Tuning Once\\nCT-RAG [41] Synthesized dataset Text Item-base Tuning Once\\nAtlas [42] Wikipedia, Common Crawl Text Chunk Pre-training Iterative'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='RA VEN [43] Wikipedia Text Chunk Pre-training Once\\nRETRO++ [44] Pre-training Corpus Text Chunk Pre-training Iterative\\nINSTRUCTRETRO [45] Pre-training corpus Text Chunk Pre-training Iterative\\nRRR [7] Search Engine Text Chunk Tuning Once\\nRA-e2e [46] Dataset-base Text Chunk Tuning Once\\nPROMPTAGATOR [21] BEIR Text Chunk Tuning Once\\nAAR [47] MSMARCO,Wikipedia Text Chunk Tuning Once'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='RA-DIT [27] Common Crawl,Wikipedia Text Chunk Tuning Once\\nRAG-Robust [48] Wikipedia Text Chunk Tuning Once\\nRA-Long-Form [49] Dataset-base Text Chunk Tuning Once\\nCoN [50] Wikipedia Text Chunk Tuning Once\\nSelf-RAG [25] Wikipedia Text Chunk Tuning Adaptive\\nBGM [26] Wikipedia Text Chunk Inference Once\\nCoQ [51] Wikipedia Text Chunk Inference Iterative'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='Token-Elimination [52] Wikipedia Text Chunk Inference Once\\nPaperQA [53] Arxiv,Online Database,PubMed Text Chunk Inference Iterative\\nNoiseRAG [54] FactoidWiki Text Chunk Inference Once\\nIAG [55] Search Engine,Wikipedia Text Chunk Inference Once\\nNoMIRACL [56] Wikipedia Text Chunk Inference Once\\nToC [57] Search Engine,Wikipedia Text Chunk Inference Recursive'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='SKR [58] Dataset-base,Wikipedia Text Chunk Inference Adaptive\\nITRG [59] Wikipedia Text Chunk Inference Iterative\\nRAG-LongContext [60] Dataset-base Text Chunk Inference Once\\nITER-RETGEN [14] Wikipedia Text Chunk Inference Iterative\\nIRCoT [61] Wikipedia Text Chunk Inference Recursive\\nLLM-Knowledge-Boundary [62] Wikipedia Text Chunk Inference Once'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='RAPTOR [63] Dataset-base Text Chunk Inference Recursive\\nRECITE [22] LLMs Text Chunk Inference Once\\nICRALM [64] Pile,Wikipedia Text Chunk Inference Iterative\\nRetrieve-and-Sample [65] Dataset-base Text Doc Tuning Once\\nZemi [66] C4 Text Doc Tuning Once\\nCRAG [67] Arxiv Text Doc Inference Once\\n1-PAGER [68] Wikipedia Text Doc Inference Iterative\\nPRCA [69] Dataset-base Text Doc Inference Once'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='QLM-Doc-ranking [70] Dataset-base Text Doc Inference Once\\nRecomp [71] Wikipedia Text Doc Inference Once\\nDSP [23] Wikipedia Text Doc Inference Iterative\\nRePLUG [72] Pile Text Doc Inference Once\\nARM-RAG [73] Dataset-base Text Doc Inference Iterative\\nGenRead [13] LLMs Text Doc Inference Iterative\\nUniMS-RAG [74] Dataset-base Text Multi Tuning Once'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='CREA-ICL [19] Dataset-base Crosslingual,Text Sentence Inference Once\\nPKG [75] LLM Tabular,Text Chunk Inference Once\\nSANTA [76] Dataset-base Code,Text Item Pre-training Once\\nSURGE [77] Freebase KG Sub-Graph Tuning Once\\nMK-ToD [78] Dataset-base KG Entity Tuning Once\\nDual-Feedback-ToD [79] Dataset-base KG Entity Sequence Tuning Once\\nKnowledGPT [15] Dataset-base KG Triplet Inference Muti-time'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 5}, page_content='FABULA [80] Dataset-base,Graph KG Entity Inference Once\\nHyKGE [81] CMeKG KG Entity Inference Once\\nKALMV [82] Wikipedia KG Triplet Inference Iterative\\nRoG [83] Freebase KG Triplet Inference Iterative\\nG-Retriever [84] Dataset-base TextGraph Sub-Graph Inference Once'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='7\\nFig. 4. RAG compared with other model optimization methods in the aspects of “External Knowledge Required” and “Model Adaption Required”. Prompt\\nEngineering requires low modifications to the model and external knowledge, focusing on harnessing the capabilities of LLMs themselves. Fine-tuning, on'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='the other hand, involves further training the model. In the early stages of RAG (Naive RAG), there is a low demand for model modifications. As research\\nprogresses, Modular RAG has become more integrated with fine-tuning techniques.\\nUnstructured Data , such as text, is the most widely used\\nretrieval source, which are mainly gathered from corpus. For'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='open-domain question-answering (ODQA) tasks, the primary\\nretrieval sources are Wikipedia Dump with the current major\\nversions including HotpotQA4(1st October , 2017), DPR5(20\\nDecember, 2018). In addition to encyclopedic data, common\\nunstructured data includes cross-lingual text [19] and domain-\\nspecific data (such as medical [67]and legal domains [29]).'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='Semi-structured data . typically refers to data that contains a\\ncombination of text and table information, such as PDF. Han-\\ndling semi-structured data poses challenges for conventional\\nRAG systems due to two main reasons. Firstly, text splitting\\nprocesses may inadvertently separate tables, leading to data\\ncorruption during retrieval. Secondly, incorporating tables into'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='the data can complicate semantic similarity searches. When\\ndealing with semi-structured data, one approach involves lever-\\naging the code capabilities of LLMs to execute Text-2-SQL\\nqueries on tables within databases, such as TableGPT [85].\\nAlternatively, tables can be transformed into text format for\\nfurther analysis using text-based methods [75]. However, both'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='of these methods are not optimal solutions, indicating substan-\\ntial research opportunities in this area.\\nStructured data , such as knowledge graphs (KGs) [86] ,\\nwhich are typically verified and can provide more precise in-\\nformation. KnowledGPT [15] generates KB search queries and\\nstores knowledge in a personalized base, enhancing the RAG'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='model’s knowledge richness. In response to the limitations of\\nLLMs in understanding and answering questions about textual\\ngraphs, G-Retriever [84] integrates Graph Neural Networks\\n4https://hotpotqa.github.io/wiki-readme.html\\n5https://github.com/facebookresearch/DPR(GNNs), LLMs and RAG, enhancing graph comprehension\\nand question-answering capabilities through soft prompting'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='of the LLM, and employs the Prize-Collecting Steiner Tree\\n(PCST) optimization problem for targeted graph retrieval. On\\nthe contrary, it requires additional effort to build, validate,\\nand maintain structured databases. On the contrary, it requires\\nadditional effort to build, validate, and maintain structured\\ndatabases.\\nLLMs-Generated Content. Addressing the limitations of'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='external auxiliary information in RAG, some research has\\nfocused on exploiting LLMs’ internal knowledge. SKR [58]\\nclassifies questions as known or unknown, applying retrieval\\nenhancement selectively. GenRead [13] replaces the retriever\\nwith an LLM generator, finding that LLM-generated contexts\\noften contain more accurate answers due to better alignment'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='with the pre-training objectives of causal language modeling.\\nSelfmem [17] iteratively creates an unbounded memory pool\\nwith a retrieval-enhanced generator, using a memory selec-\\ntor to choose outputs that serve as dual problems to the\\noriginal question, thus self-enhancing the generative model.\\nThese methodologies underscore the breadth of innovative'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='data source utilization in RAG, striving to improve model\\nperformance and task effectiveness.\\n2) Retrieval Granularity: Another important factor besides\\nthe data format of the retrieval source is the granularity of\\nthe retrieved data. Coarse-grained retrieval units theoretically\\ncan provide more relevant information for the problem, but'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 6}, page_content='they may also contain redundant content, which could distract\\nthe retriever and language models in downstream tasks [50],\\n[87]. On the other hand, fine-grained retrieval unit granularity\\nincreases the burden of retrieval and does not guarantee seman-\\ntic integrity and meeting the required knowledge. Choosing'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='8\\nthe appropriate retrieval granularity during inference can be\\na simple and effective strategy to improve the retrieval and\\ndownstream task performance of dense retrievers.\\nIn text, retrieval granularity ranges from fine to coarse,\\nincluding Token, Phrase, Sentence, Proposition, Chunks, Doc-\\nument. Among them, DenseX [30]proposed the concept of'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='using propositions as retrieval units. Propositions are defined\\nas atomic expressions in the text, each encapsulating a unique\\nfactual segment and presented in a concise, self-contained nat-\\nural language format. This approach aims to enhance retrieval\\nprecision and relevance. On the Knowledge Graph (KG),\\nretrieval granularity includes Entity, Triplet, and sub-Graph.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='The granularity of retrieval can also be adapted to downstream\\ntasks, such as retrieving Item IDs [40]in recommendation tasks\\nand Sentence pairs [38]. Detailed information is illustrated in\\nTable I.\\nB. Indexing Optimization\\nIn the Indexing phase, documents will be processed, seg-\\nmented, and transformed into Embeddings to be stored in a'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='vector database. The quality of index construction determines\\nwhether the correct context can be obtained in the retrieval\\nphase.\\n1) Chunking Strategy: The most common method is to split\\nthe document into chunks on a fixed number of tokens (e.g.,\\n100, 256, 512) [88]. Larger chunks can capture more context,\\nbut they also generate more noise, requiring longer processing'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='time and higher costs. While smaller chunks may not fully\\nconvey the necessary context, they do have less noise. How-\\never, chunks leads to truncation within sentences, prompting\\nthe optimization of a recursive splits and sliding window meth-\\nods, enabling layered retrieval by merging globally related\\ninformation across multiple retrieval processes [89]. Never-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='theless, these approaches still cannot strike a balance between\\nsemantic completeness and context length. Therefore, methods\\nlike Small2Big have been proposed, where sentences (small)\\nare used as the retrieval unit, and the preceding and following\\nsentences are provided as (big) context to LLMs [90].\\n2) Metadata Attachments: Chunks can be enriched with'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='metadata information such as page number, file name, au-\\nthor,category timestamp. Subsequently, retrieval can be filtered\\nbased on this metadata, limiting the scope of the retrieval.\\nAssigning different weights to document timestamps during\\nretrieval can achieve time-aware RAG, ensuring the freshness\\nof knowledge and avoiding outdated information.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='In addition to extracting metadata from the original doc-\\numents, metadata can also be artificially constructed. For\\nexample, adding summaries of paragraph, as well as intro-\\nducing hypothetical questions. This method is also known as\\nReverse HyDE. Specifically, using LLM to generate questions\\nthat can be answered by the document, then calculating the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='similarity between the original question and the hypothetical\\nquestion during retrieval to reduce the semantic gap between\\nthe question and the answer.\\n3) Structural Index: One effective method for enhancing\\ninformation retrieval is to establish a hierarchical structure for\\nthe documents. By constructing In structure, RAG system can'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='expedite the retrieval and processing of pertinent data.Hierarchical index structure . File are arranged in parent-\\nchild relationships, with chunks linked to them. Data sum-\\nmaries are stored at each node, aiding in the swift traversal\\nof data and assisting the RAG system in determining which\\nchunks to extract. This approach can also mitigate the illusion\\ncaused by block extraction issues.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='caused by block extraction issues.\\nKnowledge Graph index . Utilize KG in constructing the\\nhierarchical structure of documents contributes to maintaining\\nconsistency. It delineates the connections between different\\nconcepts and entities, markedly reducing the potential for\\nillusions. Another advantage is the transformation of the\\ninformation retrieval process into instructions that LLM can'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='comprehend, thereby enhancing the accuracy of knowledge\\nretrieval and enabling LLM to generate contextually coherent\\nresponses, thus improving the overall efficiency of the RAG\\nsystem. To capture the logical relationship between document\\ncontent and structure, KGP [91] proposed a method of building\\nan index between multiple documents using KG. This KG'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='consists of nodes (representing paragraphs or structures in the\\ndocuments, such as pages and tables) and edges (indicating\\nsemantic/lexical similarity between paragraphs or relationships\\nwithin the document structure), effectively addressing knowl-\\nedge retrieval and reasoning problems in a multi-document\\nenvironment.\\nC. Query Optimization\\nOne of the primary challenges with Naive RAG is its'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='direct reliance on the user’s original query as the basis for\\nretrieval. Formulating a precise and clear question is difficult,\\nand imprudent queries result in subpar retrieval effectiveness.\\nSometimes, the question itself is complex, and the language\\nis not well-organized. Another difficulty lies in language\\ncomplexity ambiguity. Language models often struggle when'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='dealing with specialized vocabulary or ambiguous abbrevi-\\nations with multiple meanings. For instance, they may not\\ndiscern whether “LLM” refers to large language model or a\\nMaster of Laws in a legal context.\\n1) Query Expansion: Expanding a single query into mul-\\ntiple queries enriches the content of the query, providing\\nfurther context to address any lack of specific nuances, thereby'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='ensuring the optimal relevance of the generated answers.\\nMulti-Query . By employing prompt engineering to expand\\nqueries via LLMs, these queries can then be executed in\\nparallel. The expansion of queries is not random, but rather\\nmeticulously designed.\\nSub-Query . The process of sub-question planning represents\\nthe generation of the necessary sub-questions to contextualize'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='and fully answer the original question when combined. This\\nprocess of adding relevant context is, in principle, similar\\nto query expansion. Specifically, a complex question can be\\ndecomposed into a series of simpler sub-questions using the\\nleast-to-most prompting method [92].\\nChain-of-Verification(CoVe) . The expanded queries undergo\\nvalidation by LLM to achieve the effect of reducing halluci-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 7}, page_content='nations. Validated expanded queries typically exhibit higher\\nreliability [93].'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='9\\n2) Query Transformation: The core concept is to retrieve\\nchunks based on a transformed query instead of the user’s\\noriginal query.\\nQuery Rewrite .The original queries are not always optimal\\nfor LLM retrieval, especially in real-world scenarios. There-\\nfore, we can prompt LLM to rewrite the queries. In addition to\\nusing LLM for query rewriting, specialized smaller language'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='models, such as RRR (Rewrite-retrieve-read) [7]. The imple-\\nmentation of the query rewrite method in the Taobao, known\\nas BEQUE [9] has notably enhanced recall effectiveness for\\nlong-tail queries, resulting in a rise in GMV .\\nAnother query transformation method is to use prompt\\nengineering to let LLM generate a query based on the original'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='query for subsequent retrieval. HyDE [11] construct hypothet-\\nical documents (assumed answers to the original query). It\\nfocuses on embedding similarity from answer to answer rather\\nthan seeking embedding similarity for the problem or query.\\nUsing the Step-back Prompting method [10], the original\\nquery is abstracted to generate a high-level concept question'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='(step-back question). In the RAG system, both the step-back\\nquestion and the original query are used for retrieval, and both\\nthe results are utilized as the basis for language model answer\\ngeneration.\\n3) Query Routing: Based on varying queries, routing to\\ndistinct RAG pipeline,which is suitable for a versatile RAG\\nsystem designed to accommodate diverse scenarios.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='Metadata Router/ Filter . The first step involves extracting\\nkeywords (entity) from the query, followed by filtering based\\non the keywords and metadata within the chunks to narrow\\ndown the search scope.\\nSemantic Router is another method of routing involves\\nleveraging the semantic information of the query. Specific\\napprach see Semantic Router6. Certainly, a hybrid routing'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='approach can also be employed, combining both semantic and\\nmetadata-based methods for enhanced query routing.\\nD. Embedding\\nIn RAG, retrieval is achieved by calculating the similarity\\n(e.g. cosine similarity) between the embeddings of the ques-\\ntion and document chunks, where the semantic representation\\ncapability of embedding models plays a key role. This mainly'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='includes a sparse encoder (BM25) and a dense retriever (BERT\\narchitecture Pre-training language models). Recent research\\nhas introduced prominent embedding models such as AngIE,\\nV oyage, BGE,etc [94]–[96], which are benefit from multi-task\\ninstruct tuning. Hugging Face’s MTEB leaderboard7evaluates\\nembedding models across 8 tasks, covering 58 datasests. Ad-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='ditionally, C-MTEB focuses on Chinese capability, covering\\n6 tasks and 35 datasets. There is no one-size-fits-all answer\\nto “which embedding model to use.” However, some specific\\nmodels are better suited for particular use cases.\\n1) Mix/hybrid Retrieval : Sparse and dense embedding\\napproaches capture different relevance features and can ben-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='efit from each other by leveraging complementary relevance\\ninformation. For instance, sparse retrieval models can be used\\n6https://github.com/aurelio-labs/semantic-router\\n7https://huggingface.co/spaces/mteb/leaderboardto provide initial search results for training dense retrieval\\nmodels. Additionally, pre-training language models (PLMs)\\ncan be utilized to learn term weights to enhance sparse'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='retrieval. Specifically, it also demonstrates that sparse retrieval\\nmodels can enhance the zero-shot retrieval capability of dense\\nretrieval models and assist dense retrievers in handling queries\\ncontaining rare entities, thereby improving robustness.\\n2) Fine-tuning Embedding Model: In instances where the\\ncontext significantly deviates from pre-training corpus, partic-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='ularly within highly specialized disciplines such as healthcare,\\nlegal practice, and other sectors replete with proprietary jargon,\\nfine-tuning the embedding model on your own domain dataset\\nbecomes essential to mitigate such discrepancies.\\nIn addition to supplementing domain knowledge, another\\npurpose of fine-tuning is to align the retriever and generator,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='for example, using the results of LLM as the supervision signal\\nfor fine-tuning, known as LSR (LM-supervised Retriever).\\nPROMPTAGATOR [21] utilizes the LLM as a few-shot query\\ngenerator to create task-specific retrievers, addressing chal-\\nlenges in supervised fine-tuning, particularly in data-scarce\\ndomains. Another approach, LLM-Embedder [97], exploits'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='LLMs to generate reward signals across multiple downstream\\ntasks. The retriever is fine-tuned with two types of supervised\\nsignals: hard labels for the dataset and soft rewards from\\nthe LLMs. This dual-signal approach fosters a more effective\\nfine-tuning process, tailoring the embedding model to diverse\\ndownstream applications. REPLUG [72] utilizes a retriever'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='and an LLM to calculate the probability distributions of the\\nretrieved documents and then performs supervised training\\nby computing the KL divergence. This straightforward and\\neffective training method enhances the performance of the\\nretrieval model by using an LM as the supervisory signal,\\neliminating the need for specific cross-attention mechanisms.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='Moreover, inspired by RLHF (Reinforcement Learning from\\nHuman Feedback), utilizing LM-based feedback to reinforce\\nthe retriever through reinforcement learning.\\nE. Adapter\\nFine-tuning models may present challenges, such as in-\\ntegrating functionality through an API or addressing con-\\nstraints arising from limited local computational resources.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='Consequently, some approaches opt to incorporate an external\\nadapter to aid in alignment.\\nTo optimize the multi-task capabilities of LLM, UP-\\nRISE [20] trained a lightweight prompt retriever that can\\nautomatically retrieve prompts from a pre-built prompt pool\\nthat are suitable for a given zero-shot task input. AAR\\n(Augmentation-Adapted Retriver) [47] introduces a universal'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='adapter designed to accommodate multiple downstream tasks.\\nWhile PRCA [69] add a pluggable reward-driven contextual\\nadapter to enhance performance on specific tasks. BGM [26]\\nkeeps the retriever and LLM fixed,and trains a bridge Seq2Seq\\nmodel in between. The bridge model aims to transform the\\nretrieved information into a format that LLMs can work with'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 8}, page_content='effectively, allowing it to not only rerank but also dynami-\\ncally select passages for each query, and potentially employ\\nmore advanced strategies like repetition. Furthermore, PKG'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='10\\nintroduces an innovative method for integrating knowledge\\ninto white-box models via directive fine-tuning [75]. In this\\napproach, the retriever module is directly substituted to gen-\\nerate relevant documents according to a query. This method\\nassists in addressing the difficulties encountered during the\\nfine-tuning process and enhances model performance.\\nIV. G ENERATION'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='IV. G ENERATION\\nAfter retrieval, it is not a good practice to directly input all\\nthe retrieved information to the LLM for answering questions.\\nFollowing will introduce adjustments from two perspectives:\\nadjusting the retrieved content and adjusting the LLM.\\nA. Context Curation\\nRedundant information can interfere with the final gener-\\nation of LLM, and overly long contexts can also lead LLM'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='to the “Lost in the middle” problem [98]. Like humans, LLM\\ntends to only focus on the beginning and end of long texts,\\nwhile forgetting the middle portion. Therefore, in the RAG\\nsystem, we typically need to further process the retrieved\\ncontent.\\n1) Reranking: Reranking fundamentally reorders document\\nchunks to highlight the most pertinent results first, effectively'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='reducing the overall document pool, severing a dual purpose\\nin information retrieval, acting as both an enhancer and a\\nfilter, delivering refined inputs for more precise language\\nmodel processing [70]. Reranking can be performed using\\nrule-based methods that depend on predefined metrics like\\nDiversity, Relevance, and MRR, or model-based approaches'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='like Encoder-Decoder models from the BERT series (e.g.,\\nSpanBERT), specialized reranking models such as Cohere\\nrerank or bge-raranker-large, and general large language mod-\\nels like GPT [12], [99].\\n2) Context Selection/Compression: A common misconcep-\\ntion in the RAG process is the belief that retrieving as many\\nrelevant documents as possible and concatenating them to form'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='a lengthy retrieval prompt is beneficial. However, excessive\\ncontext can introduce more noise, diminishing the LLM’s\\nperception of key information .\\n(Long) LLMLingua [100], [101] utilize small language\\nmodels (SLMs) such as GPT-2 Small or LLaMA-7B, to\\ndetect and remove unimportant tokens, transforming it into\\na form that is challenging for humans to comprehend but'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='well understood by LLMs. This approach presents a direct\\nand practical method for prompt compression, eliminating the\\nneed for additional training of LLMs while balancing language\\nintegrity and compression ratio. PRCA tackled this issue by\\ntraining an information extractor [69]. Similarly, RECOMP\\nadopts a comparable approach by training an information'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='condenser using contrastive learning [71]. Each training data\\npoint consists of one positive sample and five negative sam-\\nples, and the encoder undergoes training using contrastive loss\\nthroughout this process [102] .\\nIn addition to compressing the context, reducing the num-\\nber of documents aslo helps improve the accuracy of the\\nmodel’s answers. Ma et al. [103] propose the “Filter-Reranker”'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='paradigm, which combines the strengths of LLMs and SLMs.In this paradigm, SLMs serve as filters, while LLMs function\\nas reordering agents. The research shows that instructing\\nLLMs to rearrange challenging samples identified by SLMs\\nleads to significant improvements in various Information\\nExtraction (IE) tasks. Another straightforward and effective'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='approach involves having the LLM evaluate the retrieved\\ncontent before generating the final answer. This allows the\\nLLM to filter out documents with poor relevance through LLM\\ncritique. For instance, in Chatlaw [104], the LLM is prompted\\nto self-suggestion on the referenced legal provisions to assess\\ntheir relevance.\\nB. LLM Fine-tuning\\nTargeted fine-tuning based on the scenario and data char-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='acteristics on LLMs can yield better results. This is also one\\nof the greatest advantages of using on-premise LLMs. When\\nLLMs lack data in a specific domain, additional knowledge can\\nbe provided to the LLM through fine-tuning. Huggingface’s\\nfine-tuning data can also be used as an initial step.\\nAnother benefit of fine-tuning is the ability to adjust the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='model’s input and output. For example, it can enable LLM to\\nadapt to specific data formats and generate responses in a par-\\nticular style as instructed [37]. For retrieval tasks that engage\\nwith structured data, the SANTA framework [76] implements\\na tripartite training regimen to effectively encapsulate both\\nstructural and semantic nuances. The initial phase focuses on'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='the retriever, where contrastive learning is harnessed to refine\\nthe query and document embeddings.\\nAligning LLM outputs with human or retriever preferences\\nthrough reinforcement learning is a potential approach. For\\ninstance, manually annotating the final generated answers\\nand then providing feedback through reinforcement learning.\\nIn addition to aligning with human preferences, it is also'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='possible to align with the preferences of fine-tuned models\\nand retrievers [79]. When circumstances prevent access to\\npowerful proprietary models or larger parameter open-source\\nmodels, a simple and effective method is to distill the more\\npowerful models(e.g. GPT-4). Fine-tuning of LLM can also\\nbe coordinated with fine-tuning of the retriever to align pref-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='erences. A typical approach, such as RA-DIT [27], aligns the\\nscoring functions between Retriever and Generator using KL\\ndivergence.\\nV. A UGMENTATION PROCESS IN RAG\\nIn the domain of RAG, the standard practice often involves\\na singular (once) retrieval step followed by generation, which\\ncan lead to inefficiencies and sometimes is typically insuffi-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='cient for complex problems demanding multi-step reasoning,\\nas it provides a limited scope of information [105]. Many\\nstudies have optimized the retrieval process in response to this\\nissue, and we have summarised them in Figure 5.\\nA. Iterative Retrieval\\nIterative retrieval is a process where the knowledge base\\nis repeatedly searched based on the initial query and the text'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 9}, page_content='generated so far, providing a more comprehensive knowledge'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='11\\nFig. 5. In addition to the most common once retrieval, RAG also includes three types of retrieval augmentation processes. (left) Iterative retrieval involves\\nalternating between retrieval and generation, allowing for richer and more targeted context from the knowledge base at each step. (Middle) Recursive retrieval'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='involves gradually refining the user query and breaking down the problem into sub-problems, then continuously solving complex problems through retrieval\\nand generation. (Right) Adaptive retrieval focuses on enabling the RAG system to autonomously determine whether external knowledge retrieval is necessary'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='and when to stop retrieval and generation, often utilizing LLM-generated special tokens for control.\\nbase for LLMs. This approach has been shown to enhance\\nthe robustness of subsequent answer generation by offering\\nadditional contextual references through multiple retrieval\\niterations. However, it may be affected by semantic discon-\\ntinuity and the accumulation of irrelevant information. ITER-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='RETGEN [14] employs a synergistic approach that lever-\\nages “retrieval-enhanced generation” alongside “generation-\\nenhanced retrieval” for tasks that necessitate the reproduction\\nof specific information. The model harnesses the content\\nrequired to address the input task as a contextual basis for\\nretrieving pertinent knowledge, which in turn facilitates the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='generation of improved responses in subsequent iterations.\\nB. Recursive Retrieval\\nRecursive retrieval is often used in information retrieval and\\nNLP to improve the depth and relevance of search results.\\nThe process involves iteratively refining search queries based\\non the results obtained from previous searches. Recursive\\nRetrieval aims to enhance the search experience by gradu-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='ally converging on the most pertinent information through a\\nfeedback loop. IRCoT [61] uses chain-of-thought to guide\\nthe retrieval process and refines the CoT with the obtained\\nretrieval results. ToC [57] creates a clarification tree that\\nsystematically optimizes the ambiguous parts in the Query. It\\ncan be particularly useful in complex search scenarios where'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='the user’s needs are not entirely clear from the outset or where\\nthe information sought is highly specialized or nuanced. The\\nrecursive nature of the process allows for continuous learning\\nand adaptation to the user’s requirements, often resulting in\\nimproved satisfaction with the search outcomes.\\nTo address specific data scenarios, recursive retrieval and'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='multi-hop retrieval techniques are utilized together. Recursiveretrieval involves a structured index to process and retrieve\\ndata in a hierarchical manner, which may include summarizing\\nsections of a document or lengthy PDF before performing a\\nretrieval based on this summary. Subsequently, a secondary\\nretrieval within the document refines the search, embodying'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='the recursive nature of the process. In contrast, multi-hop\\nretrieval is designed to delve deeper into graph-structured data\\nsources, extracting interconnected information [106].\\nC. Adaptive Retrieval\\nAdaptive retrieval methods, exemplified by Flare [24] and\\nSelf-RAG [25], refine the RAG framework by enabling LLMs\\nto actively determine the optimal moments and content for'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='retrieval, thus enhancing the efficiency and relevance of the\\ninformation sourced.\\nThese methods are part of a broader trend wherein\\nLLMs employ active judgment in their operations, as seen\\nin model agents like AutoGPT, Toolformer, and Graph-\\nToolformer [107]–[109]. Graph-Toolformer, for instance, di-\\nvides its retrieval process into distinct steps where LLMs'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='proactively use retrievers, apply Self-Ask techniques, and em-\\nploy few-shot prompts to initiate search queries. This proactive\\nstance allows LLMs to decide when to search for necessary\\ninformation, akin to how an agent utilizes tools.\\nWebGPT [110] integrates a reinforcement learning frame-\\nwork to train the GPT-3 model in autonomously using a'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 10}, page_content='search engine during text generation. It navigates this process\\nusing special tokens that facilitate actions such as search\\nengine queries, browsing results, and citing references, thereby\\nexpanding GPT-3’s capabilities through the use of external\\nsearch engines. Flare automates timing retrieval by monitoring\\nthe confidence of the generation process, as indicated by the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='12\\nprobability of generated terms [24]. When the probability falls\\nbelow a certain threshold would activates the retrieval system\\nto collect relevant information, thus optimizing the retrieval\\ncycle. Self-RAG [25] introduces “reflection tokens” that allow\\nthe model to introspect its outputs. These tokens come in\\ntwo varieties: “retrieve” and “critic”. The model autonomously'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='decides when to activate retrieval, or alternatively, a predefined\\nthreshold may trigger the process. During retrieval, the gen-\\nerator conducts a fragment-level beam search across multiple\\nparagraphs to derive the most coherent sequence. Critic scores\\nare used to update the subdivision scores, with the flexibility\\nto adjust these weights during inference, tailoring the model’s'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='behavior. Self-RAG’s design obviates the need for additional\\nclassifiers or reliance on Natural Language Inference (NLI)\\nmodels, thus streamlining the decision-making process for\\nwhen to engage retrieval mechanisms and improving the\\nmodel’s autonomous judgment capabilities in generating ac-\\ncurate responses.\\nVI. T ASK AND EVALUATION\\nThe rapid advancement and growing adoption of RAG'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='in the field of NLP have propelled the evaluation of RAG\\nmodels to the forefront of research in the LLMs community.\\nThe primary objective of this evaluation is to comprehend\\nand optimize the performance of RAG models across diverse\\napplication scenarios.This chapter will mainly introduce the\\nmain downstream tasks of RAG, datasets, and how to evaluate\\nRAG systems.\\nA. Downstream Task'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='RAG systems.\\nA. Downstream Task\\nThe core task of RAG remains Question Answering (QA),\\nincluding traditional single-hop/multi-hop QA, multiple-\\nchoice, domain-specific QA as well as long-form scenarios\\nsuitable for RAG. In addition to QA, RAG is continuously\\nbeing expanded into multiple downstream tasks, such as Infor-\\nmation Extraction (IE), dialogue generation, code search, etc.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='The main downstream tasks of RAG and their corresponding\\ndatasets are summarized in Table II.\\nB. Evaluation Target\\nHistorically, RAG models assessments have centered on\\ntheir execution in specific downstream tasks. These evaluations\\nemploy established metrics suitable to the tasks at hand. For\\ninstance, question answering evaluations might rely on EM'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='and F1 scores [7], [45], [59], [72], whereas fact-checking\\ntasks often hinge on Accuracy as the primary metric [4],\\n[14], [42]. BLEU and ROUGE metrics are also commonly\\nused to evaluate answer quality [26], [32], [52], [78]. Tools\\nlike RALLE, designed for the automatic evaluation of RAG\\napplications, similarly base their assessments on these task-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='specific metrics [160]. Despite this, there is a notable paucity\\nof research dedicated to evaluating the distinct characteristics\\nof RAG models.The main evaluation objectives include:\\nRetrieval Quality . Evaluating the retrieval quality is crucial\\nfor determining the effectiveness of the context sourced by'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='the retriever component. Standard metrics from the domainsof search engines, recommendation systems, and information\\nretrieval systems are employed to measure the performance of\\nthe RAG retrieval module. Metrics such as Hit Rate, MRR, and\\nNDCG are commonly utilized for this purpose [161], [162].\\nGeneration Quality . The assessment of generation quality'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='centers on the generator’s capacity to synthesize coherent and\\nrelevant answers from the retrieved context. This evaluation\\ncan be categorized based on the content’s objectives: unlabeled\\nand labeled content. For unlabeled content, the evaluation\\nencompasses the faithfulness, relevance, and non-harmfulness\\nof the generated answers. In contrast, for labeled content,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='the focus is on the accuracy of the information produced by\\nthe model [161]. Additionally, both retrieval and generation\\nquality assessments can be conducted through manual or\\nautomatic evaluation methods [29], [161], [163].\\nC. Evaluation Aspects\\nContemporary evaluation practices of RAG models empha-\\nsize three primary quality scores and four essential abilities,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='which collectively inform the evaluation of the two principal\\ntargets of the RAG model: retrieval and generation.\\n1) Quality Scores: Quality scores include context rele-\\nvance, answer faithfulness, and answer relevance. These qual-\\nity scores evaluate the efficiency of the RAG model from\\ndifferent perspectives in the process of information retrieval\\nand generation [164]–[166].'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='and generation [164]–[166].\\nContext Relevance evaluates the precision and specificity\\nof the retrieved context, ensuring relevance and minimizing\\nprocessing costs associated with extraneous content.\\nAnswer Faithfulness ensures that the generated answers\\nremain true to the retrieved context, maintaining consistency\\nand avoiding contradictions.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='and avoiding contradictions.\\nAnswer Relevance requires that the generated answers are\\ndirectly pertinent to the posed questions, effectively addressing\\nthe core inquiry.\\n2) Required Abilities: RAG evaluation also encompasses\\nfour abilities indicative of its adaptability and efficiency:\\nnoise robustness, negative rejection, information integration,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='and counterfactual robustness [167], [168]. These abilities are\\ncritical for the model’s performance under various challenges\\nand complex scenarios, impacting the quality scores.\\nNoise Robustness appraises the model’s capability to man-\\nage noise documents that are question-related but lack sub-\\nstantive information.\\nNegative Rejection assesses the model’s discernment in'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='refraining from responding when the retrieved documents do\\nnot contain the necessary knowledge to answer a question.\\nInformation Integration evaluates the model’s proficiency in\\nsynthesizing information from multiple documents to address\\ncomplex questions.\\nCounterfactual Robustness tests the model’s ability to rec-\\nognize and disregard known inaccuracies within documents,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 11}, page_content='even when instructed about potential misinformation.\\nContext relevance and noise robustness are important for\\nevaluating the quality of retrieval, while answer faithfulness,\\nanswer relevance, negative rejection, information integration,\\nand counterfactual robustness are important for evaluating the\\nquality of generation.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 12}, page_content='13\\nTABLE II\\nDOWNSTREAM TASKS AND DATASETS OF RAG\\nTask Sub Task Dataset Method\\nQA Single-hop Natural Qustion(NQ) [111][26], [30], [34], [42], [45], [50], [52], [59], [64], [82]\\n[3], [4], [22], [27], [40], [43], [54], [62], [71], [112]\\n[20], [44], [72]\\nTriviaQA(TQA) [113][13], [30], [34], [45], [50], [64]\\n[4], [27], [59], [62], [112]\\n[22], [25], [43], [44], [71], [72]'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 12}, page_content='[22], [25], [43], [44], [71], [72]\\nSQuAD [114] [20], [23], [30], [32], [45], [69], [112]\\nWeb Questions(WebQ) [115] [3], [4], [13], [30], [50], [68]\\nPopQA [116] [7], [25], [67]\\nMS MARCO [117] [4], [40], [52]\\nMulti-hop HotpotQA [118][23], [26], [31], [34], [47], [51], [61], [82]\\n[7], [14], [22], [27], [59], [62], [69], [71], [91]\\n2WikiMultiHopQA [119] [14], [24], [48], [59], [61], [91]'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 12}, page_content='MuSiQue [120] [14], [51], [61], [91]\\nLong-form QA ELI5 [121] [27], [34], [43], [49], [51]\\nNarrativeQA(NQA) [122] [45], [60], [63], [123]\\nASQA [124] [24], [57]\\nQMSum(QM) [125] [60], [123]\\nDomain QA Qasper [126] [60], [63]\\nCOVID-QA [127] [35], [46]\\nCMB [128],MMCU Medical [129] [81]\\nMulti-Choice QA QuALITY [130] [60], [63]\\nARC [131] [25], [67]\\nCommonsenseQA [132] [58], [66]'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 12}, page_content='CommonsenseQA [132] [58], [66]\\nGraph QA GraphQA [84] [84]\\nDialog Dialog Generation Wizard of Wikipedia (WoW) [133] [13], [27], [34], [42]\\nPersonal Dialog KBP [134] [74], [135]\\nDuleMon [136] [74]\\nTask-oriented Dialog CamRest [137] [78], [79]\\nRecommendation Amazon(Toys,Sport,Beauty) [138] [39], [40]\\nIE Event Argument Extraction WikiEvent [139] [13], [27], [37], [42]\\nRAMS [140] [36], [37]'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 12}, page_content='RAMS [140] [36], [37]\\nRelation Extraction T-REx [141],ZsRE [142] [27], [51]\\nReasoning Commonsense Reasoning HellaSwag [143] [20], [66]\\nCoT Reasoning CoT Reasoning [144] [27]\\nComplex Reasoning CSQA [145] [55]\\nOthers Language Understanding MMLU [146] [7], [27], [28], [42], [43], [47], [72]\\nLanguage Modeling WikiText-103 [147] [5], [29], [64], [71]\\nStrategyQA [148] [14], [24], [48], [51], [55], [58]'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 12}, page_content='Fact Checking/Verification FEVER [149] [4], [13], [27], [34], [42], [50]\\nPubHealth [150] [25], [67]\\nText Generation Biography [151] [67]\\nText Summarization WikiASP [152] [24]\\nXSum [153] [17]\\nText Classification VioLens [154] [19]\\nTREC [155] [33]\\nSentiment SST-2 [156] [20], [33], [38]\\nCode Search CodeSearchNet [157] [76]\\nRobustness Evaluation NoMIRACL [56] [56]\\nMath GSM8K [158] [73]'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 12}, page_content='Math GSM8K [158] [73]\\nMachine Translation JRC-Acquis [159] [17]'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='14\\nTABLE III\\nSUMMARY OF METRICS APPLICABLE FOR EVALUATION ASPECTS OF RAG\\nContext\\nRelevanceFaithfulnessAnswer\\nRelevanceNoise\\nRobustnessNegative\\nRejectionInformation\\nIntegrationCounterfactual\\nRobustness\\nAccuracy ✓ ✓ ✓ ✓ ✓ ✓ ✓\\nEM ✓\\nRecall ✓\\nPrecision ✓ ✓\\nR-Rate ✓\\nCosine Similarity ✓\\nHit Rate ✓\\nMRR ✓\\nNDCG ✓\\nBLEU ✓ ✓ ✓\\nROUGE/ROUGE-L ✓ ✓ ✓\\nThe specific metrics for each evaluation aspect are sum-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='marized in Table III. It is essential to recognize that these\\nmetrics, derived from related work, are traditional measures\\nand do not yet represent a mature or standardized approach for\\nquantifying RAG evaluation aspects. Custom metrics tailored\\nto the nuances of RAG models, though not included here, have\\nalso been developed in some evaluation studies.\\nD. Evaluation Benchmarks and Tools'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='D. Evaluation Benchmarks and Tools\\nA series of benchmark tests and tools have been proposed\\nto facilitate the evaluation of RAG.These instruments furnish\\nquantitative metrics that not only gauge RAG model perfor-\\nmance but also enhance comprehension of the model’s capabil-\\nities across various evaluation aspects. Prominent benchmarks\\nsuch as RGB, RECALL and CRUD [167]–[169] focus on'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='appraising the essential abilities of RAG models. Concur-\\nrently, state-of-the-art automated tools like RAGAS [164],\\nARES [165], and TruLens8employ LLMs to adjudicate the\\nquality scores. These tools and benchmarks collectively form\\na robust framework for the systematic evaluation of RAG\\nmodels, as summarized in Table IV.\\nVII. D ISCUSSION AND FUTURE PROSPECTS'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='VII. D ISCUSSION AND FUTURE PROSPECTS\\nDespite the considerable progress in RAG technology, sev-\\neral challenges persist that warrant in-depth research.This\\nchapter will mainly introduce the current challenges and future\\nresearch directions faced by RAG.\\nA. RAG vs Long Context\\nWith the deepening of related research, the context of LLMs\\nis continuously expanding [170]–[172]. Presently, LLMs can'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='effortlessly manage contexts exceeding 200,000 tokens9. This\\ncapability signifies that long-document question answering,\\npreviously reliant on RAG, can now incorporate the entire\\ndocument directly into the prompt. This has also sparked\\ndiscussions on whether RAG is still necessary when LLMs\\n8https://www.trulens.org/trulens eval/core concepts ragtriad/'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='9https://kimi.moonshot.cnare not constrained by context. In fact, RAG still plays an\\nirreplaceable role. On one hand, providing LLMs with a\\nlarge amount of context at once will significantly impact its\\ninference speed, while chunked retrieval and on-demand input\\ncan significantly improve operational efficiency. On the other\\nhand, RAG-based generation can quickly locate the original'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='references for LLMs to help users verify the generated an-\\nswers. The entire retrieval and reasoning process is observable,\\nwhile generation solely relying on long context remains a\\nblack box. Conversely, the expansion of context provides new\\nopportunities for the development of RAG, enabling it to\\naddress more complex problems and integrative or summary'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='questions that require reading a large amount of material to\\nanswer [49]. Developing new RAG methods in the context of\\nsuper-long contexts is one of the future research trends.\\nB. RAG Robustness\\nThe presence of noise or contradictory information during\\nretrieval can detrimentally affect RAG’s output quality. This\\nsituation is figuratively referred to as “Misinformation can'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='be worse than no information at all”. Improving RAG’s\\nresistance to such adversarial or counterfactual inputs is gain-\\ning research momentum and has become a key performance\\nmetric [48], [50], [82]. Cuconasu et al. [54] analyze which\\ntype of documents should be retrieved, evaluate the relevance\\nof the documents to the prompt, their position, and the'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='number included in the context. The research findings reveal\\nthat including irrelevant documents can unexpectedly increase\\naccuracy by over 30%, contradicting the initial assumption\\nof reduced quality. These results underscore the importance\\nof developing specialized strategies to integrate retrieval with\\nlanguage generation models, highlighting the need for further'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 13}, page_content='research and exploration into the robustness of RAG.\\nC. Hybrid Approaches\\nCombining RAG with fine-tuning is emerging as a leading\\nstrategy. Determining the optimal integration of RAG and\\nfine-tuning whether sequential, alternating, or through end-to-\\nend joint training—and how to harness both parameterized'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='15\\nTABLE IV\\nSUMMARY OF EVALUATION FRAMEWORKS\\nEvaluation Framework Evaluation Targets Evaluation Aspects Quantitative Metrics\\nRGB† Retrieval Quality\\nGeneration QualityNoise Robustness\\nNegative Rejection\\nInformation Integration\\nCounterfactual RobustnessAccuracy\\nEM\\nAccuracy\\nAccuracy\\nRECALL†Generation Quality Counterfactual Robustness R-Rate (Reappearance Rate)\\nRAGAS‡ Retrieval Quality'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='RAGAS‡ Retrieval Quality\\nGeneration QualityContext Relevance\\nFaithfulness\\nAnswer Relevance*\\n*\\nCosine Similarity\\nARES‡ Retrieval Quality\\nGeneration QualityContext Relevance\\nFaithfulness\\nAnswer RelevanceAccuracy\\nAccuracy\\nAccuracy\\nTruLens‡ Retrieval Quality\\nGeneration QualityContext Relevance\\nFaithfulness\\nAnswer Relevance*\\n*\\n*\\nCRUD† Retrieval Quality\\nGeneration QualityCreative Generation'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='Generation QualityCreative Generation\\nKnowledge-intensive QA\\nError Correction\\nSummarizationBLEU\\nROUGE-L\\nBertScore\\nRAGQuestEval\\n† represents a benchmark, and ‡ represents a tool. * denotes customized quantitative metrics, which deviate from traditional\\nmetrics. Readers are encouraged to consult pertinent literature for the specific quantification formulas associated with these'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='metrics, as required.\\nand non-parameterized advantages are areas ripe for explo-\\nration [27]. Another trend is to introduce SLMs with specific\\nfunctionalities into RAG and fine-tuned by the results of RAG\\nsystem. For example, CRAG [67] trains a lightweight retrieval\\nevaluator to assess the overall quality of the retrieved docu-\\nments for a query and triggers different knowledge retrieval'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='actions based on confidence levels.\\nD. Scaling laws of RAG\\nEnd-to-end RAG models and pre-trained models based\\non RAG are still one of the focuses of current re-\\nsearchers [173].The parameters of these models are one of\\nthe key factors.While scaling laws [174] are established for\\nLLMs, their applicability to RAG remains uncertain. Initial'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='studies like RETRO++ [44] have begun to address this, yet the\\nparameter count in RAG models still lags behind that of LLMs.\\nThe possibility of an Inverse Scaling Law10, where smaller\\nmodels outperform larger ones, is particularly intriguing and\\nmerits further investigation.\\nE. Production-Ready RAG\\nRAG’s practicality and alignment with engineering require-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='ments have facilitated its adoption. However, enhancing re-\\ntrieval efficiency, improving document recall in large knowl-\\nedge bases, and ensuring data security—such as preventing\\n10https://github.com/inverse-scaling/prizeinadvertent disclosure of document sources or metadata by\\nLLMs—are critical engineering challenges that remain to be\\naddressed [175].'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='addressed [175].\\nThe development of the RAG ecosystem is greatly impacted\\nby the progression of its technical stack. Key tools like\\nLangChain and LLamaIndex have quickly gained popularity\\nwith the emergence of ChatGPT, providing extensive RAG-\\nrelated APIs and becoming essential in the realm of LLMs.The\\nemerging technology stack, while not as rich in features as'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='LangChain and LLamaIndex, stands out through its specialized\\nproducts. For example, Flowise AI prioritizes a low-code\\napproach, allowing users to deploy AI applications, including\\nRAG, through a user-friendly drag-and-drop interface. Other\\ntechnologies like HayStack, Meltano, and Cohere Coral are\\nalso gaining attention for their unique contributions to the field.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='In addition to AI-focused vendors, traditional software and\\ncloud service providers are expanding their offerings to include\\nRAG-centric services. Weaviate’s Verba11is designed for\\npersonal assistant applications, while Amazon’s Kendra12\\noffers intelligent enterprise search services, enabling users to\\nbrowse various content repositories using built-in connectors.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 14}, page_content='In the development of RAG technology, there is a clear\\ntrend towards different specialization directions, such as: 1)\\nCustomization - tailoring RAG to meet specific requirements.\\n2) Simplification - making RAG easier to use to reduce the\\n11https://github.com/weaviate/Verba\\n12https://aws.amazon.com/cn/kendra/'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='16\\nFig. 6. Summary of RAG ecosystem\\ninitial learning curve. 3) Specialization - optimizing RAG to\\nbetter serve production environments.\\nThe mutual growth of RAG models and their technology\\nstacks is evident; technological advancements continuously\\nestablish new standards for existing infrastructure. In turn,\\nenhancements to the technology stack drive the development'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='of RAG capabilities. RAG toolkits are converging into a\\nfoundational technology stack, laying the groundwork for\\nadvanced enterprise applications. However, a fully integrated,\\ncomprehensive platform concept is still in the future, requiring\\nfurther innovation and development.\\nF . Multi-modal RAG\\nRAG has transcended its initial text-based question-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='answering confines, embracing a diverse array of modal data.\\nThis expansion has spawned innovative multimodal models\\nthat integrate RAG concepts across various domains:\\nImage . RA-CM3 [176] stands as a pioneering multimodal\\nmodel of both retrieving and generating text and images.\\nBLIP-2 [177] leverages frozen image encoders alongside\\nLLMs for efficient visual language pre-training, enabling zero-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='shot image-to-text conversions. The “Visualize Before You\\nWrite” method [178] employs image generation to steer the\\nLM’s text generation, showing promise in open-ended text\\ngeneration tasks.\\nAudio and Video . The GSS method retrieves and stitches\\ntogether audio clips to convert machine-translated data into\\nspeech-translated data [179]. UEOP marks a significant ad-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='vancement in end-to-end automatic speech recognition by\\nincorporating external, offline strategies for voice-to-text con-\\nversion [180]. Additionally, KNN-based attention fusion lever-\\nages audio embeddings and semantically related text embed-\\ndings to refine ASR, thereby accelerating domain adaptation.Vid2Seq augments language models with specialized temporal'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='markers, facilitating the prediction of event boundaries and\\ntextual descriptions within a unified output sequence [181].\\nCode . RBPS [182] excels in small-scale learning tasks by\\nretrieving code examples that align with developers’ objectives\\nthrough encoding and frequency analysis. This approach has\\ndemonstrated efficacy in tasks such as test assertion genera-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='tion and program repair. For structured knowledge, the CoK\\nmethod [106] first extracts facts pertinent to the input query\\nfrom a knowledge graph, then integrates these facts as hints\\nwithin the input, enhancing performance in knowledge graph\\nquestion-answering tasks.\\nVIII. C ONCLUSION\\nThe summary of this paper, as depicted in Figure 6, empha-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='sizes RAG’s significant advancement in enhancing the capa-\\nbilities of LLMs by integrating parameterized knowledge from\\nlanguage models with extensive non-parameterized data from\\nexternal knowledge bases. The survey showcases the evolution\\nof RAG technologies and their application on many different\\ntasks. The analysis outlines three developmental paradigms'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='within the RAG framework: Naive, Advanced, and Modu-\\nlar RAG, each representing a progressive enhancement over\\nits predecessors. RAG’s technical integration with other AI\\nmethodologies, such as fine-tuning and reinforcement learning,\\nhas further expanded its capabilities. Despite the progress in\\nRAG technology, there are research opportunities to improve'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 15}, page_content='its robustness and its ability to handle extended contexts.\\nRAG’s application scope is expanding into multimodal do-\\nmains, adapting its principles to interpret and process diverse\\ndata forms like images, videos, and code. This expansion high-\\nlights RAG’s significant practical implications for AI deploy-\\nment, attracting interest from academic and industrial sectors.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='17\\nThe growing ecosystem of RAG is evidenced by the rise in\\nRAG-centric AI applications and the continuous development\\nof supportive tools. As RAG’s application landscape broadens,\\nthere is a need to refine evaluation methodologies to keep\\npace with its evolution. Ensuring accurate and representative\\nperformance assessments is crucial for fully capturing RAG’s'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='contributions to the AI research and development community.\\nREFERENCES\\n[1] N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel, “Large\\nlanguage models struggle to learn long-tail knowledge,” in Interna-\\ntional Conference on Machine Learning . PMLR, 2023, pp. 15 696–\\n15 707.\\n[2] Y . Zhang, Y . Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='Y . Zhang, Y . Chen et al. , “Siren’s song in the ai ocean: A survey on hal-\\nlucination in large language models,” arXiv preprint arXiv:2309.01219 ,\\n2023.\\n[3] D. Arora, A. Kini, S. R. Chowdhury, N. Natarajan, G. Sinha, and\\nA. Sharma, “Gar-meets-rag paradigm for zero-shot information re-\\ntrieval,” arXiv preprint arXiv:2310.20158 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='[4] P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal,\\nH. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al. , “Retrieval-\\naugmented generation for knowledge-intensive nlp tasks,” Advances in\\nNeural Information Processing Systems , vol. 33, pp. 9459–9474, 2020.\\n[5] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark et al. ,\\n“Improving language models by retrieving from trillions of tokens,”\\ninInternational conference on machine learning . PMLR, 2022, pp.\\n2206–2240.\\n[6] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,\\nC. Zhang, S. Agarwal, K. Slama, A. Ray et al. , “Training language'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='models to follow instructions with human feedback,” Advances in\\nneural information processing systems , vol. 35, pp. 27 730–27 744,\\n2022.\\n[7] X. Ma, Y . Gong, P. He, H. Zhao, and N. Duan, “Query rewrit-\\ning for retrieval-augmented large language models,” arXiv preprint\\narXiv:2305.14283 , 2023.\\n[8] I. ILIN, “Advanced rag techniques: an il-\\nlustrated overview,” https://pub.towardsai.net/'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='advanced-rag-techniques-an-illustrated-overview-04d193d8fec6,\\n2023.\\n[9] W. Peng, G. Li, Y . Jiang, Z. Wang, D. Ou, X. Zeng, E. Chen et al. ,\\n“Large language model based long-tail query rewriting in taobao\\nsearch,” arXiv preprint arXiv:2311.03758 , 2023.\\n[10] H. S. Zheng, S. Mishra, X. Chen, H.-T. Cheng, E. H. Chi, Q. V . Le,\\nand D. Zhou, “Take a step back: Evoking reasoning via abstraction in'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='large language models,” arXiv preprint arXiv:2310.06117 , 2023.\\n[11] L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-shot dense retrieval\\nwithout relevance labels,” arXiv preprint arXiv:2212.10496 , 2022.\\n[12] V . Blagojevi, “Enhancing rag pipelines in haystack: Introducing diver-\\nsityranker and lostinthemiddleranker,” https://towardsdatascience.com/'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='enhancing-rag-pipelines-in-haystack-45f14e2bc9f5, 2023.\\n[13] W. Yu, D. Iter, S. Wang, Y . Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng,\\nand M. Jiang, “Generate rather than retrieve: Large language models\\nare strong context generators,” arXiv preprint arXiv:2209.10063 , 2022.\\n[14] Z. Shao, Y . Gong, Y . Shen, M. Huang, N. Duan, and W. Chen,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='“Enhancing retrieval-augmented large language models with iterative\\nretrieval-generation synergy,” arXiv preprint arXiv:2305.15294 , 2023.\\n[15] X. Wang, Q. Yang, Y . Qiu, J. Liang, Q. He, Z. Gu, Y . Xiao,\\nand W. Wang, “Knowledgpt: Enhancing large language models with\\nretrieval and storage access on knowledge bases,” arXiv preprint\\narXiv:2308.11761 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='arXiv:2308.11761 , 2023.\\n[16] A. H. Raudaschl, “Forget rag, the future\\nis rag-fusion,” https://towardsdatascience.com/\\nforget-rag-the-future-is-rag-fusion-1147298d8ad1, 2023.\\n[17] X. Cheng, D. Luo, X. Chen, L. Liu, D. Zhao, and R. Yan, “Lift\\nyourself up: Retrieval-augmented text generation with self memory,”\\narXiv preprint arXiv:2305.02437 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='arXiv preprint arXiv:2305.02437 , 2023.\\n[18] S. Wang, Y . Xu, Y . Fang, Y . Liu, S. Sun, R. Xu, C. Zhu, and\\nM. Zeng, “Training data is more valuable than you think: A simple\\nand effective method by retrieving from training data,” arXiv preprint\\narXiv:2203.08773 , 2022.[19] X. Li, E. Nie, and S. Liang, “From classification to generation:'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='Insights into crosslingual retrieval augmented icl,” arXiv preprint\\narXiv:2311.06595 , 2023.\\n[20] D. Cheng, S. Huang, J. Bi, Y . Zhan, J. Liu, Y . Wang, H. Sun,\\nF. Wei, D. Deng, and Q. Zhang, “Uprise: Universal prompt retrieval\\nfor improving zero-shot evaluation,” arXiv preprint arXiv:2303.08518 ,\\n2023.\\n[21] Z. Dai, V . Y . Zhao, J. Ma, Y . Luan, J. Ni, J. Lu, A. Bakalov, K. Guu,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='K. B. Hall, and M.-W. Chang, “Promptagator: Few-shot dense retrieval\\nfrom 8 examples,” arXiv preprint arXiv:2209.11755 , 2022.\\n[22] Z. Sun, X. Wang, Y . Tay, Y . Yang, and D. Zhou, “Recitation-augmented\\nlanguage models,” arXiv preprint arXiv:2210.01296 , 2022.\\n[23] O. Khattab, K. Santhanam, X. L. Li, D. Hall, P. Liang, C. Potts,\\nand M. Zaharia, “Demonstrate-search-predict: Composing retrieval'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='and language models for knowledge-intensive nlp,” arXiv preprint\\narXiv:2212.14024 , 2022.\\n[24] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y . Yang,\\nJ. Callan, and G. Neubig, “Active retrieval augmented generation,”\\narXiv preprint arXiv:2305.06983 , 2023.\\n[25] A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag:'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='Learning to retrieve, generate, and critique through self-reflection,”\\narXiv preprint arXiv:2310.11511 , 2023.\\n[26] Z. Ke, W. Kong, C. Li, M. Zhang, Q. Mei, and M. Bendersky,\\n“Bridging the preference gap between retrievers and llms,” arXiv\\npreprint arXiv:2401.06954 , 2024.\\n[27] X. V . Lin, X. Chen, M. Chen, W. Shi, M. Lomeli, R. James, P. Ro-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='driguez, J. Kahn, G. Szilvasy, M. Lewis et al. , “Ra-dit: Retrieval-\\naugmented dual instruction tuning,” arXiv preprint arXiv:2310.01352 ,\\n2023.\\n[28] O. Ovadia, M. Brief, M. Mishaeli, and O. Elisha, “Fine-tuning or\\nretrieval? comparing knowledge injection in llms,” arXiv preprint\\narXiv:2312.05934 , 2023.\\n[29] T. Lan, D. Cai, Y . Wang, H. Huang, and X.-L. Mao, “Copy is all'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='you need,” in The Eleventh International Conference on Learning\\nRepresentations , 2022.\\n[30] T. Chen, H. Wang, S. Chen, W. Yu, K. Ma, X. Zhao, D. Yu, and\\nH. Zhang, “Dense x retrieval: What retrieval granularity should we\\nuse?” arXiv preprint arXiv:2312.06648 , 2023.\\n[31] F. Luo and M. Surdeanu, “Divide & conquer for entailment-aware'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='multi-hop evidence retrieval,” arXiv preprint arXiv:2311.02616 , 2023.\\n[32] Q. Gou, Z. Xia, B. Yu, H. Yu, F. Huang, Y . Li, and N. Cam-Tu,\\n“Diversify question generation with retrieval-augmented style transfer,”\\narXiv preprint arXiv:2310.14503 , 2023.\\n[33] Z. Guo, S. Cheng, Y . Wang, P. Li, and Y . Liu, “Prompt-guided re-\\ntrieval augmentation for non-knowledge-intensive tasks,” arXiv preprint'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='arXiv:2305.17653 , 2023.\\n[34] Z. Wang, J. Araki, Z. Jiang, M. R. Parvez, and G. Neubig, “Learning\\nto filter context for retrieval-augmented generation,” arXiv preprint\\narXiv:2311.08377 , 2023.\\n[35] M. Seo, J. Baek, J. Thorne, and S. J. Hwang, “Retrieval-augmented\\ndata augmentation for low-resource domain tasks,” arXiv preprint\\narXiv:2402.13482 , 2024.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='arXiv:2402.13482 , 2024.\\n[36] Y . Ma, Y . Cao, Y . Hong, and A. Sun, “Large language model is not\\na good few-shot information extractor, but a good reranker for hard\\nsamples!” arXiv preprint arXiv:2303.08559 , 2023.\\n[37] X. Du and H. Ji, “Retrieval-augmented generative question answering\\nfor event argument extraction,” arXiv preprint arXiv:2211.07067 , 2022.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='[38] L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context\\nexamples for large language models,” arXiv preprint arXiv:2307.07164 ,\\n2023.\\n[39] S. Rajput, N. Mehta, A. Singh, R. H. Keshavan, T. Vu, L. Heldt,\\nL. Hong, Y . Tay, V . Q. Tran, J. Samost et al. , “Recommender systems\\nwith generative retrieval,” arXiv preprint arXiv:2305.05065 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='[40] B. Jin, H. Zeng, G. Wang, X. Chen, T. Wei, R. Li, Z. Wang, Z. Li,\\nY . Li, H. Lu et al. , “Language models as semantic indexers,” arXiv\\npreprint arXiv:2310.07815 , 2023.\\n[41] R. Anantha, T. Bethi, D. V odianik, and S. Chappidi, “Context tuning\\nfor retrieval augmented generation,” arXiv preprint arXiv:2312.05708 ,\\n2023.\\n[42] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 16}, page_content='J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Few-shot\\nlearning with retrieval augmented language models,” arXiv preprint\\narXiv:2208.03299 , 2022.\\n[43] J. Huang, W. Ping, P. Xu, M. Shoeybi, K. C.-C. Chang, and B. Catan-\\nzaro, “Raven: In-context learning with retrieval augmented encoder-\\ndecoder language models,” arXiv preprint arXiv:2308.07922 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='18\\n[44] B. Wang, W. Ping, P. Xu, L. McAfee, Z. Liu, M. Shoeybi, Y . Dong,\\nO. Kuchaiev, B. Li, C. Xiao et al. , “Shall we pretrain autoregressive\\nlanguage models with retrieval? a comprehensive study,” arXiv preprint\\narXiv:2304.06762 , 2023.\\n[45] B. Wang, W. Ping, L. McAfee, P. Xu, B. Li, M. Shoeybi, and B. Catan-\\nzaro, “Instructretro: Instruction tuning post retrieval-augmented pre-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='training,” arXiv preprint arXiv:2310.07713 , 2023.\\n[46] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana,\\nand S. Nanayakkara, “Improving the domain adaptation of retrieval\\naugmented generation (rag) models for open domain question answer-\\ning,” Transactions of the Association for Computational Linguistics ,\\nvol. 11, pp. 1–17, 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='vol. 11, pp. 1–17, 2023.\\n[47] Z. Yu, C. Xiong, S. Yu, and Z. Liu, “Augmentation-adapted retriever\\nimproves generalization of language models as generic plug-in,” arXiv\\npreprint arXiv:2305.17331 , 2023.\\n[48] O. Yoran, T. Wolfson, O. Ram, and J. Berant, “Making retrieval-\\naugmented language models robust to irrelevant context,” arXiv\\npreprint arXiv:2310.01558 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='preprint arXiv:2310.01558 , 2023.\\n[49] H.-T. Chen, F. Xu, S. A. Arora, and E. Choi, “Understanding re-\\ntrieval augmentation for long-form question answering,” arXiv preprint\\narXiv:2310.12150 , 2023.\\n[50] W. Yu, H. Zhang, X. Pan, K. Ma, H. Wang, and D. Yu, “Chain-of-note:\\nEnhancing robustness in retrieval-augmented language models,” arXiv\\npreprint arXiv:2311.09210 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='preprint arXiv:2311.09210 , 2023.\\n[51] S. Xu, L. Pang, H. Shen, X. Cheng, and T.-S. Chua, “Search-in-the-\\nchain: Towards accurate, credible and traceable large language models\\nfor knowledgeintensive tasks,” CoRR, vol. abs/2304.14732 , 2023.\\n[52] M. Berchansky, P. Izsak, A. Caciularu, I. Dagan, and M. Wasserblat,\\n“Optimizing retrieval-augmented reader models via token elimination,”'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='arXiv preprint arXiv:2310.13682 , 2023.\\n[53] J. L ´ala, O. O’Donoghue, A. Shtedritski, S. Cox, S. G. Rodriques,\\nand A. D. White, “Paperqa: Retrieval-augmented generative agent for\\nscientific research,” arXiv preprint arXiv:2312.07559 , 2023.\\n[54] F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, C. Campagnano,\\nY . Maarek, N. Tonellotto, and F. Silvestri, “The power of noise:'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='Redefining retrieval for rag systems,” arXiv preprint arXiv:2401.14887 ,\\n2024.\\n[55] Z. Zhang, X. Zhang, Y . Ren, S. Shi, M. Han, Y . Wu, R. Lai, and\\nZ. Cao, “Iag: Induction-augmented generation framework for answer-\\ning reasoning questions,” in Proceedings of the 2023 Conference on\\nEmpirical Methods in Natural Language Processing , 2023, pp. 1–14.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='[56] N. Thakur, L. Bonifacio, X. Zhang, O. Ogundepo, E. Kamalloo,\\nD. Alfonso-Hermelo, X. Li, Q. Liu, B. Chen, M. Rezagholizadeh et al. ,\\n“Nomiracl: Knowing when you don’t know for robust multilingual\\nretrieval-augmented generation,” arXiv preprint arXiv:2312.11361 ,\\n2023.\\n[57] G. Kim, S. Kim, B. Jeon, J. Park, and J. Kang, “Tree of clarifica-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='tions: Answering ambiguous questions with retrieval-augmented large\\nlanguage models,” arXiv preprint arXiv:2310.14696 , 2023.\\n[58] Y . Wang, P. Li, M. Sun, and Y . Liu, “Self-knowledge guided\\nretrieval augmentation for large language models,” arXiv preprint\\narXiv:2310.05002 , 2023.\\n[59] Z. Feng, X. Feng, D. Zhao, M. Yang, and B. Qin, “Retrieval-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='generation synergy augmented large language models,” arXiv preprint\\narXiv:2310.05149 , 2023.\\n[60] P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu, S. Subramanian,\\nE. Bakhturina, M. Shoeybi, and B. Catanzaro, “Retrieval meets long\\ncontext large language models,” arXiv preprint arXiv:2310.03025 ,\\n2023.\\n[61] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Interleav-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='ing retrieval with chain-of-thought reasoning for knowledge-intensive\\nmulti-step questions,” arXiv preprint arXiv:2212.10509 , 2022.\\n[62] R. Ren, Y . Wang, Y . Qu, W. X. Zhao, J. Liu, H. Tian, H. Wu, J.-\\nR. Wen, and H. Wang, “Investigating the factual knowledge boundary\\nof large language models with retrieval augmentation,” arXiv preprint\\narXiv:2307.11019 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='arXiv:2307.11019 , 2023.\\n[63] P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D.\\nManning, “Raptor: Recursive abstractive processing for tree-organized\\nretrieval,” arXiv preprint arXiv:2401.18059 , 2024.\\n[64] O. Ram, Y . Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-\\nBrown, and Y . Shoham, “In-context retrieval-augmented language'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='models,” arXiv preprint arXiv:2302.00083 , 2023.\\n[65] Y . Ren, Y . Cao, P. Guo, F. Fang, W. Ma, and Z. Lin, “Retrieve-and-\\nsample: Document-level event argument extraction via hybrid retrieval\\naugmentation,” in Proceedings of the 61st Annual Meeting of the\\nAssociation for Computational Linguistics (Volume 1: Long Papers) ,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='2023, pp. 293–306.[66] Z. Wang, X. Pan, D. Yu, D. Yu, J. Chen, and H. Ji, “Zemi: Learning\\nzero-shot semi-parametric language models from multiple tasks,” arXiv\\npreprint arXiv:2210.00185 , 2022.\\n[67] S.-Q. Yan, J.-C. Gu, Y . Zhu, and Z.-H. Ling, “Corrective retrieval\\naugmented generation,” arXiv preprint arXiv:2401.15884 , 2024.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='[68] P. Jain, L. B. Soares, and T. Kwiatkowski, “1-pager: One pass answer\\ngeneration and evidence retrieval,” arXiv preprint arXiv:2310.16568 ,\\n2023.\\n[69] H. Yang, Z. Li, Y . Zhang, J. Wang, N. Cheng, M. Li, and J. Xiao, “Prca:\\nFitting black-box large language models for retrieval question answer-\\ning via pluggable reward-driven contextual adapter,” arXiv preprint\\narXiv:2310.18347 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='arXiv:2310.18347 , 2023.\\n[70] S. Zhuang, B. Liu, B. Koopman, and G. Zuccon, “Open-source large\\nlanguage models are strong zero-shot query likelihood models for\\ndocument ranking,” arXiv preprint arXiv:2310.13243 , 2023.\\n[71] F. Xu, W. Shi, and E. Choi, “Recomp: Improving retrieval-augmented\\nlms with compression and selective augmentation,” arXiv preprint\\narXiv:2310.04408 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='arXiv:2310.04408 , 2023.\\n[72] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettle-\\nmoyer, and W.-t. Yih, “Replug: Retrieval-augmented black-box lan-\\nguage models,” arXiv preprint arXiv:2301.12652 , 2023.\\n[73] E. Melz, “Enhancing llm intelligence with arm-rag: Auxiliary ra-\\ntionale memory for retrieval augmented generation,” arXiv preprint\\narXiv:2311.04177 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='arXiv:2311.04177 , 2023.\\n[74] H. Wang, W. Huang, Y . Deng, R. Wang, Z. Wang, Y . Wang, F. Mi,\\nJ. Z. Pan, and K.-F. Wong, “Unims-rag: A unified multi-source\\nretrieval-augmented generation for personalized dialogue systems,”\\narXiv preprint arXiv:2401.13256 , 2024.\\n[75] Z. Luo, C. Xu, P. Zhao, X. Geng, C. Tao, J. Ma, Q. Lin, and D. Jiang,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='“Augmented large language models with parametric knowledge guid-\\ning,” arXiv preprint arXiv:2305.04757 , 2023.\\n[76] X. Li, Z. Liu, C. Xiong, S. Yu, Y . Gu, Z. Liu, and G. Yu, “Structure-\\naware language model pretraining improves dense retrieval on struc-\\ntured data,” arXiv preprint arXiv:2305.19912 , 2023.\\n[77] M. Kang, J. M. Kwak, J. Baek, and S. J. Hwang, “Knowledge'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='graph-augmented language models for knowledge-grounded dialogue\\ngeneration,” arXiv preprint arXiv:2305.18846 , 2023.\\n[78] W. Shen, Y . Gao, C. Huang, F. Wan, X. Quan, and W. Bi, “Retrieval-\\ngeneration alignment for end-to-end task-oriented dialogue system,”\\narXiv preprint arXiv:2310.08877 , 2023.\\n[79] T. Shi, L. Li, Z. Lin, T. Yang, X. Quan, and Q. Wang, “Dual-feedback'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='knowledge retrieval for task-oriented dialogue systems,” arXiv preprint\\narXiv:2310.14528 , 2023.\\n[80] P. Ranade and A. Joshi, “Fabula: Intelligence report generation\\nusing retrieval-augmented narrative construction,” arXiv preprint\\narXiv:2310.13848 , 2023.\\n[81] X. Jiang, R. Zhang, Y . Xu, R. Qiu, Y . Fang, Z. Wang, J. Tang,\\nH. Ding, X. Chu, J. Zhao et al. , “Think and retrieval: A hypothesis'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='knowledge graph enhanced medical large language models,” arXiv\\npreprint arXiv:2312.15883 , 2023.\\n[82] J. Baek, S. Jeong, M. Kang, J. C. Park, and S. J. Hwang,\\n“Knowledge-augmented language model verification,” arXiv preprint\\narXiv:2310.12836 , 2023.\\n[83] L. Luo, Y .-F. Li, G. Haffari, and S. Pan, “Reasoning on graphs: Faithful\\nand interpretable large language model reasoning,” arXiv preprint'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='arXiv:2310.01061 , 2023.\\n[84] X. He, Y . Tian, Y . Sun, N. V . Chawla, T. Laurent, Y . LeCun,\\nX. Bresson, and B. Hooi, “G-retriever: Retrieval-augmented generation\\nfor textual graph understanding and question answering,” arXiv preprint\\narXiv:2402.07630 , 2024.\\n[85] L. Zha, J. Zhou, L. Li, R. Wang, Q. Huang, S. Yang, J. Yuan, C. Su,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='X. Li, A. Su et al. , “Tablegpt: Towards unifying tables, nature language\\nand commands into one gpt,” arXiv preprint arXiv:2307.08674 , 2023.\\n[86] M. Gaur, K. Gunaratna, V . Srinivasan, and H. Jin, “Iseeq: Information\\nseeking question generation using dynamic meta-information retrieval\\nand knowledge graphs,” in Proceedings of the AAAI Conference on'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='Artificial Intelligence , vol. 36, no. 10, 2022, pp. 10 672–10 680.\\n[87] F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli,\\nand D. Zhou, “Large language models can be easily distracted by\\nirrelevant context,” in International Conference on Machine Learning .\\nPMLR, 2023, pp. 31 210–31 227.\\n[88] R. Teja, “Evaluating the ideal chunk size for a rag'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 17}, page_content='system using llamaindex,” https://www.llamaindex.ai/blog/\\nevaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5,\\n2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='19\\n[89] Langchain, “Recursively split by character,” https://python.langchain.\\ncom/docs/modules/data connection/document transformers/recursive\\ntext splitter, 2023.\\n[90] S. Yang, “Advanced rag 01: Small-to-\\nbig retrieval,” https://towardsdatascience.com/\\nadvanced-rag-01-small-to-big-retrieval-172181b396d4, 2023.\\n[91] Y . Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='“Knowledge graph prompting for multi-document question answering,”\\narXiv preprint arXiv:2308.11730 , 2023.\\n[92] D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schu-\\nurmans, C. Cui, O. Bousquet, Q. Le et al. , “Least-to-most prompting\\nenables complex reasoning in large language models,” arXiv preprint\\narXiv:2205.10625 , 2022.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='arXiv:2205.10625 , 2022.\\n[93] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz,\\nand J. Weston, “Chain-of-verification reduces hallucination in large\\nlanguage models,” arXiv preprint arXiv:2309.11495 , 2023.\\n[94] X. Li and J. Li, “Angle-optimized text embeddings,” arXiv preprint\\narXiv:2309.12871 , 2023.\\n[95] V oyageAI, “V oyage’s embedding models,” https://docs.voyageai.com/'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='embeddings/, 2023.\\n[96] BAAI, “Flagembedding,” https://github.com/FlagOpen/\\nFlagEmbedding, 2023.\\n[97] P. Zhang, S. Xiao, Z. Liu, Z. Dou, and J.-Y . Nie, “Retrieve anything\\nto augment large language models,” arXiv preprint arXiv:2310.07554 ,\\n2023.\\n[98] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni,\\nand P. Liang, “Lost in the middle: How language models use long'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='contexts,” arXiv preprint arXiv:2307.03172 , 2023.\\n[99] Y . Gao, T. Sheng, Y . Xiang, Y . Xiong, H. Wang, and J. Zhang, “Chat-\\nrec: Towards interactive and explainable llms-augmented recommender\\nsystem,” arXiv preprint arXiv:2303.14524 , 2023.\\n[100] N. Anderson, C. Wilson, and S. D. Richardson, “Lingua: Addressing\\nscenarios for live interpretation and automatic dubbing,” in Proceedings'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='of the 15th Biennial Conference of the Association for Machine\\nTranslation in the Americas (Volume 2: Users and Providers Track\\nand Government Track) , J. Campbell, S. Larocca, J. Marciano,\\nK. Savenkov, and A. Yanishevsky, Eds. Orlando, USA: Association\\nfor Machine Translation in the Americas, Sep. 2022, pp. 202–209.\\n[Online]. Available: https://aclanthology.org/2022.amta-upg.14'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='[101] H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu,\\n“Longllmlingua: Accelerating and enhancing llms in long context\\nscenarios via prompt compression,” arXiv preprint arXiv:2310.06839 ,\\n2023.\\n[102] V . Karpukhin, B. O ˘guz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen,\\nand W.-t. Yih, “Dense passage retrieval for open-domain question'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='answering,” arXiv preprint arXiv:2004.04906 , 2020.\\n[103] Y . Ma, Y . Cao, Y . Hong, and A. Sun, “Large language model is\\nnot a good few-shot information extractor, but a good reranker for\\nhard samples!” ArXiv , vol. abs/2303.08559, 2023. [Online]. Available:\\nhttps://api.semanticscholar.org/CorpusID:257532405\\n[104] J. Cui, Z. Li, Y . Yan, B. Chen, and L. Yuan, “Chatlaw: Open-source'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='legal large language model with integrated external knowledge bases,”\\narXiv preprint arXiv:2306.16092 , 2023.\\n[105] O. Yoran, T. Wolfson, O. Ram, and J. Berant, “Making retrieval-\\naugmented language models robust to irrelevant context,” arXiv\\npreprint arXiv:2310.01558 , 2023.\\n[106] X. Li, R. Zhao, Y . K. Chia, B. Ding, L. Bing, S. Joty, and S. Poria,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='“Chain of knowledge: A framework for grounding large language mod-\\nels with structured knowledge bases,” arXiv preprint arXiv:2305.13269 ,\\n2023.\\n[107] H. Yang, S. Yue, and Y . He, “Auto-gpt for online decision\\nmaking: Benchmarks and additional opinions,” arXiv preprint\\narXiv:2306.02224 , 2023.\\n[108] T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettle-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='moyer, N. Cancedda, and T. Scialom, “Toolformer: Language models\\ncan teach themselves to use tools,” arXiv preprint arXiv:2302.04761 ,\\n2023.\\n[109] J. Zhang, “Graph-toolformer: To empower llms with graph rea-\\nsoning ability via prompt augmented by chatgpt,” arXiv preprint\\narXiv:2304.11116 , 2023.\\n[110] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='C. Hesse, S. Jain, V . Kosaraju, W. Saunders et al. , “Webgpt: Browser-\\nassisted question-answering with human feedback,” arXiv preprint\\narXiv:2112.09332 , 2021.\\n[111] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh,\\nC. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al. , “Natural'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='questions: a benchmark for question answering research,” Transactionsof the Association for Computational Linguistics , vol. 7, pp. 453–466,\\n2019.\\n[112] Y . Liu, S. Yavuz, R. Meng, M. Moorthy, S. Joty, C. Xiong, and Y . Zhou,\\n“Exploring the integration strategies of retriever and large language\\nmodels,” arXiv preprint arXiv:2308.12574 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='[113] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large\\nscale distantly supervised challenge dataset for reading comprehen-\\nsion,” arXiv preprint arXiv:1705.03551 , 2017.\\n[114] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+\\nquestions for machine comprehension of text,” arXiv preprint\\narXiv:1606.05250 , 2016.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='arXiv:1606.05250 , 2016.\\n[115] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on\\nfreebase from question-answer pairs,” in Proceedings of the 2013\\nconference on empirical methods in natural language processing , 2013,\\npp. 1533–1544.\\n[116] A. Mallen, A. Asai, V . Zhong, R. Das, H. Hajishirzi, and D. Khashabi,\\n“When not to trust language models: Investigating effectiveness and'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='limitations of parametric and non-parametric memories,” arXiv preprint\\narXiv:2212.10511 , 2022.\\n[117] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder,\\nand L. Deng, “Ms marco: A human-generated machine reading com-\\nprehension dataset,” 2016.\\n[118] Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdi-\\nnov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explain-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='able multi-hop question answering,” arXiv preprint arXiv:1809.09600 ,\\n2018.\\n[119] X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa, “Constructing a\\nmulti-hop qa dataset for comprehensive evaluation of reasoning steps,”\\narXiv preprint arXiv:2011.01060 , 2020.\\n[120] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Musique:'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='Multihop questions via single-hop question composition,” Transactions\\nof the Association for Computational Linguistics , vol. 10, pp. 539–554,\\n2022.\\n[121] A. Fan, Y . Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli, “Eli5:\\nLong form question answering,” arXiv preprint arXiv:1907.09190 ,\\n2019.\\n[122] T. Ko ˇcisk`y, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='and E. Grefenstette, “The narrativeqa reading comprehension chal-\\nlenge,” Transactions of the Association for Computational Linguistics ,\\nvol. 6, pp. 317–328, 2018.\\n[123] K.-H. Lee, X. Chen, H. Furuta, J. Canny, and I. Fischer, “A human-\\ninspired reading agent with gist memory of very long contexts,” arXiv\\npreprint arXiv:2402.09727 , 2024.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='preprint arXiv:2402.09727 , 2024.\\n[124] I. Stelmakh, Y . Luan, B. Dhingra, and M.-W. Chang, “Asqa: Factoid\\nquestions meet long-form answers,” arXiv preprint arXiv:2204.06092 ,\\n2022.\\n[125] M. Zhong, D. Yin, T. Yu, A. Zaidi, M. Mutuma, R. Jha, A. H.\\nAwadallah, A. Celikyilmaz, Y . Liu, X. Qiu et al. , “Qmsum: A new\\nbenchmark for query-based multi-domain meeting summarization,”'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='arXiv preprint arXiv:2104.05938 , 2021.\\n[126] P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith, and M. Gardner,\\n“A dataset of information-seeking questions and answers anchored in\\nresearch papers,” arXiv preprint arXiv:2105.03011 , 2021.\\n[127] T. M ¨oller, A. Reina, R. Jayakumar, and M. Pietsch, “Covid-qa: A\\nquestion answering dataset for covid-19,” in ACL 2020 Workshop on'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='Natural Language Processing for COVID-19 (NLP-COVID) , 2020.\\n[128] X. Wang, G. H. Chen, D. Song, Z. Zhang, Z. Chen, Q. Xiao, F. Jiang,\\nJ. Li, X. Wan, B. Wang et al. , “Cmb: A comprehensive medical\\nbenchmark in chinese,” arXiv preprint arXiv:2308.08833 , 2023.\\n[129] H. Zeng, “Measuring massive multitask chinese understanding,” arXiv\\npreprint arXiv:2304.12986 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='preprint arXiv:2304.12986 , 2023.\\n[130] R. Y . Pang, A. Parrish, N. Joshi, N. Nangia, J. Phang, A. Chen, V . Pad-\\nmakumar, J. Ma, J. Thompson, H. He et al. , “Quality: Question an-\\nswering with long input texts, yes!” arXiv preprint arXiv:2112.08608 ,\\n2021.\\n[131] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='and O. Tafjord, “Think you have solved question answering? try arc,\\nthe ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457 , 2018.\\n[132] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa:\\nA question answering challenge targeting commonsense knowledge,”\\narXiv preprint arXiv:1811.00937 , 2018.\\n[133] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 18}, page_content='“Wizard of wikipedia: Knowledge-powered conversational agents,”\\narXiv preprint arXiv:1811.01241 , 2018.\\n[134] H. Wang, M. Hu, Y . Deng, R. Wang, F. Mi, W. Wang, Y . Wang, W.-\\nC. Kwan, I. King, and K.-F. Wong, “Large language models as source'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='20\\nplanner for personalized knowledge-grounded dialogue,” arXiv preprint\\narXiv:2310.08840 , 2023.\\n[135] ——, “Large language models as source planner for personal-\\nized knowledge-grounded dialogue,” arXiv preprint arXiv:2310.08840 ,\\n2023.\\n[136] X. Xu, Z. Gou, W. Wu, Z.-Y . Niu, H. Wu, H. Wang, and S. Wang,\\n“Long time no see! open-domain conversation with long-term persona'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='memory,” arXiv preprint arXiv:2203.05797 , 2022.\\n[137] T.-H. Wen, M. Gasic, N. Mrksic, L. M. Rojas-Barahona, P.-H.\\nSu, S. Ultes, D. Vandyke, and S. Young, “Conditional generation\\nand snapshot learning in neural dialogue systems,” arXiv preprint\\narXiv:1606.03352 , 2016.\\n[138] R. He and J. McAuley, “Ups and downs: Modeling the visual evolution'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='of fashion trends with one-class collaborative filtering,” in proceedings\\nof the 25th international conference on world wide web , 2016, pp.\\n507–517.\\n[139] S. Li, H. Ji, and J. Han, “Document-level event argument extraction\\nby conditional generation,” arXiv preprint arXiv:2104.05919 , 2021.\\n[140] S. Ebner, P. Xia, R. Culkin, K. Rawlins, and B. Van Durme, “Multi-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='sentence argument linking,” arXiv preprint arXiv:1911.03766 , 2019.\\n[141] H. Elsahar, P. V ougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest,\\nand E. Simperl, “T-rex: A large scale alignment of natural language\\nwith knowledge base triples,” in Proceedings of the Eleventh Inter-\\nnational Conference on Language Resources and Evaluation (LREC\\n2018) , 2018.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='2018) , 2018.\\n[142] O. Levy, M. Seo, E. Choi, and L. Zettlemoyer, “Zero-shot relation ex-\\ntraction via reading comprehension,” arXiv preprint arXiv:1706.04115 ,\\n2017.\\n[143] R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “Hel-\\nlaswag: Can a machine really finish your sentence?” arXiv preprint\\narXiv:1905.07830 , 2019.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='arXiv:1905.07830 , 2019.\\n[144] S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, and M. Seo,\\n“The cot collection: Improving zero-shot and few-shot learning of\\nlanguage models via chain-of-thought fine-tuning,” arXiv preprint\\narXiv:2305.14045 , 2023.\\n[145] A. Saha, V . Pahuja, M. Khapra, K. Sankaranarayanan, and S. Chandar,\\n“Complex sequential question answering: Towards learning to converse'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='over linked question answer pairs with a knowledge graph,” in Proceed-\\nings of the AAAI conference on artificial intelligence , vol. 32, no. 1,\\n2018.\\n[146] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and\\nJ. Steinhardt, “Measuring massive multitask language understanding,”\\narXiv preprint arXiv:2009.03300 , 2020.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='arXiv preprint arXiv:2009.03300 , 2020.\\n[147] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel\\nmixture models,” arXiv preprint arXiv:1609.07843 , 2016.\\n[148] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant,\\n“Did aristotle use a laptop? a question answering benchmark with\\nimplicit reasoning strategies,” Transactions of the Association for'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='Computational Linguistics , vol. 9, pp. 346–361, 2021.\\n[149] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, “Fever: a\\nlarge-scale dataset for fact extraction and verification,” arXiv preprint\\narXiv:1803.05355 , 2018.\\n[150] N. Kotonya and F. Toni, “Explainable automated fact-checking for\\npublic health claims,” arXiv preprint arXiv:2010.09926 , 2020.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='[151] R. Lebret, D. Grangier, and M. Auli, “Neural text generation from\\nstructured data with application to the biography domain,” arXiv\\npreprint arXiv:1603.07771 , 2016.\\n[152] H. Hayashi, P. Budania, P. Wang, C. Ackerson, R. Neervannan,\\nand G. Neubig, “Wikiasp: A dataset for multi-domain aspect-based\\nsummarization,” Transactions of the Association for Computational'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='Linguistics , vol. 9, pp. 211–225, 2021.\\n[153] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details,\\njust the summary! topic-aware convolutional neural networks for ex-\\ntreme summarization,” arXiv preprint arXiv:1808.08745 , 2018.\\n[154] S. Saha, J. A. Junaed, M. Saleki, A. S. Sharma, M. R. Rifat, M. Rahouti,\\nS. I. Ahmed, N. Mohammed, and M. R. Amin, “Vio-lens: A novel'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='dataset of annotated social network posts leading to different forms\\nof communal violence and its evaluation,” in Proceedings of the First\\nWorkshop on Bangla Language Processing (BLP-2023) , 2023, pp. 72–\\n84.\\n[155] X. Li and D. Roth, “Learning question classifiers,” in COLING 2002:\\nThe 19th International Conference on Computational Linguistics , 2002.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='[156] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng,\\nand C. Potts, “Recursive deep models for semantic compositionality\\nover a sentiment treebank,” in Proceedings of the 2013 conference on\\nempirical methods in natural language processing , 2013, pp. 1631–\\n1642.[157] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='“Codesearchnet challenge: Evaluating the state of semantic code\\nsearch,” arXiv preprint arXiv:1909.09436 , 2019.\\n[158] K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,\\nM. Plappert, J. Tworek, J. Hilton, R. Nakano et al. , “Training verifiers\\nto solve math word problems,” arXiv preprint arXiv:2110.14168 , 2021.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='[159] R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufis,\\nand D. Varga, “The jrc-acquis: A multilingual aligned parallel corpus\\nwith 20+ languages,” arXiv preprint cs/0609058 , 2006.\\n[160] Y . Hoshi, D. Miyashita, Y . Ng, K. Tatsuno, Y . Morioka, O. Torii,\\nand J. Deguchi, “Ralle: A framework for developing and eval-'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='uating retrieval-augmented large language models,” arXiv preprint\\narXiv:2308.10633 , 2023.\\n[161] J. Liu, “Building production-ready rag applications,” https://www.ai.\\nengineer/summit/schedule/building-production-ready-rag-applications,\\n2023.\\n[162] I. Nguyen, “Evaluating rag part i: How to evaluate document retrieval,”\\nhttps://www.deepset.ai/blog/rag-evaluation-retrieval, 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='[163] Q. Leng, K. Uhlenhuth, and A. Polyzotis, “Best practices for\\nllm evaluation of rag applications,” https://www.databricks.com/blog/\\nLLM-auto-eval-best-practices-RAG, 2023.\\n[164] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “Ragas: Au-\\ntomated evaluation of retrieval augmented generation,” arXiv preprint\\narXiv:2309.15217 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='arXiv:2309.15217 , 2023.\\n[165] J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia, “Ares: An\\nautomated evaluation framework for retrieval-augmented generation\\nsystems,” arXiv preprint arXiv:2311.09476 , 2023.\\n[166] C. Jarvis and J. Allard, “A survey of techniques for\\nmaximizing llm performance,” https://community.openai.\\ncom/t/openai-dev-day-2023-breakout-sessions/505213#'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='a-survey-of-techniques-for-maximizing-llm-performance-2, 2023.\\n[167] J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large lan-\\nguage models in retrieval-augmented generation,” arXiv preprint\\narXiv:2309.01431 , 2023.\\n[168] Y . Liu, L. Huang, S. Li, S. Chen, H. Zhou, F. Meng, J. Zhou, and\\nX. Sun, “Recall: A benchmark for llms robustness against external'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='counterfactual knowledge,” arXiv preprint arXiv:2311.08147 , 2023.\\n[169] Y . Lyu, Z. Li, S. Niu, F. Xiong, B. Tang, W. Wang, H. Wu, H. Liu,\\nT. Xu, and E. Chen, “Crud-rag: A comprehensive chinese benchmark\\nfor retrieval-augmented generation of large language models,” arXiv\\npreprint arXiv:2401.17043 , 2024.\\n[170] P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu, S. Subramanian,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='E. Bakhturina, M. Shoeybi, and B. Catanzaro, “Retrieval meets long\\ncontext large language models,” arXiv preprint arXiv:2310.03025 ,\\n2023.\\n[171] C. Packer, V . Fang, S. G. Patil, K. Lin, S. Wooders, and J. E. Gon-\\nzalez, “Memgpt: Towards llms as operating systems,” arXiv preprint\\narXiv:2310.08560 , 2023.\\n[172] G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='streaming language models with attention sinks,” arXiv preprint\\narXiv:2309.17453 , 2023.\\n[173] T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Zaharia, I. Stoica, and J. E.\\nGonzalez, “Raft: Adapting language model to domain specific rag,”\\narXiv preprint arXiv:2403.10131 , 2024.\\n[174] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess,'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws\\nfor neural language models,” arXiv preprint arXiv:2001.08361 , 2020.\\n[175] U. Alon, F. Xu, J. He, S. Sengupta, D. Roth, and G. Neubig, “Neuro-\\nsymbolic language modeling with automaton-augmented retrieval,” in\\nInternational Conference on Machine Learning . PMLR, 2022, pp.\\n468–485.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='468–485.\\n[176] M. Yasunaga, A. Aghajanyan, W. Shi, R. James, J. Leskovec, P. Liang,\\nM. Lewis, L. Zettlemoyer, and W.-t. Yih, “Retrieval-augmented multi-\\nmodal language modeling,” arXiv preprint arXiv:2211.12561 , 2022.\\n[177] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-\\nimage pre-training with frozen image encoders and large language'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='models,” arXiv preprint arXiv:2301.12597 , 2023.\\n[178] W. Zhu, A. Yan, Y . Lu, W. Xu, X. E. Wang, M. Eckstein, and W. Y .\\nWang, “Visualize before you write: Imagination-guided open-ended\\ntext generation,” arXiv preprint arXiv:2210.03765 , 2022.\\n[179] J. Zhao, G. Haffar, and E. Shareghi, “Generating synthetic speech from\\nspokenvocab for speech translation,” arXiv preprint arXiv:2210.08174 ,\\n2022.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 19}, page_content='2022.\\n[180] D. M. Chan, S. Ghosh, A. Rastrow, and B. Hoffmeister, “Using external\\noff-policy speech-to-text mappings in contextual end-to-end automated\\nspeech recognition,” arXiv preprint arXiv:2301.02736 , 2023.'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 20}, page_content='21\\n[181] A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev,\\nJ. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual\\nlanguage model for dense video captioning,” in Proceedings of the\\nIEEE/CVF Conference on Computer Vision and Pattern Recognition ,\\n2023, pp. 10 714–10 726.\\n[182] N. Nashid, M. Sintaha, and A. Mesbah, “Retrieval-based prompt'), Document(metadata={'source': 'https://arxiv.org/pdf/2312.10997.pdf', 'page': 20}, page_content='selection for code-related few-shot learning,” in 2023 IEEE/ACM 45th\\nInternational Conference on Software Engineering (ICSE) , 2023, pp.\\n2450–2462.')]\n"
     ]
    }
   ],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    " \n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=400,\n",
    "    chunk_overlap=40,\n",
    "    length_function=len,\n",
    "    is_separator_regex=False\n",
    ")\n",
    "\n",
    "chunks = text_splitter.split_documents(pages)\n",
    "print(chunks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/testys/.local/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.embeddings import HuggingFaceBgeEmbeddings\n",
    "\n",
    "model_name = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
    "model_kwargs = {\"device\": \"cpu\"}\n",
    "encode_kwargs = {\"padding\": \"max_length\", \"max_length\": 512, \"truncation\": True, \"normalize_embeddings\": True}\n",
    "embeddings = HuggingFaceBgeEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs\n",
    ")\n",
    "\n",
    "# chunk_text = list(map(lambda x: x.page_content, chunks))\n",
    "# embeddings = embeddings.embed_documents(chunk_text)\n",
    "# print(embeddings[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.vectorstores import FAISS\n",
    "\n",
    "\n",
    "\n",
    "db = FAISS.from_documents(chunks, embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"What is the main drawback of the RAG method based on the paper?\"\n",
    "\n",
    "# results = db.search(query=query, k=5, search_type=\"similarity\")\n",
    "\n",
    "# print(results[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
    "chat_prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"system\", \"You are a researcher who has just read a paper on a new method for solving a problem in your field. You are excited about the potential of the method, but you have some questions about the details of the method and its limitations.\"),\n",
    "        (\"human\", \"{question}\")    \n",
    "    ]\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "import os\n",
    "\n",
    "load_dotenv()\n",
    "\n",
    "api_key = os.getenv(\"OPENAI_API_KEY\")\n",
    "\n",
    "chat_model = ChatOpenAI(model_name=\"gpt-4o\",\n",
    "                        api_key=api_key,\n",
    "                        temperature=0.9,\n",
    "                        max_tokens=1000\n",
    "                        )\n",
    "\n",
    "chain = chat_prompt | chat_model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = chain.invoke(\n",
    "    {\n",
    "        \"context\": \"\\n\\n\".join(list(map(lambda x: x.page_content, chunks))),\n",
    "        \"question\":query\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'According to the paper, one of the main drawbacks of the Retrieval-Augmented Generation (RAG) method is its reliance on the quality and relevance of the retrieved documents. If the retrieval component fails to find relevant information, the generation component may produce inaccurate or irrelevant outputs. This dependency highlights a few specific issues:\\n\\n1. **Retrieval Quality**: If the underlying retrieval algorithm is not robust or the index from which the documents are retrieved is not comprehensive and up to date, the entire process can be compromised.\\n\\n2. **Noise in Retrieved Documents**: The method might retrieve documents that contain irrelevant or even erroneous information, which could negatively influence the generated responses.\\n\\n3. **Computational Complexity**: Integrating retrieval and generation components can introduce additional computational overhead, which might not be feasible in real-time or resource-constrained environments.\\n\\n4. **Fine-Tuning Requirements**: The method may require extensive fine-tuning to strike a balance between retrieval and generation, which can be resource-intensive and may not generalize well across different domains.\\n\\nThese limitations suggest that while RAG has strong potential, its effectiveness is closely tied to the sophistication and accuracy of its retrieval mechanism.'"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chatbot with RAG method\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "model = ChatOpenAI(model=\"gpt-4o\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.messages import HumanMessage, SystemMessage, AIMessage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AIMessage(content=\"Hello Bobby! I'm an AI, so I don't have feelings, but I'm here and ready to help you with whatever you need. How can I assist you today?\", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 33, 'prompt_tokens': 15, 'total_tokens': 48, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_c17d3befe7', 'finish_reason': 'stop', 'logprobs': None}, id='run-41ec5a6b-f78e-4262-9048-b18d73a048e0-0', usage_metadata={'input_tokens': 15, 'output_tokens': 33, 'total_tokens': 48})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.invoke(\n",
    "    [\n",
    "        HumanMessage(content=\"Hello, how are you?, I'm Bobby\"),\n",
    "        # AIMessage(content=\"Hello Bobby! I'm fine, how can I help you?\")\n",
    "\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langgraph.checkpoint.memory import MemorySaver\n",
    "from langgraph.graph import START, MessagesState, StateGraph\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "workflow = StateGraph(\n",
    "    state_schema=MessagesState\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def call_model(state:MessagesState):\n",
    "    response = model.invoke(state[\"messages\"])\n",
    "    return {\"messages\":response}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<langgraph.graph.state.StateGraph at 0x70a0e1f96b10>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "workflow.add_edge(START, \"model\")\n",
    "workflow.add_node(\"model\", call_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "memory = MemorySaver()\n",
    "app = workflow.compile(checkpointer=memory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "    \"configurable\": {\"thread_id\": \"1234\"}\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"Hi! I'm Bobby, and you?\"\n",
    "\n",
    "input_message = [HumanMessage(content=query)]\n",
    "output = app.invoke(\n",
    "    {\n",
    "        \"messages\": input_message\n",
    "    }, \n",
    "    config\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'messages': [HumanMessage(content=\"Hi! I'm Bobby, and you?\", additional_kwargs={}, response_metadata={}, id='82d2044f-9517-458a-95b8-f0bfa3fb9300'),\n",
       "  AIMessage(content=\"Hello, Bobby! I'm an AI developed by OpenAI. How can I assist you today?\", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 15, 'total_tokens': 34, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_c17d3befe7', 'finish_reason': 'stop', 'logprobs': None}, id='run-169455ca-94b7-49dd-b3d7-4f89fbddf7b3-0', usage_metadata={'input_tokens': 15, 'output_tokens': 19, 'total_tokens': 34})]}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================\u001b[1m Human Message \u001b[0m=================================\n",
      "\n",
      "Hi! I'm Bobby, and you?\n"
     ]
    }
   ],
   "source": [
    "output[\"messages\"][0].pretty_print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_query = \"What is my name?\"\n",
    "\n",
    "input_message = [HumanMessage(content=new_query)]\n",
    "output = app.invoke(\n",
    "    {\n",
    "        \"messages\": input_message\n",
    "    }, \n",
    "    config\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'messages': [HumanMessage(content=\"Hi! I'm Bobby, and you?\", additional_kwargs={}, response_metadata={}, id='82d2044f-9517-458a-95b8-f0bfa3fb9300'),\n",
       "  AIMessage(content=\"Hello, Bobby! I'm an AI developed by OpenAI. How can I assist you today?\", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 15, 'total_tokens': 34, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_c17d3befe7', 'finish_reason': 'stop', 'logprobs': None}, id='run-169455ca-94b7-49dd-b3d7-4f89fbddf7b3-0', usage_metadata={'input_tokens': 15, 'output_tokens': 19, 'total_tokens': 34}),\n",
       "  HumanMessage(content='What is my name', additional_kwargs={}, response_metadata={}, id='7eaa30f5-5969-4274-8914-ecc4d157eff0'),\n",
       "  AIMessage(content='You mentioned that your name is Bobby. How can I assist you today, Bobby?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 17, 'prompt_tokens': 46, 'total_tokens': 63, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_c17d3befe7', 'finish_reason': 'stop', 'logprobs': None}, id='run-fa57f662-b5b7-4918-911a-e68886bb70f3-0', usage_metadata={'input_tokens': 46, 'output_tokens': 17, 'total_tokens': 63}),\n",
       "  HumanMessage(content='What is my name?', additional_kwargs={}, response_metadata={}, id='be3f2b50-122f-4839-99ef-d207c2da7073'),\n",
       "  AIMessage(content='You told me earlier that your name is Bobby. How can I help you today, Bobby?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 76, 'total_tokens': 95, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_057232b607', 'finish_reason': 'stop', 'logprobs': None}, id='run-178f2c9b-07ed-4ef0-8178-69a109bb2d6f-0', usage_metadata={'input_tokens': 76, 'output_tokens': 19, 'total_tokens': 95})]}"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "from redis import Redis\n",
    "import json\n",
    "\n",
    "r = Redis(host=\"localhost\", port=6379, db=3)\n",
    "\n",
    "def save_message(chat_id, message_type, message_content, metadata=None):\n",
    "    \"\"\"\n",
    "    Save a chat message to Redis.\n",
    "    \n",
    "    Args:\n",
    "        chat_id (str): Unique identifier for the conversation (chat session).\n",
    "        message_type (str): 'user' or 'ai' to denote who sent the message.\n",
    "        message_content (str): The actual message content.\n",
    "        metadata (dict, optional): Additional metadata like tokens used, model information, etc.\n",
    "        \n",
    "    \"\"\"\n",
    "    key = f\"chat:{chat_id}\"\n",
    "    \n",
    "    # Message object to store\n",
    "    message = {\n",
    "        \"type\": message_type,  # 'user' or 'ai'\n",
    "        \"content\": message_content,\n",
    "        \"metadata\": metadata if metadata else {}\n",
    "    }\n",
    "    \n",
    "    # Save the message to a Redis list\n",
    "    r.rpush(key, json.dumps(message))\n",
    "\n",
    "\n",
    "def get_chat_history(chat_id):\n",
    "    key = f\"chat:{chat_id}\"\n",
    "\n",
    "    messages = r.lrange(key, 0, -1)\n",
    "\n",
    "    chat_history = [json.loads(msg) for msg in messages]\n",
    "\n",
    "    return chat_history\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "82d2044f-9517-458a-95b8-f0bfa3fb9300\n",
      "run-169455ca-94b7-49dd-b3d7-4f89fbddf7b3-0\n",
      "7eaa30f5-5969-4274-8914-ecc4d157eff0\n",
      "run-fa57f662-b5b7-4918-911a-e68886bb70f3-0\n",
      "be3f2b50-122f-4839-99ef-d207c2da7073\n",
      "run-178f2c9b-07ed-4ef0-8178-69a109bb2d6f-0\n"
     ]
    }
   ],
   "source": [
    "for message in output[\"messages\"]:\n",
    "    print(message.id)\n",
    "    print(message.type)\n",
    "    print(message.content)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}