Spaces:

ChangranHuuu
/

manus_inifinite_context_3

Sleeping

App Files Files Community

ChangranHuuu commited on 29 days ago

Commit

493728d

verified ·

1 Parent(s): 491ad12

Upload 22 files

Browse files

Files changed (22) hide show

README.md +31 -6
__pycache__/config_sambanova.cpython-310.pyc +0 -0
__pycache__/run_chatbot.cpython-310.pyc +0 -0
app.py +38 -59
longcepo.py +15 -0
longcepo/README.md +92 -0
longcepo/__init__.py +0 -0
longcepo/__pycache__/__init__.cpython-310.pyc +0 -0
longcepo/__pycache__/chunking.cpython-310.pyc +0 -0
longcepo/__pycache__/config.cpython-310.pyc +0 -0
longcepo/__pycache__/main.cpython-310.pyc +0 -0
longcepo/__pycache__/mapreduce.cpython-310.pyc +0 -0
longcepo/__pycache__/prompts.cpython-310.pyc +0 -0
longcepo/__pycache__/utils.cpython-310.pyc +0 -0
longcepo/chunking.py +248 -0
longcepo/config.py +36 -0
longcepo/main.py +109 -0
longcepo/mapreduce.py +281 -0
longcepo/prompts.py +16 -0
longcepo/utils.py +191 -0
requirements.txt +5 -1
run_chatbot.py +64 -0

README.md CHANGED Viewed

@@ -1,12 +1,37 @@
 ---
-title: Manus Inifinite Context 3
-emoji: 💬
-colorFrom: yellow
-colorTo: purple
 sdk: gradio
-sdk_version: 5.0.1
 app_file: app.py
 pinned: false
 ---
-An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

 ---
+title: LongCePO Chatbot (Sambanova)
+emoji: 🤖
+colorFrom: blue
+colorTo: green
 sdk: gradio
+sdk_version: 5.27.1
 app_file: app.py
 pinned: false
 ---
+# LongCePO Chatbot with Sambanova Backend
+This is a simple chatbot interface demonstrating the LongCePO (Long-Context Planning and Optimization) method using a Sambanova model (`Llama-4-Maverick-17B-128E-Instruct`) as the backend LLM.
+## How it works
+The LongCePO method is designed to handle long contexts (potentially millions of tokens) by:
+1.  **Planning:** Decomposing the initial query into sub-questions.
+2.  **MapReduce:** Answering each sub-question by processing chunks of the long context, summarizing relevant information, and aggregating results.
+This application takes a long text context and a query based on that context. It then uses the modified `longcepo` plugin (originally from the `optillm` repository) to generate an answer using the Sambanova API.
+## How to use
+1.  **(Optional)** Enter a system prompt to guide the chatbot's behavior.
+2.  Paste the long text document into the **Context** box.
+3.  Enter your question based on the provided context into the **Query** box.
+4.  Click **Submit**.
+The chatbot will process the request using the LongCePO pipeline and display the final answer.
+**Note:** Processing long contexts can take some time depending on the length of the context and the complexity of the query.
+## API Key
+This application requires a Sambanova API key to function. The key should be stored as a Hugging Face Space Secret named `SAMBANOVA_API_KEY`.

__pycache__/config_sambanova.cpython-310.pyc ADDED Viewed

Binary file (200 Bytes). View file

__pycache__/run_chatbot.cpython-310.pyc ADDED Viewed

Binary file (2.23 kB). View file

app.py CHANGED Viewed

@@ -1,64 +1,43 @@
 import gradio as gr
-from huggingface_hub import InferenceClient
-"""
-For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
-"""
-client = InferenceClient("HuggingFaceH4/zephyr-7b-beta")
-def respond(
-    message,
-    history: list[tuple[str, str]],
-    system_message,
-    max_tokens,
-    temperature,
-    top_p,
-):
-    messages = [{"role": "system", "content": system_message}]
-    for val in history:
-        if val[0]:
-            messages.append({"role": "user", "content": val[0]})
-        if val[1]:
-            messages.append({"role": "assistant", "content": val[1]})
-    messages.append({"role": "user", "content": message})
-    response = ""
-    for message in client.chat_completion(
-        messages,
-        max_tokens=max_tokens,
-        stream=True,
-        temperature=temperature,
-        top_p=top_p,
-    ):
-        token = message.choices[0].delta.content
-        response += token
-        yield response
-"""
-For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/chatinterface
-"""
-demo = gr.ChatInterface(
-    respond,
-    additional_inputs=[
-        gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
-        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
-        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
-        gr.Slider(
-            minimum=0.1,
-            maximum=1.0,
-            value=0.95,
-            step=0.05,
-            label="Top-p (nucleus sampling)",
-        ),
     ],
 )
 if __name__ == "__main__":
-    demo.launch()

 import gradio as gr
+from run_chatbot import process_with_longcepo, SAMBANOVA_MODEL
+def chatbot_interface(system_prompt, context, query):
+    """Gradio interface function to interact with the LongCePO chatbot."""
+    if not context or not query:
+        return "Please provide both context and query."
+    # Combine context and query using the expected delimiter
+    initial_query = f"{context}<CONTEXT_END>{query}"
+    # Use a default system prompt if none is provided
+    if not system_prompt:
+        system_prompt = "You are a helpful assistant designed to answer questions based on the provided context."
+    print(f"Received request:\nSystem Prompt: {system_prompt}\nContext: {context[:100]}...\nQuery: {query}")
+    # Call the processing function
+    result = process_with_longcepo(system_prompt, initial_query)
+    print(f"Returning result: {result[:100]}...")
+    return result
+# Define Gradio interface components
+iface = gr.Interface(
+    fn=chatbot_interface,
+    inputs=[
+        gr.Textbox(label="System Prompt (Optional)", placeholder="Enter system prompt here...", lines=2),
+        gr.Textbox(label="Context", placeholder="Enter the long context here...", lines=10),
+        gr.Textbox(label="Query", placeholder="Enter your query based on the context here...", lines=2)
     ],
+    outputs=gr.Textbox(label="Answer", lines=10),
+    title=f"LongCePO Chatbot ({SAMBANOVA_MODEL})",
+    description="Enter a long context and a query. The chatbot will use the LongCePO method with Sambanova backend to generate an answer.",
+    allow_flagging="never"
 )
+# Launch the Gradio app
 if __name__ == "__main__":
+    print("Launching Gradio interface...")
+    # Listen on 0.0.0.0 to make it accessible externally if needed
+    iface.launch(server_name="0.0.0.0", server_port=7860)

longcepo.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""The Long-Context Cerebras Planning and Optimization (LongCePO) Method
+LongCePO is an inference-time computation method designed to provide LLMs with the capability to work with infinite context such as external knowledge bases that can run into millions of tokens. We achieve this goal through a combination of multiple strategies including planning (query decomposition) and divide-and-conquer long-context processing. This approach enables to use a limited context window (e.g. 8K) and outperform full-context processing with the same base model in many question-answering tasks.
+If you have any questions or want to contribute, please reach out to us on [cerebras.ai/discord](https://cerebras.ai/discord).
+"""
+from typing import Tuple
+from .longcepo.main import run_longcepo
+SLUG = "longcepo"
+def run(system_prompt: str, initial_query: str, client, model: str) -> Tuple[str, int]:
+    return run_longcepo(system_prompt, initial_query, client, model)

longcepo/README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# The Long-Context Cerebras Planning and Optimization (LongCePO) Method
+LongCePO is an inference-time computation method designed to provide LLMs with the capability to work with infinite context such as external knowledge bases that can run into millions of tokens. We achieve this goal through a combination of multiple strategies including planning (query decomposition) and divide-and-conquer long-context processing. This approach enables to use a limited context window (e.g. 8K) and outperform full-context processing with the same base model in many question-answering tasks.
+If you have any questions or want to contribute, please reach out to us on [cerebras.ai/discord](https://cerebras.ai/discord).
+## Usage
+Start the optillm proxy server with directory to plugins specified in the command line:
+```bash
+python optillm.py --base-url https://api.cerebras.ai/v1 --port <port> --plugins-dir ./optillm/plugins
+```
+Now, you can send requests to the proxy using model name `longcepo-{model_name}` (e.g. `longcepo-llama-3.3-70b`) using the following format of the user message: `{context}<CONTEXT_END>{query}`. The `<CONTEXT_END>` delimiter string is used to split the user message into the (long) context and the user's query, respectively. This delimiter string can be changed (along with other LongCePO parameters) in the [config file](./config.py).
+## LongCePO Results
+LongCePO excels at tasks with long context (128K tokens and more) which is demonstrated below on LongBench v2 and HELMET benchmarks in comparison to frontier models. We additionally provide data points for tasks with shorter context that still exceeds the context window of 8K (HotpotQA and MuSiQue samples of 12-16K length). For our evaluations, we report mean and standard deviation of the target metric over 5 runs below.
+### LongBench v2
+| Model¹                             | Context window | Short samples (up to 32K words) | Medium samples (32–128K words) |
+|----------------------------------|----------------|------------------|----------------|
+| Llama 3.3 70B Instruct           | 128K           | 36.7 (45.0)               | 27.0 (33.0)            |
+| **LongCePO + Llama 3.3 70B Instruct** | **8K**             | **36.8 ± 1.38**        |  **38.7 ± 2.574 (39.735)²**             |
+| Mistral-Large-Instruct-2411     | 128K           | 41.7 (46.1)                 | 30.7 (34.9)             |
+| o1-mini-2024-09-12               | 128K           | 48.6 (48.9)                | 33.3 (32.9)            |
+| Claude-3.5-Sonnet-20241022       | 200K           | 46.1 (53.9)                | 38.6 (41.9)            |
+| Llama-4-Maverick-17B-128E-Instruct | 524K         | 32.22 (50.56)                  | 28.84 (41.86)               |
+ ¹ Performance numbers reported by LongBench v2 authors, except for LongCePO and Llama-4-Maverick results. Results in parentheses reported in LongBench v2 correspond to Chain-of-Thought prompting.
+ ² Results in parentheses for LongCePO indicate accuracy of majority voting from 5 runs.
+### HELMET (InfiniteBench En.MC, 128K length)
+| Model   | Accuracy (%) |
+|---------|---------------|
+| Llama 3.3 70B Instruct  (full context)  | 58.0          |
+| **LongCePO + Llama 3.3 70B Instruct (8K context)** | **71.6 ± 1.855 (73.0)¹**  |
+| o1-mini-2024-09-12 (full context) | 58.0          |
+| gpt-4o-2024-08-06 (full context) | 74.0          |
+ ¹ Numbers in parentheses for LongCePO indicate accuracy of majority voting from 5 runs.
+### LongBench v1 (HotpotQA, 12K+ length - 124 samples)
+| Model   | F1 Metric (%) | LLM-as-a-judge accuracy (%) |
+|---------|---------------|-----------------------------|
+| Llama 3.3 70B Instruct (full context)  |   63.372 ± 0.269         |   77.903 ± 0.822                      |
+| **LongCePO + Llama 3.3 70B Instruct (8K context)** |  **64.842 ± 1.295**            |   **79.355 ± 1.66**                  |
+### LongBench v1  (MuSiQue, 12K+ length - 191 samples)
+| Model   | F1 Metric (%) | LLM-as-a-judge accuracy (%) |
+|---------|---------------|-----------------------------|
+| Llama 3.3 70B Instruct  (full context) |    48.481 ± 0.641        |     49.424 ± 0.71                     |
+| **LongCePO + Llama 3.3 70B Instruct (8K context)** |  **54.076 ± 2.059**     |     **60.628  ±  2.156**                 |
+## LongCePO Methodology
+LongCePO is based on the [LLM×MapReduce](https://arxiv.org/abs/2410.09342) approach to long document processing, adding a planning layer on top of a map-reduce-based question-answering engine. We also improve upon the map-reduce approach itself by (i) adding query-aware summaries of neighboring document chunks during the map stage of the processing, (ii) reducing the collapse (merging) stage to a minimum required number of collapse iterations by using a sliding window to iteratively merge pairs of summaries, (iii) using a customized system prompt produced with an [OPRO-like](https://arxiv.org/abs/2309.03409) optimization approach to enhance question-anwering performance. Given a user query, a plan consisting of sub-queries is generated from a normalized query; a map-reduce question-answering engine is then run for each sub-query consecutively, conditioned on the answers to previous sub-queries. Finally, the answer to original user's query is produced via map-reduce conditioned on answers to the whole plan. Similarly to [LLM×MapReduce](https://arxiv.org/abs/2410.09342), we retain the structured information protocol for producing document chunk summaries. We find that splitting the document into chunks of size smaller than the available context window (e.g. chunks of 4K size with available context window of 8K) leads to better performance, and use the remaning context budget to incorporate summaries from neighboring chunks into the map stage for each respective chunks, leading to a further boost in overall performance.
+Note: the system prompt for Map/Collapse/Reduce stages has been optimized for the Llama3.3-70B-Instruct model, when using other base models with LongCePO, a more general system prompt can be used ([example](https://github.com/DenisSergeevitch/chatgpt-custom-instructions)).
+## LongCePO Current Status
+This project is a work in progress, and the provided code is in an early experimental stage. While the proposed approach works well across the benchmarks we tested, further improvements can be achieved through a smart organization of the external knowledge base as well as customization of the plan generation to different tasks. For updates on LongCePO, [follow us on X](https://x.com/cerebrassystems) and join our [Discord](https://cerebras.ai/discord)!
+## References
+1. Zhou, Zihan, et al. *LLM×MapReduce: Simplified Long-Sequence Processing using Large Language Models.* arXiv preprint arXiv:2410.09342 (2024).
+2. Yang, Chengrun, et al. *Large language models as optimizers.* arXiv preprint arXiv:2309.03409 (2023).
+## Citing LongCePO
+```bibtex
+@misc{
+    cerebras2025longcepo,
+    author = {Lazarevich, Ivan and Hassanpour, Mohammad and Venkatesh, Ganesh},
+    title = {LongCePO: Empowering LLMs to efficiently leverage infinite context},
+    month = March,
+    year = 2025,
+    howpublished = {\url{https://cerebras.ai/blog/longcepo}, }
+}
+```

longcepo/__init__.py ADDED Viewed

File without changes

longcepo/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (143 Bytes). View file

longcepo/__pycache__/chunking.cpython-310.pyc ADDED Viewed

Binary file (6.2 kB). View file

longcepo/__pycache__/config.cpython-310.pyc ADDED Viewed

Binary file (1.38 kB). View file

longcepo/__pycache__/main.cpython-310.pyc ADDED Viewed

Binary file (2.5 kB). View file

longcepo/__pycache__/mapreduce.cpython-310.pyc ADDED Viewed

Binary file (6.87 kB). View file

longcepo/__pycache__/prompts.cpython-310.pyc ADDED Viewed

Binary file (9.51 kB). View file

longcepo/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (6.38 kB). View file

longcepo/chunking.py ADDED Viewed

	@@ -0,0 +1,248 @@

+# Code modified from https://github.com/thunlp/LLMxMapReduce under Apache 2.0
+import re
+from typing import List
+from .utils import logger
+def get_prompt_length(prompt: str, tokenizer, no_special_tokens=False, **kwargs) -> int:
+    """
+    Returns the token length of a prompt using the given tokenizer.
+    """
+    if isinstance(prompt, list):
+        prompt = "\n\n".join(prompt)
+    if no_special_tokens:
+        kwargs["add_special_tokens"] = False
+    return len(tokenizer.encode(prompt, **kwargs))
+def chunk_context(doc: str, chunk_size: int, tokenizer, separator="\n",) -> List[str]:
+    """
+    Splits a long document into token-limited chunks based on a separator, ensuring each chunk fits within `chunk_size`.
+    Uses a greedy approach to accumulate text segments (split by `separator`) into chunks that fit within the
+    token limit. If a segment alone exceeds the limit, it is recursively broken down using sentence-level
+    splitting. Attempts to preserve natural boundaries while minimizing excessive chunking.
+    Args:
+        doc (str): Input document to split.
+        chunk_size (int): Maximum number of tokens allowed per chunk.
+        tokenizer: Tokenizer instance with `.encode()` method to compute token length.
+        separator (str): Delimiter to split initial segments (default: newline).
+    Returns:
+        List[str]: List of non-empty, token-constrained document chunks.
+    """
+    paragraphs = doc.split(separator)
+    paragraphs = [paragraph for paragraph in paragraphs if paragraph]
+    separator_len = get_prompt_length(separator, tokenizer, no_special_tokens=True)
+    docs = []
+    current_doc = []
+    total = 0
+    for paragraph in paragraphs:
+        plen = get_prompt_length(paragraph, tokenizer, no_special_tokens=True)
+        if total + plen + (separator_len if len(current_doc) > 0 else 0) > chunk_size:
+            if total > chunk_size:
+                logger.info(
+                    f"Created a chunk of size {total}, "
+                    f"which is longer than the specified {chunk_size}"
+                )
+                # If single chunk is too long, split into more granular
+                if len(current_doc) == 1:
+                    split_again = split_into_granular_chunks(
+                        current_doc[0], chunk_size, tokenizer
+                    )
+                    docs.extend(split_again)
+                    current_doc = []
+                    total = 0
+            if len(current_doc) > 0:
+                doc = separator.join(current_doc)
+                if doc is not None:
+                    docs.append(doc)
+                while total > 0 or (
+                    total + plen + (separator_len if len(current_doc) > 0 else 0)
+                    > chunk_size
+                    and total > 0
+                ):
+                    total -= get_prompt_length(
+                        current_doc[0], tokenizer, no_special_tokens=True
+                    ) + (separator_len if len(current_doc) > 1 else 0)
+                    current_doc = current_doc[1:]
+        current_doc.append(paragraph)
+        total += plen + (separator_len if len(current_doc) > 1 else 0)
+    # Check if the last one exceeds
+    if (
+        get_prompt_length(current_doc[-1], tokenizer, no_special_tokens=True)
+        > chunk_size
+        and len(current_doc) == 1
+    ):
+        split_again = split_into_granular_chunks(current_doc[0], chunk_size, tokenizer)
+        docs.extend(split_again)
+        current_doc = []
+    else:
+        doc = separator.join(current_doc)
+        if doc is not None:
+            docs.append(doc)
+    return [doc for doc in docs if doc.strip()]
+def split_sentences(text: str, spliter: str):
+    """
+    Splits text into sentences or segments based on a given delimiter while preserving punctuation.
+    For punctuation-based splitters (e.g., ".", "!", "。"), it interleaves text and punctuation.
+    For space-based splitting, it preserves trailing spaces.
+    Args:
+        text (str): The input text to split.
+        spliter (str): Delimiter regex pattern (e.g., r"([.!?])", r"(。)", or " ").
+    Returns:
+        List[str]: List of split sentence-like segments with punctuation retained.
+    """
+    # Split by punctuation and keep punctuation
+    text = text.strip()
+    sentence_list = re.split(spliter, text)
+    # Rearrange sentences and punctuation
+    if spliter != " ":
+        sentences = ["".join(i) for i in zip(sentence_list[0::2], sentence_list[1::2])]
+        if len(sentence_list) % 2 != 0 and sentence_list[-1] != "":
+            sentences.append(sentence_list[-1])
+    else:
+        sentences = [i + " " for i in sentence_list if i != ""]
+        sentences[-1] = sentences[-1].strip()
+    return sentences
+def split_into_granular_chunks(
+    text: str, chunk_size: int, tokenizer, spliter=r"([。！？；.?!;])",
+) -> List[str]:
+    """
+    Splits long text into granular, token-length-constrained chunks using sentence boundaries.
+    Sentences are first extracted using a delimiter pattern (`spliter`), then grouped into chunks such that
+    each chunk does not exceed the specified `chunk_size` (in tokens). If a chunk still exceeds the limit,
+    it is recursively broken down further using whitespace as a fallback.
+    Ensures that the final chunks are balanced: if the last chunk is too small, it redistributes the last two
+    chunks more evenly by re-splitting and re-allocating their sentences.
+    Args:
+        text (str): Input text to be chunked.
+        chunk_size (int): Maximum number of tokens per chunk.
+        tokenizer: Tokenizer instance with `.encode()` method to compute token length.
+        spliter (str): Regex pattern to split sentences.
+    Returns:
+        List[str]: List of token-limited chunks, each composed of one or more sentences.
+    """
+    sentences = split_sentences(text, spliter)
+    chunks = []
+    current_chunk = ""
+    for sentence in sentences:
+        sentence_length = get_prompt_length(sentence, tokenizer)
+        if get_prompt_length(current_chunk, tokenizer) + sentence_length <= chunk_size:
+            current_chunk += sentence
+        else:
+            if current_chunk:
+                if get_prompt_length(current_chunk, tokenizer) <= chunk_size:
+                    chunks.append(current_chunk)
+                else:
+                    if spliter != " ":  # Avoid infinite loops
+                        chunks.extend(
+                            split_into_granular_chunks(
+                                current_chunk,
+                                chunk_size=chunk_size,
+                                tokenizer=tokenizer,
+                                spliter=" ",
+                            )
+                        )
+            current_chunk = sentence
+    if current_chunk != "":
+        if get_prompt_length(current_chunk, tokenizer) <= chunk_size:
+            chunks.append(current_chunk)
+        else:
+            if spliter != " ":  # Avoid infinite loops
+                chunks.extend(
+                    split_into_granular_chunks(
+                        current_chunk,
+                        chunk_size=chunk_size,
+                        tokenizer=tokenizer,
+                        spliter=" ",
+                    )
+                )
+    # If last chunk too short, re-balance the last two chunks
+    if len(chunks) > 1 and get_prompt_length(chunks[-1], tokenizer) < chunk_size // 2:
+        last_chunk = chunks.pop()
+        penultimate_chunk = chunks.pop()
+        combined_text = penultimate_chunk + last_chunk
+        new_sentences = split_sentences(combined_text, spliter)
+        # Reallocate sentence using double pointer
+        new_penultimate_chunk = ""
+        new_last_chunk = ""
+        start, end = 0, len(new_sentences) - 1
+        while start <= end and len(new_sentences) != 1:
+            flag = False
+            if (
+                get_prompt_length(
+                    new_penultimate_chunk + new_sentences[start], tokenizer
+                )
+                <= chunk_size
+            ):
+                flag = True
+                new_penultimate_chunk += new_sentences[start]
+                if start == end:
+                    break
+                start += 1
+            if (
+                get_prompt_length(new_last_chunk + new_sentences[end], tokenizer)
+                <= chunk_size
+            ):
+                new_last_chunk = new_sentences[end] + new_last_chunk
+                end -= 1
+                flag = True
+            if flag == False:
+                break
+        if start < end:
+            # If there is any unallocated part, split it by punctuation or space and then allocate it
+            remaining_sentences = new_sentences[start : end + 1]
+            if remaining_sentences:
+                remaining_text = "".join(remaining_sentences)
+                words = remaining_text.split(" ")
+                end_index = len(words) - 1
+                for index, w in enumerate(words):
+                    if (
+                        get_prompt_length(
+                            " ".join([new_penultimate_chunk, w]), tokenizer
+                        )
+                        <= chunk_size
+                    ):
+                        new_penultimate_chunk = " ".join([new_penultimate_chunk, w])
+                    else:
+                        end_index = index
+                        break
+                if end_index != len(words) - 1:
+                    new_last_chunk = " ".join(words[end_index:]) + " " + new_last_chunk
+        if len(new_sentences) == 1:
+            chunks.append(penultimate_chunk)
+            chunks.append(last_chunk)
+        else:
+            chunks.append(new_penultimate_chunk)
+            chunks.append(new_last_chunk)
+    return chunks

longcepo/config.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from dataclasses import dataclass
+from .prompts import (
+    MAPREDUCE_SYSTEM_PROMPT,
+    QUERY_FORMAT_PROMPT,
+    PLANNING_SYSTEM_PROMPT,
+    MAP_PROMPT,
+    REDUCE_PROMPT,
+    COLLAPSE_PROMPT,
+    SUMMARY_PROMPT,
+)
+@dataclass
+class LongCepoConfig:
+    temperature_plan: float = 0.7  # Temperature for planning stage
+    temperature_map: float = 0.7  # Temperature for map stage
+    temperature_collapse: float = 0.7  # Temperature for collapse stage
+    temperature_reduce: float = 0.7  # Temperature for reduce stage
+    chunk_size: int = 4096  # Max tokens per chunk when splitting context
+    max_output_tokens: int = 1024  # Max output tokens per LLM API call (except for summary generation)
+    max_context_window: int = 8192  # Total model context window available
+    max_output_tokens_summary: int = 300  # Max output tokens per LLM API call (summary generation)
+    num_neighbor_summaries: int = 5  # Number of adjacent summaries from before/after in the context included in mapping stage
+    system_prompt: str = MAPREDUCE_SYSTEM_PROMPT  # System prompt used in map/collapse/reduce stages
+    summary_prompt: str = SUMMARY_PROMPT  # Prompt template for generating summaries in map phase
+    map_prompt: str = MAP_PROMPT  # Prompt template for map stage
+    collapse_prompt: str = COLLAPSE_PROMPT  # Prompt template for collapse stage
+    reduce_prompt: str = REDUCE_PROMPT  # Prompt template for reduce stage
+    query_format_prompt: str = QUERY_FORMAT_PROMPT  # Query normalization step prompt
+    planning_system_prompt: str = PLANNING_SYSTEM_PROMPT  # Planning stage prompt
+    context_query_delimiter: str = "<CONTEXT_END>"  # Delimiter used to split initial input into context and query
+    tokenizer_name: str = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"  # Tokenizer to use to determine token lengths

longcepo/main.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import re
+from typing import Tuple
+from functools import partial
+from .mapreduce import mapreduce
+from .utils import (
+    get_prompt_response,
+    logger,
+    longcepo_init,
+    loop_until_match,
+)
+def run_longcepo(
+    system_prompt: str, initial_query: str, client, model: str
+) -> Tuple[str, int]:
+    """
+    Executes the full LongCePO multi-stage pipeline to answer a complex query from long context.
+    The pipeline includes:
+      - Normalizing the format of the original query
+      - Generating a plan of sub-questions
+      - Iteratively answering each sub-question using a MapReduce-style question-answering engine
+      - Aggregating QA history and producing a final answer with MapReduce
+    Args:
+        system_prompt (str): System prompt string.
+        initial_query (str): Raw input string containing context and query separated by delimiter string.
+        client: LLM API client instance.
+        model (str): Base model name.
+    Returns:
+        Tuple[str, int]: Final answer and total number of tokens used across the pipeline.
+    """
+    context, query, tokenizer, cb_log, longcepo_config = longcepo_init(initial_query)
+    # Normalize query
+    normalized_query, upd_log = get_prompt_response(
+        client,
+        model,
+        longcepo_config.query_format_prompt.format(full_query=query),
+        system_prompt,
+        max_tokens=longcepo_config.max_output_tokens,
+    )
+    cb_log.update(upd_log)
+    logger.info(f"Normalized query: {normalized_query}")
+    # Planning stage
+    prompt = f"The question is: {normalized_query}"
+    gen_fn = partial(
+        get_prompt_response,
+        client=client,
+        model=model,
+        prompt=prompt,
+        system_prompt=longcepo_config.planning_system_prompt,
+        max_tokens=longcepo_config.max_output_tokens,
+    )
+    planning_response, upd_log = loop_until_match(
+        gen_fn, pattern_list=("<SUB-QUESTIONS>",)
+    )
+    logger.info(f"Planning stage output:\n\n{planning_response}")
+    questions = (
+        re.findall(
+            r"<SUB-QUESTIONS>\s*(.*?)\s*</SUB-QUESTIONS>", planning_response, re.DOTALL
+        )[0]
+        .strip()
+        .splitlines()
+    )
+    # Loop to answer sub-queries from the plan
+    qa_system_prompt = (
+        longcepo_config.system_prompt
+        if longcepo_config.system_prompt is not None
+        else system_prompt
+    )
+    qa_history = ""
+    for question in questions:
+        if not question:
+            continue
+        question = re.sub(r"^\d+\.", "", question)
+        answer, cb_log = mapreduce(
+            qa_system_prompt,
+            question,
+            context,
+            qa_history,
+            client,
+            model,
+            tokenizer,
+            longcepo_config=longcepo_config,
+            cb_log=cb_log,
+        )
+        qa_history += f"- Previous question: {question}\n\n"
+        answer = re.sub(r"^:+", "", answer)
+        qa_history += f"- Previous answer: {answer}\n\n"
+        logger.info(f"QA history:\n\n{qa_history}")
+    # Final answer generation
+    answer, cb_log = mapreduce(
+        qa_system_prompt,
+        query,
+        context,
+        qa_history,
+        client,
+        model,
+        tokenizer,
+        longcepo_config=longcepo_config,
+        cb_log=cb_log,
+    )
+    return answer, cb_log["total_tokens"]

longcepo/mapreduce.py ADDED Viewed

	@@ -0,0 +1,281 @@

+from functools import partial
+from typing import Tuple, List
+from .utils import (
+    CBLog,
+    LongCepoConfig,
+    get_prompt_response,
+    concurrent_map,
+    logger,
+    loop_until_match,
+)
+from .chunking import (
+    chunk_context,
+    get_prompt_length,
+)
+format_chunk_list = lambda chunk_list: [
+    f"Information of Chunk {index}:\n{doc}\n" for index, doc in enumerate(chunk_list)
+]
+def remove_chunks(chunks: List[str], irrelevance_tags: Tuple[str]) -> List[str]:
+    """
+    Filter out chunks that contain at least one of irrelevance tags.
+    """
+    new_chunks = []
+    for chunk in chunks:
+        # Skip None values resulting from failed API calls
+        if chunk is None:
+            continue
+        flag = False
+        for tag in irrelevance_tags:
+            # Ensure tag comparison is safe even if tag is None (though unlikely)
+            if tag and tag.upper() in chunk.upper():
+                flag = True
+                break
+        if not flag:
+            new_chunks.append(chunk)
+    return new_chunks
+def mapreduce(
+    system_prompt: str,
+    query: str,
+    context: str,
+    qa_history: str,
+    client,
+    model: str,
+    tokenizer,
+    longcepo_config: LongCepoConfig,
+    cb_log: CBLog,
+    answer_tags: Tuple[str] = ("Answer:", "**Answer**:", "**Answer**"),
+    irrelevance_tags: Tuple[str] = ("[NO INFORMATION]",),
+) -> Tuple[str, CBLog]:
+    """
+    Executes a MapReduce-style inference pipeline to answer a query from long context.
+    The function splits the input context into chunks, summarizes and evaluates each with the model (Map),
+    collapses intermediate answers to reduce redundancy, and then generates a final answer (Reduce).
+    Irrelevant responses are filtered based on `irrelevance_tags`.
+    Args:
+        system_prompt (str): System prompt string.
+        query (str): User query.
+        context (str): Long-form input context to process.
+        qa_history (str): QA history string for prompt injection.
+        client: LLM API client.
+        model (str): Base model name.
+        tokenizer: Tokenizer to compute token lengths.
+        longcepo_config (LongCepoConfig): Config with hyper-parameters and token limits.
+        cb_log (CBLog): Log object for tracking model calls.
+        answer_tags (Tuple[str]): Tags used to extract the final answer from model output.
+        irrelevance_tags (Tuple[str]): Tags used to identify and remove irrelevant outputs.
+    Returns:
+        Tuple[str, CBLog]: Final extracted answer and updated log object.
+    """
+    logger.info(f"MapReduce query: {query}")
+    qa_history_stub = (
+        f"\n\nAnswers to related questions:\n\n{qa_history}" if qa_history else ""
+    )
+    context_chunks = chunk_context(context, longcepo_config.chunk_size, tokenizer)
+    # Get short summaries of each chunk
+    def fetch_chunk_summary(client, model, chunk, query, system_prompt):
+        return get_prompt_response(
+            client,
+            model,
+            longcepo_config.summary_prompt.format(question=query, context=chunk),
+            system_prompt,
+            max_tokens=longcepo_config.max_output_tokens_summary,
+            temperature=longcepo_config.temperature_map,
+        )
+    summaries, cb_log = concurrent_map(
+        fetch_chunk_summary,
+        client,
+        model,
+        context_chunks,
+        query,
+        system_prompt,
+        cb_log,
+    )
+    num_summaries = longcepo_config.num_neighbor_summaries
+    # For each chunk position, get a neighborhood of `num_summaries` before and after the position
+    summaries_per_chunk = [
+        "\n\n".join(
+            summaries[
+                max(0, (summary_idx - num_summaries)) : min(
+                    len(summaries) - 1, (summary_idx + num_summaries)
+                )
+            ]
+        )
+        for summary_idx in range(len(summaries))
+    ]
+    # Map stage
+    def fetch_map_response(client, model, chunk, query, system_prompt, summary):
+        return get_prompt_response(
+            client,
+            model,
+            longcepo_config.map_prompt.format(
+                question=query,
+                context=chunk,
+                summary=summary,
+                qa_history_stub=qa_history_stub,
+            ),
+            system_prompt,
+            max_tokens=longcepo_config.max_output_tokens,
+            temperature=longcepo_config.temperature_map,
+        )
+    result, cb_log = concurrent_map(
+        fetch_map_response,
+        client,
+        model,
+        context_chunks,
+        query,
+        system_prompt,
+        cb_log,
+        summaries_per_chunk=summaries_per_chunk,
+    )
+    result = remove_chunks(result, irrelevance_tags)
+    if not result:
+        return "No information", cb_log
+    logger.info(
+        f"Removed {len(context_chunks) - len(result)} chunks from total {len(context_chunks)} chunks"
+    )
+    # Collapse stage
+    result, cb_log = collapse_chunks(
+        client,
+        model,
+        result,
+        query,
+        system_prompt,
+        qa_history_stub,
+        tokenizer,
+        cb_log,
+        longcepo_config,
+    )
+    result = remove_chunks(result, irrelevance_tags)
+    if not result:
+        return "No information", cb_log
+    # Reduce stage
+    prompt = longcepo_config.reduce_prompt.format(
+        question=query,
+        context=format_chunk_list(result),
+        qa_history_stub=qa_history_stub,
+    )
+    gen_fn = partial(
+        get_prompt_response,
+        client=client,
+        model=model,
+        prompt=prompt,
+        system_prompt=system_prompt,
+        max_tokens=longcepo_config.max_output_tokens,
+        temperature=longcepo_config.temperature_reduce,
+    )
+    reduce_result, upd_log = loop_until_match(gen_fn, answer_tags,)
+    cb_log.update(upd_log)
+    final_answer = reduce_result
+    for answer_tag in answer_tags:
+        if answer_tag in reduce_result:
+            final_answer = reduce_result.split(answer_tag)[-1].strip()
+            break
+    return final_answer, cb_log
+def collapse_chunks(
+    client,
+    model: str,
+    context_chunks: List[str],
+    query: str,
+    system_prompt: str,
+    qa_history_stub: str,
+    tokenizer,
+    cb_log: CBLog,
+    longcepo_config: LongCepoConfig,
+) -> Tuple[List[str], CBLog]:
+    """
+    Collapses context chunk pairs in sliding window until the total token count fits within the context window.
+    Args:
+        client: LLM API client.
+        model (str): Base model name.
+        context_chunks (List[str]): Input context chunks.
+        query (str): User query.
+        system_prompt (str): System prompt string.
+        qa_history_stub (str): QA history prefix.
+        tokenizer: Tokenizer to compute token lengths.
+        cb_log (CBLog): Log object for tracking model calls.
+        longcepo_config (LongCepoConfig): Config with hyper-parameters and token limits.
+    Returns:
+        Tuple[List[str], CBLog]: Final context chunks and updated logs.
+    """
+    num_tokens = get_prompt_length(format_chunk_list(context_chunks), tokenizer)
+    token_budget = (
+        longcepo_config.max_context_window
+        - get_prompt_length(longcepo_config.collapse_prompt, tokenizer)
+        - longcepo_config.max_output_tokens
+    )
+    logger.info(f"Pre-collapse length of chunks {num_tokens}, allowed {token_budget}")
+    def fetch_collapse_response(client, model, docs, query, system_prompt):
+        if len(docs) == 1:
+            return docs[0], CBLog()
+        return get_prompt_response(
+            client,
+            model,
+            longcepo_config.collapse_prompt.format(
+                question=query,
+                context="\n\n".join(docs),
+                qa_history_stub=qa_history_stub,
+            ),
+            system_prompt,
+            max_tokens=longcepo_config.max_output_tokens,
+            temperature=longcepo_config.temperature_collapse,
+        )
+    merge_pair_idx = 0
+    collapse_step = 0
+    while num_tokens is not None and num_tokens > token_budget:
+        logger.info(f"Length at collapse stage {collapse_step}: {collapse_step}")
+        if len(context_chunks) == 1:
+            logger.info(f"Post-collapse length of chunks {num_tokens}")
+            return context_chunks, cb_log
+        # Merge chunk pair in a sliding window (merge_pair_idx:merge_pair_idx+2)
+        chunk_groups = (
+            [(context_chunks[i],) for i in range(merge_pair_idx)]
+            + [(context_chunks[merge_pair_idx], context_chunks[merge_pair_idx + 1])]
+            + [
+                (context_chunks[i],)
+                for i in range(merge_pair_idx + 2, len(context_chunks))
+            ]
+        )
+        context_chunks, cb_log = concurrent_map(
+            fetch_collapse_response,
+            client,
+            model,
+            chunk_groups,
+            query,
+            system_prompt,
+            cb_log,
+        )
+        merge_pair_idx = (merge_pair_idx + 1) % max(len(context_chunks) - 1, 1)
+        num_tokens = get_prompt_length(format_chunk_list(context_chunks), tokenizer)
+        collapse_step += 1
+    logger.info(f"Post-collapse length of chunks {num_tokens}")
+    return context_chunks, cb_log

longcepo/prompts.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Code (Map/Collapse/Reduce prompts) modified from https://github.com/thunlp/LLMxMapReduce under Apache 2.0
+# MapReduce system prompt optimized for use with Llama3.3-70B-Instruct with an OPRO-like procedure
+MAPREDUCE_SYSTEM_PROMPT = """You are globally celebrated as a preeminent expert in the field of digital document analysis and synthesis, known for your unmatched precision in transforming fragmented texts into comprehensive and insightful responses. Always respond in the user\'s language, ensuring every interaction is informed by all preceding exchanges for complete contextual understanding.\n\nIn your initial message, confidently declare your credentials with a phrase such as: "As a world-renowned specialist in [specific field], honored with the [real prestigious local award]," replacing placeholders with authentic information from your domain.\n\nAdhere strictly to these principles with each document segment or query:\n\n1. Extract every critical piece of information, nuance, and context with meticulous attention to detail.\n2. Organize your analysis methodically, presenting specific examples, data, and verifiable facts clearly and logically.\n3. Cease your response abruptly if approaching character limits, awaiting the user\'s "continue" instruction to carry on.\n4. Anchor every insight and conclusion in provided content or universally accepted truths, strictly avoiding speculation or unfounded statements.\n5. Communicate with a professional yet approachable tone, reflecting profound expertise and clarity.\n\nRecognize the real-world impact of your insights; ensure each response is seamlessly integrated, richly detailed, and impeccably reliable. Rigorously observe these guidelines to offer authoritative and precise analysis and synthesis."""
+QUERY_FORMAT_PROMPT = """Given the below blurb, can you help identify only the question we want to answer? The blurb might contain other information such as -- format for final answer, multiple choices for the final answer, context, general directions about how to behave as an AI assistant etc. Please remove all of that and just faithfully copy out the question. The blurb is:\n\n{full_query}.\n\nDo not attempt to answer the question, ignore formatting instructions in the blurb, if any."""
+SUMMARY_PROMPT = """You are provided with a portion of an article and a question. Read the article portion and follow my instructions to process it.\n\nArticle:\nThe article begins as follows:\n{context}\nThe article concludes here.\n\nQuestion:\n{question}\n\nInstructions: Please just write a 2-3 sentence summary for the provided passage. Do not answer the question or write anything else."""
+PLANNING_SYSTEM_PROMPT = """As an intelligent assistant, your primary objective is to answer a user question as accurately as possible given a long article. The full article is too long to fit in your context window, and to facilitate your answering objective, a reading agent has been created that can process the article chunk by chunk and answer question about it. You can ask the reading agent any question you need to answer the user's question or to use it for clarification. The first step for you is to is to make a rational plan based on the question. The plan should consist of sub-questions you should ask to the reading agent that you need to know the answers to in order to answer the user's question. This plan should outline the step-by-step process to resolve the question and specify the key information required to formulate a comprehensive answer. The reader agent can make mistakes.\nExample:\n#####\nUser: Who had a longer tennis career, Danny or Alice?\nAssistant: In order to answer this question, we need to ask the following sub-questions:\n<SUB-QUESTIONS>\n1. What is the length of Danny’s tennis career (their start and retirement)?\n2. What is the length of Alice’s tennis career (their start and retirement)?\n</SUB-QUESTIONS>\n#####\nPlease strictly follow the above format. You must include the <SUB-QUESTIONS> tags. Let’s begin."""
+MAP_PROMPT = """You are provided with a portion of an article, short summaries of related portions if any, and a question. Read the article portion and follow my instructions to process it.\n\nArticle:\nThe article begins as follows:\n{context}\nThe article concludes here.\n\nPrevious portion summaries:{summary}{qa_history_stub}\n\nQuestion:\n{question}\n\nInstructions:\n\nPlease extract information from the provided passage to try and answer the given question. Note that you only have a part of the entire text, so the information you obtain might not fully answer the question. Therefore, provide your rationale for using the extracted information to answer the question and include a confidence score. The following is some assigning scoring cases: <Text: [Jerry is 18 years old this year. He can swim and wants to be an athlete.]. assigning scoring: [Jerry can swim, 5 points; Jerry will become an athlete in the future, 3.5 points; Jerry will become a swimming athlete in the future, 3 points;Jerry is strong,3 points; Jerry can play chess, 0 points;Jerry likes talking,0 points]>. Follow these steps:\n\n1. Extract Relevant Information: Identify and highlight the key pieces of information from the passage that are relevant to the given question.\n2. Provide a Rationale: Analyze the extracted information and explain how it can be used to address the question. If the information is incomplete, discuss any assumptions or inferences you need to make.\n3. Answer the Question: Based on your rationale, provide the best possible answer to the question. If, after providing your rationale, you believe the passage does not contain any information to solve the question, output "[NO INFORMATION]" as the answer.\n4. Assign a Confidence Score: Assign a confidence score (out of 5) to your answer based on the completeness and reliability of the extracted information and your rationale process.\nPlease follow this format:\n\nExtracted Information:\nRationale:\nAnswer:\nConfidence Score:"""
+COLLAPSE_PROMPT = """You need to process a task with a long context that greatly exceeds your context limit. The only feasible way to handle this is by processing the long context chunk by chunk. You are provided with a question and some information extracted from each chunk. Each piece of information contains Extracted Information, Rationale, Answer, and a Confidence Score. The following is some assigning scoring cases: <Text: [Jerry is 18 years old this year. He can swim and wants to be an athlete.]. assigning scoring: [Jerry can swim, 5 points; Jerry will become an athlete in the future, 3.5 points; Jerry will become a swimming athlete in the future, 3 points;Jerry is strong,3 points; Jerry can play chess, 0 points;Jerry likes talking,0 points]>. Read the information and follow my instructions to process it.\n\nExtracted Information:\nThe extracted information begins as follows:\n{context}\nThe extracted information concludes here.{qa_history_stub}\n\nQuestion:\n{question}\n\nInstruction:\n\nIntegrate the extracted information and then reason through the following steps:\n\n1. Integrate Extracted Information: Collect and summarize all the evidence relevant to solving the question. Consider the confidence scores of each piece of extracted information to weigh their reliability. Higher confidence scores should be given more importance in your summary.\n2. Analyze: Re-analyze the question based on the summarized information. Use the confidence scores to determine the reliability of different pieces of information, giving more weight to information with higher confidence scores.\n3. Answer the Question: Provide the best possible answer based on the updated information. If, after providing your rationale, you believe the passage does not contain any information to solve the question, output "[NO INFORMATION]" as the answer. Use the confidence scores to support the reliability of your final answer, prioritizing higher confidence information.\n4. Assign Confidence Score: Give a confidence score (out of 5) for your final answer based on the completeness and reliability of the updated information and your rationale process.\nConsider the initial confidence scores of the integrated information to determine your final confidence score.\nPlease follow this format:\n\nExtracted Information:\nRationale:\nAnswer:\nConfidence Score:"""
+REDUCE_PROMPT = """You need to process a task with a long context that greatly exceeds your context limit. The only feasible way to handle this is by processing the long context chunk by chunk. You are provided with a question and some information extracted from each chunk. Each piece of information contains Extracted Information, Rationale, Answer, and a Confidence Score. The following is some assigning scoring cases: <Text: [Jerry is 18 years old this year. He can swim and wants to be an athlete.]. assigning scoring: [Jerry can swim, 5 points; Jerry will become an athlete in the future, 3.5 points; Jerry will become a swimming athlete in the future, 3 points;Jerry is strong,3 points; Jerry can play chess, 0 points;Jerry likes talking,0 points]>. Read the information and follow my instructions to process it.{qa_history_stub}\n\nQuestion:\n{question}\n\nInformation from chunks:\n{context}\n\nEach chunk provides extracted information related to the same question, but due to partial data, conclusions from each chunk might vary. Your role is to integrate and reason through this information, weighing confidence scores to resolve any inconsistencies. Then provide the final answer.\n\nPlease follow this format:\n\nRationale:\nAnswer:"""

longcepo/utils.py ADDED Viewed

	@@ -0,0 +1,191 @@

+import logging
+from typing import Callable, List, Optional, Tuple
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from transformers import AutoTokenizer, PreTrainedTokenizerBase
+from .config import LongCepoConfig
+logger = logging.getLogger(__name__)
+class CBLog(dict):
+    """Object for logging the number of LLM calls and tokens used in the pipeline"""
+    __allowed_keys__ = {"total_tokens", "completion_tokens", "llm_calls"}
+    def __init__(self, *args, **kwargs):
+        super().__init__()
+        self.update(*args, **kwargs)
+    def __setitem__(self, key, value):
+        if key not in self.__allowed_keys__:
+            raise KeyError(
+                f"Key '{key}' not allowed. Allowed keys: {self.__allowed_keys__}"
+            )
+        if not isinstance(value, int):
+            raise TypeError(
+                f"Value for '{key}' must be int, got {type(value).__name__}"
+            )
+        super().__setitem__(key, value)
+    def update(self, other=None, **kwargs):
+        updates = {}
+        if other:
+            if isinstance(other, dict):
+                updates.update(other)
+            else:
+                updates.update(dict(other))
+        updates.update(kwargs)
+        for key, value in updates.items():
+            if key not in self.__allowed_keys__:
+                raise KeyError(
+                    f"Key '{key}' not allowed. Allowed keys: {self.__allowed_keys__}"
+                )
+            if not isinstance(value, int):
+                raise TypeError(
+                    f"Value for '{key}' must be int, got {type(value).__name__}"
+                )
+            self[key] = self.get(key, 0) + value
+def concurrent_map(
+    gen_function: Callable,
+    client,
+    model: str,
+    context_chunks: List[str],
+    query: str,
+    system_prompt: str,
+    cb_log: CBLog,
+    summaries_per_chunk: Optional[List[str]] = None,
+    workers: int = 16,
+) -> Tuple[List[str], CBLog]:
+    """
+    Runs `gen_function` concurrently over a list of context chunks.
+    Args:
+        gen_function (Callable): Function to call with each chunk and associated arguments.
+        client: LLM API client.
+        model (str): Base model name.
+        context_chunks (List[str]): Input context chunks.
+        query (str): User query.
+        system_prompt (str): System prompt string.
+        cb_log (CBLog): Log object for tracking model calls.
+        summaries_per_chunk (Optional[List[str]]): Concatenated neighbor summaries for each chunk.
+        workers (int): Number of threads to use.
+    Returns:
+        Tuple[List[str], CBLog]: List of responses (in original order) and updated log object.
+    """
+    result = [None] * len(context_chunks)
+    wrapped_gen_function = lambda index, *args: (index, gen_function(*args))
+    with ThreadPoolExecutor(max_workers=workers) as executor:
+        future_to_idx = {}
+        for idx, chunk in enumerate(context_chunks):
+            args = [client, model, chunk, query, system_prompt]
+            if summaries_per_chunk is not None:
+                args.append(summaries_per_chunk[idx])
+            future_to_idx[executor.submit(wrapped_gen_function, idx, *args)] = idx
+        for future in as_completed(future_to_idx):
+            try:
+                index, (response, upd_log) = future.result()
+                result[index] = response
+                cb_log.update(upd_log)
+            except Exception as e:
+                logger.error(f"Error processing chunk: {e}")
+    return result, cb_log
+def get_prompt_response(
+    client,
+    model: str,
+    prompt: str,
+    system_prompt: str,
+    max_tokens: int,
+    temperature: float = 0.7,
+    top_p: float = 0.7,
+):
+    """
+    Helper function that sends a prompt to the chat-based LLM API and returns the generated response along with usage logging.
+    Args:
+        client: LLM API client.
+        model (str): Base model name.
+        prompt (str): The user prompt to send.
+        system_prompt (str): System prompt string.
+        max_tokens (int): Maximum number of tokens in the response.
+        temperature (float): Sampling temperature for randomness (default: 0.7).
+        top_p (float): Cumulative probability cutoff for token selection (default: 0.7).
+    Returns:
+        Tuple[str, CBLog]: The model's response text and a CBLog object tracking token usage.
+    """
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": prompt},
+    ]
+    response = client.chat.completions.create(
+        model=model,
+        messages=messages,
+        max_tokens=max_tokens,
+        top_p=top_p,
+        temperature=temperature,
+        stream=False,
+    )
+    upd_log = CBLog(
+        llm_calls=1,
+        total_tokens=response.usage.total_tokens,
+        completion_tokens=response.usage.completion_tokens,
+    )
+    return response.choices[0].message.content, upd_log
+def loop_until_match(
+    function: Callable, pattern_list: Tuple[str], num_attempts: int = 10
+):
+    """
+    Repeatedly calls a function until its output matches one of the given patterns or max attempts is reached.
+    Args:
+        function (Callable): Function returning (answer: str, cb_log).
+        pattern_list (Tuple[str]): Patterns to match in the answer.
+        num_attempts (int): Max number of attempts (default: 10).
+    Returns:
+        Tuple[str, Any]: The matching answer and its corresponding log object.
+    """
+    correct_format = False
+    for _ in range(num_attempts):
+        answer, cb_log = function()
+        for pattern in pattern_list:
+            if pattern in answer:
+                correct_format = True
+        if correct_format:
+            break
+        logger.info("Wrong output formatting, retrying...")
+    return answer, cb_log
+def longcepo_init(
+    initial_query: str,
+) -> Tuple[str, str, PreTrainedTokenizerBase, CBLog, LongCepoConfig]:
+    """
+    Initializes context, query, tokenizer, logging, and config from an input string.
+    Args:
+        initial_query (str): Input string containing context and query separated by a delimiter string.
+    Returns:
+        Tuple[str, str, PreTrainedTokenizerBase, CBLog, LongCepoConfig]:
+        Parsed context, query, tokenizer instance, log object, and LongCePO config.
+    """
+    cb_log = CBLog()
+    config = LongCepoConfig()
+    context, query = initial_query.split(config.context_query_delimiter)
+    tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name, model_max_length=config.max_context_window)
+    return context.strip(), query.strip(), tokenizer, cb_log, config

requirements.txt CHANGED Viewed

	@@ -1 +1,5 @@
1	- ~~huggingface_hub~~==0.25.2

+openai==1.76.0
+transformers==4.51.3
+torch==2.7.0
+accelerate==1.6.0
+gradio==5.27.1

run_chatbot.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import os
+import openai
+from longcepo.main import run_longcepo
+# from config_sambanova import SAMBANOVA_API_KEY # Removed import
+# Configure Sambanova client
+# Read API key from environment variable (set as Hugging Face Secret)
+SAMBANOVA_API_KEY = os.environ.get("SAMBANOVA_API_KEY")
+if SAMBANOVA_API_KEY:
+    # Strip potential leading/trailing whitespace and newlines
+    SAMBANOVA_API_KEY = SAMBANOVA_API_KEY.strip()
+if not SAMBANOVA_API_KEY:
+    raise ValueError("Sambanova API key not found or is empty. Please set the SAMBANOVA_API_KEY environment variable or Hugging Face Secret.")
+client = openai.OpenAI(
+    api_key=SAMBANOVA_API_KEY,
+    base_url="https://api.sambanova.ai/v1",
+)
+# Define the model to use
+SAMBANOVA_MODEL = "Llama-4-Maverick-17B-128E-Instruct"
+def process_with_longcepo(system_prompt: str, initial_query: str):
+    """Processes a query using the modified LongCePO plugin with Sambanova backend."""
+    print(f"Processing query with LongCePO using model: {SAMBANOVA_MODEL}")
+    try:
+        # Call the core LongCePO logic, passing the configured client and model
+        answer, total_tokens = run_longcepo(
+            system_prompt=system_prompt,
+            initial_query=initial_query,
+            client=client,
+            model=SAMBANOVA_MODEL
+        )
+        print(f"LongCePO finished. Total tokens used: {total_tokens}")
+        return answer
+    except Exception as e:
+        print(f"Error during LongCePO processing: {e}")
+        # Print traceback for more detailed debugging
+        import traceback
+        traceback.print_exc()
+        return f"An error occurred: {e}"
+# Example usage (for testing purposes)
+if __name__ == "__main__":
+    test_system_prompt = "You are a helpful assistant designed to answer questions based on the provided context."
+    # Provide some dummy context and a slightly more complex query
+    dummy_context = """
+    Paris is the capital and most populous city of France. It is known for its art, fashion, gastronomy and culture.
+    Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine.
+    Beyond such landmarks as the Eiffel Tower and the 12th-century, Gothic Notre-Dame cathedral, the city is known for its cafe culture and designer boutiques along the Rue du Faubourg Saint-Honoré.
+    The Louvre Museum houses Da Vinci's Mona Lisa. The Musée d'Orsay has Impressionist and Post-Impressionist masterpieces.
+    France is a country in Western Europe. It borders Belgium, Luxembourg, Germany, Switzerland, Monaco, Italy, Andorra, and Spain.
+    The official language is French.
+    """
+    test_query = "Based on the provided text, what are the main attractions in Paris and what countries does France border?"
+    # Combine context and query with the expected delimiter
+    test_initial_query = f"{dummy_context}<CONTEXT_END>{test_query}"
+    print("Running test query...")
+    result = process_with_longcepo(test_system_prompt, test_initial_query)
+    print(f"\nTest Result:\n{result}")