Spaces:

thenativefox
/

RAG

Running

RAG

File size: 10,232 Bytes

939262b

Not all of the tool-calling features shown above are used by all models. Some use tool call IDs, others simply use the function name and
match tool calls to results using the ordering, and there are several models that use neither and only issue one tool 
call at a time to avoid confusion. If you want your code to be compatible across as many models as possible, we 
recommend structuring your tools calls like we've shown here, and returning tool results in the order that
they were issued by the model. The chat templates on each model should handle the rest.

Understanding tool schemas
Each function you pass to the tools argument of apply_chat_template is converted into a 
JSON schema. These schemas
are then passed to the model chat template. In other words, tool-use models do not see your functions directly, and they
never see the actual code inside them. What they care about is the function definitions and the arguments they
need to pass to them - they care about what the tools do and how to use them, not how they work! It is up to you
to read their outputs, detect if they have requested to use a tool, pass their arguments to the tool function, and
return the response in the chat.
Generating JSON schemas to pass to the template should be automatic and invisible as long as your functions
follow the specification above, but if you encounter problems, or you simply want more control over the conversion, 
you can handle the conversion manually. Here is an example of a manual schema conversion.
thon
from transformers.utils import get_json_schema
def multiply(a: float, b: float):
    """
    A function that multiplies two numbers
Args:
    a: The first number to multiply
    b: The second number to multiply
"""
return a * b

schema = get_json_schema(multiply)
print(schema)

This will yield:
json
{
  "type": "function", 
  "function": {
    "name": "multiply", 
    "description": "A function that multiplies two numbers", 
    "parameters": {
      "type": "object", 
      "properties": {
        "a": {
          "type": "number", 
          "description": "The first number to multiply"
        }, 
        "b": {
          "type": "number",
          "description": "The second number to multiply"
        }
      }, 
      "required": ["a", "b"]
    }
  }
}
If you wish, you can edit these schemas, or even write them from scratch yourself without using get_json_schema at 
all. JSON schemas can be passed directly to the tools argument of 
apply_chat_template - this gives you a lot of power to define precise schemas for more complex functions. Be careful,
though - the more complex your schemas, the more likely the model is to get confused when dealing with them! We 
recommend simple function signatures where possible, keeping arguments (and especially complex, nested arguments) 
to a minimum.
Here is an example of defining schemas by hand, and passing them directly to apply_chat_template:
thon
A simple function that takes no arguments
current_time = {
  "type": "function", 
  "function": {
    "name": "current_time",
    "description": "Get the current local time as a string.",
    "parameters": {
      'type': 'object',
      'properties': {}
    }
  }
}
A more complete function that takes two numerical arguments
multiply = {
  'type': 'function',
  'function': {
    'name': 'multiply',
    'description': 'A function that multiplies two numbers', 
    'parameters': {
      'type': 'object', 
      'properties': {
        'a': {
          'type': 'number',
          'description': 'The first number to multiply'
        }, 
        'b': {
          'type': 'number', 'description': 'The second number to multiply'
        }
      }, 
      'required': ['a', 'b']
    }
  }
}
model_input = tokenizer.apply_chat_template(
    messages,
    tools = [current_time, multiply]
)

Advanced: Retrieval-augmented generation
"Retrieval-augmented generation" or "RAG" LLMs can search a corpus of documents for information before responding
to a query. This allows models to vastly expand their knowledge base beyond their limited context size. Our 
recommendation for RAG models is that their template
should accept a documents argument. This should be a list of documents, where each "document"
is a single dict with title and contents keys, both of which are strings. Because this format is much simpler
than the JSON schemas used for tools, no helper functions are necessary.
Here's an example of a RAG template in action:
thon
document1 = {
    "title": "The Moon: Our Age-Old Foe",
    "contents": "Man has always dreamed of destroying the moon. In this essay, I shall"
}
document2 = {
    "title": "The Sun: Our Age-Old Friend",
    "contents": "Although often underappreciated, the sun provides several notable benefits"
}
model_input = tokenizer.apply_chat_template(
    messages,
    documents=[document1, document2]
)

Advanced: How do chat templates work?
The chat template for a model is stored on the tokenizer.chat_template attribute. If no chat template is set, the
default template for that model class is used instead. Let's take a look at the template for BlenderBot:
thon

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
tokenizer.default_chat_template
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ '  ' }}{% endif %}{% endfor %}{{ eos_token }}"

That's kind of intimidating. Let's clean it up a little to make it more readable. In the process, though, we also make
sure that the newlines and indentation we add don't end up being included in the template output - see the tip on
trimming whitespace below!
{%- for message in messages %}
    {%- if message['role'] == 'user' %}
        {{- ' ' }}
    {%- endif %}
    {{- message['content'] }}
    {%- if not loop.last %}
        {{- '  ' }}
    {%- endif %}
{%- endfor %}
{{- eos_token }}
If you've never seen one of these before, this is a Jinja template.
Jinja is a templating language that allows you to write simple code that generates text. In many ways, the code and
syntax resembles Python. In pure Python, this template would look something like this:
python
for idx, message in enumerate(messages):
    if message['role'] == 'user':
        print(' ')
    print(message['content'])
    if not idx == len(messages) - 1:  # Check for the last message in the conversation
        print('  ')
print(eos_token)
Effectively, the template does three things:
1. For each message, if the message is a user message, add a blank space before it, otherwise print nothing.
2. Add the message content
3. If the message is not the last message, add two spaces after it. After the final message, print the EOS token.
This is a pretty simple template - it doesn't add any control tokens, and it doesn't support "system" messages, which 
are a common way to give the model directives about how it should behave in the subsequent conversation.
But Jinja gives you a lot of flexibility to do those things! Let's see a Jinja template that can format inputs
similarly to the way LLaMA formats them (note that the real LLaMA template includes handling for default system
messages and slightly different system message handling in general - don't use this one in your actual code!)
{%- for message in messages %}
    {%- if message['role'] == 'user' %}
        {{- bos_token + '[INST] ' + message['content'] + ' [/INST]' }}
    {%- elif message['role'] == 'system' %}
        {{- '<<SYS>>\\n' + message['content'] + '\\n<</SYS>>\\n\\n' }}
    {%- elif message['role'] == 'assistant' %}
        {{- ' '  + message['content'] + ' ' + eos_token }}
    {%- endif %}
{%- endfor %}
Hopefully if you stare at this for a little bit you can see what this template is doing - it adds specific tokens based
on the "role" of each message, which represents who sent it. User, assistant and system messages are clearly
distinguishable to the model because of the tokens they're wrapped in.
Advanced: Adding and editing chat templates
How do I create a chat template?
Simple, just write a jinja template and set tokenizer.chat_template. You may find it easier to start with an 
existing template from another model and simply edit it for your needs! For example, we could take the LLaMA template
above and add "[ASST]" and "[/ASST]" to assistant messages:
{%- for message in messages %}
    {%- if message['role'] == 'user' %}
        {{- bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }}
    {%- elif message['role'] == 'system' %}
        {{- '<<SYS>>\\n' + message['content'].strip() + '\\n<</SYS>>\\n\\n' }}
    {%- elif message['role'] == 'assistant' %}
        {{- '[ASST] '  + message['content'] + ' [/ASST]' + eos_token }}
    {%- endif %}
{%- endfor %}
Now, simply set the tokenizer.chat_template attribute. Next time you use [~PreTrainedTokenizer.apply_chat_template], it will
use your new template! This attribute will be saved in the tokenizer_config.json file, so you can use
[~utils.PushToHubMixin.push_to_hub] to upload your new template to the Hub and make sure everyone's using the right
template for your model!
python
template = tokenizer.chat_template
template = template.replace("SYS", "SYSTEM")  # Change the system token
tokenizer.chat_template = template  # Set the new template
tokenizer.push_to_hub("model_name")  # Upload your new template to the Hub!
The method [~PreTrainedTokenizer.apply_chat_template] which uses your chat template is called by the [TextGenerationPipeline] class, so 
once you set the correct chat template, your model will automatically become compatible with [TextGenerationPipeline].

If you're fine-tuning a model for chat, in addition to setting a chat template, you should probably add any new chat
control tokens as special tokens in the tokenizer. Special tokens are never split, 
ensuring that your control tokens are always handled as single tokens rather than being tokenized in pieces. You 
should also set the tokenizer's eos_token attribute to the token that marks the end of assistant generations in your
template. This will ensure that text generation tools can correctly figure out when to stop generating text.