Spaces:
Sleeping
Sleeping
File size: 10,543 Bytes
939262b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
If you're fine-tuning a model for chat, in addition to setting a chat template, you should probably add any new chat control tokens as special tokens in the tokenizer. Special tokens are never split, ensuring that your control tokens are always handled as single tokens rather than being tokenized in pieces. You should also set the tokenizer's eos_token attribute to the token that marks the end of assistant generations in your template. This will ensure that text generation tools can correctly figure out when to stop generating text. Why do some models have multiple templates? Some models use different templates for different use cases. For example, they might use one template for normal chat and another for tool-use, or retrieval-augmented generation. In these cases, tokenizer.chat_template is a dictionary. This can cause some confusion, and where possible, we recommend using a single template for all use-cases. You can use Jinja statements like if tools is defined and {% macro %} definitions to easily wrap multiple code paths in a single template. When a tokenizer has multiple templates, tokenizer.chat_template will be a dict, where each key is the name of a template. The apply_chat_template method has special handling for certain template names: Specifically, it will look for a template named default in most cases, and will raise an error if it can't find one. However, if a template named tool_use exists when the user has passed a tools argument, it will use that instead. To access templates with other names, pass the name of the template you want to the chat_template argument of apply_chat_template(). We find that this can be a bit confusing for users, though - so if you're writing a template yourself, we recommend trying to put it all in a single template where possible! What are "default" templates? Before the introduction of chat templates, chat handling was hardcoded at the model class level. For backwards compatibility, we have retained this class-specific handling as default templates, also set at the class level. If a model does not have a chat template set, but there is a default template for its model class, the TextGenerationPipeline class and methods like apply_chat_template will use the class template instead. You can find out what the default template for your tokenizer is by checking the tokenizer.default_chat_template attribute. This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when the class template is appropriate for your model, we strongly recommend overriding the default template by setting the chat_template attribute explicitly to make it clear to users that your model has been correctly configured for chat. Now that actual chat templates have been adopted more widely, default templates have been deprecated and will be removed in a future release. We strongly recommend setting the chat_template attribute for any tokenizers that still depend on them! What template should I use? When setting the template for a model that's already been trained for chat, you should ensure that the template exactly matches the message formatting that the model saw during training, or else you will probably experience performance degradation. This is true even if you're training the model further - you will probably get the best performance if you keep the chat tokens constant. This is very analogous to tokenization - you generally get the best performance for inference or fine-tuning when you precisely match the tokenization used during training. If you're training a model from scratch, or fine-tuning a base language model for chat, on the other hand, you have a lot of freedom to choose an appropriate template! LLMs are smart enough to learn to handle lots of different input formats. One popular choice is the ChatML format, and this is a good, flexible choice for many use-cases. It looks like this: {%- for message in messages %} {{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }} {%- endfor %} If you like this one, here it is in one-liner form, ready to copy into your code. The one-liner also includes handy support for generation prompts, but note that it doesn't add BOS or EOS tokens! If your model expects those, they won't be added automatically by apply_chat_template - in other words, the text will be tokenized with add_special_tokens=False. This is to avoid potential conflicts between the template and the add_special_tokens logic. If your model expects special tokens, make sure to add them to the template! python tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" This template wraps each message in <|im_start|> and <|im_end|> tokens, and simply writes the role as a string, which allows for flexibility in the roles you train with. The output looks like this: text <|im_start|>system You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant I'm doing great!<|im_end|> The "user", "system" and "assistant" roles are the standard for chat, and we recommend using them when it makes sense, particularly if you want your model to operate well with [TextGenerationPipeline]. However, you are not limited to these roles - templating is extremely flexible, and any string can be a role. I want to add some chat templates! How should I get started? If you have any chat models, you should set their tokenizer.chat_template attribute and test it using [~PreTrainedTokenizer.apply_chat_template], then push the updated tokenizer to the Hub. This applies even if you're not the model owner - if you're using a model with an empty chat template, or one that's still using the default class template, please open a pull request to the model repository so that this attribute can be set properly! Once the attribute is set, that's it, you're done! tokenizer.apply_chat_template will now work correctly for that model, which means it is also automatically supported in places like TextGenerationPipeline! By ensuring that models have this attribute, we can make sure that the whole community gets to use the full power of open-source models. Formatting mismatches have been haunting the field and silently harming performance for too long - it's time to put an end to them! Advanced: Template writing tips If you're unfamiliar with Jinja, we generally find that the easiest way to write a chat template is to first write a short Python script that formats messages the way you want, and then convert that script into a template. Remember that the template handler will receive the conversation history as a variable called messages. You will be able to access messages in your template just like you can in Python, which means you can loop over it with {% for message in messages %} or access individual messages with {{ messages[0] }}, for example. You can also use the following tips to convert your code to Jinja: Trimming whitespace By default, Jinja will print any whitespace that comes before or after a block. This can be a problem for chat templates, which generally want to be very precise with whitespace! To avoid this, we strongly recommend writing your templates like this: {%- for message in messages %} {{- message['role'] + message['content'] }} {%- endfor %} rather than like this: {% for message in messages %} {{ message['role'] + message['content'] }} {% endfor %} Adding - will strip any whitespace that comes before the block. The second example looks innocent, but the newline and indentation may end up being included in the output, which is probably not what you want! For loops For loops in Jinja look like this: {%- for message in messages %} {{- message['content'] }} {%- endfor %} Note that whatever's inside the {{ expression block }} will be printed to the output. You can use operators like + to combine strings inside expression blocks. If statements If statements in Jinja look like this: {%- if message['role'] == 'user' %} {{- message['content'] }} {%- endif %} Note how where Python uses whitespace to mark the beginnings and ends of for and if blocks, Jinja requires you to explicitly end them with {% endfor %} and {% endif %}. Special variables Inside your template, you will have access to the list of messages, but you can also access several other special variables. These include special tokens like bos_token and eos_token, as well as the add_generation_prompt variable that we discussed above. You can also use the loop variable to access information about the current loop iteration, for example using {% if loop.last %} to check if the current message is the last message in the conversation. Here's an example that puts these ideas together to add a generation prompt at the end of the conversation if add_generation_prompt is True: {%- if loop.last and add_generation_prompt %} {{- bos_token + 'Assistant:\n' }} {%- endif %} Compatibility with non-Python Jinja There are multiple implementations of Jinja in various languages. They generally have the same syntax, but a key difference is that when you're writing a template in Python you can use Python methods, such as .lower() on strings or .items() on dicts. This will break if someone tries to use your template on a non-Python implementation of Jinja. Non-Python implementations are particularly common in deployment environments, where JS and Rust are very popular. Don't panic, though! There are a few easy changes you can make to your templates to ensure they're compatible across all implementations of Jinja: Replace Python methods with Jinja filters. These usually have the same name, for example string.lower() becomes string|lower, and dict.items() becomes dict|items. One notable change is that string.strip() becomes string|trim. See the list of built-in filters in the Jinja documentation for more. Replace True, False and None, which are Python-specific, with true, false and none. Directly rendering a dict or list may give different results in other implementations (for example, string entries might change from single-quoted to double-quoted). Adding the tojson filter can help to ensure consistency here. |