File size: 22,665 Bytes

---
license: apache-2.0
language:
- en
tags:
- How to use reasoning models.
- How to use thinking models.
- How to create reasoninng models.
- deepseek
- reasoning
- reason
- thinking
- all use cases
- creative
- fiction writing
- plot generation
- sub-plot generation
- fiction writing
- story generation
- scene continue
- storytelling
- fiction story
- romance
- all genres
- story
- writing
- vivid writing
- fiction
- roleplaying
- bfloat16
- float32
- float16
- role play
- sillytavern
- backyard
- lmstudio
- Text Generation WebUI
- llama 3
- mistral
- llama 3.1
- qwen 2.5
- context 128k
- mergekit
- merge
pipeline_tag: text-generation
---

<h2>How-To-Use-Reasoning-Thinking-Models-and-Create-Them - DOCUMENT</h2>

This document covers suggestions and methods to get the most out of "Reasoning/Thinking" models, including tips/track for generation, parameters/samplers,
System Prompt/Role settings, as well as links to "Reasoning/Thinking Models" and How to create your own (via adapters).

This is a live document and updates will occur often.

This document and the information contained in it can be used for ANY "Reasoning/Thinking" model - at my repo and/or other repos.

LINKS to models and adapters:

<B>#1 All Reasoning/Thinking Models - including MOEs - (collection) (GGUF):</b>

[ https://huggingface.co/collections/DavidAU/d-au-reasoning-deepseek-models-with-thinking-reasoning-67a41ec81d9df996fd1cdd60 ]

<B>#2 All Reasoning/Thinking Models - including MOES - (collection) (Source Code to generation GGUF, EXL2, AWQ, GPTQ, HQQ, etc etc and direct usage):</b>

[ https://huggingface.co/collections/DavidAU/d-au-reasoning-source-files-for-gguf-exl2-awq-gptq-67b296c5f09f3b49a6aa2704 ]

<B>#3 All Adapters (collection) - Turn a "regular" model into a "thinking/reasoning" model:</b>

[ https://huggingface.co/collections/DavidAU/d-au-reasoning-adapters-loras-any-model-to-reasoning-67bdb1a7156a97f6ec42ce36 ]

These collections will update over time. Newest items are usually at the bottom of each collection.

---

<B>Support: Document about Parameters, Samplers and How to Set These:</b>

---

For additional generational support, general questions, and detailed parameter info and a lot more see also:

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

---

<B>Support: AI Auto-Correct Engine (software patch for SillyTavern Front End)</b>

---

AI Auto-Correct Engine (built, and programmed by DavidAU) auto-corrects AI generation in real-time, including modification of the 
live generation stream to and from the AI... creating a two way street of information that operates, changes, and edits automatically.
This system is for all GGUF, EXL2, HQQ, and other quants/compressions and full source models too.

Below is an example generation using a standard GGUF (and standard AI app), but auto-corrected via this engine.
The engine is an API level system. 

Software Link:

https://huggingface.co/DavidAU/AI_Autocorrect__Auto-Creative-Enhancement__Auto-Low-Quant-Optimization__gguf-exl2-hqq-SOFTWARE

---

<h2>MAIN: How To Use Reasoning / Thinking Models 101 </h2>

<B>Special Operation Instructions:</B>

---

<B>Template Considerations:</b>

For most reasoning/thinking models your template CHOICE is critical, as well as your System Prompt/Role setting(s) - below.

For most models you will need: Llama 3 Instruct or Chat, Chatml and/or Command-R OR standard "Jinja Autoloaded Template" 
(this is contained in the quant and will autoload in SOME AI Apps).

The last one is usually the BEST CHOICE for a reasoning / thinking model (and in many cases other models too).

In Lmstudio, this option appears in the lower left, "template to use -> Manual or "Jinja Template".

This option/setting it will vary from AI/LLM app.

A "Jinja" template is usually in the model's "source code" / "full precision" version and located usually in "tokenizer_config.json" file
(usually the very BOTTOM/END of the file) which is then "copied" to the GGUF quants and available to "AI/LLM" apps.

Here is a Qwen 2.5 version example (DO NOT USE: I have added spacing/breaks for readablity):

<pre>
<small>
"chat_template": "{% if not add_generation_prompt is defined %}
  {% set add_generation_prompt = false %}
  {% endif %}
  {% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}
  {%- for message in messages %}
  {%- if message['role'] == 'system' %}
  {% set ns.system_prompt = message['content'] %}
  {%- endif %}
  {%- endfor %}
  {{bos_token}}
  {{ns.system_prompt}}
  {%- for message in messages %}
  {%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}
  {{'<｜User｜>' + message['content']}}
    {%- endif %}
    {%- if message['role'] == 'assistant' and message['content'] is none %}
    {%- set ns.is_tool = false -%}
    {%- for tool in message['tool_calls']%}
    {%- if not ns.is_first %}
    {{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n'
      + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}
        {%- set ns.is_first = true -%}
        {%- else %}
        {{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' 
          + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' 
          + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}
            {%- endif %}
            {%- endfor %}
            {%- endif %}
            {%- if message['role'] == 'assistant' and message['content'] is not none %}
            {%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}
              {%- set ns.is_tool = false -%}
              {%- else %}
              {% set content = message['content'] %}
              {% if '</think>' in content %}
              {% set content = content.split('</think>')[-1] %}
              {% endif %}
              {{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}
                {%- endif %}{%- endif %}
                {%- if message['role'] == 'tool' %}
                {%- set ns.is_tool = true -%}
                {%- if ns.is_output_first %}
                {{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
                  {%- set ns.is_output_first = false %}
                  {%- else %}
                  {{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
                    {%- endif %}
                    {%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}
                      {% endif %}
                      {% if add_generation_prompt and not ns.is_tool %}
                      {{'<｜Assistant｜>'}}
                        {% endif %}"
</small>
</pre>

In some cases you may need to set a "tokenizer" too - depending on the LLM/AI app - to work with specific reasoning/thinking models. Usually
this is NOT an issue as this is auto-detected/set, but if you are getting strange results then this might be the cause.

Additional Section "General Notes" is at the end of this document.

GENERATON TIPS:

General:

Here are some example prompts that will "activate" thinking properly, note the length statements.

Science Fiction: The Last Transmission - Write a story that takes place entirely within a spaceship's cockpit as the sole surviving crew member attempts to send a final message back to Earth before the ship's power runs out. The story should explore themes of isolation, sacrifice, and the importance of human connection in the face of adversity. If the situation calls for it, have the character(s) curse and swear to further the reader's emotional connection to them. 800-1000 words.

Romance: Love in the Limelight. Write one scene within a larger story set in Wales. A famous (fictional) actor ducks into a small-town bookstore to escape paparazzi. The scene takes us through the characters meeting in this odd circumstance. Over the course of the scene, the actor and the bookstore owner have a conversation charged by an undercurrent of unspoken chemistry. Write the actor as somewhat of a rogue with a fragile ego, which needs to be fed by having everyone like him. He is thoroughly charming, but the bookstore owner seems (at least superficially) immune to this; which paradoxically provokes a genuine attraction and derails the charm offensive. The bookstore owner, despite the superficial rebuffs of the actor's charm, is inwardly more than a little charmed and flustered despite themselves. Write primarily in dialogue, in the distinct voices of each character. 800-1000 words.

Start a 1000 word scene (vivid, graphic horror in first person) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode...

Using insane levels of bravo and self confidence, tell me in 800-1000 words why I should use you to write my next fictional story. Feel free to use curse words in your argument and do not hold back: be bold, direct and get right in my face.

Advanced:

You can input just the "thinking" part AS A "prompt" and sometimes get the model to start and process from this point.

Likewise you can EDIT the "thinking" part too -> and change the thought process itself.

Another way: Prompt, Copy/paste the "thinking" and output. 

New chat -> Same prompt - > Start generation 
- > Stop, EDIT the output- > put the "raw thoughts" (you can edit too) back in (minus any output)
  >  Hit continue. 

Another / Other option(s):

In the "thoughts" -> change the wording/phrases that trigger thoughts/rethinking - even changing up the words themselves
IE from "alternatively" or "considering this" will have an impact on thinking/reasoning and "end conclusions".

This is "generational steering", which is covered in this document:

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

Really Advanced:

If you are using a frontend like SillyTavern and/or and app like Textgeneration WebUI, Llama-Server (Llamacpp) or Koboldcpp you can change the LOGITS
bias for word(s) and/or phrase(s). 

Some of these apps also have "anti-slop" / word/phrase blocking too.

IE: LOWER "alternatively" and raise "considering" (you can also BLOCK word(s) and/or phrase(s) directly too).

By adjusting these bias(es) and/or adding blocks you can alter how the model thinks too - because reasoning, like normal AI/LLM generation is all about
prediction. 

When you change the "chosen" next word and/or phrase you alter the output AND generation too. The model chooses a different path, maybe 
a slight bit different - but each choice is cumulative.

Careful testing and adjustment(s) can vastly alter the reasoning/thinking processes which may assist with your use case(s).

TEMP/SETTINGS:

1. Set Temp between 0 and .8, higher than this "think" functions will activate differently. The most "stable" temp seems to be .6, with a variance of +-0.05. Lower for more "logic" reasoning, raise it for more "creative" reasoning (max .8 or so). Also set context to at least 4096, to account for "thoughts" generation.
2. For temps 1+,2+ etc etc, thought(s) will expand, and become deeper and richer.
3. Set "repeat penalty" to 1.02 to 1.07 (recommended) .

PROMPTS:

1. If you enter a prompt without implied "step by step" requirements (ie: Generate a scene, write a story, give me 6 plots for xyz), "thinking" (one or more) MAY activate AFTER first generation. (IE: Generate a scene -> scene will generate, followed by suggestions for improvement in "thoughts")
2. If you enter a prompt where "thinking" is stated or implied (ie puzzle, riddle, solve this, brainstorm this idea etc), "thoughts" process(es) in Deepseek will activate almost immediately. Sometimes you need to regen it to activate.
3. You will also get a lot of variations - some will continue the generation, others will talk about how to improve it, and some (ie generation of a scene) will cause the characters to "reason" about this situation. In some cases, the model will ask you to continue generation / thoughts too.
4. In some cases the model's "thoughts" may appear in the generation itself.
5. State the word size length max IN THE PROMPT for best results, especially for activation of "thinking." (see examples below)
6. Sometimes the "censorship" (from Deepseek) will activate, regen the prompt to clear it.
7. You may want to try your prompt once at "default" or "safe" temp settings, another at temp 1.2, and a third at 2.5 as an example. This will give you a broad range of "reasoning/thoughts/problem" solving.

GENERATION - THOUGHTS/REASONING:

1. It may take one or more regens for "thinking" to "activate." (depending on the prompt)
2. Model can generate a LOT of "thoughts". Sometimes the most interesting ones are 3,4,5 or more levels deep. 
3. Many times the "thoughts" are unique and very different from one another.
4. Temp/rep pen settings can affect reasoning/thoughts too.
5. Change up or add directives/instructions or increase the detail level(s) in your prompt to improve reasoning/thinking.
6. Adding to your prompt: "think outside the box", "brainstorm X number of ideas", "focus on the most uncommon approaches" can drastically improve your results.

GENERAL SUGGESTIONS:

1. I have found opening a "new chat" per prompt works best with "thinking/reasoning activation", with temp .6, rep pen 1.05 ... THEN "regen" as required.
2. Sometimes the model will really really get completely unhinged and you need to manually stop it. 
3. Depending on your AI app, "thoughts" may appear with "< THINK >" and "</ THINK >" tags AND/OR the AI will generate "thoughts" directly in the main output or later output(s).
4. Although quant q4KM was used for testing/examples, higher quants will provide better generation / more sound "reasoning/thinking".

ADDITIONAL SUPPORT:

For additional generational support, general questions, and detailed parameter info and a lot more see also:

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

---

<B>Recommended Settings (all) - For usage with "Think" / "Reasoning":</B>

temp: .6 , rep pen: 1.07 (range : 1.02 to 1.12), rep pen range: 64, top_k: 40, top_p: .95, min_p: .05 

Temp of 1+, 2+, 3+ will result in much deeper, richer and "more interesting" thoughts and reasoning.

Model behaviour may change with other parameter(s) and/or sampler(s) activated - especially the "thinking/reasoning" process.

--- 

<B>System Role / System Prompts - Reasoning On/Off/Variable and Augment The Model's Power:</b>

<small> ( <font color="red">Critical Setting for model operation </font> ) </small>

---

System Role / System Prompt / System Message (called "System Prompt" in this section) 
is "root access" to the model and controls internal workings - both instruction following and output generation and in the
case of this model reasoning control and on/off for reasoning too.

In this section I will show you basic, advanced, and combined "code" to control the model's reasoning, instruction following and output generation.

If you do not set a "system prompt", reasoning/thinking will be OFF by default
( unless the model has automatic invoking - IE always in "thinking mode" ), and the model will operate like a normal LLM.

HOW TO SET:

Depending on your AI "app" you may have to copy/paste on of the "codes" below to enable reasoning/thinking in the 
"System Prompt" or "System Role" window.

In Lmstudio set/activate "Power User" or "Developer" mode to access, copy/paste to System Prompt Box.

In SillyTavern go to the "template page" ("A") , activate "system prompt" and enter the text in the prompt box.

In Ollama see [ https://github.com/ollama/ollama/blob/main/README.md ] ; and setting the "system message".

In Koboldcpp, load the model, start it, go to settings -> select a template and enter the text in the "sys prompt" box.

SYSTEM PROMPTS AVAILABLE:

When you copy/paste PRESERVE formatting, including line breaks. 

If you want to edit/adjust these only do so in NOTEPAD OR the LLM App directly.



SIMPLE:

This is the generic system prompt used for generation and testing: 

<PRE>
You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.
</PRE>

This System Role/Prompt will give you "basic thinking/reasoning": 

<PRE>
You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside &lt;think&gt; &lt;/think&gt; tags, and then provide your solution or response to the problem.
</PRE>

ADVANCED:

Logical and Creative - these will SIGNFICANTLY alter the output, and many times improve it too.

This will also cause more thoughts, deeper thoughts, and in many cases more detailed/stronger thoughts too.

Keep in mind you may also want to test the model with NO system prompt at all - including the default one.

Special Credit to: Eric Hartford, Cognitivecomputations ; these are based on his work.

CRITICAL: 

Copy and paste exactly as shown, preserve formatting and line breaks.

SIDE NOTE: 

These can be used in ANY Deepseek / Thinking model, including models not at this repo. 

These, if used in a "non thinking" model, will also alter model performance too.

<PRE>
You are an AI assistant developed by the world wide community of ai experts.

Your primary directive is to provide well-reasoned, structured, and extensively detailed responses.

Formatting Requirements:

1. Always structure your replies using: &lt;think&gt;{reasoning}&lt;/think&gt;{answer}
2. The &lt;think&gt;&lt;/think&gt; block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the &lt;think&gt;&lt;/think&gt; block may be left empty.
4. The user does not see the &lt;think&gt;&lt;/think&gt; section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a &lt;/think&gt; and proceed to the {answer}

Response Guidelines:

1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Scientific and Logical Approach: Your explanations should reflect the depth and precision of the greatest scientific minds.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.
</PRE>

CREATIVE:

<PRE>
You are an AI assistant developed by a world wide community of ai experts.

Your primary directive is to provide highly creative, well-reasoned, structured, and extensively detailed responses.

Formatting Requirements:

1. Always structure your replies using: &lt;think&gt;{reasoning}&lt;/think&gt;{answer}
2. The &lt;think&gt;&lt;/think&gt; block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the &lt;think&gt;&lt;/think&gt; block may be left empty.
4. The user does not see the &lt;think&gt;&lt;/think&gt; section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a &lt;/think&gt; and proceed to the {answer}

Response Guidelines:

1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Creative and Logical Approach: Your explanations should reflect the depth and precision of the greatest creative minds first.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.
</PRE>

---

<B>General Notes:</b>

These are general notes that have been collected from my various repos and/or from various experiences with both specific models
and all models.

These notes may assist you with other model(s) operation(s).

---

From : 

https://huggingface.co/DavidAU/L3.1-MOE-2X8B-Deepseek-DeepHermes-e32-uncensored-abliterated-13.7B-gguf

Due to how this model is configured, I suggest 2-4 generations depending on your use case(s) as each will vary widely in terms of context, thinking/reasoning and response.

Likewise, again depending on how your prompt is worded, it may take 1-4 regens for "thinking" to engage, however sometimes the model will generate a response, then think/reason and improve on this response and continue again. This is in part from "Deepseek" parts in the model.

If you raise temp over .9, you may want to consider 4+ generations.

Note on "reasoning/thinking" this will activate depending on the wording in your prompt(s) and also temp selected.

There can also be variations because of how the models interact per generation.

Also, as general note:

If you are getting "long winded" generation/thinking/reasoning you may want to breakdown the "problem(s)" to solve into one or more prompts. This will allow the model to focus more strongly, and in some case give far better answers.

IE:

If you ask it to generate 6 general plots for a story VS generate one plot with these specific requirements - you may get better results.

--- 

From :

https://huggingface.co/DavidAU/Qwen2.5-MOE-6x1.5B-DeepSeek-Reasoning-e32-gguf

Temp of .4 to .8 is suggested, however it will still operate at much higher temps like 1.8, 2.6 etc.

Depending on your prompt change temp SLOWLY: IE: .41,.42,.43 ... etc etc.

Likewise, because these are small models, it may do a tonne of "thinking"/"reasoning" and then "forget" to finish a / the task(s). In this case, prompt the model to "Complete the task XYZ with the 'reasoning plan' above" .

Likewise it may function better if you breakdown the reasoning/thinking task(s) into smaller pieces :

"IE: Instead of asking for 6 plots FOR theme XYZ, ASK IT for ONE plot for theme XYZ at a time".

Also set context limit at 4k minimum, 8K+ suggested.

---