--- base_model: BAAI/bge-m3 tags: - datadreamer - datadreamer-0.46.0 - synthetic - sentence-transformers - feature-extraction - sentence-similarity library_name: sentence-transformers pipeline_tag: sentence-similarity --- Given a *document*, this retrieval embedding model helps retrieve *instruction templates* from [FineTemplates](https://huggingface.co/datasets/fineinstructions/finetemplates) relevant to various chunks / sections of a document or an entire document. **Note:** This retrieval embedding is symmetric, so it can also be used to retrieve relevant documents to a [(`compatible_document_description`) of an instruction template](https://huggingface.co/datasets/fineinstructions/finetemplates). ## Requirements ``` datasets faiss huggingface_hub numpy pandas sentence_transformers ``` ## Simple Usage Example ```python import importlib import json from huggingface_hub import hf_hub_download def download_and_import_module(module_name, variable): module = importlib.util.module_from_spec( importlib.util.spec_from_file_location( module_name, hf_hub_download( repo_id="fineinstructions/instruction_template_retrieval_embedding", filename=f"{module_name}.py", ), ) ) module.__spec__.loader.exec_module(module) return getattr(module, variable) # Import the retriever helper class InstructionTemplateRetriever = download_and_import_module("instruction_template_retriever", "InstructionTemplateRetriever") # Prepare an example document EXAMPLE_DOC = """ Title: Surprising Facts about Pigeons Submitted On: September 24, 2008 Fact 1: During World War I, a homing pigeon named Cher Ami played a critical role in saving nearly 200 soldiers who were trapped behind enemy lines. Despite being injured by enemy fire, Cher Ami managed to deliver a crucial message that led to their rescue. For this act of bravery, the French government awarded the pigeon the Croix de Guerre, a military medal of honor. Cher Ami became a symbol of courage and the extraordinary utility of pigeons in wartime communication. Fact 2: Pigeons possess impressive cognitive abilities, one of the most surprising being their capacity for self-recognition in mirrors. This trait is rare in the animal kingdom and is often considered a marker of higher intelligence. Experiments have shown that pigeons can distinguish themselves from other birds when looking into a mirror, suggesting a level of self-awareness previously thought to be unique to primates and a few other animals. Fact 3: Thanks to centuries of selective breeding, there are now more than 300 recognized breeds of domestic pigeon. These range from show pigeons with elaborate feather patterns and head crests to performance breeds used in tumbling and racing. The sheer variety reflects the bird’s long history as a companion species to humans. Fact 4: The Ancient Romans were known for their elaborate grooming rituals, and pigeons played an unexpected role in their beauty routines. Specifically, they used pigeon droppings as a bleaching agent to style and lighten their hair. This unusual practice was part of the broader Roman obsession with fashion and appearance, demonstrating how even the most unexpected materials found a place in early cosmetic treatments. """ # Retrieve relevant instruction templates to different chunks / sections of a document retriever = InstructionTemplateRetriever( coverage_chunks=4, sigma=0.05, alpha=1.0 # Ensure instruction templates cover information in the document with 4 chunks/sections ) print(json.dumps(retriever.search(document=EXAMPLE_DOC), indent=4)) # ****************************************************** # Retrieval results look like: # ****************************************************** # Instruction Templates for Entire Document: # - "What's something a few word description of something remarkable or noteworthy you can tell me" # Instruction Templates for Chunk 1/4 of the Document: # - "write a a few word description of the type of message for a significant achievement or milestone" # Instruction Templates for Chunk 2/4 of the Document: # - "how are a type of organism or entity so exceptionally strong or notable in some way?" # Instruction Templates for Chunk 3/4 of the Document: # - "what are the common a type of organism, creature, or entity?" # Instruction Templates for Chunk 4/4 of the Document: # - "how did a group of people perform a common practice or activity" # ****************************************************** # Increasing diversity: # ----------------------- # You can increase diversity using the `reweight` parameter # to increase diversity in instruction length like so: # `print(json.dumps(retriever.search(document=EXAMPLE_DOC, reweight=True), indent=4))` # ****************************************************** # ****************************************************** # Documentation: # ----------------------- # You can read the full documentation of the `InstructionTemplateRetriever.search` method: # by opening/reading the instruction_template_retriever.py file here: # https://huggingface.co/fineinstructions/instruction_template_retrieval_embedding/tree/main # ****************************************************** ``` --- This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json).