arxiv:2501.16937

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Published on Jan 28

· Submitted by

akhaliq on Jan 29

Upvote

Authors:

Makoto Shing ,

Kou Misaki ,

Sho Yokoi ,

Takuya Akiba

Abstract

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce Temporally Adaptive Interpolated Distillation (TAID), a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: TAID-LLM-1.5B for language tasks and TAID-VLM-2B for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Jan 30

https://github.com/SakanaAI/TAID

librarian-bot

Jan 31

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

yuntaeyang

Feb 20

•

edited Feb 20

There's a question about experiment.
When I look at the code, it looks like I'm not using student tokenizer.
If Teacher and Student are setting up for different model families (ex. T: pi-3, S: tinyllama), you need to use the tokenizer for each model to get the correct input. If you are currently sharing teacher_tokenizer, we need to discuss whether the experiment is the right setting in this paper.

mkshing

Paper author Feb 26

Hi, thank you for the question.
We chose pairs that use the same tokenizers. For the pair of phi-3 and TinyLlama, both use the same Llama tokenizer, i.e. they're compatible.
I hope it helps.

vangeenius

Mar 20

I found this paper when looking at the sakana.ai website. I am surprised TAID did not yet receive more attention in the AI-LLM community since the implications are huge. Efficiency is probably one of the biggest problems for the advancement of AI. GPU capacity, let alone energy use, could be greatly reduced when TAID can be applied for particular spcialised models. Thanks for the great work.