RachidAR
's Collections
Ternary LLMs & Knowledge distillation & SOTA
updated
Addition is All You Need for Energy-efficient Language Models
Paper
•
2410.00907
•
Published
•
151
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper
•
2402.17764
•
Published
•
615
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
•
2404.16710
•
Published
•
80
Beyond Scaling Laws: Understanding Transformer Performance with
Associative Memory
Paper
•
2405.08707
•
Published
•
33
Token-Scaled Logit Distillation for Ternary Weight Generative Language
Models
Paper
•
2308.06744
•
Published
•
1
TerDiT: Ternary Diffusion Models with Transformers
Paper
•
2405.14854
•
Published
•
2
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
•
2405.12981
•
Published
•
33
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Paper
•
2405.05254
•
Published
•
10
Paper
•
2410.05258
•
Published
•
178
BitNet a4.8: 4-bit Activations for 1-bit LLMs
Paper
•
2411.04965
•
Published
•
69
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
•
2502.11089
•
Published
•
156