Context limit of 4096 tokens
#25
by
huggingface-meta
- opened
The current open-source release tops out at 4096 tokens. State-of-the-art long-context models (Gemma-128k, Claude-200k, LLama-3-70B-Long) have β₯128k
I'd like to understand a) why BitNet stops at 4k and b) what must change in code, data, and compute to ship a BitNet variant with >4k token limit (typically 128k).
My assumptions distilled from the repo + paper:
- Positional encoding: repo says RoPE; no NTK-scaling constants exposed.
- KV-cache size: 2 B params @1.58 bit -> weights ~0.40 GB; KV-cache β n_ctx Γ n_layers Γ 16 bytes
4 k -> β260 MB, 128k -> β8 GB. - Training data: pre-training used 4T tokens with 4k seq-length curriculum, no long-seq adaptation published.
- Kernel LUTs: lookup-table kernels support any n_ctx, so the hard stop looks architectural, not kernel-side.
Happy to help with experimentation; e.g. adding NTK-scaled RoPE to bitnet.cpp, generating long-seq synthetic pre-train batches, or benchmarking KV-cache pressure on Apple Silicon RAM.
Thanks for clarifying the design constraints and next steps :)