microsoft/bitnet-b1.58-2B-4T · Context limit of 4096 tokens

The current open-source release tops out at 4096 tokens. State-of-the-art long-context models (Gemma-128k, Claude-200k, LLama-3-70B-Long) have ≥128k

I'd like to understand a) why BitNet stops at 4k and b) what must change in code, data, and compute to ship a BitNet variant with >4k token limit (typically 128k).

My assumptions distilled from the repo + paper:

Positional encoding: repo says RoPE; no NTK-scaling constants exposed.
KV-cache size: 2 B params @1.58 bit -> weights ~0.40 GB; KV-cache ≈ n_ctx × n_layers × 16 bytes
4 k -> ≈260 MB, 128k -> ≈8 GB.
Training data: pre-training used 4T tokens with 4k seq-length curriculum, no long-seq adaptation published.
Kernel LUTs: lookup-table kernels support any n_ctx, so the hard stop looks architectural, not kernel-side.

Happy to help with experimentation; e.g. adding NTK-scaled RoPE to bitnet.cpp, generating long-seq synthetic pre-train batches, or benchmarking KV-cache pressure on Apple Silicon RAM.

Thanks for clarifying the design constraints and next steps :)