Context limit of 4096 tokens

#25
by huggingface-meta - opened

The current open-source release tops out at 4096 tokens. State-of-the-art long-context models (Gemma-128k, Claude-200k, LLama-3-70B-Long) have β‰₯128k

I'd like to understand a) why BitNet stops at 4k and b) what must change in code, data, and compute to ship a BitNet variant with >4k token limit (typically 128k).

My assumptions distilled from the repo + paper:

  • Positional encoding: repo says RoPE; no NTK-scaling constants exposed.
  • KV-cache size: 2 B params @1.58 bit -> weights ~0.40 GB; KV-cache β‰ˆ n_ctx Γ— n_layers Γ— 16 bytes
    4 k -> β‰ˆ260 MB, 128k -> β‰ˆ8 GB.
  • Training data: pre-training used 4T tokens with 4k seq-length curriculum, no long-seq adaptation published.
  • Kernel LUTs: lookup-table kernels support any n_ctx, so the hard stop looks architectural, not kernel-side.

Happy to help with experimentation; e.g. adding NTK-scaled RoPE to bitnet.cpp, generating long-seq synthetic pre-train batches, or benchmarking KV-cache pressure on Apple Silicon RAM.

Thanks for clarifying the design constraints and next steps :)

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment