PRWKV-7-Reka-Flash-3-21B-Instruct-Preview-v0.1 – A Transformer’s Soul, Rewritten in RNNs

This model's cxa073 architecture is old version.

i will retrain with cxa075 architecture asap :)


PRWKV

🧠 Model Overview

PRWKV-7-Reka-Flash-3-21B is a 21-billion parameter pure-RNN language model, built by distilling the transformer-based Reka Flash 3 (21B) into the advanced RWKV v7 architecture. That’s right — we took one of the fastest attention-based models and said:

“Cool architecture. Now let’s do it without attention at all.”

The result? A model that retains much of the reasoning and linguistic capability of its teacher, while eliminating the need for KV-Cache, attention maps, or massive VRAM overhead. This preview version already performs coherent generation while being lightweight and memory-stable — making it a promising step towards practical, large-scale RNN-based LLMs.


⚙️ Model Specs

  • Parameters: 21.3B (L44D6144 + 19968MLP)
  • Architecture: RWKV v7 (no attention blocks, pure RNN)
  • Architecture Modify: GQAStyle KVDim,Removed tokenshift,gate,w clamp -0.6 Head=96
  • Training: Multi-stage distillation from the original Reka Flash 3 (Transformer)
  • Inference: Static size state, no KV cache required
  • Context length (Preview): 4096 tokens(stage2 KL) -> 16384 (stage3 smoothed sft)
  • VRAM footprint: Stable and compact — ideal for long-sequence inference on limited hardware
  • Development Stage: Stage 2(KD process), Experimental preview (no performance guarantees)
  • License: Apache 2.0

What is RWKV-7?

RWKV-7 "Goose" with Expressive Dynamic State Evolution.

RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training

Key Innovations

This model builds upon and refines the attention replacement approaches pioneered by several notable projects, including:

  • Qwerky7 (Qwen 2.5 72B + QRWKV7 Arch)
  • Qwerky6 (Qwen 2.5 32B,72B + QRWKV7 Arch)
  • ARWKV (Qwen 2.5 1.5B-7B + RWKV v7 Arch)

The primary advantage of using the RWKV architecture is the elimination of KV-Cache requirements, allowing for infinite context generation with static VRAM consumption.

🔧 Distillation Process

Stage 1 – Hidden State Alignment

The first step involved aligning the internal representations of the teacher (Transformer) and student (RWKV) using mean squared error (MSE) loss between hidden states.

This wasn’t optional — without Stage 1, the KL divergence during Stage 2 stayed at “unreasonable cosmic levels” (read: 18.0). With careful engineering, including:

  • Temporal loss (to preserve sequential evolution)
  • SVD-based dimensional filtering (to reduce noise and chaos)
  • Heavy monitoring (and heavier coffee, energy drinks)

we successfully got the RWKV student to mimic the internal logic of the transformer teacher.

Stage 2 – Knowledge Distillation (KL Loss)

After the hidden state alignment, we applied KL divergence-based soft target distillation with a temperature of 1.0.
Higher temperatures (2.0) resulted in near-random babbling — beautiful syntax, absolutely no sense.

With T=1.0, the model steadily converged to a KL of ~1.0, eventually dipping below 0.35 over time.
Sure, it plateaued. Then we poked it. Lowered the learning rate. Spiked it with CE loss.
And yes, we lost sleep when the loss graph flatlined for 2 days straight... but it came back stronger.


😅 Struggles from the Front Lines (a.k.a. Things We Survived)

  • Training a 24B model using LoRA, mixed-precision, and love is exactly as painful as it sounds.
  • At one point, the model output:

    “プロパスパメイメを説。Iとプロプライフリで進する…”
    ...and we realized that perhaps, just maybe, something went wrong in the KL.

  • Learned the hard way that MLP freezing is not a suggestion — it’s a survival strategy.
  • Context length experiments were like dungeon raids.

    “8192? Too laggy. 4096? Barely stable. 2048? Fine. Don’t touch it.”

  • We rolled back checkpoints more often than we backed up our own laptops.
  • Found out that CE loss can be a gentle savior when your KL starts shouting.

💡 Why This Matters

  • No KV cache → Forget the memory spike. Generation speed is consistent.
  • No attention → Lower computational overhead and simpler inference logic.
  • RWKV-style streaming → Long-context generation is now viable on hardware that doesn’t sound like it’s about to launch a rocket.
  • RNN revival at scale → Proving that attention-free architectures can go big — really big.

📌 Current Status

This is a preview release, intended to showcase the potential of attention-free modeling at 24B scale.
It is not yet fully fine-tuned for instruction-following or long-context reasoning, but early results are promising and further SFT work is planned.


🧪 Future Plans

  • Extended context training (4096 → 8192)
  • LoRA/PEFT adapters for task specialization
  • Multilingual instruction tuning
  • Alignment & safety refinements
  • Release of 14B and 7B distilled siblings

Training Infrastructure

  • Hardware: 2 x AMD MI300X GPU
  • Training Duration: 3 days(Stage1,2)
  • Cost: 600USD(without Spike breaktime..)
  • Stage1 40MToken (LR1e-4)
  • Stage2 60MToken (Temperature 1.5,1.2,1.0KD LR3e-5, 1e-5 final)
  • Stage3 800MToken(T.B.D)

Acknowledgements

This work was made possible through the contributions of:

❤️ Final Words

This project was built by a single player with limited compute, stubborn optimism, and a disturbing tolerance for GPU crashes.

If you're an LLM developer, you're not alone in talking to your loss graph like it's a houseplant.
If you’re thinking of distilling a transformer into an RNN, let me tell you:

It’s like teaching a cat to speak Latin — but when it finally meows “E pluribus unum,”
it’s all worth it.

Enjoy the model. More to come.


🛠️ Built with open-source passion.
💬 Powered by caffeine.
🔥 Fueled by failure.
And somehow, still speaking fluent RNN.


License

Released under the Apache 2.0 license.

2025 OpenMOSE

https://x.com/_m0se_

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/PRWKV-7-Reka-Flash-3-24B-Preview-v0.1

Finetuned
(3)
this model