---
license: mit
language:
- en
base_model:
- microsoft/Florence-2-large
- mbreuss/flower_vla_pret
pipeline_tag: robotics
tags:
- robotics
- VLA
---

# FlowerVLA - Vision-Language-Action Flow Model for CALVIN ABCD

This is a pretrained FlowerVLA model for robotic manipulation trained on the CALVIN ABCD dataset. 
Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.

## Model Description

FlowerVLA is a novel architecture that:
- Uses half of Florence-2 for multi-modal vision-language encoding
- Employs an novel transformer-based flow matching architecture 
- Provides an efficient, versatile VLA policy with only ~1B parameters

## Model Performance

This checkpoint contains weights for the CALVIN ABCD challenge and currently ranks 1 with the following results:

| Train→Test | Method | 1 | 2 | 3 | 4 | 5 | **Avg. Len.** |
|------------|--------|---|---|---|---|---|---------------|
| {dataset_name} | FlowerVLA | 99.1% | 97.8% | 95.2% | 92.4% | 87.8% | 4.72 |


### Input/Output Specifications

#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings

#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions

## Usage

Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments.

```python
obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```

## Training Details

### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.05


@inproceedings{
    reuss2025flower,
    # Add citation when available
}


## License
This model is released under the MIT license.