mbreuss
/

flower_libero_spatial

@@ -1,46 +1,83 @@
-# FlowerVLA - Vision-Language-Action Flow Model for {dataset_name}
-    This is a pretrained FlowerVLA model for robotic manipulation trained on the {dataset_name} dataset. FlowerVLA is an efficient Vision-Language-Action Flow policy for robot learning.
-    ## Model Description
-    FlowerVLA is a novel architecture that:
-    - Uses Florence-2 for multi-modal vision-language encoding
-    - Employs a transformer-based flow matching architecture
-    - Provides an efficient policy with ~1B parameters
-    - Operates on action chunks for better long-horizon planning
-    ## Usage
-    ```python
-    from huggingface_hub import snapshot_download
-    import torch
-    import hydra
-    from omegaconf import OmegaConf
-    import json
-    import os
-    model_path = snapshot_download(repo_id="{repo_id}")
-    with open(os.path.join(model_path, "config.json")) as f:
-        config = json.load(f)
-    model_cfg = OmegaConf.create(config["model_config"])
-    model_cfg["_target_"] = "flower.models.flower.FLOWERVLA"
-    model = hydra.utils.instantiate(model_cfg)
-    state_dict = torch.load(os.path.join(model_path, "model.pt"))
-    model.load_state_dict(state_dict)
-    model.eval()
-    # obs = {...}  # Your observation dict
-    # goal = {"lang_text": "push the blue block to the right"}
-    # action = model.step(obs, goal)
-    @inproceedings{
-        reuss2024multimodal,
-        # Add citation when available
     }

+---
+license: mit
+language:
+- en
+base_model:
+- microsoft/Florence-2-large
+pipeline_tag: robotics
+tags:
+- VLA
+- LIBERO
+- Robotics
+- Flow
+---
+# FlowerVLA - Vision-Language-Action Flow Model finetuned on LIBERO Spatial
+This is a pretrained FlowerVLA model for robotic manipulation trained on the LIBERO Spatial dataset.
+Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.
+## Model Description
+FlowerVLA is a novel architecture that:
+- Uses half of Florence-2 for multi-modal vision-language encoding
+- Employs an novel transformer-based flow matching architecture
+- Provides an efficient, versatile VLA policy with only ~1B parameters
+## Model Performance
+This checkpoint contains weights for the LIBERO Spatial challenge and achieves these results:
+avg_seq_len success rate 0.9681089520454407
+pick_up_the_black_bowl_between_the_plate_and_the_ramekin_and_place_it_on_the_plate with success 0.9791666666666666
+pick_up_the_black_bowl_next_to_the_ramekin_and_place_it_on_the_plate with success 0.9807692307692308
+pick_up_the_black_bowl_from_table_center_and_place_it_on_the_plate with success 0.9807692307692308
+pick_up_the_black_bowl_on_the_cookie_box_and_place_it_on_the_plate with success 1.0
+pick_up_the_black_bowl_in_the_top_drawer_of_the_wooden_cabinet_and_place_it_on_the_plate with success 1.0
+pick_up_the_black_bowl_on_the_ramekin_and_place_it_on_the_plate with success 0.8621794871794872
+pick_up_the_black_bowl_next_to_the_cookie_box_and_place_it_on_the_plate with success 1.0
+pick_up_the_black_bowl_on_the_stove_and_place_it_on_the_plate with success 1.0
+pick_up_the_black_bowl_next_to_the_plate_and_place_it_on_the_plate with success 0.9166666666666666
+pick_up_the_black_bowl_on_the_wooden_cabinet_and_place_it_on_the_plate with success 0.9615384615384616
+### Input/Output Specifications
+#### Inputs
+- RGB Static Camera: `(B, T, 3, H, W)` tensor
+- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
+- Language Instructions: Text strings
+#### Outputs
+- Action Space: `(B, T, 7)` tensor representing delta EEF actions
+## Usage
+Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments.
+```python
+obs = {
+    "rgb_obs": {
+        "rgb_static": static_image,
+        "rgb_gripper": gripper_image
     }
+}
+goal = {"lang_text": "pick up the blue cube"}
+action = model.step(obs, goal)
+```
+## Training Details
+### Configuration
+- **Optimizer**: AdamW
+- **Learning Rate**: 2e-5
+- **Weight Decay**: 0.05
+@inproceedings{
+    reuss2025flower,
+    # Add citation when available
+}
+## License
+This model is released under the MIT license.