Gesture-to-Code Adapter for StarCoder2-3B

Model Description

This repository contains a Gesture-to-Code Adapter designed to work with the StarCoder2-3B language model. By injecting gesture embeddings into the StarCoder2-3B token space, the adapter enables real-time translation of recognized gestures into structured programming code. It leverages StarCoder2-3B’s powerful code generation capabilities, extending them to multimodal input.

Key Features

Base Model: StarCoder2-3B, a 3-billion parameter LLM specialized in code.
Adapter: A lightweight MLP-based projection layer that aligns gesture embeddings (from a CNN or other visual encoder) to StarCoder2-3B’s 3072-dim token embeddings.
Training Objective: Mean-squared error (MSE) alignment of gesture–token pairs, plus optional contrastive alignment to refine embeddings.
Usage: Real-time sign language to code snippet generation, focusing on accessibility for Deaf or hard-of-hearing programmers.

Dataset

Name: A custom gesture dataset containing images for typical code-related gestures (e.g., “for loop,” “if statement,” “function definition”).
Format: Each gesture is an image or short video snippet, which is converted to a fixed-size CNN embedding. The embedding is labeled to match the intended code structure.
Scale: The dataset includes around XX,000 samples, covering ~XX discrete gestural instructions.

Training Process

Gesture Encoder: A CNN-based classifier extracts 256- or 512-dimensional embeddings from sign images.
Adapter Learning: We train a simple projection (fully connected + activation) to map these embeddings into StarCoder2-3B’s input space.
Integration: During code generation, the adapter’s output replaces a special token’s embedding (e.g., <G>). The code model then produces a relevant code snippet conditioned on the recognized gesture.

Model Performance

Cosine Similarity between the adapter’s outputs and the matched StarCoder2-3B tokens.
Accuracy/F1 on sign-to-code classification for recognized gestures.
Code Quality: Preliminary tests show valid syntax ~XX% of the time, with advanced logic requiring additional prompt context or manual checks.

Intended Use

Accessibility: Provide a new input modality for coding, especially beneficial for Deaf/hard-of-hearing individuals.
Educational Tools: Enable sign-based code demonstrations in academic settings or coding bootcamps.
Research: Investigate multimodal alignment between visual gestures and textual code embeddings.

Limitations

Limited Gesture Set: Only covers a subset of sign language gestures and code constructs. Expanding coverage requires additional labeled data.
Hardware Requirements: Real-time inference typically requires GPU acceleration for both CNN and StarCoder2-3B.
Complex Code: While StarCoder2-3B is advanced, complicated multi-file or large project code generation might not be end-to-end feasible.

How to Use

from transformers import AutoModel

# 1. Load StarCoder2-3B
starcoder = AutoModel.from_pretrained("starcoder2-3b")

# 2. Load the adapter
# e.g., adapter = load_adapter("YourName/gesture2code_adapter")

# 3. Integration snippet
# For a recognized gesture -> CNN embedding -> adapter -> StarCoder2-3B token
# Replace special token <G> embedding with adapter output.

SenalVithana
/

gesture-to-llm-adapter