Spaces:

Luigi
/

Video-Human-Fall-Detection-with-CLIP

Running on Zero

App Files Files Community

Luigi commited on 20 days ago

Commit

c5ee215

1 Parent(s): f994a03

initial commit

Browse files

Files changed (3) hide show

README.md +42 -4
app.py +43 -4
requirements.txt +7 -0

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
 title: Video Human Fall Detector
-emoji: 🐢
-colorFrom: pink
-colorTo: pink
 sdk: gradio
 sdk_version: 5.25.0
 app_file: app.py
@@ -11,4 +11,42 @@ license: apache-2.0
 short_description: Fall Detection Demo using LightCLIP
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Video Human Fall Detector
+emoji: 🐠
+colorFrom: purple
+colorTo: red
 sdk: gradio
 sdk_version: 5.25.0
 app_file: app.py
 short_description: Fall Detection Demo using LightCLIP
 ---
+# Fall Detection Demo using LightCLIP on Hugging Face Spaces
+This project demonstrates a lightweight, transformer-based approach to detect human falls in video clips using a vision–language model (VLM). The demo is designed for complex scenes including multiple persons, obstacles, and varying lighting conditions. It employs a sliding-window technique to check multiple frames for robust detection and aggregates predictions over time to reduce false alarms.
+## Overview
+The demo uses a pre-trained LightCLIP (or CLIP) model to compute image–text similarity scores between video frames and natural language prompts. Two prompts are used:
+- **Fall Prompt:** "A person falling on the ground."
+- **Non-Fall Prompt:** "A person standing or walking."
+For each window of frames extracted from the video, the model computes similarity scores for each frame. The scores are aggregated over a sliding window, and if the average score for the "fall" prompt exceeds a defined threshold, a fall event is registered along with an approximate timestamp.
+## Project Files
+- **app.py:** The main application file containing the Gradio demo.
+- **requirements.txt:** Lists all the required Python libraries.
+- **README.md:** This file.
+## How to Run
+1. **Clone or download the repository** into your Hugging Face Spaces.
+2. Ensure the project is set to use the **GPU plan** in Spaces.
+3. Spaces will automatically install the required libraries from `requirements.txt`.
+4. Launch the demo by running `app.py` (Gradio will start the web interface).
+## Code Overview
+- **Frame Extraction:** The video is processed using OpenCV to extract frames (resized to 224×224).
+- **LightCLIP Inference:** The demo uses the Hugging Face Transformers library to load a CLIP model (acting as LightCLIP). It computes image embeddings for each frame and compares them to text embeddings of the fall and non-fall descriptions.
+- **Temporal Aggregation:** A sliding window (e.g. 16 frames with a stride of 8) is used to calculate average "fall" scores. Windows exceeding a threshold (e.g. 0.8) are flagged as fall events.
+- **User Interface:** A simple Gradio UI allows users to upload a video clip and displays the detection result along with a representative frame and list of detected fall times.
+## Customization
+- **Model:** Replace `"openai/clip-vit-base-patch32"` in `app.py` with your own LightCLIP model checkpoint if available.
+- **Threshold & Window Size:** Adjust parameters such as the detection threshold, window size, and stride for better results on your dataset.
+- **Deployment:** This demo is configured to run on a GPU-backed Hugging Face Space for real-time inference.
+Enjoy experimenting with fall detection!

app.py CHANGED Viewed

@@ -1,7 +1,46 @@
 import gradio as gr
-def greet(name):
-    return "Hello " + name + "!!"
-demo = gr.Interface(fn=greet, inputs="text", outputs="text")
-demo.launch()

+import torch
+import spaces  # Import early to avoid potential issues
 import gradio as gr
+from transformers import CLIPProcessor, CLIPModel
+# Load the CLIP model and processor on the CPU initially
+model_name = "openai/clip-vit-base-patch32"
+model = CLIPModel.from_pretrained(model_name)
+processor = CLIPProcessor.from_pretrained(model_name)
+@spaces.GPU
+def clip_similarity(image, text):
+    """
+    Computes a similarity score between an input image and text using the CLIP model.
+    This function is decorated with @spaces.GPU so that the model is moved to GPU only when needed.
+    """
+    # Create a torch device for cuda
+    device = torch.device("cuda")
+    # Move the model to GPU within the function
+    model.to(device)
+    # Preprocess the inputs and move tensors to GPU
+    inputs = processor(text=[text], images=image, return_tensors="pt", padding=True)
+    inputs = {key: val.to(device) for key, val in inputs.items()}
+    # Run inference
+    outputs = model(**inputs)
+    # Extract similarity score (logits_per_image): higher value indicates better matching
+    similarity_score = outputs.logits_per_image.detach().cpu().numpy()[0]
+    return float(similarity_score)
+# Set up the Gradio interface
+iface = gr.Interface(
+    fn=clip_similarity,
+    inputs=[
+        gr.Image(type="pil", label="Upload Image"),
+        gr.Text(label="Input Text")
+    ],
+    outputs=gr.Number(label="Similarity Score"),
+    title="CLIP Similarity Demo with ZeroGPU"
+)
+if __name__ == "__main__":
+    iface.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio
+torch>=2.4.0
+transformers>=4.20.0
+opencv-python
+Pillow
+accelerate
+yt_dlp