Spaces:
Running
on
Zero
ZeroGPU Usage Guide for Hugging Face Spaces
Spaces ZeroGPU Overview
ZeroGPU is a shared infrastructure that optimizes GPU usage for AI models and demos on Hugging Face Spaces. It dynamically allocates and releases NVIDIA H200 GPUs as needed, offering:
- Free GPU Access: Enables cost-effective GPU usage for Spaces.
- Multi-GPU Support: Allows Spaces to leverage multiple GPUs concurrently.
Unlike traditional single-GPU allocations, ZeroGPU's efficient system lowers barriers for deploying AI models by maximizing resource utilization. The system allocates GPUs on-demand and releases them when not in use.
Technical Specifications
- GPU Type: Nvidia H200 slice
- Available VRAM: 70GB per workload
Usage Quotas
- Regular users: Limited daily GPU usage quota
- PRO users: 5x more daily usage quota (1500 seconds per day) and highest priority in GPU queues
Version Compatibility
Always check for compatibility with ZeroGPU. Current supported versions:
- Gradio: 4+ (Current project uses 5.29.0)
- PyTorch: 2.1.2, 2.2.2, 2.4.0, 2.5.1 (Note: 2.3.x is not supported due to a PyTorch bug)
- Python: 3.10.13
Always verify the SDK version in README.md is up to date:
sdk: gradio
sdk_version: 5.29.0
Hugging Face provides an easy way to keep your SDK version up-to-date. When viewing your Space on the Hugging Face platform, it will automatically detect if a newer SDK version is available and display an upgrade notification with an "Upgrade" button. You can update your SDK version with a single click without manually editing the README.md file.
And ensure requirements.txt has compatible versions:
transformers>=4.30.0
torch==2.4.0
accelerate>=0.26.0
Required Environment Variables
Configure these variables in your Space settings:
Secret Variables
HF_TOKEN
: Your valid Hugging Face access token (with appropriate permissions)
Regular Variables
ZEROGPU_V2=true
: Enables ZeroGPU v2ZERO_GPU_PATCH_TORCH_DEVICE=1
: Enables device patching for PyTorch
According to community discussions, these environment variables are crucial for proper ZeroGPU functioning, especially when using the API. While the exact reason may not be fully documented, they help resolve common issues like quota exceeded errors.
Using the spaces.GPU
Decorator
The @spaces.GPU
decorator is essential for ZeroGPU functionality. It requests GPU allocation when the function is called and releases it upon completion.
Example from our app.py:
@spaces.GPU
def generate_text_local(model_path, prompt, max_new_tokens=512, temperature=0.7, top_p=0.95):
"""Local text generation"""
try:
# Use the already initialized model
if model_path in pipelines:
model_pipeline = pipelines[model_path]
# Log GPU usage information
device_info = next(model_pipeline.model.parameters()).device
logger.info(f"Running text generation with {model_path} on device: {device_info}")
outputs = model_pipeline(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
clean_up_tokenization_spaces=True,
)
return outputs[0]["generated_text"].replace(prompt, "").strip()
else:
return f"Error: Model {model_path} not initialized"
except Exception as e:
logger.error(f"Error in text generation with {model_path}: {str(e)}")
return f"Error: {str(e)}"
You can also specify custom durations for longer-running functions:
@spaces.GPU(duration=120) # Set max runtime to 120 seconds
def long_running_function(params):
# Function code
Verifying GPU Execution
It's important to confirm your code is actually running on a GPU. In our app.py, we do this by:
# Log GPU usage information
device_info = next(model_pipeline.model.parameters()).device
logger.info(f"Running text generation with {model_path} on device: {device_info}")
This logs the device being used. You should see "cuda" in the output if using GPU.
Alternative methods to check:
print(f"Is CUDA available: {torch.cuda.is_available()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name()}")
Parallel Processing with ZeroGPU
ZeroGPU allows efficient parallel processing with multiple models. In our app.py, we use ThreadPoolExecutor to run multiple models concurrently:
def generate_responses(prompt, max_tokens, temperature, top_p, selected_models):
# ...
responses = {}
futures_to_model = {}
with ThreadPoolExecutor(max_workers=len(selected_models)) as executor:
# Submit tasks for each model
futures = []
for model_name in selected_models:
model_path = model_options[model_name]
future = executor.submit(
generate_text_local,
model_path,
prompt,
max_new_tokens=max_tokens,
temperature=temperature,
top_p=top_p
)
futures.append(future)
futures_to_model[future] = model_name
# Collect results
for future in as_completed(futures):
model_name = futures_to_model[future]
responses[model_name] = future.result()
# ...
This approach efficiently uses GPU resources by:
- Creating concurrent tasks for each model
- Collecting results as they complete
- Automatically releasing GPU resources when each task finishes
Best Practices
- Always import the spaces module:
import spaces
at the top of your script - Decorate GPU-intensive functions: Use
@spaces.GPU
for functions requiring GPU - Specify appropriate durations: Set realistic durations to improve queue priority
- Add user authentication: Include the Hugging Face sign-in button in your Space UI
- Log device information: Verify that your code is actually running on GPU
- Handle errors gracefully: Implement proper error handling around GPU operations
- Use parallel processing wisely: Leverage concurrent execution where appropriate
By following these tips, you can effectively utilize ZeroGPU for your Hugging Face Spaces while maximizing performance and avoiding common quota issues.
References
Hugging Face Spaces ZeroGPU Documentation - Official documentation on Spaces ZeroGPU, including technical specifications, compatibility, and usage guidelines.
Hugging Face Community Discussion: Usage quota exceeded - Community thread discussing common quota issues with ZeroGPU and potential solutions, including the environment variables mentioned in this guide.