ZeroGPU Usage Guide for Hugging Face Spaces

Spaces ZeroGPU Overview

ZeroGPU is a shared infrastructure that optimizes GPU usage for AI models and demos on Hugging Face Spaces. It dynamically allocates and releases NVIDIA H200 GPUs as needed, offering:

Free GPU Access: Enables cost-effective GPU usage for Spaces.
Multi-GPU Support: Allows Spaces to leverage multiple GPUs concurrently.

Unlike traditional single-GPU allocations, ZeroGPU's efficient system lowers barriers for deploying AI models by maximizing resource utilization. The system allocates GPUs on-demand and releases them when not in use.

Technical Specifications

GPU Type: Nvidia H200 slice
Available VRAM: 70GB per workload

Usage Quotas

Regular users: Limited daily GPU usage quota
PRO users: 5x more daily usage quota (1500 seconds per day) and highest priority in GPU queues

Version Compatibility

Always check for compatibility with ZeroGPU. Current supported versions:

Gradio: 4+ (Current project uses 5.29.0)
PyTorch: 2.1.2, 2.2.2, 2.4.0, 2.5.1 (Note: 2.3.x is not supported due to a PyTorch bug)
Python: 3.10.13

Always verify the SDK version in README.md is up to date:

sdk: gradio
sdk_version: 5.29.0

Hugging Face provides an easy way to keep your SDK version up-to-date. When viewing your Space on the Hugging Face platform, it will automatically detect if a newer SDK version is available and display an upgrade notification with an "Upgrade" button. You can update your SDK version with a single click without manually editing the README.md file.

And ensure requirements.txt has compatible versions:

transformers>=4.30.0
torch==2.4.0
accelerate>=0.26.0

Required Environment Variables

Configure these variables in your Space settings:

Secret Variables

HF_TOKEN: Your valid Hugging Face access token (with appropriate permissions)

Regular Variables

ZEROGPU_V2=true: Enables ZeroGPU v2
ZERO_GPU_PATCH_TORCH_DEVICE=1: Enables device patching for PyTorch

According to community discussions, these environment variables are crucial for proper ZeroGPU functioning, especially when using the API. While the exact reason may not be fully documented, they help resolve common issues like quota exceeded errors.

Using the `spaces.GPU` Decorator

The @spaces.GPU decorator is essential for ZeroGPU functionality. It requests GPU allocation when the function is called and releases it upon completion.

Example from our app.py:

@spaces.GPU
def generate_text_local(model_path, prompt, max_new_tokens=512, temperature=0.7, top_p=0.95):
    """Local text generation"""
    try:
        # Use the already initialized model
        if model_path in pipelines:
            model_pipeline = pipelines[model_path]
            
            # Log GPU usage information
            device_info = next(model_pipeline.model.parameters()).device
            logger.info(f"Running text generation with {model_path} on device: {device_info}")

            outputs = model_pipeline(
                prompt,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=temperature,
                top_p=top_p,
                clean_up_tokenization_spaces=True,
            )

            return outputs[0]["generated_text"].replace(prompt, "").strip()
        else:
            return f"Error: Model {model_path} not initialized"
    except Exception as e:
        logger.error(f"Error in text generation with {model_path}: {str(e)}")
        return f"Error: {str(e)}"

You can also specify custom durations for longer-running functions:

@spaces.GPU(duration=120)  # Set max runtime to 120 seconds
def long_running_function(params):
    # Function code

Verifying GPU Execution

It's important to confirm your code is actually running on a GPU. In our app.py, we do this by:

# Log GPU usage information
device_info = next(model_pipeline.model.parameters()).device
logger.info(f"Running text generation with {model_path} on device: {device_info}")

This logs the device being used. You should see "cuda" in the output if using GPU.

Alternative methods to check:

print(f"Is CUDA available: {torch.cuda.is_available()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name()}")

Parallel Processing with ZeroGPU

ZeroGPU allows efficient parallel processing with multiple models. In our app.py, we use ThreadPoolExecutor to run multiple models concurrently:

def generate_responses(prompt, max_tokens, temperature, top_p, selected_models):
    # ...
    responses = {}
    futures_to_model = {}
                
    with ThreadPoolExecutor(max_workers=len(selected_models)) as executor:
        # Submit tasks for each model
        futures = []
        for model_name in selected_models:
            model_path = model_options[model_name]
            future = executor.submit(
                generate_text_local, 
                model_path, 
                prompt, 
                max_new_tokens=max_tokens,
                temperature=temperature, 
                top_p=top_p
            )
            futures.append(future)
            futures_to_model[future] = model_name
        
        # Collect results
        for future in as_completed(futures):
            model_name = futures_to_model[future]
            responses[model_name] = future.result()
    # ...

This approach efficiently uses GPU resources by:

Creating concurrent tasks for each model
Collecting results as they complete
Automatically releasing GPU resources when each task finishes

Best Practices

Always import the spaces module: import spaces at the top of your script
Decorate GPU-intensive functions: Use @spaces.GPU for functions requiring GPU
Specify appropriate durations: Set realistic durations to improve queue priority
Add user authentication: Include the Hugging Face sign-in button in your Space UI
Log device information: Verify that your code is actually running on GPU
Handle errors gracefully: Implement proper error handling around GPU operations
Use parallel processing wisely: Leverage concurrent execution where appropriate

By following these tips, you can effectively utilize ZeroGPU for your Hugging Face Spaces while maximizing performance and avoiding common quota issues.

References

Hugging Face Spaces ZeroGPU Documentation - Official documentation on Spaces ZeroGPU, including technical specifications, compatibility, and usage guidelines.
Hugging Face Community Discussion: Usage quota exceeded - Community thread discussing common quota issues with ZeroGPU and potential solutions, including the environment variables mentioned in this guide.