Spaces:
Running
on
Zero
Running
on
Zero
zero gpu tips
Browse files- hf_readme.png +0 -0
- zerogpu.md +175 -0
hf_readme.png
ADDED
![]() |
zerogpu.md
ADDED
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ZeroGPU Usage Guide for Hugging Face Spaces
|
2 |
+
|
3 |
+
## Spaces ZeroGPU Overview
|
4 |
+
|
5 |
+
ZeroGPU is a shared infrastructure that optimizes GPU usage for AI models and demos on Hugging Face Spaces. It dynamically allocates and releases NVIDIA H200 GPUs as needed, offering:
|
6 |
+
|
7 |
+
1. **Free GPU Access**: Enables cost-effective GPU usage for Spaces.
|
8 |
+
2. **Multi-GPU Support**: Allows Spaces to leverage multiple GPUs concurrently.
|
9 |
+
|
10 |
+
Unlike traditional single-GPU allocations, ZeroGPU's efficient system lowers barriers for deploying AI models by maximizing resource utilization. The system allocates GPUs on-demand and releases them when not in use.
|
11 |
+
|
12 |
+
### Technical Specifications
|
13 |
+
- **GPU Type**: Nvidia H200 slice
|
14 |
+
- **Available VRAM**: 70GB per workload
|
15 |
+
|
16 |
+
### Usage Quotas
|
17 |
+
- **Regular users**: Limited daily GPU usage quota
|
18 |
+
- **PRO users**: 5x more daily usage quota (1500 seconds per day) and highest priority in GPU queues
|
19 |
+
|
20 |
+
## Version Compatibility
|
21 |
+
|
22 |
+
Always check for compatibility with ZeroGPU. Current supported versions:
|
23 |
+
|
24 |
+
- **Gradio**: 4+ (Current project uses 5.29.0)
|
25 |
+
- **PyTorch**: 2.1.2, 2.2.2, 2.4.0, 2.5.1 (Note: 2.3.x is not supported due to a PyTorch bug)
|
26 |
+
- **Python**: 3.10.13
|
27 |
+
|
28 |
+
Always verify the SDK version in README.md is up to date:
|
29 |
+
```yaml
|
30 |
+
sdk: gradio
|
31 |
+
sdk_version: 5.29.0
|
32 |
+
```
|
33 |
+
|
34 |
+
Hugging Face provides an easy way to keep your SDK version up-to-date. When viewing your Space on the Hugging Face platform, it will automatically detect if a newer SDK version is available and display an upgrade notification with an "Upgrade" button. You can update your SDK version with a single click without manually editing the README.md file.
|
35 |
+
|
36 |
+

|
37 |
+
|
38 |
+
And ensure requirements.txt has compatible versions:
|
39 |
+
```
|
40 |
+
transformers>=4.30.0
|
41 |
+
torch==2.4.0
|
42 |
+
accelerate>=0.26.0
|
43 |
+
```
|
44 |
+
|
45 |
+
## Required Environment Variables
|
46 |
+
|
47 |
+
Configure these variables in your Space settings:
|
48 |
+
|
49 |
+
### Secret Variables
|
50 |
+
- `HF_TOKEN`: Your valid Hugging Face access token (with appropriate permissions)
|
51 |
+
|
52 |
+
### Regular Variables
|
53 |
+
- `ZEROGPU_V2=true`: Enables ZeroGPU v2
|
54 |
+
- `ZERO_GPU_PATCH_TORCH_DEVICE=1`: Enables device patching for PyTorch
|
55 |
+
|
56 |
+
According to community discussions, these environment variables are crucial for proper ZeroGPU functioning, especially when using the API. While the exact reason may not be fully documented, they help resolve common issues like quota exceeded errors.
|
57 |
+
|
58 |
+
## Using the `spaces.GPU` Decorator
|
59 |
+
|
60 |
+
The `@spaces.GPU` decorator is essential for ZeroGPU functionality. It requests GPU allocation when the function is called and releases it upon completion.
|
61 |
+
|
62 |
+
Example from our app.py:
|
63 |
+
|
64 |
+
```python
|
65 |
+
@spaces.GPU
|
66 |
+
def generate_text_local(model_path, prompt, max_new_tokens=512, temperature=0.7, top_p=0.95):
|
67 |
+
"""Local text generation"""
|
68 |
+
try:
|
69 |
+
# Use the already initialized model
|
70 |
+
if model_path in pipelines:
|
71 |
+
model_pipeline = pipelines[model_path]
|
72 |
+
|
73 |
+
# Log GPU usage information
|
74 |
+
device_info = next(model_pipeline.model.parameters()).device
|
75 |
+
logger.info(f"Running text generation with {model_path} on device: {device_info}")
|
76 |
+
|
77 |
+
outputs = model_pipeline(
|
78 |
+
prompt,
|
79 |
+
max_new_tokens=max_new_tokens,
|
80 |
+
do_sample=True,
|
81 |
+
temperature=temperature,
|
82 |
+
top_p=top_p,
|
83 |
+
clean_up_tokenization_spaces=True,
|
84 |
+
)
|
85 |
+
|
86 |
+
return outputs[0]["generated_text"].replace(prompt, "").strip()
|
87 |
+
else:
|
88 |
+
return f"Error: Model {model_path} not initialized"
|
89 |
+
except Exception as e:
|
90 |
+
logger.error(f"Error in text generation with {model_path}: {str(e)}")
|
91 |
+
return f"Error: {str(e)}"
|
92 |
+
```
|
93 |
+
|
94 |
+
You can also specify custom durations for longer-running functions:
|
95 |
+
|
96 |
+
```python
|
97 |
+
@spaces.GPU(duration=120) # Set max runtime to 120 seconds
|
98 |
+
def long_running_function(params):
|
99 |
+
# Function code
|
100 |
+
```
|
101 |
+
|
102 |
+
## Verifying GPU Execution
|
103 |
+
|
104 |
+
It's important to confirm your code is actually running on a GPU. In our app.py, we do this by:
|
105 |
+
|
106 |
+
```python
|
107 |
+
# Log GPU usage information
|
108 |
+
device_info = next(model_pipeline.model.parameters()).device
|
109 |
+
logger.info(f"Running text generation with {model_path} on device: {device_info}")
|
110 |
+
```
|
111 |
+
|
112 |
+
This logs the device being used. You should see "cuda" in the output if using GPU.
|
113 |
+
|
114 |
+
Alternative methods to check:
|
115 |
+
```python
|
116 |
+
print(f"Is CUDA available: {torch.cuda.is_available()}")
|
117 |
+
print(f"Current device: {torch.cuda.current_device()}")
|
118 |
+
print(f"Device name: {torch.cuda.get_device_name()}")
|
119 |
+
```
|
120 |
+
|
121 |
+
## Parallel Processing with ZeroGPU
|
122 |
+
|
123 |
+
ZeroGPU allows efficient parallel processing with multiple models. In our app.py, we use ThreadPoolExecutor to run multiple models concurrently:
|
124 |
+
|
125 |
+
```python
|
126 |
+
def generate_responses(prompt, max_tokens, temperature, top_p, selected_models):
|
127 |
+
# ...
|
128 |
+
responses = {}
|
129 |
+
futures_to_model = {}
|
130 |
+
|
131 |
+
with ThreadPoolExecutor(max_workers=len(selected_models)) as executor:
|
132 |
+
# Submit tasks for each model
|
133 |
+
futures = []
|
134 |
+
for model_name in selected_models:
|
135 |
+
model_path = model_options[model_name]
|
136 |
+
future = executor.submit(
|
137 |
+
generate_text_local,
|
138 |
+
model_path,
|
139 |
+
prompt,
|
140 |
+
max_new_tokens=max_tokens,
|
141 |
+
temperature=temperature,
|
142 |
+
top_p=top_p
|
143 |
+
)
|
144 |
+
futures.append(future)
|
145 |
+
futures_to_model[future] = model_name
|
146 |
+
|
147 |
+
# Collect results
|
148 |
+
for future in as_completed(futures):
|
149 |
+
model_name = futures_to_model[future]
|
150 |
+
responses[model_name] = future.result()
|
151 |
+
# ...
|
152 |
+
```
|
153 |
+
|
154 |
+
This approach efficiently uses GPU resources by:
|
155 |
+
1. Creating concurrent tasks for each model
|
156 |
+
2. Collecting results as they complete
|
157 |
+
3. Automatically releasing GPU resources when each task finishes
|
158 |
+
|
159 |
+
## Best Practices
|
160 |
+
|
161 |
+
1. **Always import the spaces module**: `import spaces` at the top of your script
|
162 |
+
2. **Decorate GPU-intensive functions**: Use `@spaces.GPU` for functions requiring GPU
|
163 |
+
3. **Specify appropriate durations**: Set realistic durations to improve queue priority
|
164 |
+
4. **Add user authentication**: Include the Hugging Face sign-in button in your Space UI
|
165 |
+
5. **Log device information**: Verify that your code is actually running on GPU
|
166 |
+
6. **Handle errors gracefully**: Implement proper error handling around GPU operations
|
167 |
+
7. **Use parallel processing wisely**: Leverage concurrent execution where appropriate
|
168 |
+
|
169 |
+
By following these tips, you can effectively utilize ZeroGPU for your Hugging Face Spaces while maximizing performance and avoiding common quota issues.
|
170 |
+
|
171 |
+
## References
|
172 |
+
|
173 |
+
1. [Hugging Face Spaces ZeroGPU Documentation](https://huggingface.co/docs/hub/spaces-zerogpu) - Official documentation on Spaces ZeroGPU, including technical specifications, compatibility, and usage guidelines.
|
174 |
+
|
175 |
+
2. [Hugging Face Community Discussion: Usage quota exceeded](https://discuss.huggingface.co/t/usage-quota-exceeded/106619) - Community thread discussing common quota issues with ZeroGPU and potential solutions, including the environment variables mentioned in this guide.
|