nyasukun commited on
Commit
dad89f9
·
1 Parent(s): 209933e

zero gpu tips

Browse files
Files changed (2) hide show
  1. hf_readme.png +0 -0
  2. zerogpu.md +175 -0
hf_readme.png ADDED
zerogpu.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ZeroGPU Usage Guide for Hugging Face Spaces
2
+
3
+ ## Spaces ZeroGPU Overview
4
+
5
+ ZeroGPU is a shared infrastructure that optimizes GPU usage for AI models and demos on Hugging Face Spaces. It dynamically allocates and releases NVIDIA H200 GPUs as needed, offering:
6
+
7
+ 1. **Free GPU Access**: Enables cost-effective GPU usage for Spaces.
8
+ 2. **Multi-GPU Support**: Allows Spaces to leverage multiple GPUs concurrently.
9
+
10
+ Unlike traditional single-GPU allocations, ZeroGPU's efficient system lowers barriers for deploying AI models by maximizing resource utilization. The system allocates GPUs on-demand and releases them when not in use.
11
+
12
+ ### Technical Specifications
13
+ - **GPU Type**: Nvidia H200 slice
14
+ - **Available VRAM**: 70GB per workload
15
+
16
+ ### Usage Quotas
17
+ - **Regular users**: Limited daily GPU usage quota
18
+ - **PRO users**: 5x more daily usage quota (1500 seconds per day) and highest priority in GPU queues
19
+
20
+ ## Version Compatibility
21
+
22
+ Always check for compatibility with ZeroGPU. Current supported versions:
23
+
24
+ - **Gradio**: 4+ (Current project uses 5.29.0)
25
+ - **PyTorch**: 2.1.2, 2.2.2, 2.4.0, 2.5.1 (Note: 2.3.x is not supported due to a PyTorch bug)
26
+ - **Python**: 3.10.13
27
+
28
+ Always verify the SDK version in README.md is up to date:
29
+ ```yaml
30
+ sdk: gradio
31
+ sdk_version: 5.29.0
32
+ ```
33
+
34
+ Hugging Face provides an easy way to keep your SDK version up-to-date. When viewing your Space on the Hugging Face platform, it will automatically detect if a newer SDK version is available and display an upgrade notification with an "Upgrade" button. You can update your SDK version with a single click without manually editing the README.md file.
35
+
36
+ ![Gradio SDK upgrade notification on Hugging Face](hf_readme.png)
37
+
38
+ And ensure requirements.txt has compatible versions:
39
+ ```
40
+ transformers>=4.30.0
41
+ torch==2.4.0
42
+ accelerate>=0.26.0
43
+ ```
44
+
45
+ ## Required Environment Variables
46
+
47
+ Configure these variables in your Space settings:
48
+
49
+ ### Secret Variables
50
+ - `HF_TOKEN`: Your valid Hugging Face access token (with appropriate permissions)
51
+
52
+ ### Regular Variables
53
+ - `ZEROGPU_V2=true`: Enables ZeroGPU v2
54
+ - `ZERO_GPU_PATCH_TORCH_DEVICE=1`: Enables device patching for PyTorch
55
+
56
+ According to community discussions, these environment variables are crucial for proper ZeroGPU functioning, especially when using the API. While the exact reason may not be fully documented, they help resolve common issues like quota exceeded errors.
57
+
58
+ ## Using the `spaces.GPU` Decorator
59
+
60
+ The `@spaces.GPU` decorator is essential for ZeroGPU functionality. It requests GPU allocation when the function is called and releases it upon completion.
61
+
62
+ Example from our app.py:
63
+
64
+ ```python
65
+ @spaces.GPU
66
+ def generate_text_local(model_path, prompt, max_new_tokens=512, temperature=0.7, top_p=0.95):
67
+ """Local text generation"""
68
+ try:
69
+ # Use the already initialized model
70
+ if model_path in pipelines:
71
+ model_pipeline = pipelines[model_path]
72
+
73
+ # Log GPU usage information
74
+ device_info = next(model_pipeline.model.parameters()).device
75
+ logger.info(f"Running text generation with {model_path} on device: {device_info}")
76
+
77
+ outputs = model_pipeline(
78
+ prompt,
79
+ max_new_tokens=max_new_tokens,
80
+ do_sample=True,
81
+ temperature=temperature,
82
+ top_p=top_p,
83
+ clean_up_tokenization_spaces=True,
84
+ )
85
+
86
+ return outputs[0]["generated_text"].replace(prompt, "").strip()
87
+ else:
88
+ return f"Error: Model {model_path} not initialized"
89
+ except Exception as e:
90
+ logger.error(f"Error in text generation with {model_path}: {str(e)}")
91
+ return f"Error: {str(e)}"
92
+ ```
93
+
94
+ You can also specify custom durations for longer-running functions:
95
+
96
+ ```python
97
+ @spaces.GPU(duration=120) # Set max runtime to 120 seconds
98
+ def long_running_function(params):
99
+ # Function code
100
+ ```
101
+
102
+ ## Verifying GPU Execution
103
+
104
+ It's important to confirm your code is actually running on a GPU. In our app.py, we do this by:
105
+
106
+ ```python
107
+ # Log GPU usage information
108
+ device_info = next(model_pipeline.model.parameters()).device
109
+ logger.info(f"Running text generation with {model_path} on device: {device_info}")
110
+ ```
111
+
112
+ This logs the device being used. You should see "cuda" in the output if using GPU.
113
+
114
+ Alternative methods to check:
115
+ ```python
116
+ print(f"Is CUDA available: {torch.cuda.is_available()}")
117
+ print(f"Current device: {torch.cuda.current_device()}")
118
+ print(f"Device name: {torch.cuda.get_device_name()}")
119
+ ```
120
+
121
+ ## Parallel Processing with ZeroGPU
122
+
123
+ ZeroGPU allows efficient parallel processing with multiple models. In our app.py, we use ThreadPoolExecutor to run multiple models concurrently:
124
+
125
+ ```python
126
+ def generate_responses(prompt, max_tokens, temperature, top_p, selected_models):
127
+ # ...
128
+ responses = {}
129
+ futures_to_model = {}
130
+
131
+ with ThreadPoolExecutor(max_workers=len(selected_models)) as executor:
132
+ # Submit tasks for each model
133
+ futures = []
134
+ for model_name in selected_models:
135
+ model_path = model_options[model_name]
136
+ future = executor.submit(
137
+ generate_text_local,
138
+ model_path,
139
+ prompt,
140
+ max_new_tokens=max_tokens,
141
+ temperature=temperature,
142
+ top_p=top_p
143
+ )
144
+ futures.append(future)
145
+ futures_to_model[future] = model_name
146
+
147
+ # Collect results
148
+ for future in as_completed(futures):
149
+ model_name = futures_to_model[future]
150
+ responses[model_name] = future.result()
151
+ # ...
152
+ ```
153
+
154
+ This approach efficiently uses GPU resources by:
155
+ 1. Creating concurrent tasks for each model
156
+ 2. Collecting results as they complete
157
+ 3. Automatically releasing GPU resources when each task finishes
158
+
159
+ ## Best Practices
160
+
161
+ 1. **Always import the spaces module**: `import spaces` at the top of your script
162
+ 2. **Decorate GPU-intensive functions**: Use `@spaces.GPU` for functions requiring GPU
163
+ 3. **Specify appropriate durations**: Set realistic durations to improve queue priority
164
+ 4. **Add user authentication**: Include the Hugging Face sign-in button in your Space UI
165
+ 5. **Log device information**: Verify that your code is actually running on GPU
166
+ 6. **Handle errors gracefully**: Implement proper error handling around GPU operations
167
+ 7. **Use parallel processing wisely**: Leverage concurrent execution where appropriate
168
+
169
+ By following these tips, you can effectively utilize ZeroGPU for your Hugging Face Spaces while maximizing performance and avoiding common quota issues.
170
+
171
+ ## References
172
+
173
+ 1. [Hugging Face Spaces ZeroGPU Documentation](https://huggingface.co/docs/hub/spaces-zerogpu) - Official documentation on Spaces ZeroGPU, including technical specifications, compatibility, and usage guidelines.
174
+
175
+ 2. [Hugging Face Community Discussion: Usage quota exceeded](https://discuss.huggingface.co/t/usage-quota-exceeded/106619) - Community thread discussing common quota issues with ZeroGPU and potential solutions, including the environment variables mentioned in this guide.