Spaces:

damienbenveniste
/

deploy_vLLM

Sleeping

Damien Benveniste commited on Aug 12, 2024

Commit

b9fb207

1 Parent(s): ae23345

modified

Files changed (1) hide show

app.py CHANGED Viewed

@@ -17,7 +17,6 @@ engine = AsyncLLMEngine.from_engine_args(
         max_num_seqs=16,               # Reduced for T4
         gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
         max_model_len=512,            # Phi-3-mini-4k context length
-        quantization='awq',            # Enable quantization if supported by the model
         enforce_eager=True,            # Disable CUDA graph
         dtype='half',                  # Use half precision
     )

         max_num_seqs=16,               # Reduced for T4
         gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
         max_model_len=512,            # Phi-3-mini-4k context length
         enforce_eager=True,            # Disable CUDA graph
         dtype='half',                  # Use half precision
     )