Sanket Rai

sanketrai

https://sanketrai.xyz

AI & ML interests

NLP, CV, RL, Deep Learning, Gen AI, , MLOps

Recent Activity

upvoted an article 15 days ago

Illustrating Reinforcement Learning from Human Feedback (RLHF)

upvoted an article 21 days ago

Mixture of Experts Explained

upvoted an article 29 days ago

You could have designed state of the art positional encoding

View all activity

Organizations

sanketrai's activity

upvoted an article 15 days ago

Article

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Dec 9, 2022

• 247

upvoted an article 21 days ago

Article

Mixture of Experts Explained

Dec 11, 2023

• 601

upvoted an article 29 days ago

Article

You could have designed state of the art positional encoding

Nov 25, 2024

• 241

updated a model about 2 months ago

sanketrai/qwen-2.5-coder-7b-sql-create-context-qlora

Updated Mar 19

published a model about 2 months ago

sanketrai/qwen-2.5-coder-7b-sql-create-context-qlora

Updated Mar 19

updated a model 2 months ago

sanketrai/starcoder-1b-hf-stack-v1-lora

Updated Mar 7 • 3

published a model 2 months ago

sanketrai/starcoder-1b-hf-stack-v1-lora

Updated Mar 7 • 3

reacted to macadeliccc's post with 🔥 3 months ago

Post

1389

Save money on your compute bill by using LMCache to share prefix KV between 2 different vllm instances. By deploying LMCache backend along with your vLLM containers, you can share a prefix KV Cache between 2 different containers and models. It is very simple to implement into your existing stack.

Step 1: Pull docker images

docker pull apostacyh/vllm:lmcache-0.1.0

Step 2: Start vLLM + LMCache

model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=0"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<Your huggingface access token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.6 --port 8000 \
    --lmcache-config-file /lmcache/LMCache/examples/example-local.yaml

You can add another vLLM instance as long as its on a separate GPU by simply deploying another:

# The second vLLM instance listens at port 8001
model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=1"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8001:8001 \
    --env "HF_TOKEN=<Your huggingface token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.7 --port 8001 \
    --lmcache-config-file /lmcache/LMCache/examples/example.yaml

This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).

LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm