Go to file

Zhuohan Li c45f3c3ab6 Optimize tensor parallel execution speed (#17 )		2023-04-01 00:51:08 +08:00
benchmark	Optimize tensor parallel execution speed (#17 )	2023-04-01 00:51:08 +08:00
cacheflow	Optimize tensor parallel execution speed (#17 )	2023-04-01 00:51:08 +08:00
csrc	Implement custom kernel for LLaMA rotary embedding (#14 )	2023-03-30 11:04:21 -07:00
playground	FastAPI-based working frontend (#10 )	2023-03-29 14:48:56 +08:00
tests/kernels	Implement custom kernel for LLaMA rotary embedding (#14 )	2023-03-30 11:04:21 -07:00
.gitignore	Add gitignore	2023-02-16 07:47:21 +00:00
README.md	Implement LLaMA (#9 )	2023-03-30 12:25:32 +08:00
setup.py	Implement custom kernel for LLaMA rotary embedding (#14 )	2023-03-30 11:04:21 -07:00
simple_server.py	FastAPI-based working frontend (#10 )	2023-03-29 14:48:56 +08:00

README.md

CacheFlow

Installation

pip install psutil numpy ray torch
pip install git+https://github.com/huggingface/transformers  # Required for LLaMA.
pip install sentencepiece  # Required for LlamaTokenizer.
pip install flash-attn  # This may take up to 20 mins.
pip install -e .

Test simple server

ray start --head
python simple_server.py

The detailed arguments for simple_server.py can be found by:

python simple_server.py --help

FastAPI server

Install the following additional dependencies:

pip install fastapi uvicorn

To start the server:

ray start --head
python -m cacheflow.http_frontend.fastapi_frontend

To test the server:

python -m cacheflow.http_frontend.test_cli_client

Gradio web server

Install the following additional dependencies:

pip install gradio

Start the server:

python -m cacheflow.http_frontend.fastapi_frontend
# At another terminal
python -m cacheflow.http_frontend.gradio_webserver