History

Jesse Gross c3ff916431 runner.go: Don't add inputs to cache view until actually processed We need to track which tokens are in the cache ourselves. We currently add tokens to the cache tracker when we add them to batch but they are not actually in the cache until we call Decode. This can cause confusion when we are shifting the cache. Avoids "could not find a KV slot for the batch" issues. Bug #7545		2024-11-20 12:49:24 -08:00
..
cache_test.go	runner.go: Better abstract vision model integration	2024-10-30 14:53:43 -07:00
cache.go	runner.go: Don't add inputs to cache view until actually processed	2024-11-20 12:49:24 -08:00
image_test.go	runner.go: Better abstract vision model integration	2024-10-30 14:53:43 -07:00
image.go	runner.go: Check for zero length images	2024-11-08 09:39:32 -08:00
README.md	Re-introduce the `llama` package (#5034 )	2024-10-08 08:53:54 -07:00
requirements.go	Re-introduce the `llama` package (#5034 )	2024-10-08 08:53:54 -07:00
runner.go	runner.go: Don't add inputs to cache view until actually processed	2024-11-20 12:49:24 -08:00
stop_test.go	runner.go: Handle truncation of tokens for stop sequences	2024-10-09 20:39:04 -07:00
stop.go	runner.go: Handle truncation of tokens for stop sequences	2024-10-09 20:39:04 -07:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embeddings