This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM. |
||
|---|---|---|
| .. | ||
| auth.go | ||
| download.go | ||
| fixblobs_test.go | ||
| fixblobs.go | ||
| images.go | ||
| layer.go | ||
| manifest_test.go | ||
| manifest.go | ||
| model.go | ||
| modelpath_test.go | ||
| modelpath.go | ||
| prompt_test.go | ||
| prompt.go | ||
| routes_create_test.go | ||
| routes_delete_test.go | ||
| routes_list_test.go | ||
| routes_test.go | ||
| routes.go | ||
| sched_test.go | ||
| sched.go | ||
| upload.go | ||