Serving Inference
Using Release.ai to serve inference APIs.
Once you have tuned a model, you need a way to serve it to the rest of your apps and ultimately your users. An inference server such as Ollama is one way to achieve this. In this doc we will show you how to serve your models using Release.
Creating and running the Ollama service
Create an Ollama server by selecting Create Application
in the Applications page then click Create from template
. Select Execution type of Server
, then pick Ollama
from the list of available templates. Hit Finish
then click Create Environment
. If you want this environment to stay up persistently, select Permanent
from the Environment Type
and give it a name like production
or staging
.
The Llm
field can be populated with the name of a pre-build Ollama model. You can browse available models on the Ollama library.
Make sure to select a cluster that has GPU resources available in the Advanced Options
section.
Now click Create Environment
. Once the environment is up and running, you can begin sending queries to it.
Querying the Ollama server
You can query your new server using CURL and the hostname from the environment page.
Loading a new model
When you tune a new version of a model and convert it to HuggingFace format, you can load it into Ollama by using CURL.
Now you can query it with the same command given above and the new name.
Last updated