Serving Inference

Using Release.ai to serve inference APIs.

Once you have tuned a model, you need a way to serve it to the rest of your apps and ultimately your users. An inference server such as Ollama is one way to achieve this. In this doc we will show you how to serve your models using Release.

Creating and running the Ollama service

Create an Ollama server by selecting Create Application in the Applications page then click Create from template. Select Execution type of Server, then pick Ollama from the list of available templates. Hit Finish then click Create Environment. If you want this environment to stay up persistently, select Permanent from the Environment Type and give it a name like production or staging.

The Llm field can be populated with the name of a pre-build Ollama model. You can browse available models on the Ollama library.

Make sure to select a cluster that has GPU resources available in the Advanced Options section.

Now click Create Environment. Once the environment is up and running, you can begin sending queries to it.

Querying the Ollama server

You can query your new server using CURL and the hostname from the environment page.

curl https://${OLLAMA_BASE_URL}/api/chat -d '{
  "model": "${YOUR_MODEL_NAME}",
  "stream": false,
  "messages": [
    { "role": "user", "content": "why is the sky blue" }
  ]
}'

Loading a new model

When you tune a new version of a model and convert it to HuggingFace format, you can load it into Ollama by using CURL.

curl https://${OLLAMA_BASE_URL}/api/create -d '{
    "name": "${YOUR_MODEL_NAME}",
    "modelfile": "FROM ${YOUR_MODEL_PATH}/ollama.bin\nTEMPLATE \"[INST] <<SYS>>{{ .System }}<</SYS>>\n{{ .Prompt }} [/INST]\""
  }'

Now you can query it with the same command given above and the new name.

Last updated