Serverless AI API
The nature of AI and LLM workloads on already trained models lends itself very naturally to a serverless-style architecture. As a framework for building and deploying serverless applications, Spin provides an interface for you to perform AI inference within Spin applications.
Using Serverless AI From Applications
Configuration
By default, a given component of a Spin application will not have access to any Serverless AI models. Access must be provided explicitly via the Spin application’s manifest (the spin.toml file).  For example, an individual component in a Spin application could be given access to the llama2-chat model by adding the following ai_models configuration inside the specific [component.(name)] section:
// -- snip --
[component.please-send-the-codes]
ai_models = ["codellama-instruct"]
// -- snip --
Spin supports “llama2-chat” and “codellama-instruct” for inferencing and “all-minilm-l6-v2” for generating embeddings.
File Structure
By default, the Spin framework will expect any already trained model files (which are configured as per the previous section) to be downloaded by the user and made available inside a .spin/ai-models/ file path of a given application. For example:
code-generator-rs/.spin/ai-models/llama/codellama-instruct
See the serverless AI Tutorial documentation for more concrete examples of implementing the Fermyon Serverless AI API, in your favorite language.
Embeddings models are slightly more complicated; it is expected that both a
tokenizer.jsonand amodel.safetensorsare located in the directory named after the model. For example, for thefoo-bar-bazmodel, Spin will look in the.spin/ai-models/foo-bar-bazdirectory fortokenizer.jsonand amodel.safetensors.
Serverless AI Interface
The Spin SDK surfaces the Serverless AI interface to a variety of different languages. See the Language Support Overview to see if your specific language is supported.
The set of operations is common across all supporting language SDKs:
| Operation | Parameters | Returns | Behavior | 
|---|---|---|---|
| infer | model stringprompt string | string | The inferis performed on a specific model.The name of the model is the first parameter provided (i.e. llama2-chat,codellama-instruct, or other; passed in as astring).The second parameter is a prompt; passed in as a string. | 
| infer_with_options | model stringprompt stringparams list | string | The infer_with_optionsis performed on a specific model.The name of the model is the first parameter provided (i.e. llama2-chat,codellama-instruct, or other; passed in as astring).The second parameter is a prompt; passed in as a string.The third parameter is a mix of float and unsigned integers relating to inferencing parameters in this order: - max-tokens(unsigned 32 integer) Note: the backing implementation may return less tokens.Default is 100 - repeat-penalty(float 32) The amount the model should avoid repeating tokens.Default is 1.1 - repeat-penalty-last-n-token-count(unsigned 32 integer) The number of tokens the model should apply the repeat penalty to.Default is 64 - temperature(float 32) The randomness with which the next token is selected.Default is 0.8 - top-k(unsigned 32 integer) The number of possible next tokens the model will choose from.Default is 40 - top-p(float 32) The probability total of next tokens the model will choose from.Default is 0.9 The result from infer_with_optionsis astring | 
| generate-embeddings | model stringprompt list<string> | string | The generate-embeddingsis performed on a specific model.The name of the model is the first parameter provided (i.e. all-minilm-l6-v2, passed in as astring).The second parameter is a prompt; passed in as a listofstrings.The result from generate-embeddingsis a two-dimension array containing float32 type values only | 
The exact detail of calling these operations from your application depends on your language:
Want to go straight to the reference documentation? Find it here.
To use Serverless AI functions, the llm module from the Spin SDK provides the methods. The following snippet is from the Rust code generation example:
use spin_sdk::{
    http::{IntoResponse, Request, Response},
    llm,
};
// -- snip --
fn handle_code(req: Request) -> anyhow::Result<impl IntoResponse> {
    // -- snip --
    let result = llm::infer_with_options(
        llm::InferencingModel::CodellamaInstruct,
        &prompt,
        llm::InferencingParams {
            max_tokens: 400,
            repeat_penalty: 1.1,
            repeat_penalty_last_n_token_count: 64,
            temperature: 0.8,
            top_k: 40,
            top_p: 0.9,
        },
    )?;
    // -- snip --	
}
General Notes
The infer_with_options examples, operation:
- The above example takes the model name llm::InferencingModel::CodellamaInstructas input. From an interface point of view, the model name is technically an alias for a string (to maximize future compatibility as users want to support more and different types of models).
- The second parameter is a prompt (string) from whoever/whatever is making the request to the handle_code()function.
- A third, optional, parameter which is an interface allows you to specify parameters such as max_tokens,repeat_penalty,repeat_penalty_last_n_token_count,temperature,top_kandtop_p.
- The return value (the inferencing-resultrecord) contains a text field of typestring. Ideally, this would be astreamthat would allow streaming inferencing results back to the user, but alas streaming support is not yet ready for use so we leave that as a possible future backward incompatible change.
Want to go straight to the reference documentation? Find it here.
To use Serverless AI functions, the Llm module from the Spin SDK provides two methods: infer and generateEmbeddings. For example: 
import { ResponseBuilder, Llm} from "@fermyon/spin-sdk"
export async function handler(req: Request, res: ResponseBuilder) {
    let embeddings = Llm.generateEmbeddings(Llm.EmbeddingModels.AllMiniLmL6V2, ["someString"])
    console.log(embeddings.embeddings)
    let result = Llm.infer(Llm.InferencingModels.Llama2Chat, prompt)
    res.set({"content-type":"text/plain"})
    res.send(result.text)
}
General Notes
infer operation:
- It takes in the following arguments - model name, prompt and a optional third parameter for inferencing options.
- The model name is a string. There are enums for the inbuilt models (llama2-chat and codellama) in InferencingModels.
- The optional third parameter which is an InferencingOptions interface allows you to specify parameters such as maxTokens,repeatPenalty,repeatPenaltyLastNTokenCount,temperature,topK,topP.
- The return value is an InferenceResult.
generateEmbeddings operation:
- It takes two arguments - model name and list of strings to generate the embeddings for.
- The model name is a string. There are enums for the inbuilt models (AllMiniLmL6V2) in EmbeddingModels.
- The return value is an EmbeddingResult
Want to go straight to the reference documentation? Find it here.
from spin_sdk import http
from spin_sdk.http import Request, Response
from spin_sdk import llm
class IncomingHandler(http.IncomingHandler):
    def handle_request(self, request: Request) -> Response:
        prompt="You are a stand up comedy writer. Tell me a joke."
        result = llm.infer("llama2-chat", prompt)
        return Response(200,
                        {"content-type": "application/json"},
                        bytes(result.text, "utf-8"))
General Notes
- 
The model name is passed in as a string (as shown above; "llama2-chat").infer_with_optionsoperation:
- 
It takes in a model name, prompt text, and optionally a parameter object to control the inferencing. 
Want to go straight to the reference documentation? Find it here.
Serverless AI functions are available in the github.com/fermyon/spin/sdk/go/v2/llm package. See Go Packages for reference documentation. For example:
package main
import (
	"fmt"
	"net/http"
	spinhttp "github.com/fermyon/spin/sdk/go/v2/http"
	"github.com/fermyon/spin/sdk/go/v2/llm"
)
func init() {
	spinhttp.Handle(func(w http.ResponseWriter, r *http.Request) {
		result, err := llm.Infer("llama2-chat", "What is a good prompt?", nil)
		if err != nil {
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}
		fmt.Printf("Prompt tokens:    %d\n", result.Usage.PromptTokenCount)
		fmt.Printf("Generated tokens: %d\n", result.Usage.GeneratedTokenCount)
		fmt.Fprintf(w, "%s\n", result.Text)
		embeddings, err := llm.GenerateEmbeddings("all-minilm-l6-v2", []string{"Hello world"})
		if err != nil {
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}
		fmt.Printf("Prompt Tokens: %d\n", embeddings.Usage.PromptTokenCount)
		fmt.Printf("%v\n", embeddings.Embeddings)
	})
}
General Notes
infer operation:
- It takes in the following arguments - model name, prompt and an optional third parameter for inferencing options (pass nilif you don’t want to specify it).
- The model name is a string.
- The params allows you to specify MaxTokens,RepeatPenalty,RepeatPenaltyLastNTokenCount,Temperature,TopK,TopP.
- It returns a result struct with a Textfield that contains the answer and aUsagefield that contains metadata about the operation.
generateEmbeddings operation:
- It takes two arguments - model name and list of strings to generate the embeddings for.
- The model name is a string: all-minilm-l6-v2
- It returns a result struct with an Embeddingsfield that contains the[][]float32embeddings and aUsagefield that contains metadata about the operation.
Troubleshooting
Error “Local LLM operations are not supported in this version of Spin”
If you see “Local LLM operations are not supported in this version of Spin”, then your copy of Spin has been built without local LLM support.
The term “version” in the error message refers to how the software you are using built the Spin runtime, not to the numeric version of the runtime itself.
Most Spin builds support local LLMs as described above. However, the models built into Spin do not build on some combinations of platforms (for example, there are known problems with the aarch64/musl combination). This may cause some environments that embed Spin to disable the local LLM feature altogether. (For examples, some versions of the containerd-spin-shim did this.) In such cases, you will see the error above.
In such cases, you can:
- See if there is another Spin build available for your platform. All current builds from the Spin GitHub repository or Fermyon Spin installer support support local LLMs.
- Use the cloud-gpuplugin and runtime config option to have LLM inferencing serviced in Fermyon Cloud instead of locally.
 
                             
                            