Model Versioning & Properties

Version management, model metadata, lifecycle states, and quantization options.

Model Versioning

Every model in Xerotier has a semantic version (semver) string. When you upload a model, it is assigned version 1.0.0 by default. You can create new versions of a model, track version history, and control which version is active.

Version Fields

Field Type Description
version string Semver version string (e.g., "1.0.0", "2.1.3"). Defaults to "1.0.0".
parentVersionId UUID | null Reference to the previous version. Null for the first version.
isLatest boolean Whether this is the active version. Only one version per model name per project should be marked as latest.
versionNotes string | null Description of changes in this version.

Version Management with xeroctl

Shell
# List versions for a model xeroctl models versions list <model-id> # Create a new version xeroctl models versions create <model-id> 2.0.0 --notes "Improved accuracy" # Promote a version to latest xeroctl models versions promote <model-id> 2.0.0 # Rollback to a previous version xeroctl models versions rollback <model-id> 1.0.0

Endpoints resolve to the latest version of a model by default. When you promote a version, the endpoint automatically picks up the new version on subsequent requests.

Version Management with the API

curl
# List versions for a model curl https://api.xerotier.ai/proj_ABC123/v1/models/MODEL_ID/versions \ -H "Authorization: Bearer xero_my-project_abc123" # Create a new version curl -X POST https://api.xerotier.ai/proj_ABC123/v1/models/MODEL_ID/versions \ -H "Authorization: Bearer xero_my-project_abc123" \ -H "Content-Type: application/json" \ -d '{ "version": "2.0.0", "versionNotes": "Improved accuracy" }' # Promote a version to latest curl -X POST https://api.xerotier.ai/proj_ABC123/v1/models/MODEL_ID/versions/2.0.0/promote \ -H "Authorization: Bearer xero_my-project_abc123"
Python
import requests headers = {"Authorization": "Bearer xero_my-project_abc123"} base = "https://api.xerotier.ai/proj_ABC123/v1" model_id = "MODEL_ID" # List versions versions = requests.get( f"{base}/models/{model_id}/versions", headers=headers ).json() for v in versions.get("versions", []): print(f" {v['version']} (latest={v['isLatest']})") # Create a new version response = requests.post( f"{base}/models/{model_id}/versions", headers=headers, json={"version": "2.0.0", "versionNotes": "Improved accuracy"} ) print(f"Created version: {response.json()}") # Promote a version requests.post( f"{base}/models/{model_id}/versions/2.0.0/promote", headers=headers )
Node.js
const headers = { "Authorization": "Bearer xero_my-project_abc123", "Content-Type": "application/json" }; const base = "https://api.xerotier.ai/proj_ABC123/v1"; const modelId = "MODEL_ID"; // List versions const versionsResponse = await fetch( `${base}/models/${modelId}/versions`, { headers } ); const versions = await versionsResponse.json(); for (const v of versions.versions || []) { console.log(` ${v.version} (latest=${v.isLatest})`); } // Create a new version const createResponse = await fetch( `${base}/models/${modelId}/versions`, { method: "POST", headers, body: JSON.stringify({ version: "2.0.0", versionNotes: "Improved accuracy" }) } ); console.log("Created version:", await createResponse.json()); // Promote a version await fetch( `${base}/models/${modelId}/versions/2.0.0/promote`, { method: "POST", headers } );

Model Metadata

Each model carries extensive metadata that affects routing, inference behavior, and display in the catalog.

Core Properties

Field Type Description
name string Model display name.
format string Storage format: safetensors, bin, exl2, or directory.
sizeBytes integer Total model size in bytes.
status string Current state: uploading, validating, ready, or error.
architecture string | null Model architecture family (e.g., "llama", "qwen", "mistral").
parameterCount integer | null Number of model parameters.
contextLength integer | null Maximum context window in tokens. Affects max_tokens auto-clamping at the router.
license string | null Model license identifier.
isMultimodal boolean Whether the model supports image/multimodal input. Default: false.

Generation Defaults

Field Type Description
defaultTemperature double | null Model's default temperature if not specified in the request.
defaultTopP double | null Model's default top_p if not specified in the request.
chatTemplate string | null Jinja2 template for message formatting (from tokenizer_config.json).

Model Lifecycle

Models progress through a series of status transitions from upload to availability:

Status Description
uploading Model files are being uploaded. Not yet available for inference.
validating Upload complete. The system is validating model files, extracting metadata (architecture, parameters, context length), and checking compatibility.
ready Validation passed. The model can be assigned to endpoints and loaded on backends.
error Validation failed. The validationError field contains details about what went wrong. Fix the issue and re-upload or revalidate.

You can trigger revalidation of a model using xeroctl models revalidate <model-id> or the POST /models/:modelId/revalidate API endpoint. This re-checks model files and updates metadata without re-uploading.

Model Loading

When a model is assigned to an endpoint and a request arrives, the router sends a load request to a compatible backend. The backend auto-configures:

  • Context length -- Auto-detected from model config if not specified.
  • Max sequences -- Auto-calculated based on available resources.
  • Quantization -- Selected based on model size vs available GPU VRAM (see Quantization).

Workload Types

Each model can be tagged with a workload type that describes its primary use case:

Type Description
chat General-purpose conversational models. Default workload type.
code Code generation and completion models.
reasoning Models optimized for chain-of-thought and analytical tasks.
embedding Text embedding models for semantic search and similarity.
multilingual Models with strong multilingual support.

Workload type is used for filtering in the model catalog and does not affect inference behavior or routing.

Quantization

Quantization reduces model size and memory requirements by using lower-precision number formats. Xerotier supports both pre-quantized models and runtime quantization.

Pre-Quantized Models

Some models are distributed with quantization already applied. These models have isPreQuantized: true and include details about the method used:

Field Description
preQuantizationMethod Method used: compressed-tensors, gptq, or awq.
preQuantizationBits Precision: 4 or 8 bits.

Runtime Quantization

If a model does not fit in available GPU memory at full precision, the backend agent can apply runtime quantization automatically. The load acknowledgment includes:

  • appliedQuantization -- The method applied (e.g., "bitsandbytes", "fp8", "bitsandbytes-fp4", "awq", "gptq").
  • quantizationReason -- Why this method was chosen: "native_fits", "pre_quantized", "runtime_quantization", or "cannot_fit".

Runtime quantization is transparent to the user. The model functions the same way, but with slightly reduced precision and memory footprint.

MoE Model Support

Mixture-of-Experts (MoE) models are supported. MoE-specific metadata fields include numExperts (total experts in the model) and numExpertsPerTok (experts activated per token). MoE models can benefit from tuned kernel configurations, configurable via the backend agent's --enable-moe-config flag.