Model Versioning & Properties
Version management, model metadata, lifecycle states, and quantization options.
Model Versioning
Every model in Xerotier has a semantic version (semver) string. When you
upload a model, it is assigned version 1.0.0 by default. You
can create new versions of a model, track version history, and control
which version is active.
Version Fields
| Field | Type | Description |
|---|---|---|
version |
string | Semver version string (e.g., "1.0.0", "2.1.3"). Defaults to "1.0.0". |
parentVersionId |
UUID | null | Reference to the previous version. Null for the first version. |
isLatest |
boolean | Whether this is the active version. Only one version per model name per project should be marked as latest. |
versionNotes |
string | null | Description of changes in this version. |
Version Management with xeroctl
# List versions for a model
xeroctl models versions list <model-id>
# Create a new version
xeroctl models versions create <model-id> 2.0.0 --notes "Improved accuracy"
# Promote a version to latest
xeroctl models versions promote <model-id> 2.0.0
# Rollback to a previous version
xeroctl models versions rollback <model-id> 1.0.0
Endpoints resolve to the latest version of a model by default. When you promote a version, the endpoint automatically picks up the new version on subsequent requests.
Version Management with the API
# List versions for a model
curl https://api.xerotier.ai/proj_ABC123/v1/models/MODEL_ID/versions \
-H "Authorization: Bearer xero_my-project_abc123"
# Create a new version
curl -X POST https://api.xerotier.ai/proj_ABC123/v1/models/MODEL_ID/versions \
-H "Authorization: Bearer xero_my-project_abc123" \
-H "Content-Type: application/json" \
-d '{
"version": "2.0.0",
"versionNotes": "Improved accuracy"
}'
# Promote a version to latest
curl -X POST https://api.xerotier.ai/proj_ABC123/v1/models/MODEL_ID/versions/2.0.0/promote \
-H "Authorization: Bearer xero_my-project_abc123"
import requests
headers = {"Authorization": "Bearer xero_my-project_abc123"}
base = "https://api.xerotier.ai/proj_ABC123/v1"
model_id = "MODEL_ID"
# List versions
versions = requests.get(
f"{base}/models/{model_id}/versions",
headers=headers
).json()
for v in versions.get("versions", []):
print(f" {v['version']} (latest={v['isLatest']})")
# Create a new version
response = requests.post(
f"{base}/models/{model_id}/versions",
headers=headers,
json={"version": "2.0.0", "versionNotes": "Improved accuracy"}
)
print(f"Created version: {response.json()}")
# Promote a version
requests.post(
f"{base}/models/{model_id}/versions/2.0.0/promote",
headers=headers
)
const headers = {
"Authorization": "Bearer xero_my-project_abc123",
"Content-Type": "application/json"
};
const base = "https://api.xerotier.ai/proj_ABC123/v1";
const modelId = "MODEL_ID";
// List versions
const versionsResponse = await fetch(
`${base}/models/${modelId}/versions`,
{ headers }
);
const versions = await versionsResponse.json();
for (const v of versions.versions || []) {
console.log(` ${v.version} (latest=${v.isLatest})`);
}
// Create a new version
const createResponse = await fetch(
`${base}/models/${modelId}/versions`,
{
method: "POST",
headers,
body: JSON.stringify({
version: "2.0.0",
versionNotes: "Improved accuracy"
})
}
);
console.log("Created version:", await createResponse.json());
// Promote a version
await fetch(
`${base}/models/${modelId}/versions/2.0.0/promote`,
{ method: "POST", headers }
);
Model Metadata
Each model carries extensive metadata that affects routing, inference behavior, and display in the catalog.
Core Properties
| Field | Type | Description |
|---|---|---|
name |
string | Model display name. |
format |
string | Storage format: safetensors, bin, exl2, or directory. |
sizeBytes |
integer | Total model size in bytes. |
status |
string | Current state: uploading, validating, ready, or error. |
architecture |
string | null | Model architecture family (e.g., "llama", "qwen", "mistral"). |
parameterCount |
integer | null | Number of model parameters. |
contextLength |
integer | null | Maximum context window in tokens. Affects max_tokens auto-clamping at the router. |
license |
string | null | Model license identifier. |
isMultimodal |
boolean | Whether the model supports image/multimodal input. Default: false. |
Generation Defaults
| Field | Type | Description |
|---|---|---|
defaultTemperature |
double | null | Model's default temperature if not specified in the request. |
defaultTopP |
double | null | Model's default top_p if not specified in the request. |
chatTemplate |
string | null | Jinja2 template for message formatting (from tokenizer_config.json). |
Model Lifecycle
Models progress through a series of status transitions from upload to availability:
| Status | Description |
|---|---|
uploading |
Model files are being uploaded. Not yet available for inference. |
validating |
Upload complete. The system is validating model files, extracting metadata (architecture, parameters, context length), and checking compatibility. |
ready |
Validation passed. The model can be assigned to endpoints and loaded on backends. |
error |
Validation failed. The validationError field contains details about what went wrong. Fix the issue and re-upload or revalidate. |
You can trigger revalidation of a model using xeroctl models revalidate <model-id>
or the POST /models/:modelId/revalidate API endpoint. This
re-checks model files and updates metadata without re-uploading.
Model Loading
When a model is assigned to an endpoint and a request arrives, the router sends a load request to a compatible backend. The backend auto-configures:
- Context length -- Auto-detected from model config if not specified.
- Max sequences -- Auto-calculated based on available resources.
- Quantization -- Selected based on model size vs available GPU VRAM (see Quantization).
Workload Types
Each model can be tagged with a workload type that describes its primary use case:
| Type | Description |
|---|---|
chat |
General-purpose conversational models. Default workload type. |
code |
Code generation and completion models. |
reasoning |
Models optimized for chain-of-thought and analytical tasks. |
embedding |
Text embedding models for semantic search and similarity. |
multilingual |
Models with strong multilingual support. |
Workload type is used for filtering in the model catalog and does not affect inference behavior or routing.
Quantization
Quantization reduces model size and memory requirements by using lower-precision number formats. Xerotier supports both pre-quantized models and runtime quantization.
Pre-Quantized Models
Some models are distributed with quantization already applied. These
models have isPreQuantized: true and include details
about the method used:
| Field | Description |
|---|---|
preQuantizationMethod |
Method used: compressed-tensors, gptq, or awq. |
preQuantizationBits |
Precision: 4 or 8 bits. |
Runtime Quantization
If a model does not fit in available GPU memory at full precision, the backend agent can apply runtime quantization automatically. The load acknowledgment includes:
appliedQuantization-- The method applied (e.g., "bitsandbytes", "fp8", "bitsandbytes-fp4", "awq", "gptq").quantizationReason-- Why this method was chosen: "native_fits", "pre_quantized", "runtime_quantization", or "cannot_fit".
Runtime quantization is transparent to the user. The model functions the same way, but with slightly reduced precision and memory footprint.
MoE Model Support
Mixture-of-Experts (MoE) models are supported. MoE-specific metadata
fields include numExperts (total experts in the model) and
numExpertsPerTok (experts activated per token). MoE models
can benefit from tuned kernel configurations, configurable via the
backend agent's --enable-moe-config flag.