// model catalog
Validated Models.
No fixed catalog. Operators upload or register the model; sharing is per-project. Every model is served by vLLM behind an OpenAI-compatible /v1.
- models
- 27
- families
- 14
- api
- /v1, OpenAI-compatible
Xerotier has no fixed model catalog. The models below were uploaded or registered by their project owners and shared publicly. All run on vLLM behind an OpenAI-compatible API.
Gemma 3 model card Model Page: Gemma Resources and Technical Documentation: [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Terms of Use: [Terms][terms] Authors: Google DeepMind Model Information Summary description and brief definition of inputs and outputs. Description Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140...
Parameters
4.3B
Context
15K
License
Unknown
Architecture
gemma3
Gemma 3 model card Model Page: Gemma Resources and Technical Documentation: [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Terms of Use: [Terms][terms] Authors: Google DeepMind Model Information Summary description and brief definition of inputs and outputs. Description Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140...
Parameters
12.2B
Context
32K
License
Unknown
Architecture
gemma3
Gemma 3 model card Model Page: Gemma Resources and Technical Documentation: [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Terms of Use: [Terms][terms] Authors: Google DeepMind Model Information Summary description and brief definition of inputs and outputs. Description Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140...
Parameters
27.4B
Context
Unknown
License
Unknown
Architecture
gemma3
GLM-4.7-Flash π Join our Discord community. π Check out the GLM-4.7 technical blog , technical report(GLM-4.5) . π Use GLM-4.7-Flash API services on Z.ai API Platform. π One click to GLM-4.7 . Introduction GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency. Performances on Benchmarks | Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B-Thinking-2507 | GPT-OSS-20B | |--------------------|---------------|-----------------------------|-------------| | AIME 25 | 91.6 | 85.0 | 91.7 | | GPQA | 75.2 | 73.4 | 71.5 | | LCB v6 | 64.0 | 66.0 | 61.0 | | HLE | 14.4 | 9.8 | 10.9 | | SWE-bench Verified | 59.2 | 22.0 | 34.0 | | ΟΒ²-Bench | 79.5 | 49.0 | 47.7 | | BrowseComp | 42.8 |...
Parameters
3.8B
Context
202K
License
MIT
Architecture
glm4_moe_lite
mof-class3-qualified Granite-4.1-8B Model Summary: Granite-4.1-8B is a 8B parameter long-context instruct model finetuned from Granite-4.1-8B-Base* using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. Granite 4.1 models have gone through an improved post-training pipeline, including supervised finetuning and reinforcement learning alignment, resulting in enhanced tool calling, instruction following, and chat capabilities. Developers: Granite Team, IBM HF Collection: Granite 4.1 Language Models HF Collection Technical Blog: Granite-4.1 Blog GitHub Repository: ibm-granite/granite-4.1-language-models Website: Granite Docs Release Date: April 29th, 2026 License: Apache 2.0 Supported Languages: English, German, Spanish,...
Parameters
11.6B
Context
49K
License
Unknown
Architecture
granite
mof-class3-qualified Granite-4.0-H-Small π£ Update [10-07-2025]: Added a default system prompt* to the chat template to guide the model towards more professional, accurate, and safe* responses. Model Summary: Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base* using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF)* and tool-calling* capabilities, making them more effective in enterprise applications....
Parameters
11.6B
Context
131K
License
Unknown
Architecture
granitemoehybrid
mof-class3-qualified Granite-4.0-H-Tiny π£ Update [10-07-2025]: Added a default system prompt* to the chat template to guide the model towards more professional, accurate, and safe* responses. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base* using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF)* and tool-calling* capabilities, making them more effective in enterprise applications. Developers:...
Parameters
1.8B
Context
131K
License
Unknown
Architecture
granitemoehybrid
Try LFM β’ Docs β’ LEAP β’ Discord LFM2-8B-A1B LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of our first MoE based on LFM2, with 8.3B total parameters and 1.5B active parameters. LFM2-8B-A1B is the best on-device MoE in terms of both quality (comparable to 3-4B dense models) and speed (faster than Qwen3-1.7B). Code and knowledge capabilities are significantly improved compared to LFM2-2.6B. Quantized variants fit comfortably on high-end phones, tablets, and laptops. Find more information about LFM2-8B-A1B in our blog post. π Model details Due to their small size, we recommend fine-tuning LFM2 models on narrow...
Parameters
1.9B
Context
68K
License
LFM Open License v1.0
Architecture
lfm2_moe
Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 Model Overview Model Architecture: Llama4ForConditionalGeneration Input: Text / Image Output: Text Model Optimizations: Activation quantization: None Weight quantization: INT4 Release Date: 04/25/2025 Version: 1.0 Validated on: RHOAI 2.20, RHAIIS 3.0, RHELAI 1.5 Model Developers: Red Hat (Neural Magic) Model Optimizations This model was obtained by quantizing weights of Llama-4-Scout-17B-16E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization. Deployment This model can be deployed efficiently on...
Parameters
22.2B
Context
10485K
License
Unknown
Architecture
llama4
Ministral 3 14B Reasoning 2512 The largest model in the Ministral 3 family, Ministral 3 14B offers frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language model with vision capabilities. This model is the reasoning post-trained version, trained for reasoning tasks, making it ideal for math, coding and stem related use cases. The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware. Ministral 3 14B can even be deployed locally, capable of fitting in 32GB of VRAM in BF16, and less than 24GB of RAM/VRAM when quantized. Learn more in our blog post and paper. Key Features Ministral 3 14B consists of two main architectural components: 13.5B Language Model 0.4B Vision...
Parameters
18.1B
Context
262K
License
Unknown
Architecture
mistral3
Ministral 3 3B Reasoning 2512 The smallest model in the Ministral 3 family, Ministral 3 3B is a powerful, efficient tiny language model with vision capabilities. This model is the reasoning post-trained version, trained for reasoning tasks, making it ideal for math, coding and stem related use cases. The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware. Ministral 3 3B can even be deployed locally, fitting in 16GB of VRAM in BF16, and less than 8GB of RAM/VRAM when quantized. Learn more in our blog post and paper. Key Features Ministral 3 3B consists of two main architectural components: 3.4B Language Model 0.4B Vision Encoder The Ministral 3 3B Reasoning model offers the following capabilities: Vision: Enables the model to analyze images...
Parameters
4.7B
Context
99K
License
Unknown
Architecture
mistral3
Devstral Small 2 24B Instruct 2512 Devstral is an agentic LLM for software engineering tasks. Devstral Small 2 excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench. This model is an Instruct model in FP8, fine-tuned to follow instructions, making it ideal for chat, agentic and instruction based tasks for SWE use cases. For enterprises requiring specialized capabilities (increased context, domain-specific knowledge, etc.), we invite companies to reach out to us. Key Features The Devstral Small 2 Instruct model offers the following capabilities: Agentic Coding: Devstral is designed to excel at agentic coding tasks, making it a great choice for software engineering agents....
Parameters
18.1B
Context
82K
License
Unknown
Architecture
mistral3
Granite-Embedding-311M-Multilingual-R2 Model Summary: Granite-Embedding-311M-Multilingual-R2 is a 311M parameter dense embedding model from the Granite Embeddings collection for high-quality multilingual text embeddings. It produces 768-dimensional vectors with a context length of up to 32,768 tokens. The model supports 200+ languages (based on the multilingual pretraining corpus of the underlying encoder), with enhanced support for 52 languages and programming code that receive explicit retrieval-pair and cross-lingual training. All training data uses permissive, enterprise-friendly licenses, plus IBM-collected and IBM-generated datasets. Granite Embedding 311M Multilingual R2 shows strong performance across multilingual information retrieval benchmarks, code retrieval, long-document...
Parameters
610M
Context
8K
License
Unknown
Architecture
modernbert
granite-embedding-reranker-english-r2 Model Summary: granite-embedding-reranker-english-r2_ is a 149M parameter dense cross-encoder model from the Granite Embeddings collection that can be used to generate high quality text embeddings. This model produces embedding vectors of size 768 based on context length of upto 8192 tokens. Compared to most other open-source models, this model was only trained using open-source relevance-pair datasets with permissive, enterprise-friendly license, plus IBM collected and generated datasets. The granite-embedding-reranker-english-r2_ model uses a cross-encoder architecture to compute high-quality relevance scores between queries and documents by jointly encoding their text, enabling precise reranking based on contextual alignment. The model is trained...
Parameters
285M
Context
8K
License
Unknown
Architecture
modernbert
Granite-Embedding-English-R2 Model Summary: Granite-embedding-english-r2 is a 149M parameter dense biencoder embedding model from the Granite Embeddings collection that can be used to generate high quality text embeddings. This model produces embedding vectors of size 768 based on context length of upto 8192 tokens. Compared to most other open-source models, this model was only trained using open-source relevance-pair datasets with permissive, enterprise-friendly license, plus IBM collected and generated datasets. The r2 models show strong performance across standard and IBM-built information retrieval benchmarks (BEIR, ClapNQ), code retrieval (COIR), long-document search benchmarks (MLDR, LongEmbed), conversational multi-turn (MTRAG), table retrieval (NQTables, OTT-QA, AIT-QA,...
Parameters
285M
Context
8K
License
Unknown
Architecture
modernbert
nomic-embed-text-v1.5: Resizable Production Embeddings with Matryoshka Representation Learning Blog | Technical Report | AWS SageMaker | Nomic Platform Exciting Update!: nomic-embed-text-v1.5 is now multimodal! nomic-embed-vision-v1.5 is aligned to the embedding space of nomic-embed-text-v1.5, meaning any text embedding is multimodal! Usage Important: the text prompt must* include a task instruction prefix, instructing the model which task is being performed. For example, if you are implementing a RAG application, you embed your documents as searchdocument: and embed your user queries as searchquery: . Notice: From transformers v5.5.0 and sentence transformers v5.3.0, trustremotecode=True will no longer be necessary. This will only be possible with the text-only series as of now. Task...
Parameters
160M
Context
2K
License
Unknown
Architecture
nomic_bert
Try gpt-oss Β· Guides Β· Model card Β· OpenAI blog Welcome to the gpt-oss series, OpenAIβs open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. Weβre releasing two flavors of these open models: gpt-oss-120b β for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters) gpt-oss-20b β for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. [!NOTE] This model card is dedicated to the smaller gpt-oss-20b model. Check out gpt-oss-120b for the larger model....
Parameters
4.3B
Context
Unknown
License
Apache-2.0
Architecture
Unknown
Phi-4 Model Card Phi-4 Technical Report Model Summary | | | |-------------------------|-------------------------------------------------------------------------------| | Developers | Microsoft Research | | Description | phi-4 is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures | | Architecture | 14B parameters, dense decoder-only Transformer...
Parameters
17.8B
Context
16K
License
MIT
Architecture
phi3
πPhi-4: [mini-reasoning | reasoning] | [multimodal-instruct | onnx]; [mini-instruct | onnx] Model Summary Phi-4-mini-instruct is a lightweight open model built upon synthetic data and filtered publicly available websites - with a focus on high-quality, reasoning dense data. The model belongs to the Phi-4 model family and supports 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning and direct preference optimization to support precise instruction adherence and robust safety measures. π° Phi-4-mini Microsoft Blog π Phi-4-mini Technical Report π©βπ³ Phi Cookbook π‘ Phi Portal π₯οΈ Try It Azure, Huggingface π Model paper Intended Uses Primary Use Cases The model is intended for broad multilingual commercial and research use. The model...
Parameters
6.1B
Context
131K
License
MIT
Architecture
phi3
Qwen3-0.6B Qwen3 Highlights Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code...
Parameters
781M
Context
40K
License
Apache-2.0
Architecture
qwen3
Qwen3-14B-AWQ Qwen3 Highlights Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code...
Parameters
18.3B
Context
40K
License
Apache-2.0
Architecture
qwen3
Qwen3-8B Qwen3 Highlights Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code...
Parameters
10.9B
Context
40K
License
Apache-2.0
Architecture
qwen3
Qwen3.5-0.8B Qwen Chat [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. In light of its parameter scale, the intended use cases are prototyping, task-specific fine-tuning, and other research or development purposes. Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency....
Parameters
911M
Context
23K
License
Apache-2.0
Architecture
qwen3_5
Qwen3.6-27B Qwen Chat [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. Qwen3.6 Highlights This release delivers substantial upgrades, particularly in Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. Thinking Preservation:...
Parameters
29.4B
Context
87K
License
Apache-2.0
Architecture
qwen3_5
Qwen3.5-27B-FP8 Qwen Chat [!Note] This repository contains FP8-quantized model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics are nearly identical to those of the original model. Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with...
Parameters
29.4B
Context
262K
License
Apache-2.0
Architecture
qwen3_5
Qwen3.5-35B-A3B Qwen Chat [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. [!Tip] For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by Alibaba Cloud Model Studio. In particular, Qwen3.5-Flash is the hosted version corresponding to Qwen3.5-35B-A3B with more production features, e.g., 1M context length by default and official built-in tools. For more information, please refer to the User Guide. Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5...
Parameters
3.7B
Context
138K
License
Apache-2.0
Architecture
qwen3_5_moe
Qwen3.6-35B-A3B Qwen Chat [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. Qwen3.6 Highlights This release delivers substantial upgrades, particularly in Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. Thinking...
Parameters
3.7B
Context
262K
License
Unknown
Architecture
qwen3_5_moe