granite-embedding-311m-multilingual-r2
A223E2E0-A790-4776-B103-4FCE643525B5
XIM Only
A223E2E0-A790-4776-B103-4FCE643525B5
XIM Only
---
language:
Model Summary: Granite-Embedding-311M-Multilingual-R2 is a 311M parameter dense embedding model from the Granite Embeddings collection for high-quality multilingual text embeddings. It produces 768-dimensional vectors with a context length of up to 32,768 tokens. The model supports 200+ languages (based on the multilingual pretraining corpus of the underlying encoder), with enhanced support for 52 languages and programming code that receive explicit retrieval-pair and cross-lingual training. All training data uses permissive, enterprise-friendly licenses, plus IBM-collected and IBM-generated datasets.
> Granite Embedding 311M Multilingual R2 shows strong performance across multilingual information retrieval benchmarks, code retrieval, long-document search, conversational multi-turn, and reasoning retrieval tasks. The multilingual R2 model scores 64.0 on Multilingual MTEB Retrieval (18 tasks) — a +11.8 point improvement over granite-embedding-278m-multilingual (52.2) — and averages 56.0 across all retrieval benchmarks, representing a +14.2 point gain over the previous generation. It supports Matryoshka dimension reduction, 32k-token context, and ships with ONNX and OpenVINO models for production deployment.
The model uses a bi-encoder architecture to generate high-quality embeddings from text inputs such as queries, passages, code, and documents, enabling seamless comparison through cosine similarity. Built using contrastive fine-tuning, knowledge distillation, and model merging, the Granite Embedding 311M Multilingual R2 model is optimized to ensure strong alignment between query and passage embeddings across many languages.
The Granite Embedding Multilingual R2 release consists of two multilingual embedding models, both based on the ModernBERT architecture:
The underlying encoder was pretrained on text from 200+ languages, and we report general-purpose embeddings for any of them. In addition, we provide enhanced support for 52 languages and programming code that receive explicit retrieval-pair and cross-lingual training data, producing higher-quality embeddings on retrieval tasks.
<details>
<summary>Click to expand the list of 52 enhanced-support languages</summary>
Albanian (sq), Arabic (ar), Azerbaijani (az), Bengali (bn), Bulgarian (bg), Catalan (ca), Chinese (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Khmer (km), Korean (ko), Latvian (lv), Lithuanian (lt), Malay (ms), Marathi (mr), Norwegian (no), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovenian (sl), Spanish (es), Swahili (sw), Swedish (sv), Tagalog (tl), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi).
Additionally, the models are trained on programming code (Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, C++) and support cross-lingual code retrieval.
</details>
Intended Use: The model is designed to produce fixed-length vector representations for a given text, which can be used for text similarity, retrieval, and search applications across multiple languages.
For efficient inference, these models support Flash Attention 2. Installing it is optional but can lead to faster encoding:
pip install flash_attn
Usage with Sentence Transformers:
The model is compatible with the SentenceTransformer library and is very easy to use:
First, install the sentence transformers library
pip install sentence_transformers
The model can then be used to encode pairs of text and find the similarity between their representations
from sentence_transformers import SentenceTransformer, util
model_path = "ibm-granite/granite-embedding-311m-multilingual-r2"
# Load the Sentence Transformer model
model = SentenceTransformer(model_path)
input_queries = [
'What is the tallest mountain in Japan?', # English query
'Wer hat das Lied Achy Breaky Heart geschrieben?', # German query
'ドイツの首都はどこですか?', # Japanese query
]
input_passages = [
"富士山は、静岡県と山梨県にまたがる活火山で、標高3776.12 mで日本最高峰の独立峰である。", # Japanese passage
"Achy Breaky Heart is a country song written by Don Von Tress. Originally titled Don't Tell My Heart and performed by The Marcy Brothers in 1991.", # English passage
"Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland. Die Stadt ist mit rund 3,7 Millionen Einwohnern die bevölkerungsreichste Kommune Deutschlands.", # German passage
]
# Cross-lingual retrieval: each query should score highest with its matching passage in a different language
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)
# calculate cosine similarity — expect high scores on the diagonal (EN→JA, DE→EN, JA→DE)
print(util.cos_sim(query_embeddings, passage_embeddings))
# output: tensor([[0.9393, 0.6899, 0.7627],
# [0.6780, 0.9598, 0.7062],
# [0.7818, 0.7342, 0.9172]])
Matryoshka Representation Learning:
This model supports Matryoshka Representation Learning (MRL), which allows you to truncate embeddings to smaller dimensions (e.g., 512, 384, 256, 128) with graceful performance degradation. This is useful for reducing storage and memory requirements.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ibm-granite/granite-embedding-311m-multilingual-r2")
# Full 768-dimensional embeddings
full_embeddings = model.encode(["example text"])
print(full_embeddings.shape) # (1, 768)
# Truncated to 256 dimensions — still effective for many retrieval tasks
truncated_embeddings = model.encode(["example text"], truncate_dim=256)
print(truncated_embeddings.shape) # (1, 256)
Usage with Hugging Face Transformers:
This is a simple example of how to use the granite-embedding-311m-multilingual-r2 model with the Transformers library and PyTorch. For a complete retrieval workflow including passage encoding and cosine similarity, see the Sentence Transformers example above.
First, install the required libraries
pip install transformers torch
The model can then be used to encode text
import torch
from transformers import AutoModel, AutoTokenizer
model_path = "ibm-granite/granite-embedding-311m-multilingual-r2"
# Load the model and tokenizer
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()
input_queries = [
'What is the tallest mountain in Japan?', # English query
'Wer hat das Lied Achy Breaky Heart geschrieben?', # German query
'ドイツの首都はどこですか?', # Japanese query
]
# tokenize inputs
tokenized_queries = tokenizer(input_queries, padding=True, truncation=True, return_tensors='pt')
# encode queries
with torch.no_grad():
model_output = model(**tokenized_queries)
# Perform pooling. granite-embedding-311m-multilingual-r2 uses CLS Pooling
query_embeddings = model_output[0][:, 0]
# normalize the embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, dim=1)
ONNX and OpenVINO:
Pre-converted ONNX and OpenVINO models are released alongside the PyTorch weights for production deployment. These can be loaded directly via the backend parameter in Sentence Transformers:
from sentence_transformers import SentenceTransformer
# ONNX backend
model = SentenceTransformer("ibm-granite/granite-embedding-311m-multilingual-r2", backend="onnx")
embeddings = model.encode(["example text"])
# OpenVINO backend
model = SentenceTransformer("ibm-granite/granite-embedding-311m-multilingual-r2", backend="openvino")
embeddings = model.encode(["example text"])
# OpenVINO INT8 quantized backend (smaller & faster on CPU)
model = SentenceTransformer(
"ibm-granite/granite-embedding-311m-multilingual-r2",
backend="openvino",
model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"},
)
embeddings = model.encode(["example text"])
The ONNX model is compatible with any ONNX Runtime backend (CPU, CUDA, TensorRT, DirectML). The OpenVINO model is optimized for Intel hardware including CPUs and integrated GPUs.
vLLM:
The model can be served as an embedding endpoint using vLLM:
vllm serve ibm-granite/granite-embedding-311m-multilingual-r2 --task embed
llama.cpp (GGUF):
The model can be converted to GGUF format for use with llama.cpp:
# Convert to GGUF
python convert_hf_to_gguf.py ibm-granite/granite-embedding-311m-multilingual-r2 \
--outfile granite-embedding-311m-multilingual-r2.gguf
# Generate embeddings
llama-embedding -m granite-embedding-311m-multilingual-r2.gguf -p "example text"
Note: Ollama does not currently support ModernBERT-based models.
Granite-Embedding-311M-Multilingual-R2 is in the top three in the under-500M multilingual class on average across retrieval, code search,
long-document, and reasoning benchmarks, with a +14.2 point average gain over the previous-generation
Granite-Embedding-278M-Multilingual.
Performance on Multilingual MTEB Retrieval, MTEB English Retrieval, MTEB Code Retrieval, long-document search (LongEmbed), and Reasoning as Retrieval (RaR-b) benchmarks. Scores are averages across tasks; higher is better. Throughput (documents per second) is measured on a single NVIDIA H100 GPU with batches of 1024 sequences at 512 tokens.
Granite-Embedding-311M-Multilingual-R2 scores 64.0 on MTEB Multilingual Retrieval — a +11.8 point improvement over its R1 predecessor — while encoding nearly 2,000 documents per second at comparable speed.
| Model | Parameters (M) | Embedding Size | MTEB ML Retrieval (18) | MTEB Retrieval (eng, v2) (10) | MTEB (Code, v1) (12) | LongEmbed (6) | RaR-b (17) | AVG | Throughput (docs/s) |
|---|---|---|---|---|---|---|---|---|---|
| granite-embedding-107m-multilingual | 107 | 384 | 48.1 | 47.9 | 40.7 | 34.3 | 17.1 | 37.6 | 3,337 |
| granite-embedding-278m-multilingual | 278 | 768 | 52.2 | 51.5 | 48.5 | 37.7 | 18.9 | 41.8 | 2,185 |
| granite-embedding-97m-multilingual-r2 | 97 | 384 | 59.6 | 50.1 | 60.5 | 65.5 | 24.9 | 52.1 | 2,894 |
| granite-embedding-311m-multilingual-r2 | 311 | 768 | 64.0 | 52.6 | 63.9 | 71.7 | 28.0 | 56.0 | 1,944 |
This model supports Matryoshka Embeddings, which allow for reduced embedding dimensions without a reduction in performance:
| Model | Embedding Size | MTEB (eng, v2) | MTEB (Code, v1) | ML MTEB Retrieval |
|---|---|---|---|---|
| granite-embedding-311m-multilingual-r2 | 768 | 52.6 | 63.9 | 63.9 |
| 512 | 52.5 | 63.8 | 63.9 | |
| 384 | 52.1 | 63.7 | 63.8 | |
| 256 | 51.6 | 63.4 | 63.5 | |
| 128 | 50.4 | 62.3 | 62.5 |
The granite-embedding-311m-multilingual-r2 model is based on the ModernBERT architecture with expanded multilingual vocabulary:
| Feature | granite-embedding-311m-multilingual-r2 |
|---|---|
| Embedding size | 768 |
| Number of layers | 22 |
| Number of attention heads | 12 |
| Intermediate size | 1152 |
| Activation Function | GeGLU |
| Vocabulary Size | 262,152 |
| Max. Sequence Length | 32,768 |
| Matryoshka Dimensions | 768, 512, 384, 256, 128 |
| # Parameters | ~311M |
The Granite Embedding Multilingual R2 model incorporates key enhancements from the ModernBERT architecture, including:
The model was trained using knowledge distillation with multiple teacher models, contrastive fine-tuning, and Matryoshka Representation Learning.
All training data is sourced under permissive, commercial-friendly licenses, making Granite Embedding R2 suitable for unrestricted
enterprise deployment.
Training data comes from four key sources:
For governance, all our data undergoes a data clearance process subject to technical, business, and governance review. This comprehensive process captures critical information about the data, including but not limited to their content description, ownership, intended use, data classification, licensing information, usage restrictions, how the data will be acquired, as well as an assessment of sensitive information (e.g., personal information).
The multilingual tokenizer used by this model is derived from the Gemma 3 tokenizer by Google. The original Gemma 3 tokenizer vocabulary was used as a starting point and further trained on multilingual text and code data to produce the 262K-token vocabulary used in this model. Use of the Gemma tokenizer is subject to the Gemma Terms of Use. The Gemma model family and associated resources are described at ai.google.dev/gemma.
We trained the Granite Embedding Multilingual R2 model using IBM's computing cluster, BlueVela Cluster, which is outfitted with NVIDIA H100 80GB GPUs. This cluster provides a scalable and efficient infrastructure for training our models over multiple GPUs.
Granite Embedding 311M Multilingual R2 leverages both permissively licensed open-source and select proprietary data for enhanced performance. The training data for the base language model was filtered to remove text containing hate, abuse, and profanity, though the effectiveness of such filtering may vary across language families.
Performance varies across languages: higher-resource languages and those in the 52-language enhanced-support set generally achieve better results, while low-resource languages rely on cross-lingual transfer from the pretraining stage and may exhibit lower retrieval quality. Synthetic training data, while effective for improving multilingual coverage, may introduce distributional biases not present in naturally occurring text. Longer texts will be truncated to the 32,768-token context limit.
@misc{granite-embedding-311m-multilingual-r2,
title={Granite Embedding Multilingual R2 Models},
author={IBM Granite Embedding Team},
year={2026},
}