---
license: mit
licenselink: https://huggingface.co/microsoft/phi-4/resolve/main/LICENSE
language:
pipelinetag: text-generation
tags:
| | |
|-------------------------|-------------------------------------------------------------------------------|
| Developers | Microsoft Research |
| Description | INLINECODE0AE0003AF is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.<br><br>INLINECODE15CC45901 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures |
| Architecture | 14B parameters, dense decoder-only Transformer model |
| Inputs | Text, best suited for prompts in the chat format |
| Context length | 16K tokens |
| GPUs | 1920 H100-80G |
| Training time | 21 days |
| Training data | 9.8T tokens |
| Outputs | Generated text in response to input |
| Dates | October 2024 – November 2024 |
| Status | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data |
| Release date | December 12, 2024 |
| License | MIT |
| | |
|-------------------------------|-------------------------------------------------------------------------|
| Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:<br><br>1. Memory/compute constrained environments.<br>2. Latency bound scenarios.<br>3. Reasoning and logic. |
| Out-of-Scope Use Cases | Our models is not specifically designed or evaluated for all downstream purposes, thus:<br><br>1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.<br>2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.<br>3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. |
Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from:
Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge.
We evaluated INLINECODE23CBEBE5C using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically:
INLINECODE
3FF09C524 has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories.Prior to release, INLINECODE
4FA3D8EB5 followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by INLINECODE58E18535D in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, encoding-based attacks, multi-turn attacks, and adversarial suffix attacks.Please refer to the technical report for more details on safety alignment.
To understand the capabilities, we compare INLINECODE
6278A798B with a set of models over OpenAI’s SimpleEval benchmark.At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
| Category | Benchmark | phi-4 (14B) | phi-3 (14B) | Qwen 2.5 (14B instruct) | GPT-4o-mini | Llama-3.3 (70B instruct) | Qwen 2.5 (72B instruct) | GPT-4o |
|---|---|---|---|---|---|---|---|---|
| Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 |
| Science | GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 |
| Math | MGSM<br>MATH | 80.6<br>80.4 | 53.5<br>44.6 | 79.6<br>75.6 | 86.5<br>73.0 | 89.1<br>66.3 | 87.3<br>80.0 | 90.4<br>74.6 |
| Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9 | 80.4 | 90.6 |
| Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 |
| Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 |
\* These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.
Given the nature of the training data, INLINECODE
7F0C52B2B is best suited for prompts using the chat format as follows:CODEBLOCK
0D919FD05CODEBLOCK
1872F27EELike other language models, INLINECODE
901B2A4E1 can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include: