MiniCPM-V 4.5：一款GPT-4o级别的MLLM，适用于手机上的单图、多图和高帧率视频理解

一个GPT-4o级别的MLLM，用于手机上单张图片、多幅图片和高帧率视频理解

MiniCPM-V 是一系列高效的端端多模态大型语言模型（MLLM），接受图像、视频和文本作为输入，并输出高质量的文本。MiniCPM-o还将音频作为输入，以端到端方式提供高质量的语音输出。自2024年2月以来，我们已发布了7个版本的模型，旨在实现强劲的性能和高效的部署。目前该系列中最著名的型号包括：

MiniCPM-V 4.5：🔥🔥🔥 MiniCPM-V 系列中最新且最强大的型号。该模型拥有8B参数，在视觉语言能力上优于GPT-40-最新、Gemini-2.0 Pro和Qwen2.5-VL 72B，成为开源社区中性能最高的设备端多模态模型。该版本带来了新功能，包括高效的高帧率和长视频理解（视频令牌最高可达96倍压缩率）、可控的快速/深度混合思维、强力手写OCR以及复杂的表格/文档解析。它还推动了MiniCPM-V的热门特性，如可信行为、多语言支持和端端部署能力。
MiniCPM-o 2.6：⭐️⭐️⭐️ MiniCPM-o 系列中最强大的型号。该端到端模型拥有8B参数，在视觉、语音和多模态直播流方面性能可与GPT-4o-202405相当，是开源社区中最通用且性能最出色的模型之一。在新的语音模式中，MiniCPM-o 2.6 支持双语实时语音对话，并可配置语音，还支持情感/速度/风格控制、端到端语音克隆、角色扮演等趣味功能。由于其卓越的令牌密度，MiniCPM-o 2.6首次支持iPad等端端设备上的多模态直播。

新闻

📌 寄托

[2025.09.18] 📢📢📢 MiniCPM-V 4.5 技术报告现已发布！请见此处。
[2025.09.01] ⭐️⭐️⭐️ MiniCPM-V 4.5 已获得 llama.cpp、vLLM 和 LLaMA-Factory 的官方支持欢迎你通过这些官方渠道直接使用它！支持更多框架如 Ollama 和 SGLang 正在积极开发中。
[2025.08.26] 🔥🔥🔥 我们开源了MiniCPM-V 4.5，其性能优于GPT-4o-latest、Gemini-2.0 Pro和Qwen2.5-VL 72B。它推动了MiniCPM-V的普及功能，并带来了实用的新特性。现在就试试吧！
[2025.08.01] ⭐️⭐️⭐️ 我们开源了《MiniCPM-V & o 食谱》！它为各种用户场景提供了全面的指南，并配合我们全新的文档网站，实现更顺畅的入门流程。
[2025.06.20] ⭐️⭐️⭐️ 我们的官方 Ollama 仓库发布。只需一键试用我们最新型号！
[2025.03.01] 🚀🚀🚀 MiniCPM-o的对齐技术RLAIF-V已被CVPR 2025重点录用！代码、数据集和论文均为开源！
[2025.01.24] 📢📢📢 MiniCPM-o 2.6技术报告发布！请见此处。
[2025.01.19] 📢 注意！我们目前正在努力将MiniCPM-o 2.6合并到llama.cpp、Ollama和vllm的官方仓库中。在合并完成之前，请使用我们的本地分支llama.cpp、Ollama和vllm。在合并前使用官方仓库可能会导致意想不到的问题。
[2025.01.19] ⭐️⭐️⭐️ MiniCPM-o 登顶 GitHub 热门趋势，并在 Hugging Face 热门趋势榜上进入前二！
[2025.01.17] 我们更新了MiniCPM-o 2.6 int4量化版本的使用情况，并解决了模型初始化错误。点击这里，立即试试！
[2025.01.13] 🔥🔥🔥 我们开源了 MiniCPM-o 2.6，该版本在视觉、语音和多模态直播方面与GPT-4o-202405匹配。它推进了MiniCPM-V 2.6的流行功能，并支持多种新功能。现在就试试吧！
[2024.08.17] 🚀🚀🚀 MiniCPM-V 2.6 现已获得官方llama.cpp的全面支持！各种尺寸的 GGUF 型号在此提供。
[2024.08.06] 🔥🔥🔥 我们开源了 MiniCPM-V 2.6，在单图、多图和视频理解方面优于 GPT-4V。它推进了 MiniCPM-Llama3-V 2.5 的热门功能，并支持在 iPad 上的实时视频理解。现在就试试吧！
[2024.08.03] MiniCPM-Llama3-V 2.5技术报告发布！请点击这里。
[2024.05.23] 🔥🔥🔥 MiniCPM-V 登顶 GitHub 热门趋势和 Hugging Face 热门！我们的演示由 Hugging Face Gradio 官方账号推荐，现已发布。快来试试吧！

点击查看更多新闻。

内容

MiniCPM-V 4.5
- 推断效率
MiniCPM-o 2.6
MiniCPM-V 及 o 食谱
在Gradio 🤗上与我们的演示聊天
推理
- 模型动物园
- 多回合对话
  - 多张图片聊天
  - 上下文中少数样本学习
  - 视频聊天
  - 语音与音频模式
  - 多模态直播
- 多GPU上的推理
- Mac 上的推理
- 使用llama.cpp、Ollama、vLLM的高效推理
微调
使用MiniCPM-V和MiniCPM-o的精彩表现
常见问题解答
局限性

MiniCPM-V 4.5

MiniCPM-V 4.5 是 MiniCPM-V 系列中最新且最强大的型号。该模型基于Qwen3-8B和SigLIP2-400M构建，总参数为8B。它相比之前的 MiniCPM-V 和 MiniCPM-o 型号展现出显著的性能提升，并引入了新的实用功能。MiniCPM-V 4.5 的显著特点包括：

🔥 最先进的视觉语言能力。MiniCPM-V 4.5在OpenCompass（对8个热门基准测试的综合评估）上获得了77.0的平均得分。仅有8B参数，它在视觉语言能力上超越了广泛使用的专有模型如GPT-40-latest、Gemini-2.0 Pro，以及强大的开源模型如Qwen2.5-VL 72B，使其成为30B参数下性能最强的MLLM。
🎬 高效、高帧率和长时间的视频理解。借助新的统一3D重采样器，适用于图像和视频，MiniCPM-V 4.5现在可以实现96倍的视频令牌压缩率，即6个448×448的视频帧可以合并压缩成64个视频令牌（大多数MLLM通常为1,536个令牌）。这意味着模型可以在不增加LLM推理成本的情况下感知更多视频帧。这为Video-MME、LVBench、MLVU、MotionBench、FavorBench等平台带来了最先进的高帧率（最高10FPS）视频理解和长视频理解能力，高效运行。
⚙️ 可控混合快速/深度思考。MiniCPM-V 4.5 支持快速思考，实现高效且频繁的使用，并具备竞争力的性能，同时支持更复杂的问题解决。为了在不同用户场景下考虑效率和性能权衡，这种快速/深度思考模式可以高度受控地切换。
💪 强大的OCR、文档解析等。基于LLaVA-UHD架构，MiniCPM-V 4.5能够处理任意宽高比的高分辨率图像，最高可达180万像素（例如1344×1344），使用比大多数MLLM少4倍的视觉标记。该模型在OCRBench上表现领先，超过了GPT-4o-late和Gemini 2.5等专有模型。它还在通用MLLM中实现了OmniDocBench上的PDF文档解析能力的顶尖性能。基于最新的RLAIF-V和VisCPM技术，具备可信行为，在MMHal-Bench上优于GPT-4o-最新，并支持30多种语言的多语言能力。
💫 使用简便。MiniCPM-V 4.5 可轻松以多种方式使用：（1）支持llama.cpp和 ollama 支持本地设备上高效的 CPU 推理，（2） 16 种大小的 int4、GGUF 和 AWQ 格式量化模型，（3）支持 SGLang 和 vLLM 实现高吞吐量和内存高效推理，（4）通过 Transformer 和 LLaMA-Factory 对新域和任务进行微调，（5）快速本地 WebUI 演示，（6）在 iPhone 和 iPad 上优化的本地 iOS 应用，以及（7）服务器上的在线网页演示。完整使用请参见我们的食谱！

关键技术

Architechture：用于高密度视频压缩的统一3D重采样器。MiniCPM-V 4.5引入了3D重采样器，克服了视频理解中性能与效率的权衡。通过将最多6个连续视频帧组合并联合压缩为仅64个令牌（与MiniCPM-V系列单幅图像相同的令牌数），MiniCPM-V 4.5实现了视频令牌的96×压缩率。这使得模型能够处理更多视频帧，而无需增加大型语言模型的计算成本，从而实现高帧率视频和长视频理解。该架构支持图像、多图输入和视频的统一编码，确保无缝的能力和知识传输。
预培训：OCR和文档知识的统一学习。现有的MLLM通过孤立培训方法从文档中学习OCR能力和知识。我们观察到，这两种训练方法的本质区别在于图像中文本的可见性。通过动态破坏文档中具有不同噪声水平的文本区域，并要求模型重建文本，模型学会在准确的文本识别（可见文本时）和多模态基于上下文的知识推理（文本严重模糊时）之间自适应且正确地切换。这消除了对易出错文档解析器在知识学习中的依赖，并防止过度增强的OCR数据产生幻觉，从而实现顶级OCR和多模态知识性能，同时实现最小的工程开销。
培训后：混合快速/深度思维结合多模态强化学习。MiniCPM-V 4.5通过两种可切换模式提供平衡的推理体验：快速思考以高效日常使用，深度思考用于复杂任务。该模型采用一种新的混合强化学习方法，联合优化了两种模式，显著提升了快速模式的性能，同时不牺牲深度模式的能力。结合RLPR和RLAIF-V，它从广泛的多模态数据中推广强健推理技能，同时有效减少幻觉。

评估

推断效率

OpenCompass

型	大小	平均得分 ↑	总推理时间 ↓
GLM-4.1V-9B-思维	10.3B	76.6	17.5小时
MiMo-VL-7B-RL	8.3B	76.4	11小时
MiniCPM-V 4.5	8.7B	77.0	7.5小时

视频-MME

型	大小	平均得分 ↑	总推理时间 ↓	GPU Mem ↓
Qwen2.5-VL-7B-指示	8.3B	71.6	3小时	60G
GLM-4.1V-9B-思维	10.3B	73.6	2.63小时	32G
MiniCPM-V 4.5	8.7B	73.5	0.26小时	28G

Video-MME 和 OpenCompass 均使用 8×A100 GPU 进行推断。Video-MME报告的推断时间包含完整的模型端计算，并排除了视频帧提取的外部成本（依赖于特定帧提取工具），以便公平比较。

例子

点击查看更多案件。

我们在 iPad M4 上部署了 MiniCPM-V 4.5 和 iOS 演示版。演示视频是未经编辑的原始屏幕录制。

MiniCPM-o 2.6

MiniCPM-o 2.6 是 MiniCPM-o 系列中最新且最强大的型号。该模型基于SigLip-400M、Whisper-medium-300M、ChatTTS-200M和Qwen2.5-7B，采用端到端构建，参数总计8B。它在性能上相较于 MiniCPM-V 2.6 有显著提升，并引入了实时语音对话和多模态直播的新功能。MiniCPM-o 2.6 的显著特点包括：

🔥 领先的视觉能力。MiniCPM-o 2.6在OpenCompass（对8个热门基准测试的综合评估）上获得了70.2的平均得分。仅有8B参数，在单张图像理解方面超过了广泛使用的专有模型，如GPT-4o-202405、Gemini 1.5 Pro和Claude 3.5 Sonnet。它还在多图像和视频理解方面优于GPT-4V和Claude 3.5 Sonnet，展现出有前景的上下文学习能力。
🎙 最先进的语音功能。MiniCPM-o 2.6 支持双语实时语音对话，支持英语和中文的可配置语音。它在音频理解任务（如ASR和STT翻译）上表现优于GPT-40实时，并在开源社区的语义和声学评估中展现出最先进的语音对话表现。它还允许有趣的功能，比如情感/速度/风格控制、端到端的声音克隆、角色扮演等。
🎬 强大的多模态直播能力。作为一项新功能，MiniCPM-o 2.6 能够独立于用户查询接受连续的视频和音频流，并支持实时语音交互。它优于GPT-4o-202408和Claude 3.5 Sonnet，并在StreamingBench上的开源社区展现出最先进的性能，StreamingBench是实时视频理解、全源（视频与音频）理解以及多模态上下文理解的综合基准。
💪 强大的OCR能力等。MiniCPM-o 2.6 在 MiniCPM-V 系列的普及视觉能力基础上，能够处理任意宽高比且最高可达 180 万像素（例如 1344×1344）的图像。它在 OCRBench 上对 25B 以下模型实现了最先进的性能，超过了如 GPT-4o-202405 等专有模型。基于最新的 RLAIF-V 和 VisCPM 技术，它具有可信的行为，在 MMHal-Bench 上优于 GPT-4o 和 Claude 3.5 Sonnet，并支持30多种语言的多语言能力。
🚀 效率卓越。除了友好大小外，MiniCPM-o 2.6 还展示了最先进的标记密度（即每个视觉标记编码的像素数）。处理180万像素图像时，它仅产生640个令牌，比大多数型号少75%。这直接提高了推理速度、首令牌延迟、内存使用和功耗。因此，MiniCPM-o 2.6 能够高效支持端端设备上的多模态直播，如 iPad。
💫 使用简便。MiniCPM-o 2.6 可以通过多种方式轻松使用：（1） llama.cpp 支持本地设备上高效的 CPU 推理，（2） 16 种大小的 int4 和 GGUF 格式量化模型，（3）支持高吞吐量和内存高效推理的 vLLM，（4）利用 LLaMA-Factory 对新领域和任务进行微调，（5）快速本地 WebUI 演示，（6）服务器上的在线网页演示。

模型建筑。

端到端全模态架构。不同的模态编码器/解码器以端到端方式连接和训练，以充分利用丰富的多模态知识。该模型以完全端到端的方式训练，仅有CE损失。
全模态直播机制。（1）我们将离线模态编码器/解码器转换为流输入输出的在线编码器。（2）我们设计了一种用于LLM骨干的全模态流处理的时分复用（TDM）机制。它将平行的全模态流划分为小周期时间片中的顺序信息。
可配置语音建模设计。我们设计了一个多模态系统提示，包括传统的文本系统提示，以及一个新的音频系统提示来确定助理声音。这使得语音配置在推理时间上更加灵活，同时也促进了端到端的语音克隆和基于描述的语音创建。

评估

点击查看视觉理解结果。

点击查看音频理解和语音对话结果。

点击查看多模态直播结果。

Examples

We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.

传统车型

型	介绍与指导
MiniCPM-V 4.0	公文
MiniCPM-V 2.6	公文
MiniCPM-Llama3-V 2.5	公文
MiniCPM-V 2.0	公文
MiniCPM-V 1.0	公文
OmniLMM-12B	公文

MiniCPM-V 及 o 食谱

在我们的结构化食谱中，探索MiniCPM-V和MiniCPM-o模型系列的全面、即部署解决方案，助力开发者快速实现集成视觉、语音和直播功能的多模态AI应用。主要特点包括：

简易使用文档

我们全面的文档网站以清晰、有条理的方式呈现每道菜谱。所有功能一目了然，方便你快速找到所需的内容。

广用户谱

我们支持广泛的用户，从个人到企业和研究人员。

个人：使用Ollama和Llama.cpp轻松进行推理，只需极少的设置。
企业：通过vLLM和SGLang实现高吞吐量、可扩展的性能。
研究人员：利用包括Transformers、LLaMA-Factory、SWIFT和Align-anything在内的先进框架，实现灵活的模型开发和前沿实验。

多功能部署场景

我们的生态系统为各种硬件环境和部署需求提供最优解决方案。

网页演示：启动使用FastAPI的互动多模态AI网页演示。
量化部署：利用GGUF和BNB最大化效率，最小化资源消耗。
终端设备：为iPhone和iPad带来强大的AI体验，支持离线和隐私敏感应用。

在Gradio 🤗上与我们的演示聊天

我们提供由 Hugging Face Gradio 驱动的线上和本地演示，这是目前最受欢迎的模型部署框架。它支持流媒体输出、进度条、排队、提醒及其他实用功能。

在线演示

点击这里试用MiniCPM-o 2.6的在线演示 |MiniCPM-V 2.6 |MiniCPM-Llama3-V 2.5 |MiniCPM-V 2.0。

本地WebUI演示

您可以使用以下命令轻松构建本地WebUI演示。

请确保已安装，因为其他版本可能存在兼容性问题。transformers==4.44.2

如果你使用的是较旧版本的PyTorch，可能会遇到这个问题，请在模型初始化时添加。"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'self.minicpmo_model.tts.float()

用于实时语音/视频通话演示：

启动模型服务器：

pip install -r requirements_o2.6.txt

python web_demos/minicpm-o_2.6/model_server.py

启动网页服务器：

# Make sure Node and PNPM is installed.
sudo apt-get update
sudo apt-get install nodejs npm
npm install -g pnpm


cd web_demos/minicpm-o_2.6/web_server
# create ssl cert for https, https is required to request camera and microphone permissions.
bash ./make_ssl_cert.sh  # output key.pem and cert.pem

pnpm install  # install requirements
pnpm run dev  # start server

打开浏览器，享受实时语音/视频通话。https://localhost:8088/

聊天机器人演示：

pip install -r requirements_o2.6.txt

python web_demos/minicpm-o_2.6/chatbot_web_demo_o2.6.py

打开浏览器，享受视觉模式聊天机器人。http://localhost:8000/

Inference

Model Zoo

Model	Device	Memory	Description	Download
MiniCPM-V 4.5	GPU	18 GB	The latest version, strong end-side multimodal performance for single image, multi-image and video understanding.	🤗
MiniCPM-V 4.5 gguf	CPU	8 GB	The gguf version, lower memory usage and faster inference.	🤗
MiniCPM-V 4.5 int4	GPU	9 GB	The int4 quantized version, lower GPU memory usage.	🤗
MiniCPM-V 4.5 AWQ	GPU	9 GB	The int4 quantized version, lower GPU memory usage.	🤗
MiniCPM-o 2.6	GPU	18 GB	The latest version, achieving GPT-4o level performance for vision, speech and multimodal live streaming on end-side devices.	🤗
MiniCPM-o 2.6 gguf	CPU	8 GB	The gguf version, lower memory usage and faster inference.	🤗
MiniCPM-o 2.6 int4	GPU	9 GB	The int4 quantized version, lower GPU memory usage.	🤗

Multi-turn Conversation

If you wish to enable long-thinking mode, provide the argument to the chat function.enable_thinking=True

pip install -r requirements_o2.6.txt

Please refer to the following codes to run.

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True, # or openbmb/MiniCPM-o-2_6
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_code=True) # or openbmb/MiniCPM-o-2_6

image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')

enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    enable_thinking=enable_thinking
)
print(answer)

# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

You will get the following output:

# round1
The landform in the picture is karst topography. Karst landscapes are characterized by distinctive, jagged limestone hills or mountains with steep, irregular peaks and deep valleys—exactly what you see here These unique formations result from the dissolution of soluble rocks like limestone over millions of years through water erosion.

This scene closely resembles the famous karst landscape of Guilin and Yangshuo in China’s Guangxi Province. The area features dramatic, pointed limestone peaks rising dramatically above serene rivers and lush green forests, creating a breathtaking and iconic natural beauty that attracts millions of visitors each year for its picturesque views.

# round2
When traveling to a karst landscape like this, here are some important tips:

1. Wear comfortable shoes: The terrain can be uneven and hilly.
2. Bring water and snacks for energy during hikes or boat rides.
3. Protect yourself from the sun with sunscreen, hats, and sunglasses—especially since you’ll likely spend time outdoors exploring scenic spots.
4. Respect local customs and nature regulations by not littering or disturbing wildlife.

By following these guidelines, you'll have a safe and enjoyable trip while appreciating the stunning natural beauty of places such as Guilin’s karst mountains.

多张图片聊天

点击查看运行 MiniCPM-V-4_5 的 Python 代码，并输入多张图片。

上下文中少数样本学习

点击查看运行 MiniCPM-V-4_5 的 Python 代码，带有少量输入。

视频聊天

点击查看运行 MiniCPM-V-4_5 的 Python 代码，并带有视频输入和 3D 重采样器。

语音与音频模式

模型初始化

import torch
import librosa
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()
model.tts.float()

模仿

Mimick任务反映了模型的端到端语音建模能力。模型接收音频输入，输出ASR转录，随后以高度相似度重建原始音频。重建后的音频与原始音频的相似度越高，模型在端到端语音建模上的基础能力就越强。

mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked

# `./assets/input_examples/fast-pace.wav`, 
# `./assets/input_examples/chi-english-1.wav` 
# `./assets/input_examples/exciting-emotion.wav` 
# for different aspects of speech-centric features.

msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    temperature=0.3,
    generate_audio=True,
    output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
)

带有可配置语音的通用语音对话

一个通用的场景是根据音频提示扮演特定角色。它会在一定程度上模仿角色的声音，并在文本中表现得像角色，包括语言风格。在这个模式下，听起来更自然、更像人。自定义音频提示可用于端到端定制角色的声音。MiniCPM-o-2.6MiniCPM-o-2.6

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')

# round one
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_roleplay_round_1.wav',
)

# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_roleplay_round_2.wav',
)
print(res)

语音对话作为人工智能助手

增强功能是作为AI助手，但声音选择有限。在这个模式下，它更像是语音助手，而非人类化。在此模式下，模型更注重指令遵循。演示时，建议使用、和。其他语音可能有效，但稳定性不如默认声音。MiniCPM-o-2.6MiniCPM-o-2.6assistant_female_voiceassistant_male_voiceassistant_default_female_voice

请注意，assistant_female_voice和assistant_male_voice更稳定，但听起来像机器人;而assistant_default_female_voice更像人类但不稳定，声音经常在多回合内变化。我们建议你尝试稳定的声音assistant_female_voice和 assistant_male_voice。

ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question

# round one
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_assistant_round_1.wav',
)

# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_assistant_round_2.wav',
)
print(res)

指令到语音

MiniCPM-o-2.6也可以做指令到语音，也就是语音创建。你可以详细描述一个声音，模型会生成一个与描述相符的声音。如需更多指令到语音示例，您可以参考 https://voxinstruct.github.io/VoxInstruct/。

instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'

msgs = [{'role': 'user', 'content': [instruction]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_creation.wav',
)

语音克隆

MiniCPM-o-2.6还可以做零帧文本转语音，也就是语音克隆。在这种模式下，模型会像TTS模型一样运行。

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"Please read the text below."
user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}

msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_cloning.wav',
)

解决各种音频理解任务

MiniCPM-o-2.6还可以用于处理各种音频理解任务，如ASR、扬声器分析、通用音频字幕和声音场景标记。

对于音频转文字任务，您可以使用以下提示：

ASR与ZZ同（与AST en2zh相同）：请仔细听这段音频片段，并将其内容逐字记录。
ASR加EN（与AST zh2en同名）：Please listen to the audio snippet carefully and transcribe the content.
演讲者分析：Based on the speaker's content, speculate on their gender, condition, age range, and health status.
一般音频说明：Summarize the main content of the audio.
一般声音场景标记：Utilize one keyword to convey the audio's content or the associated scene.

task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned

msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_audio_understanding.wav',
)
print(res)

多模态直播

点击查看运行 MiniCPM-o 2.6 带聊天推理的 Python 代码。

点击查看运行 MiniCPM-o 2.6 的 Python 代码，并带有流式推理。

多GPU上的推理

你可以通过将模型层分布到多块显卡（12 GB 或 16 GB）上运行 MiniCPM-Llama3-V 2.5。请参阅本教程，了解如何使用多块低显存GPU加载模型和推断。

Mac 上的推理

点击查看示例，在Mac上💻运行MiniCPM-Llama3-V 2.5，搭配MPS（苹果硅片或AMD GPU）。

使用llama.cpp、Ollama、vLLM的高效推理

详情请参见我们的分叉llama.cpp。该实现支持在 iPad（测试环境：iPad Pro + M4）上平滑推断 16~18 令牌/秒。

详情请参见我们的《Ollama Fork》。该实现支持在 iPad（测试环境：iPad Pro + M4）上平滑推断 16~18 令牌/秒。

vLLM 现正式支持 MiniCPM-V 2.6、MiniCPM-Llama3-V 2.5 和 MiniCPM-V 2.0。你现在可以用我们的分支运行MiniCPM-o 2.6。点击查看。

微调

简单微调

我们支持使用Hugging Face对MiniCPM-o 2.6、MiniCPM-V 2.6、MiniCPM-Llama3-V 2.5和MiniCPM-V 2.0进行简单微调。

参考文献

使用 Align-Anything

我们支持由 PKU-Alignment 团队（包括视觉和音频、SFT 和 DPO）通过 Align-Anything 框架对 MiniCPM-o 2.6 进行微调。Align-Anything 是一个可扩展的框架，旨在将任何模态的大型模型与人类意图对齐，开源数据集、模型和基准测试。凭借其简洁且模块化的设计，它支持 30+ 开源基准测试、40+ 模型和算法，包括 SFT、SimPO、RLHF 等。它还提供30+个可直接运行的脚本，适合初学者快速上手。

最佳实践：MiniCPM-o 2.6。

与LLaMA-Factory合作

我们支持通过LLaMA-Factory框架微调MiniCPM-o 2.6和MiniCPM-V 2.6。LLaMA-Factory 提供了一种解决方案，可以灵活定制 200+ LLMs 的微调（Lora/Full/Qlora），无需通过内置的 Web UI LLaMABoard 进行编码。它支持多种训练方法，如sft/ppo/dpo/kto，以及高级算法如Galore/BAdam/LLaMA-Pro/Pissa/LongLoRA。

最佳实践：MiniCPM-o 2.6 |MiniCPM-V 2.6。