Introduction - Ktransformers

KTransformers

🎉 介绍

KTransformers, pronounced as Quick Transformers, is designed to enhance your 🤗 变压器 experience with advanced kernel optimizations and placement/parallelism strategies.

KTransformers is a flexible, Python-centric framework designed with extensibility at its core. By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI.

Our vision for KTransformers is to serve as a flexible platform for experimenting with innovative LLM inference optimizations. Please let us know if you need any other features.

🔥 更新

2025年2月25日: 支持FP8 GPU内核用于DeepSeek-V3和R1；更长的上下文。
2025年2月10日: 支持在单个（24GB VRAM）/多 GPU 和 382G DRAM 上运行 Deepseek-R1 和 V3，速度提升高达 3~28 倍。详细教程请看这里。
2024年8月28日: 在InternLM2.5-7B-Chat-1M模型下支持1M上下文，使用24GB的VRAM和150GB的DRAM。详细教程在这里。
2024年8月28日: 将DeepseekV2所需的VRAM从21G减少到11G。
2024年8月15日: 更新注入和多GPU的详细TUTORIAL。
2024年8月14日: 支持 llamfile 作为线性后端。
2024年8月12日: 支持多个GPU；支持新模型：mixtral 8*7B和8*22B；支持在GPU上的q2k、q3k、q5k反量化。
2024年8月9日: 支持Windows原生。