DFlash is a lightweight block diffusion model designed for speculative decoding. It enables efficient and high-quality parallel drafting.
DFlash_demo.mp4
- Qwen3-4B: https://huggingface.co/z-lab/Qwen3-4B-DFlash-b16
- Qwen3-8B: https://huggingface.co/z-lab/Qwen3-8B-DFlash-b16
- Qwen3-Coder-30B-A3B: https://huggingface.co/z-lab/Qwen3-Coder-30B-A3B-DFlash
- Llama-3.1-8B-Instruct: https://huggingface.co/z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat
- openai/gpt-oss-20b
- openai/gpt-oss-120b
- zai-org/GLM-4.7
- zai-org/GLM-4.7-Flash
- Qwen/Qwen3-Coder-Next
💡 Feel free to open a GitHub issue if you’d like to request support for additional models!
We will also open-source the training recipe soon, so you can train your own DFlash draft model to accelerate any LLM.
conda create -n dflash python=3.11
conda activate dflash
git clone https://github.com/z-lab/dflash.git
cd dflash
pip install uv
uv pip install -r requirements.txt
# Optionally install flash-attn.
# If unavailable, evaluation falls back to torch.sdpa in the Transformers backend.
# The measured speedup will be slower, but the acceptance length remains comparable.
# uv pip install flash-attn --no-build-isolationexport SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3-Coder-30B-A3B-DFlash \
--tp-size 1 \
--dtype bfloat16 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--trust-remote-codefrom transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
model = AutoModel.from_pretrained(
"z-lab/Qwen3-8B-DFlash-b16",
trust_remote_code=True,
dtype="auto",
device_map="cuda:0"
).eval()
target = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
dtype="auto",
device_map="cuda:0"
).eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
prompt = "How many positive whole-number divisors does 196 have?"
messages = [
{"role": "user", "content": prompt}
]
# Note: this draft model is used for thinking mode disabled
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generate_ids = model.spec_generate(
input_ids=model_inputs["input_ids"],
max_new_tokens=2048,
temperature=0.0,
target=target,
stop_token_ids=[tokenizer.eos_token_id]
)
print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))We provide scripts to reproduce the speedup and acceptance length metrics in the paper. The reported results were tested on NVIDIA H200 or B200 GPUs.
To run benchmark on Transformers backend:
bash run_benchmark.shTo run benchmark on SGLang:
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
python benchmark_sglang.py \
--target-model Qwen/Qwen3-8B \
--draft-model z-lab/Qwen3-8B-DFlash-b16 \
--concurrencies 1,4,8,16,32 \
--dataset-name math500 \
--attention-backends fa3,flashinfer \
--tp-size 1 \
--output-md sglang_results.mdHuge thanks to @dcw02, @gongy, and the other folks at @modal-labs for the fast, high-quality support in bringing DFlash into SGLang—making it possible to truly accelerate LLM serving in real-world deployments.
If you find DFlash useful for your research or applications, please cite our project.
@misc{chen2026dflash,
title = {DFlash: Block Diffusion for Flash Speculative Decoding},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
year = {2026},
eprint = {2602.06036},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2602.06036}
}
