CoVSpec → TMC 扩展调研报告 | VLM Speculative Decoding 相关论文 Top 8

主题：Device-Edge VLM Speculative Decoding 相关论文 Top 8
生成时间：2026-05-25 10:50 CST
目标期刊：IEEE Transactions on Mobile Computing (TMC)
基线论文：CoVSpec — Device-Edge VLM Co-Inference via Speculative Decoding

📊 Top 8 论文总览

🔍 逐篇详解

#	论文	arXiv	发表	核心方向	加速比
1	DREAM	2505.19201	2025.05	VLM SD + Cross-Attention + Visual Compression	最高 3.6×
2	SpecVLM	2509.11815	2025.09	VLM SD + Elastic Visual Compressor	2.5–2.9×
3	HiViS	2509.23928	2025.09	隐藏Visual Token的Drafter	显著AAL提升
4	MASSV	2505.10526	2025.05	Multi-modal Adaptation for VLM SD	1.46×, AAL+30%
5	Sparrow	2602.15318	2026.02	Video LLM SD + 长序列	2.82× (25K tokens)
6	FastVLM	2510.22641	2025.10	Self-Speculative Decoding (SSL)	1.55–1.85×
7	DSSD	2507.12000	ICML 2025	Edge-Device 分布式Split SD	通信大幅降低
8	iLLaVA	2412.06263	ICLR 2026	Visual Token Merging + 端到端加速	2× 吞吐, 4× Prefill

1. DREAM — Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion

VLM Speculative Decoding 3.6× Speedup arXiv:2505.19201

📄 Introduction 总结

SD 在 LLM 中已证明有效，但直接迁移到 VLM 面临三大挑战：(1) 视觉 token 数量庞大，drafter 处理代价高；(2) 纯文本 drafter 缺乏视觉感知能力；(3) drafter 与 target 的 token 分布不对齐。DREAM 首次系统性地解决这三个问题，提出三合一方案。

🔬 核心方法

Cross-Attention Feature Injection：通过交叉注意力机制，将 target model 的中间层视觉-文本融合特征注入 drafter，使 drafter 隐式获得视觉感知
Entropy-Adaptive Layer Selection：基于注意力熵自动选择最有信息量的 target 中间层进行特征提取，减少冗余传输
Visual Token Compression：在 drafter 侧对视觉 token 进行压缩（pooling/pruning），降低 drafter 的计算量

🏗️ Pipeline

Image → Vision Encoder → [Target VLM: Full Forward → Entropy-Adaptive Layer Selection → Cross-Attn Feature Extraction] ↘ [Draft Model: Compressed Visual Tokens + Cross-Attn Fusion → Autoregressive Draft K tokens] → Target Verify (parallel) → Accept/Reject → Next Iteration

📊 实验结果

在 LLaVA-1.5/1.6、Pixtral、SmolVLM、Gemma3 上验证
最高 3.6× 加速（vs 常规自回归解码）
显著超越此前所有 VLM SD baseline 的 acceptance length
代码开源: github.com/SAI-Lab-NYU/DREAM

💡 可借鉴思路

Entropy-adaptive 的中间层选择思想可直接用于 CoVSpec 的 adaptive draft length 决策
Cross-attention feature injection 可作为 decoupled verification-correction 的补充
Visual token compression 与 CoVSpec 的 visual token reduction 互补

2. SpecVLM — Fast Speculative Decoding in Vision-Language Models

VLM Speculative Decoding 2.9× Speedup EAGLE-2 Baseline arXiv:2509.11815

📄 Introduction 总结

直接移植 SD 到 VLM 面临 prefill 阶段视觉 token 过多导致的 compute/memory 膨胀问题。SpecVLM 从两个层面优化：(1) 建立强 EAGLE-2 风格 baseline (EagleVLM) 实现 1.5–2.3× 加速；(2) 提出 elastic visual compressor 进一步压缩视觉 token。

🔬 核心方法

EagleVLM Baseline：将 EAGLE-2 的 feature-level draft 范式适配到 VLM
Elastic Visual Compressor：自适应选择 Pruning / Pooling / Convolution / Resampler 四种压缩原语，根据输入动态平衡 FLOPs 与精度
Online-Logit Distillation：用 on-the-fly teacher logits + penultimate features 训练 drafter，避免离线蒸馏语料库的存储和预处理成本
Training-Time Scaling Effect：发现更长的在线训练单调提升 AAL

🏗️ Pipeline

Image → Elastic Visual Compressor (Prune/Pool/Conv/Resample) → Target VLM (Full Forward) → Draft Model (EAGLE-2 style: feature-level autoregressive draft K tokens) → Target Verify → Accept/Reject → Online-Logit Distillation (training loop, CE + Smooth L1)

📊 实验结果

LLaVA 和 MMMU benchmark 上 5 epoch 内达 2.5–2.9× 端到端加速
Lossless decoding（保持 target 输出分布）
跨分辨率和任务难度一致有效
代码: github.com/haiduo/SpecVLM

💡 可借鉴思路

Elastic compressor 的选择空间可扩展到 CoVSpec 的 visual token reduction 模块
Online-logit distillation 的 training scaling effect 为 CoVSpec 的 drafter 训练提供参考
多种压缩原语（prune/pool/conv/resample）可作为 CoVSpec 的 token reduction 备选

3. HiViS — Hiding Visual Tokens from the Drafter for Speculative Decoding in VLMs

VLM Speculative Decoding Semantic Fusion arXiv:2509.23928

📄 Introduction 总结

观察到 VLM 中视觉 token 高度冗余，可大量移除而不损害生成质量。HiViS 的核心洞察：drafter 不需要直接看到原始 visual tokens，而是通过 target VLM 作为"语义融合器"间接获取视觉信息，使 drafter 的 prefill 序列长度与纯文本一致。

🔬 核心方法

Visual Token Hiding：drafter 完全不处理 visual tokens，仅接收 target VLM 融合后的语义信号
Time-Step-Aware Aligned Training：drafter 通过 step-dependent bias-correction residuals 自主传播和精炼视觉-文本语义
Target as Semantic Fusion Model：利用 target VLM 作为视觉-文本语义融合器，drafter 获得隐式视觉感知

🏗️ Pipeline

Image → Target VLM (visual encoding + deep-layer semantic fusion) → [Drafter: Text-only prefill (visual tokens hidden) + Time-Step Bias Correction → Autoregressive Draft] → Target Verify → Accept/Reject

📊 实验结果

跨多个代表性 VLM 和 benchmark 验证
显著提升 average acceptance length 和 speedup ratio
Drafter 的 prefill 延迟大幅降低（无需处理 visual tokens）

💡 可借鉴思路

"Visual Token Hiding" 与 CoVSpec 的 visual token reduction + device-edge split 天然兼容
Time-step-aware training 可改进 CoVSpec 的 adaptive draft length 决策
语义融合的思想可降低 device-edge 之间需要传输的视觉信息量

4. MASSV — Multimodal Adaptation and Self-Data Distillation for VLM Speculative Decoding

VLM Speculative Decoding Drafter Training arXiv:2505.10526

📄 Introduction 总结

将 SD 应用于 VLM 的两个根本挑战：(1) 可作为高效 drafter 的小语言模型缺乏处理视觉输入的架构组件；(2) 小模型的 token 预测与考虑视觉上下文的 VLM target 不匹配。MASSV 通过两阶段方法将现有小语言模型转化为有效的多模态 drafter。

🔬 核心方法

Phase 1 — Vision Encoder Connection：通过轻量级可训练 projector 将 target VLM 的 vision encoder 连接到 drafter 小模型
Phase 2 — Self-Distilled Visual Instruction Tuning：用 target VLM 生成的响应进行自蒸馏微调，对齐 token 预测分布
架构兼容性：可应用于任意小语言模型 → 多模态 drafter 的转换

🏗️ Pipeline

Image → Target VLM Vision Encoder ─┬→ Target VLM LLM (Full Forward) └→ Trainable Projector → Small LM Drafter → Autoregressive Draft → Target Verify → Accept/Reject Training: Target VLM generates responses → Self-distill to align drafter predictions

📊 实验结果

Qwen2.5-VL 和 Gemma3 模型族上验证
Accepted length 提升高达 30%
端到端推理加速最高 1.46×（视觉任务）
可扩展、架构兼容的方法

💡 可借鉴思路

Projector 连接的思路可用于 CoVSpec 的 device-side drafter 设计
Self-distillation 可降低 CoVSpec drafter 的训练数据需求
架构兼容性方法论对 device-edge 异构硬件适配有参考价值

5. Sparrow — Text-Anchored Window Attention with Visual-Semantic Glimpsing for Video LLM SD

Video LLM SD 长序列 arXiv:2602.15318

📄 Introduction 总结

SD 应用于 Video LLM 时面临严重的性能坍塌：drafter 因 KV-cache 爆炸和上下文窗口不匹配而陷入 attention dilution 和 negative visual gain。Sparrow 发现 visual semantic internalization 现象——关键视觉语义在深层交互中被隐式编码到文本 hidden state 中，深层推理中原始视觉输入成为结构冗余。

🔬 核心方法

Text-Anchored Window Attention：通过 hidden state reuse 将视觉计算完全 offload 到 target model
Visual-Semantic Glimpsing：用中间层视觉状态桥接训练 drafter，过滤低级视觉噪声
Multi-Token Prediction：桥接训练-推理分布偏移

🏗️ Pipeline

Video Frames → Target Vid-LLM (full visual encoding, deep semantic internalization) → [Drafter: Text-Anchored Window Attn (no raw visual tokens) + Visual-Semantic Glimpsing → Multi-Token Draft] → Target Verify → Accept/Reject

📊 实验结果

平均加速 2.82×，即使面对 25K visual tokens
有效解决长序列的性能退化
为实时长视频任务提供实用方案

💡 可借鉴思路

"Visual Semantic Internalization" 发现可直接指导 CoVSpec 扩展至 Video VLM
Hidden state reuse 可降低 device-edge 间传输的视觉数据量
Multi-token prediction 可提升 CoVSpec 的 parallel branching 效率

6. FastVLM — Self-Speculative Decoding for Fast Vision-Language Model Inference

Self-Speculative Decoding Imitation Learning arXiv:2510.22641

📄 Introduction 总结

VLM 面临高计算成本和推理延迟。FastVLM 提出基于 imitation learning 的 Self-Speculative Decoding (SSD) 框架：轻量 draft model 自回归生成 token，完整模型非自回归验证。无需额外 drafter 模型，通过 imitation network 让 draft 获得 full model 的深层洞察。

🔬 核心方法

Self-Speculative：使用同一模型的浅层作为 drafter、完整模型作为 verifier
Imitation Network：将 full model 深层特征蒸馏到 draft（浅层），refine 被拒绝的 token
Non-Autoregressive Verification：target 并行验证 draft token，被接受的直接通过，被拒绝的用 full model 修正

🏗️ Pipeline

Image → Vision Encoder → [Draft: Shallow layers → Autoregressive Draft K tokens] → [Target: Full model → Non-Autoregressive Parallel Verify] → Accept/Reject → Rejected tokens refined via Imitation Network → Next Iteration

📊 实验结果

加速 1.55–1.85×（vs final layer）
性能损失极小
IJCNLP-AACL 2025 接收

💡 可借鉴思路

Self-speculative 范式可降低 CoVSpec 中 device-side 需要额外部署 drafter 模型的成本
Imitation network 的思想可改进 CoVSpec 的 decoupled verification-correction
Non-autoregressive verification 与 parallel branching 可结合

7. DSSD — Distributed Split Speculative Decoding (ICML 2025)

Edge-Device 协同 ICML 2025 通信优化 arXiv:2507.12000

📄 Introduction 总结

LLM 部署面临 device-edge 系统的资源限制和通信开销挑战。现有方案或用精度换延迟、或面临高昂的上行传输成本。DSSD 提出分布式 split SD：不仅保留 SLM-LLM split，还将验证阶段在 device 和 edge 之间分区，用单次下行传输替代多次上行传输。

🔬 核心方法

Split Verification：将验证阶段在 device 和 edge 之间分区，device 做部分验证、edge 完成剩余
通信优化：将多次 vocabulary distribution 上行传输替换为单次下行传输，大幅降低通信延迟
SLM-LLM Split 保持：device 上 SLM drafter，edge 上 LLM target/verifier

🏗️ Pipeline

Device (SLM Drafter): Autoregressive Draft K tokens → Uplink: Draft tokens (compact) → Edge Edge (LLM Target): Split Verification → Downlink: Single verification result (not full distributions) → Device → Accept/Reject → Continue

📊 实验结果

通信延迟显著降低（多次上行 → 单次下行）
保持推理质量
超越现有方法
ICML 2025 接收

💡 可借鉴思路

Split verification + 通信优化是 CoVSpec TMC 扩展最直接的相关工作
上行→下行转换的通信模式可融入 CoVSpec 的 device-edge 通信设计
ICML 2025 说明该方向得到顶会认可，TMC 扩展的时效性好

8. iLLaVA — An Image is Worth Fewer Than 1/3 Input Tokens (ICLR 2026)

Visual Token Reduction ICLR 2026 端到端加速 arXiv:2412.06263

📄 Introduction 总结

现有方法仅关注 LLM 阶段的 token 减少，忽视了 image encoder 本身就是主要计算瓶颈。iLLaVA 首次联合优化 image encoder 和 LLM，提出 token merging 策略回收被丢弃 token 中的有用信息。

🔬 核心方法

Joint Encoder-LLM Optimization：同时在 encoder 和 LLM 两个阶段进行 token reduction
Token Merging with Recycling：将丢弃 token 中的有用信息回收到保留 token 中，减少精度损失
端到端加速：不仅加速 LLM 阶段，也加速 encoder 阶段

🏗️ Pipeline

Image → Vision Encoder (Token Merging + Recycling) → Reduced Visual Tokens → LLM → 2× Throughput, 4× Prefill Time Reduction → End-to-End Acceleration

📊 实验结果

图像和视频理解任务一致提升
2× 吞吐提升，4× prefilling 时间降低
大模型(26B) 在精度和效率上均超越小模型(8B)
ICLR 2026 接收

💡 可借鉴思路

Token merging + recycling 可集成到 CoVSpec 的 visual token reduction 模块
联合优化 encoder+LLM 的思想可指导 device 侧的 encoder 轻量化
Encoder 侧 token reduction 可进一步降低 device→edge 通信量

🔄 与 CoVSpec 的对比分析

💡 CoVSpec 可借鉴的具体思路

🛤️ 推荐期刊扩展路线

维度	CoVSpec	DREAM	SpecVLM	HiViS	MASSV	Sparrow	FastVLM	DSSD	iLLaVA
Device-Edge Split	✅ 核心	❌	❌	❌	❌	❌	❌	✅ 核心	❌
Visual Token Reduction	✅	✅ Compression	✅ Elastic	✅ Hiding	❌	✅ Hiding	❌	❌	✅ Merging
Drafter Design	Device-side 轻量	Cross-Attn	EAGLE-2	Text-only	Small LM+Projector	Window Attn	Self (Shallow)	SLM	N/A
通信优化	✅ Core	❌	❌	❌	❌	❌	❌	✅ Split Verify	❌
Multi-Token Draft	✅ Parallel Branching	✅	✅	✅	✅	✅ Multi-Token	✅	✅	N/A
Video/Multi-Frame	❌	❌	❌	❌	❌	✅ 核心	❌	❌	✅
Adaptive Mechanism	✅ Margin Gating + Adaptive Length	✅ Entropy-Adaptive	✅ Elastic Selector	✅ Time-Step-Aware	❌	❌	❌	❌	❌

路线 1（最稳）：CoVSpec + 通信理论 + 多用户调度 → TMC

核心思路：在当前 CoVSpec 基础上，将通信模式从经验设计升级为信息论框架。引入 DSSD 的 split verification 机制，从理论上分析 device-edge 之间的 rate-distortion trade-off。进一步扩展至 多用户场景（multi-user edge serving），设计联合调度策略（如 Lyapunov 优化或 restless bandit），在延迟/能耗/吞吐约束下最大化系统效用。

新增贡献点：(1) device-edge VLM 推理的通信理论建模；(2) multi-user 联合调度算法；(3) wireless-aware 速率自适应 SD（draft length 随信道变化调整）

为什么稳：通信理论 + 多用户调度是 TMC 的核心偏好，且 CoVSpec 已具备 device-edge 基础，增量合理。

相关论文支撑：DSSD (ICML 2025) 证明了 split SD + 通信优化的顶会认可度

路线 2（创新强）：CoVSpec + Video VLM + 长序列优化 → TMC

核心思路：将 CoVSpec 从 single-image VLM 扩展到 Video VLM 场景。利用 Sparrow 发现的 visual semantic internalization 现象，在 device 侧设计 temporal-aware visual token reduction（时域冗余远大于空域）。同时引入 adaptive frame scheduling：在 device 侧动态决定哪些帧需要完整编码、哪些可以跳过或低质量编码。

新增贡献点：(1) video VLM 的 device-edge speculative decoding 首个系统性方案；(2) temporal visual token reduction；(3) adaptive frame scheduling + draft length 联合优化

为什么强：Video VLM 是 2025-2026 热点，且 device-edge 场景下的 video SD 几乎空白，新颖性强。

相关论文支撑：Sparrow (2602.15318) + iLLaVA (ICLR 2026) 证明 video/token reduction 方向活跃

路线 3（技术深度）：CoVSpec + 统一理论框架 → TMC / JSAC

核心思路：将 CoVSpec 的五个模块（visual token reduction, margin-based gating, adaptive draft length, parallel branching, decoupled verification-correction）纳入统一的优化框架。例如建立一个 Joint Source-Channel Coding (JSCC) 视角下的 VLM 推理理论：visual tokens 是 source，device-edge 链路是 channel，draft/verify 是 joint coding。从理论上分析各模块之间的 trade-off（accuracy vs latency vs communication vs energy），并在 Pareto 前沿上寻找最优操作点。

新增贡献点：(1) 首个 VLM device-edge 推理的统一理论框架；(2) Pareto-optimal 操作点分析；(3) 各模块之间的 trade-off 定量关系

适合 JSAC 的理由：如果理论深度足够，JSAC (IEEE Journal on Selected Areas in Communications) 的 Semantic Communications 专题也是选项

相关论文支撑：DSSD + SpecVLM 的 online distillation 提供了不同维度的优化思路

空白方向	说明	CoVSpec 优势
Device-Edge VLM SD 通信理论	现有 VLM SD 论文均未涉及 device-edge 通信建模	CoVSpec 是唯一同时做 device-edge + VLM SD 的工作
Multi-User VLM Edge Serving	所有 SD 论文都是单用户场景	CoVSpec 的 margin-gating 天然适合优先级调度
Video VLM + Device-Edge	Sparrow 做 video SD 但无 device-edge；CoVSpec 有 device-edge 但无 video	直接交叉创新空间大
Wireless-Aware Adaptive SD	无人将无线信道状态融入 SD 的 draft length / acceptance 决策	CoVSpec 的 adaptive draft length 可扩展为 channel-aware
Energy-Latency Joint Optimization	现有工作仅关注 latency，未建模 device 能耗	CoVSpec 的 device-edge 架构天然可建模能耗

📚 CoVSpec → TMC 扩展调研报告

📊 Top 8 论文总览

🔍 逐篇详解

1. DREAM — Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

2. SpecVLM — Fast Speculative Decoding in Vision-Language Models

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

3. HiViS — Hiding Visual Tokens from the Drafter for Speculative Decoding in VLMs

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

4. MASSV — Multimodal Adaptation and Self-Data Distillation for VLM Speculative Decoding

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

5. Sparrow — Text-Anchored Window Attention with Visual-Semantic Glimpsing for Video LLM SD

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

6. FastVLM — Self-Speculative Decoding for Fast Vision-Language Model Inference

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

7. DSSD — Distributed Split Speculative Decoding (ICML 2025)

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

8. iLLaVA — An Image is Worth Fewer Than 1/3 Input Tokens (ICLR 2026)

📄 Introduction 总结

🔬 核心方法

🏗️ Pipeline

📊 实验结果

💡 可借鉴思路

🔄 与 CoVSpec 的对比分析

💡 CoVSpec 可借鉴的具体思路

1. 视觉 Token 处理优化

2. Drafter 训练与对齐

3. 通信效率

4. 扩展方向

🛤️ 推荐期刊扩展路线

路线 1（最稳）：CoVSpec + 通信理论 + 多用户调度 → TMC

路线 2（创新强）：CoVSpec + Video VLM + 长序列优化 → TMC

路线 3（技术深度）：CoVSpec + 统一理论框架 → TMC / JSAC

🎯 研究空白 (Gap Analysis)