CoVSpec → TMC 期刊扩展调研报告

🎯 CoVSpec 当前工作概览

CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

Submitted May 2026

核心技术栈：

Visual Token Reduction Margin-based Gating Adaptive Draft Length Parallel Branching Decoupled Verification-Correction

核心目标：降低 VLM 推理延迟、通信量和 API cost，通过 Device-Edge 协同 speculative decoding 实现高效 VLM 推理。

🔬 最相关论文 Top 8

#	论文	方向	日期	arXiv
1	HiViS	VLM Speculative Decoding	Sep 2025	2509.23928
2	DREAM	Multimodal SD + Cross-Attention	May 2025	2505.19201
3	SpecVLM	VLM SD + Visual Compression	Sep 2025	2509.11815
4	Sparrow	Video LLM SD	Feb 2026	2602.15318
5	WISV	Wireless-Aware Distributed SD	Apr 2026	2604.17701
6	ProSemComVLM	Edge-Cloud Semantic Communication	Apr 2026	2604.26508
7	edgeVLM	Cloud-Edge Collaborative VLM	Aug 2025	2508.12638
8	FastVLM	Self-Speculative Decoding VLM	Oct 2025	2510.22641

📄 逐篇详细分析

Paper 1 VLM Speculative Decoding Visual Token Hiding

HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Zhinan Xie, Peisong Wang, Shuang Qiu, Jian Cheng — CAS/ AIRIA/ CityU HK

📎 arXiv:2509.23928 | Sep 2025

📖 Introduction 总结

LLM 的 speculative decoding 扩展至 VLM 面临两大障碍：(1) visual token 与轻量级 drafter 之间存在语义鸿沟；(2) 长 visual token 序列拖慢 drafter 推理。作者发现 VLM 中 visual token 高度冗余，深层已内化关键视觉语义。基于此提出 HiViS：完全从 drafter 中移除 visual token，利用 target VLM 作为语义融合模型，通过 cross-modality self-attention 获取融合了视觉语义的 text embedding 提供给 drafter。

🔬 核心方法

(1) Visual Token Hiding：drafter 不直接处理 visual token，而是使用 target VLM 的 fused hidden states；(2) Time-step-aware Aligned Training：step-dependent bias-correction residuals 让 drafter 在独立 drafting 时自主传播视觉-文本语义；(3) 采用 EAGLE-2 风格的 tree-based drafting 结构。

💡 创新之处

首次提出 drafter 完全不需要 visual token 的 VLM SD 范式
发现 visual semantic internalization 现象并加以利用
step-dependent bias correction 实现无反馈自主 drafting

🔗 Pipeline

Image → Vision Encoder → Visual Tokens → [Target VLM: Multi-layer Cross-Attention Fusion] ↓ Fused Text Hidden States (视觉注入的文本表示) ↓ [Drafter: Tree-based Drafting w/ Step-dependent Bias Correction] ↓ Candidate Tokens → [Target VLM: Parallel Verification] → Output

📊 实验结果

在 LLaVA-1.5-7B 等模型上实现最高 3.15× speedup，lossless 保真。显著提升 average acceptance length。

Paper 2 Multimodal SD Cross-Attention Fusion Visual Compression

DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding

Yunhai Hu et al. — NYU / UPenn / Cerebras Systems

📎 arXiv:2505.19201 | May 2025 | Code: GitHub

📖 Introduction 总结

SD 在 LLM 中广泛应用但在 VLM 中探索有限。VLM 需要融合视觉和文本信息，这给 SD 带来独特挑战：视觉特征提取和跨模态融合增加了 draft model 的计算负担和语义对齐难度。DREAM 针对这一问题提出三管齐下的方案。

🔬 核心方法

(1) Cross-Attention Drafting：用 cross-attention 机制将 target model 中间层特征注入 draft model 提升对齐；(2) Entropy-Adaptive Feature Selection：基于 attention entropy 动态选择最有信息量的中间层特征监督 draft；(3) Visual Token Compression：由 target 中间特征引导的 visual input 压缩方案，减少 draft 延迟。

💡 创新之处

首次将 cross-attention + entropy-adaptive selection 引入 VLM SD
在 LLaVA, Pixtral, SmolVLM, Gemma3 多模型上验证通用性
最高 3.6× speedup，超越所有 SD baseline

🔗 Pipeline

Image → VisEnc → Visual Tokens → [Target VLM: Full Forward → Intermediate Features] ↓ ↓ Visual Compressor Entropy-Adaptive Selection ↓ ↓ Compressed Visual ───→ [Draft Model w/ Cross-Attention] ↓ Candidate Tokens ↓ [Target VLM: Verify] → Output

📊 实验结果

在 LLaVA-v1.6-7B/13B, SmolVLM-2B, Pixtral-12B, Gemma3-12B 上评估，最高 3.6× speedup，在多模态 benchmark（table recognition, interactive segmentation, chart QA）上显著优于 EAGLE-2 等 baseline。

Paper 3 VLM SD Elastic Visual Compression Online Distillation

SpecVLM: Fast Speculative Decoding in Vision-Language Models

Haiduo Huang et al. — AMD / XJTU

📎 arXiv:2509.11815 | Sep 2025 | Code: GitHub

📖 Introduction 总结

VLM 推理存在两个瓶颈：prefill 阶段视觉 token 数量随分辨率和视频长度膨胀，以及 decoding 阶段逐 token 生成。作者分析 LLaVA-1.6-7B 延迟分解发现 LLM prefill 是最大瓶颈。提出将 speculative decoding 与 visual token compression 结合的双加速策略。

🔬 核心方法

(1) EagleVLM Baseline：EAGLE-2 风格的 VLM SD baseline，1.5–2.3× speedup；(2) Elastic Visual Compressor：自适应选择 pruning/pooling/convolution/resampler 四种压缩原语，根据输入难度动态平衡 FLOPs 和精度；(3) Online-Logit Distillation：无需离线蒸馏数据集，on-the-fly teacher logits + penultimate features 训练 draft，发现训练时间 scaling effect。

💡 创新之处

Elastic compressor 统一四种视觉压缩原语并自适应选择
Online-logit distillation 消除离线数据集依赖
发现 multimodal SD 的 training-time scaling 效应
2.5–2.9× end-to-end speedup，lossless

🔗 Pipeline

Image → VisEnc → Visual Tokens → [Elastic Compressor: Prune/Pool/Conv/Resample Selector] ↓ Compressed Visual Tokens ↓ [EagleVLM Draft: 1-layer Decoder + Tree Drafting] ↓ Online Teacher Logits (Penultimate Features + Smooth L1) ↓ Candidate Tokens → [Target VLM: Verify] → Output

📊 实验结果

在 LLaVA benchmark 和 MMMU 上实现 2.5–2.9× end-to-end speedup，不同分辨率和任务难度下均稳定。5 epoch online training 即可达到峰值，训练时间越长 acceptance length 单调增加。

Paper 4 Video LLM SD Visual Offloading

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Libo Zhang et al. — NUDT

📎 arXiv:2602.15318 | Feb 2026 | Code: GitHub

📖 Introduction 总结

SD 在 Video LLM 中面临严重性能崩塌：长视频产生数万 visual token（25k+），导致 drafter 的 KV cache 爆炸、attention dilution 和 negative visual gain。作者发现 visual semantic internalization 现象——关键视觉语义已在深层隐式编码进 text hidden states，raw visual input 在深层推理时结构冗余。

🔬 核心方法

(1) HSR-VATA (Hidden State Reuse + Visually-Aware Text-Anchored Window Attention)：完全将视觉计算 offload 到 target model，drafter 仅复用 target 的 text hidden states；(2) IVSB (Intermediate-layer Visual State Bridging)：从 target 中间层提取经过多阶段融合的 visual states 训练 drafter；(3) Multi-token Prediction：桥接训练-推理的分布偏移。

💡 创新之处

首个将 lightweight drafter 应用于 Video LLM 的工作
揭示 long-video SD 的 attention dilution 和 negative visual gain
25k visual token 下仍实现 2.82× speedup

🔗 Pipeline

Video Frames → VisEnc → 25k+ Visual Tokens → [Target Vid-LLM: Deep Multi-Layer Fusion] ↓ Visual Semantic Internalization → Text Hidden States (含视觉语义) ↓ [Draft: HSR-VATA → Window Attention on Text Anchors] ↓ [IVSB Training: Intermediate Visual States → Draft Supervision] ↓ Multi-token Candidates → [Target: Verify] → Output

📊 实验结果

在 25k visual token 超长视频场景下实现平均 2.82× speedup，有效解决长序列性能退化问题。对比 MSD 和 ViSpec，Sparrow 在 long-video 任务上平均 accepted length 提升显著。

Paper 5 Wireless-Aware SD Semantic Verification

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

Zixuan Liu, Zhiyong Chen et al. — SJTU

📎 arXiv:2604.17701 | Apr 2026 | Submitted to IEEE Trans

📖 Introduction 总结

Distributed device-edge SD 虽有优势，但传统 token-level verification 在无线信道波动下导致大量 rejection，accepted length 严重下降。WISV 提出超越严格 token 匹配的 channel-aware semantic acceptance policy，将信道状态信息 (CSI) 融入 verification 决策。

🔬 核心方法

(1) Lightweight Decision Head：在 edge-side target LLM 中集成轻量决策头，综合高维 hidden representations 和即时 CSI 动态评估 speculative token；(2) Dual Communication Protocols：full-hidden upload 和 mismatch-first selective-hidden upload 两种模式，在 verification fidelity 和 communication overhead 间权衡；(3) 在 Jetson AGX Orin + A40 硬件 testbed 上验证。

💡 创新之处

首次将无线信道状态信息 (CSI) 融入 SD 的 verification 决策
Semantic acceptance 超越 token-level exact matching
硬件 testbed 验证 (Jetson + A40)
60.8% accepted length 提升，37.3% interaction rounds 减少

🔗 Pipeline

Device (Drafter 1B) Edge Server (Target 8B) │ │ ├─ Generate speculative tokens ──────────────────→│ │ ├─ CSI Estimation │ ├─ Hidden State Extraction │ ├─ Decision Head (CSI + Hidden → Accept/Reject) │ ├─ Protocol Select: Full / Selective Upload │ ←────────── Accepted tokens + Corrections ─────┤ │ │ └─ Continue drafting └─ Final output

📊 实验结果

1B drafter + 8B target 设置下：60.8% accepted length 提升，37.3% interaction rounds 减少，31.4% end-to-end latency 改善，精度下降 <1%。

Paper 6 Edge-Cloud Semantic Comm Progressive Compression

Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models (ProSemComVLM)

Cyril Shih-Huan Hsu et al. — UvA

📎 arXiv:2604.26508 | Apr 2026 | Under Review

📖 Introduction 总结

VLM 部署面临两难：全边缘推理算力不足，全云推理带宽受限。现有边缘-云协作方案传输固定大小表示，缺乏对动态网络条件的自适应。提出 Meta AutoEncoder 实现视觉 token 的渐进式自适应压缩，支持 plug-and-play 部署。

🔬 核心方法

(1) Meta AutoEncoder：将 visual token 压缩为自适应、渐进可细化的表示；(2) Progressive Transmission：支持不同信息级别的灵活传输，通信成本与语义保真度可控；(3) End-to-End System：NXP i.MX95 嵌入式平台 + GPU server 全链路实现，在 1 Mbps 约束下验证。

💡 创新之处

渐进式语义通信首次应用于 Edge-Cloud VLM 推理
无需额外微调，即插即用
真实嵌入式硬件 + 带宽受限网络验证

🔗 Pipeline

Edge (NXP i.MX95) Cloud (GPU Server) │ │ ├─ Image → VisEnc → Visual Tokens │ ├─ Meta AutoEncoder: Progressive Compression │ ├─ Level-k Semantic Representation ──────────────→│ │ (adaptive based on channel BW) ├─ Decode + Reshape │ ├─ VLM Inference │ ←────────────── Response ─────────────────────┤ │ │

📊 实验结果

在 1 Mbps 上行链路下，ProSemComVLM 相比全边缘和全云方案显著降低网络延迟，高压缩率下仍保持高语义一致性。在 NXP i.MX95 + GPU server 实际部署验证。

Paper 7 Cloud-Edge VLM Context Transfer

edgeVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer

Chen Qian et al. — Tsinghua

📎 arXiv:2508.12638 | Aug 2025

📖 Introduction 总结

现有云边协作 VLM（分区 LVLM 或大小模型任务卸载）无法适应云延迟波动且忽略了延迟但准确的 LVLM 响应的价值。提出 Context Transfer 范式：将延迟到达的 LVLM 输出作为历史上下文，为边缘 SVLM 提供实时指导。

🔬 核心方法

(1) Context Transfer Paradigm：LVLM 的延迟输出作为 SVLM 的历史上下文；(2) Context Replacement Module：精炼历史文本输入；(3) Visual Focus Module：增强视觉 grounding 一致性。在自动驾驶和人机交互实时推理任务上验证。

💡 创新之处

全新的云边协作范式——延迟响应变成历史上下文
Context Replacement + Visual Focus 双模块设计
对 CoVSpec 的 decoupled verification-correction 有互补借鉴意义

🔗 Pipeline

Edge (SVLM Real-time) Cloud (LVLM Delayed) │ │ ├─ Image → Fast Inference (low latency) │ │ └─ Output: preliminary response │ │ ├─ Image → Deep Analysis (high latency) │ ├─ Delayed output: high-quality response │ ←────────── Context Transfer ──────────────────┤ ├─ Context Replacement: refine history │ ├─ Visual Focus: grounding consistency │ └─ Refined real-time output │

📊 实验结果

在 3 个实时 VLM 推理任务 4 个数据集上验证，Context Transfer 范式显著提升实时推理质量，同时容忍云延迟波动。

Paper 8 Self-Speculative Decoding Imitation Learning

FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference

Divya Jyoti Bajpai, Manjesh Kumar Hanawal — IIT Bombay

📎 arXiv:2510.22641 | Oct 2025 | IJCNLP-AACL 2025

📖 Introduction 总结

提出基于 imitation learning 的 Self-Speculative Decoding (SSD) 框架。轻量 draft model 自回归生成 token，full model 非自回归验证。通过 imitation network 将 full model 深层 insights 整合进 draft model。

🔬 核心方法

(1) Self-Speculative Decoding：无需独立 drafter，从 target model 自身提取轻量版本；(2) Imitation Network：将 full model 深层表示蒸馏到 draft；(3) 保持 full model 性能完整性，仅训练 draft。1.55–1.85× speedup，性能损失最小。

💡 创新之处

将 imitation learning 引入 VLM SSD
无需额外 drafter 架构设计
IJCNLP-AACL 2025 接收

🔗 Pipeline

Image + Text → [FastVLM] ├─ Draft Model (Lightweight, Autoregressive) → Candidate Tokens │ ↑ Imitation Network: Deep Insights from Full Model ├─ Full Model (Non-autoregressive Verification) ├─ Accepted → Keep | Rejected → Full Model Correction → Guide Draft └─ Output

📊 实验结果

1.55–1.85× speedup，相比 final layer 方法性能损失最小。

📊 Top 8 论文关键对比总表

论文	技术路线	Visual Token 处理	分布式/Edge	Speedup	与 CoVSpec 关联
HiViS	Visual Token Hiding + Step Bias	完全移除	❌	3.15×	⭐⭐⭐⭐ Visual reduction 互补
DREAM	Cross-Attention + Entropy Selection	压缩	❌	3.6×	⭐⭐⭐ Cross-attention 机制
SpecVLM	Elastic Compressor + Online Distill	自适应多原语压缩	❌	2.9×	⭐⭐⭐⭐⭐ Elastic 思路 + online distill
Sparrow	Visual Offloading + Window Attention	完全 offload	❌	2.82×	⭐⭐⭐⭐ Video 扩展方向
WISV	Wireless-CSI Semantic Verify	N/A (LLM)	✅ Device-Edge	31.4%↓lat	⭐⭐⭐⭐⭐ Wireless-aware 直接互补
ProSemComVLM	Progressive Semantic Compression	渐进压缩传输	✅ Edge-Cloud	显著降低	⭐⭐⭐⭐ Communication 优化
edgeVLM	Context Transfer (延迟→上下文)	保留	✅ Cloud-Edge	实时提升	⭐⭐⭐⭐ Decoupled verification 互补
FastVLM	Self-SD + Imitation Learning	保留	❌	1.85×	⭐⭐ Self-SD 思路

💡 对 CoVSpec 的可借鉴思路

1. Visual Token 处理升级

来自 HiViS + Sparrow：发现并利用 visual semantic internalization 现象——VLM 深层已将关键视觉语义编码进 text hidden states。CoVSpec 当前的 visual token reduction 可以进一步激进到 "完全从 drafter 移除 visual token"，利用 target VLM 的 fused hidden states。

来自 SpecVLM：Elastic compressor 思路——不固定使用一种压缩方法，而是根据输入图像/视频难度自适应选择 pruning/pooling/conv/resampler 组合。CoVSpec 可以将 margin-based gating 扩展为 multi-primitive elastic gating。

2. Device-Edge 通信优化

来自 WISV：将无线信道状态 (CSI) 引入 speculative verification 决策——CoVSpec 的 margin-based gating 可以加入 channel-aware acceptance threshold，在信道差时放宽接受标准（semantic acceptance），信道好时收紧（exact match）。

来自 ProSemComVLM：渐进式语义通信——CoVSpec 的 device→edge 传输可以设计为 progressive draft upload：先传低精度 draft 快速验证，根据反馈决定是否传高精度补充。

3. Multi-User 扩展

来自 edgeVLM + WISV：将延迟响应作为历史上下文——在 multi-user 场景下，edge server 可同时服务多个 device，CoVSpec 的 parallel branching 可以扩展为 multi-user parallel drafts + shared KV cache。WISV 的 selective upload protocol 可以用于多用户场景的通信调度。

4. Video VLM 扩展

来自 Sparrow：Video LLM 的长序列挑战与 CoVSpec 的 adaptive draft length 天然互补。CoVSpec 可以新增 video mode：根据视觉序列长度动态调整 drafter 策略（短序列保留 visual token，长序列 offload）。

🛤️ 推荐期刊扩展路线

🚀 路线一（最稳·直接增量）：Wireless-Aware CoVSpec for Multi-User Edge VLM Serving

核心思路：在 CoVSpec 框架中加入 (1) 无线信道感知的 margin-based gating（借鉴 WISV 的 CSI-aware semantic verification），(2) multi-user parallel branching + shared edge KV cache，(3) progressive draft transmission（借鉴 ProSemComVLM）。

相比原 CoVSpec 增量：Wireless-aware gating + Multi-user serving + Progressive communication = 3 个实质性新增维度

投稿目标：IEEE TMC / IEEE TCCN / IEEE JSAC (ML for Communications)

可行性：⭐⭐⭐⭐⭐ — WISV 和 ProSemComVLM 均已投 IEEE Trans，论文可直接引用作为 baseline

🎯 路线二（创新性强）：Video-Capable CoVSpec with Elastic Visual Token Reduction

核心思路：将 CoVSpec 从 image VLM 扩展到 video VLM，(1) 引入 elastic visual compression（借鉴 SpecVLM），自适应选择 pruning/pooling/conv/resampler，(2) 长视频场景下的 visual offloading（借鉴 Sparrow），(3) adaptive draft length 与视频帧数联动。

相比原 CoVSpec 增量：Video VLM support + Elastic compression + Visual offloading = 新应用场景 + 新方法组合

投稿目标：IEEE TMM / IEEE TCSVT / ACM TOMM

可行性：⭐⭐⭐⭐ — Sparrow (Feb 2026) 和 Tango (Apr 2026) 等 video 论文可作对比；video VLM SD 是蓝海方向

🔬 路线三（高影响·系统级）：End-to-End Device-Edge VLM Inference System with Learning-Based Optimization

核心思路：构建完整的 device-edge VLM 推理系统，(1) 用 RL/learning-based 方法联合优化 visual token reduction rate、draft length、branching width 和 communication schedule（CoVSpec 当前参数是 heuristic），(2) 引入 online distillation（借鉴 SpecVLM），(3) 真实硬件 testbed 验证（借鉴 WISV 的 Jetson+A40 和 ProSemComVLM 的 NXP+Server）。

相比原 CoVSpec 增量：Learning-based joint optimization + Online adaptation + Hardware testbed = 系统性提升

投稿目标：IEEE TMC / ACM MobiCom / IEEE INFOCOM

可行性：⭐⭐⭐⭐ — 需要硬件部署，但 WISV 和 ProSemComVLM 已展示可行性参考

✅ 最终建议

优先推荐路线一 + 路线二的组合：

Wireless-aware multi-user CoVSpec（路线一）—— 最稳，差异化最大，WISV + ProSemComVLM 提供了清晰的 baseline 和方法借鉴。
加入 elastic visual token reduction（路线二部分）—— 将 CoVSpec 的单一 visual reduction 升级为 adaptive multi-primitive compression，这是 TMC 审稿人会欣赏的系统性改进。
目标期刊：IEEE TMC (Transactions on Mobile Computing) — 与 CoVSpec 的 device-edge co-inference 定位最匹配。