VLM视觉语言模型DMS应用探索:论文解读与实现

VLM视觉语言模型DMS应用探索:论文解读与实现

论文信息

  • 标题: Exploration of VLMs for Driver Monitoring Systems Applications
  • 作者: Paola Natalia Cañas Rodriguez 等
  • 会议: 16th ITS European Congress, Seville, Spain, 19-21 May 2025
  • 链接: https://arxiv.org/abs/2503.12281
  • 领域: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

核心创新

首次将视觉语言模型(VLM)应用于驾驶员监控系统(DMS),探索了VLM在驾驶员行为识别、疲劳检测、分心检测等任务中的潜力。

传统DMS vs VLM方法对比

维度 传统DMS VLM-DMS
开发模式 数据收集→标注→训练 Prompt Engineering
泛化能力 受限于训练数据 Zero-shot能力
新任务适应 重新训练 修改Prompt
开发周期 数月 数天
可解释性 黑盒 自然语言解释

技术方案

1. VLM架构选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
"""
VLM-DMS系统架构

支持多种VLM骨干:
- CLIP (OpenAI)
- BLIP-2 (Salesforce)
- LLaVA (LLaMA + ViT)
- GPT-4V / Gemini Vision
"""

import torch
import torch.nn as nn
from transformers import (
CLIPModel, CLIPProcessor,
Blip2ForConditionalGeneration, Blip2Processor,
LlavaForConditionalGeneration, LlavaProcessor
)
from typing import List, Dict, Tuple, Optional
from enum import Enum
import numpy as np


class VLMBackbone(Enum):
"""VLM骨干网络枚举"""
CLIP = "clip"
BLIP2 = "blip2"
LLAVA = "llava"


class VLMBasedDMS:
"""
基于VLM的驾驶员监控系统

支持任务:
- 驾驶员行为识别
- 疲劳检测
- 分心检测
- 危险行为识别
"""

def __init__(
self,
backbone: VLMBackbone = VLMBackbone.BLIP2,
device: str = "cuda",
use_quantization: bool = True
):
self.backbone = backbone
self.device = device

# 加载模型
self._load_model(backbone, use_quantization)

# DMS行为标签
self.behavior_labels = [
"safe driving",
"distracted by phone",
"distracted by passenger",
"adjusting radio",
"drinking",
"eating",
"reaching behind",
"hair/makeup",
"talking to passenger",
"yawning",
"eyes closed",
"looking away"
]

# 预定义Prompt模板
self.prompt_templates = {
'behavior': "What is the driver doing? Choose from: {labels}. Answer with the most appropriate behavior.",
'fatigue': "Is this driver showing signs of fatigue or drowsiness? Answer yes or no and explain why.",
'distraction': "Is the driver distracted? If yes, what is causing the distraction?",
'safety': "Are there any safety concerns with the driver's current behavior? List them."
}

def _load_model(self, backbone: VLMBackbone, use_quantization: bool):
"""加载VLM模型"""
if backbone == VLMBackbone.CLIP:
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
self.model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")

elif backbone == VLMBackbone.BLIP2:
self.processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
self.model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16 if use_quantization else torch.float32
)

elif backbone == VLMBackbone.LLAVA:
self.processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
self.model = LlavaForConditionalGeneration.from_pretrained(
"llava-hf/llava-1.5-7b-hf",
torch_dtype=torch.float16 if use_quantization else torch.float32
)

self.model.to(self.device)
self.model.eval()

def encode_image(self, image) -> torch.Tensor:
"""编码图像"""
if isinstance(image, np.ndarray):
from PIL import Image
image = Image.fromarray(image)

inputs = self.processor(images=image, return_tensors="pt")
return inputs.to(self.device)

def classify_behavior(
self,
image,
return_confidence: bool = True
) -> Dict:
"""
分类驾驶员行为

Args:
image: 输入图像 (PIL Image 或 numpy array)
return_confidence: 是否返回置信度

Returns:
result: {
'behavior': 行为标签,
'confidence': 置信度,
'explanation': 解释
}
"""
# 构建Prompt
labels_str = ", ".join(self.behavior_labels)
prompt = self.prompt_templates['behavior'].format(labels=labels_str)

# 编码
inputs = self.processor(
images=image,
text=prompt,
return_tensors="pt"
).to(self.device)

# 推理
with torch.no_grad():
if self.backbone == VLMBackbone.CLIP:
# CLIP分类
outputs = self.model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

# 构建文本输入
text_inputs = self.processor(
text=self.behavior_labels,
padding=True,
return_tensors="pt"
).to(self.device)

# 计算相似度
image_features = self.model.get_image_features(inputs['pixel_values'])
text_features = self.model.get_text_features(**text_inputs)

similarity = (image_features @ text_features.T).softmax(dim=1)
top_idx = similarity.argmax().item()

return {
'behavior': self.behavior_labels[top_idx],
'confidence': similarity[0, top_idx].item(),
'all_probs': {
label: similarity[0, i].item()
for i, label in enumerate(self.behavior_labels)
}
}

else:
# 生成式VLM
outputs = self.model.generate(
**inputs,
max_new_tokens=100,
do_sample=False
)

response = self.processor.decode(outputs[0], skip_special_tokens=True)

# 解析响应
detected_behavior = self._parse_behavior(response)

return {
'behavior': detected_behavior,
'response': response,
'raw_output': response
}

def detect_fatigue(self, image) -> Dict:
"""
检测疲劳

Args:
image: 输入图像

Returns:
result: {
'is_fatigued': bool,
'indicators': List[str],
'confidence': float
}
"""
prompt = self.prompt_templates['fatigue']

inputs = self.processor(
images=image,
text=prompt,
return_tensors="pt"
).to(self.device)

with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=150,
do_sample=False
)

response = self.processor.decode(outputs[0], skip_special_tokens=True)

# 解析响应
is_fatigued = 'yes' in response.lower()[:20]

# 提取疲劳指标
indicators = self._extract_fatigue_indicators(response)

return {
'is_fatigued': is_fatigued,
'indicators': indicators,
'response': response
}

def detect_distraction(self, image) -> Dict:
"""
检测分心

Args:
image: 输入图像

Returns:
result: {
'is_distracted': bool,
'distraction_type': str,
'confidence': float
}
"""
prompt = self.prompt_templates['distraction']

inputs = self.processor(
images=image,
text=prompt,
return_tensors="pt"
).to(self.device)

with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=150,
do_sample=False
)

response = self.processor.decode(outputs[0], skip_special_tokens=True)

return {
'is_distracted': 'yes' in response.lower()[:20],
'distraction_type': self._extract_distraction_type(response),
'response': response
}

def _parse_behavior(self, response: str) -> str:
"""解析行为"""
response_lower = response.lower()

for label in self.behavior_labels:
if label in response_lower:
return label

return "unknown"

def _extract_fatigue_indicators(self, response: str) -> List[str]:
"""提取疲劳指标"""
indicators = []

fatigue_keywords = {
'yawning': 'yawning',
'eyes closed': 'eyes_closed',
'blinking': 'excessive_blinking',
'head nodding': 'head_nodding',
'drowsy': 'drowsy_expression'
}

response_lower = response.lower()
for keyword, indicator in fatigue_keywords.items():
if keyword in response_lower:
indicators.append(indicator)

return indicators

def _extract_distraction_type(self, response: str) -> str:
"""提取分心类型"""
distraction_types = {
'phone': 'phone_use',
'passenger': 'passenger_distraction',
'radio': 'radio_adjustment',
'eating': 'eating',
'drinking': 'drinking',
'mirror': 'mirror_checking'
}

response_lower = response.lower()
for keyword, dtype in distraction_types.items():
if keyword in response_lower:
return dtype

return 'unknown'


# 多帧时序融合
class TemporalVLM_DMS:
"""时序VLM-DMS系统"""

def __init__(
self,
vlm: VLMBasedDMS,
window_size: int = 30, # 1秒窗口(30fps)
fusion_strategy: str = 'voting'
):
self.vlm = vlm
self.window_size = window_size
self.fusion_strategy = fusion_strategy

# 历史记录
self.history = {
'behaviors': [],
'fatigue_scores': [],
'distraction_scores': []
}

def process_frame(self, frame) -> Dict:
"""
处理单帧

Args:
frame: 输入帧

Returns:
result: 融合后结果
"""
# VLM推理
behavior_result = self.vlm.classify_behavior(frame)
fatigue_result = self.vlm.detect_fatigue(frame)
distraction_result = self.vlm.detect_distraction(frame)

# 更新历史
self.history['behaviors'].append(behavior_result)
self.history['fatigue_scores'].append(
1.0 if fatigue_result['is_fatigued'] else 0.0
)
self.history['distraction_scores'].append(
1.0 if distraction_result['is_distracted'] else 0.0
)

# 限制历史长度
if len(self.history['behaviors']) > self.window_size:
self.history['behaviors'].pop(0)
self.history['fatigue_scores'].pop(0)
self.history['distraction_scores'].pop(0)

# 时序融合
return self._temporal_fusion()

def _temporal_fusion(self) -> Dict:
"""时序融合"""
if len(self.history['behaviors']) < 5:
return {
'behavior': 'collecting',
'fatigue_score': 0.0,
'distraction_score': 0.0
}

# 行为投票
behaviors = [r['behavior'] for r in self.history['behaviors']]
from collections import Counter
behavior_counts = Counter(behaviors)
final_behavior = behavior_counts.most_common(1)[0][0]

# 疲劳评分
fatigue_score = np.mean(self.history['fatigue_scores'])

# 分心评分
distraction_score = np.mean(self.history['distraction_scores'])

return {
'behavior': final_behavior,
'behavior_confidence': behavior_counts.most_common(1)[0][1] / len(behaviors),
'fatigue_score': fatigue_score,
'distraction_score': distraction_score,
'alert_fatigue': fatigue_score > 0.5,
'alert_distraction': distraction_score > 0.5
}


# 测试
if __name__ == "__main__":
# 创建VLM-DMS
vlm_dms = VLMBasedDMS(
backbone=VLMBackbone.BLIP2,
device="cuda",
use_quantization=True
)

# 创建时序系统
temporal_dms = TemporalVLM_DMS(vlm_dms, window_size=30)

print("VLM-DMS系统初始化完成")
print(f"支持行为标签: {len(vlm_dms.behavior_labels)} 个")
print(f"Prompt模板: {list(vlm_dms.prompt_templates.keys())}")

2. Prompt Engineering

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class DMSPromptEngineer:
"""DMS Prompt工程"""

def __init__(self):
# 安全相关Prompt
self.safety_prompts = {
'critical': [
"URGENT: Is the driver's eyes closed? This is a critical safety issue.",
"EMERGENCY: Detect if the driver is unconscious or unresponsive.",
"ALERT: Is the driver showing signs of falling asleep?"
],
'warning': [
"Is the driver looking away from the road for an extended period?",
"Detect if the driver is using a mobile phone while driving.",
"Is the driver reaching for something in the back seat?"
],
'info': [
"Describe the driver's current posture and attention state.",
"What objects is the driver interacting with?",
"Is the driver wearing a seatbelt correctly?"
]
}

def get_prompt_for_scenario(
self,
scenario: str,
severity: str = 'warning'
) -> str:
"""根据场景获取Prompt"""
import random
prompts = self.safety_prompts.get(severity, [])
return random.choice(prompts) if prompts else ""

def build_multitask_prompt(self) -> str:
"""构建多任务Prompt"""
return """
Analyze this driver image and answer:
1. Behavior: What is the driver doing? (safe driving, distracted, fatigued, other)
2. Fatigue Level: Rate from 0-10 (0=alert, 10=extremely fatigued)
3. Distraction: Is the driver distracted? If yes, by what?
4. Safety Concerns: List any safety issues.
5. Recommended Action: What should the system do? (no action, warning, critical alert)

Format your answer as JSON.
"""

实验结果

Driver Monitoring Dataset (DMD)评估

任务 传统CNN VLM (BLIP-2) VLM (LLaVA)
行为识别 87.3% 82.5% 85.1%
疲劳检测 84.2% 79.8% 81.3%
分心检测 89.1% 85.6% 87.2%
Zero-shot新行为 12.5% 68.3% 72.1%

优势分析

  1. 零样本能力: 无需训练即可识别新行为
  2. 可解释性: 自然语言解释决策原因
  3. 灵活性: 通过Prompt快速适应新任务
  4. 多任务: 单个模型处理多种任务

挑战

  1. 延迟: 大型VLM推理时间较长
  2. 资源: 需要较多GPU内存
  3. 稳定性: 输出格式可能不一致
  4. 边缘部署: 难以直接部署到嵌入式设备

IMS应用启示

适用场景

场景 传统DMS VLM-DMS 建议
量产车型 ✅ 推荐 ⚠️ 实验性 传统CNN为主
新行为扩展 ❌ 需重训练 ✅ 快速适应 VLM辅助
开发阶段 ⚠️ 周期长 ✅ 快速原型 VLM优先
离线分析 ⚠️ 有限 ✅ 深度分析 VLM优势

混合方案建议

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class HybridDMS:
"""混合DMS系统"""

def __init__(self):
# 轻量级CNN用于实时
self.realtime_cnn = LightweightCNN()

# VLM用于离线分析和异常处理
self.offline_vlm = VLMBasedDMS()

def process(self, frame):
# 实时CNN推理
cnn_result = self.realtime_cnn(frame)

# 低置信度时使用VLM
if cnn_result['confidence'] < 0.7:
vlm_result = self.offline_vlm.classify_behavior(frame)
return self._merge_results(cnn_result, vlm_result)

return cnn_result

总结

核心贡献

  1. 首次探索VLM在DMS的应用
  2. 验证了零样本行为识别能力
  3. 提出了混合部署方案

未来方向

  1. 边缘优化: 模型蒸馏、量化
  2. 多模态融合: VLM + 生理信号
  3. 主动学习: 从VLM反馈改进CNN
  4. 实时性提升: 模型架构优化

参考资源: