Transformer疲劳检测SOTA方案：ViT/Swin架构实现99.15%准确率的实时部署

论文信息

论文标题：Real-time driver drowsiness detection using transformer architectures: a novel deep learning approach
来源期刊：Scientific Reports (Nature子刊)
发表时间：2025年
DOI：10.1038/s41598-025-02111-x
研究类型：深度学习算法研究

核心创新

本研究首次系统性地将Vision Transformer (ViT)和Swin Transformer应用于驾驶员疲劳检测任务，在MRL数据集上达到99.15%的准确率，超越传统CNN架构。核心创新点：(1)证明了Transformer的全局注意力机制能够捕获眼部特征的远距离依赖，解决了CNN局部感受野的局限性；(2)提出基于CAM (Class Activation Mapping)的可解释性方案，满足车载系统的信任需求；(3)在NVIDIA Jetson平台实现实时推理，延迟低于25ms。

方法详解

1. 整体架构

┌─────────────────────────────────────────────────────────┐
│                  输入预处理层                            │
│  图像尺寸: 224×224 | 归一化: [0,1] | 增强: 翻转/旋转   │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│                ViT / Swin Transformer                   │
├─────────────────────────────────────────────────────────┤
│  ViT: Patch Embedding → Transformer Encoder → MLP Head  │
│  Swin: Patch Partition → Stage×4 → Classification Head  │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│                  分类输出层                             │
│         Open-Eyes / Close-Eyes (二分类)                 │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│               疲劳评分系统                              │
│  PERCLOS阈值: 15帧 → 触发警报                          │
└─────────────────────────────────────────────────────────┘

2. Vision Transformer (ViT) 架构

2.1 Patch Embedding

将输入图像划分为固定大小的patch：

$$\mathbf{z}0 = [\mathbf{x}{class}; \mathbf{x}_p^1 E; \mathbf{x}_p^2 E; \cdots; \mathbf{x}p^N E] + \mathbf{E}{pos}$$

其中：

$\mathbf{x}_p^i \in \mathbb{R}^{P^2 \cdot C}$：第$i$个patch（$P=16$, $C=3$）
$E \in \mathbb{R}^{(P^2 \cdot C) \times D}$：线性投影矩阵
$\mathbf{E}_{pos} \in \mathbb{R}^{(N+1) \times D}$：位置嵌入
$N = HW/P^2 = 196$：patch数量

2.2 Transformer Encoder

每层包含多头自注意力(MSA)和MLP：

$$\mathbf{z}’l = \text{MSA}(\text{LN}(\mathbf{z}{l-1})) + \mathbf{z}_{l-1}$$

$$\mathbf{z}_l = \text{MLP}(\text{LN}(\mathbf{z}’_l)) + \mathbf{z}’_l$$

多头注意力计算：

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

2.3 ViT架构图

Input Image (224×224×3)
        ↓
    Patchify (16×16)
        ↓
    Linear Projection
        ↓
    Add Position Embedding
        ↓
┌───────────────────────────┐
│   Transformer Encoder×12  │
├───────────────────────────┤
│   Layer Norm              │
│         ↓                 │
│   Multi-Head Attention    │ ←── 残差连接
│   (Heads=12, D=768)       │
│         ↓                 │
│   Layer Norm              │
│         ↓                 │
│   MLP (GELU)              │ ←── 残差连接
│   (768→3072→768)          │
└───────────────────────────┘
        ↓
    MLP Head
    (768→2)
        ↓
    Softmax
    (Open/Close)

3. Swin Transformer架构

3.1 层次化设计

Swin Transformer采用4-stage层次结构：

Stage	分辨率	维度	层数	头数
1	56×56	96	2	3
2	28×28	192	2	6
3	14×14	384	6	12
4	7×7	768	2	24

3.2 窗口注意力

在局部窗口内计算注意力，降低计算复杂度：

$$\text{Complexity} = O(N) \quad \text{vs.} \quad O(N^2) \text{ (global)}$$

窗口大小：$M = 7$

3.3 Shifted Window Attention

交替使用规则窗口和移位窗口：

Stage L层:   ┌───┬───┐
             │ A │ B │   规则窗口
             ├───┼───┤
             │ C │ D │
             └───┴───┘

Stage L+1层: ┌───┬───┐
             │ B │ A │   移位窗口 (shift=M//2)
             ├───┼───┤
             │ D │ C │
             └───┴───┘

跨窗口信息交互通过移位实现。

4. 疲劳检测流程

┌─────────────────────────────────────────────────────────┐
│               实时疲劳检测流程                          │
└─────────────────────────────────────────────────────────┘

Step 1: 人脸检测 (Haar Cascade)
        ↓
Step 2: 眼部ROI提取
        ├── 左眼 (x1,y1,x2,y2)
        └── 右眼 (x1,y1,x2,y2)
        ↓
Step 3: 眼部图像预处理
        ├── 尺寸调整 (224×224)
        ├── 灰度→RGB转换
        └── 归一化 [0,1]
        ↓
Step 4: Transformer推理
        ├── 左眼状态预测
        └── 右眼状态预测
        ↓
Step 5: 疲劳评分计算
        if eyes_closed:
            score += 1
        else:
            score -= 1
        score = max(0, score)
        ↓
Step 6: 告警触发
        if score >= 15 frames:
            trigger_alarm()

5. 数据增强策略

# 训练时数据增强
augmentation_pipeline = {
    'horizontal_flip': 0.5,      # 水平翻转
    'rotation': 15,              # 旋转角度范围
    'brightness': [0.8, 1.2],    # 亮度调整
    'contrast': [0.8, 1.2],      # 对比度调整
    'shift_scale_rotate': {
        'shift_limit': 0.1,
        'scale_limit': 0.1,
        'rotate_limit': 15
    }
}

代码复现

环境配置

# 导入依赖
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import transforms
from PIL import Image
import cv2
import numpy as np
import timm  # PyTorch Image Models

ViT模型实现

class ViTDrowsinessDetector(nn.Module):
    """Vision Transformer疲劳检测器"""
    
    def __init__(self, model_name='vit_base_patch16_224', num_classes=2, pretrained=True):
        super().__init__()
        
        # 加载预训练ViT
        self.backbone = timm.create_model(
            model_name,
            pretrained=pretrained,
            num_classes=0  # 移除分类头
        )
        
        # 获取特征维度
        self.feature_dim = self.backbone.num_features
        
        # 自定义分类头
        self.classifier = nn.Sequential(
            nn.LayerNorm(self.feature_dim),
            nn.Linear(self.feature_dim, 512),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )
        
        # 初始化权重
        self._init_weights()
        
    def _init_weights(self):
        for m in self.classifier.modules():
            if isinstance(m, nn.Linear):
                nn.init.trunc_normal_(m.weight, std=0.02)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
    
    def forward(self, x, return_attention=False):
        """
        Args:
            x: (B, 3, 224, 224) 输入图像
            return_attention: 是否返回注意力图
        """
        # 提取特征
        features = self.backbone(x)  # (B, 768)
        
        # 分类
        logits = self.classifier(features)
        
        if return_attention:
            # 获取注意力权重 (用于可解释性)
            attention = self._get_attention_map(x)
            return logits, attention
        
        return logits
    
    def _get_attention_map(self, x):
        """提取注意力热力图 (CAM)"""
        with torch.no_grad():
            # 获取最后一层的注意力
            # ViT的注意力存储在backbone.blocks[-1].attn
            attention_weights = []
            
            hooks = []
            def hook_fn(module, input, output):
                attention_weights.append(output[1])  # attention weights
            
            # 注册hook
            for block in self.backbone.blocks:
                hooks.append(block.attn.register_forward_hook(hook_fn))
            
            # 前向传播
            _ = self.backbone(x)
            
            # 移除hooks
            for h in hooks:
                h.remove()
            
            # 处理注意力图
            # 取最后一个block的注意力
            attn = attention_weights[-1]  # (B, heads, N+1, N+1)
            
            # 取CLS token对所有patch的注意力
            attn = attn[:, :, 0, 1:].mean(dim=1)  # (B, N)
            
            # 重塑为2D
            attn = attn.reshape(attn.size(0), 14, 14)
            
            # 上采样到输入尺寸
            attn = F.interpolate(
                attn.unsqueeze(1),
                size=(224, 224),
                mode='bilinear',
                align_corners=False
            )
            
            return attn.squeeze(1)


class SwinDrowsinessDetector(nn.Module):
    """Swin Transformer疲劳检测器"""
    
    def __init__(self, model_name='swin_tiny_patch4_window7_224', num_classes=2, pretrained=True):
        super().__init__()
        
        # 加载预训练Swin
        self.backbone = timm.create_model(
            model_name,
            pretrained=pretrained,
            num_classes=0
        )
        
        self.feature_dim = self.backbone.num_features
        
        self.classifier = nn.Sequential(
            nn.LayerNorm(self.feature_dim),
            nn.Linear(self.feature_dim, 256),
            nn.GELU(),
            nn.Dropout(0.2),
            nn.Linear(256, num_classes)
        )
        
    def forward(self, x):
        features = self.backbone(x)
        return self.classifier(features)

实时检测系统

class RealTimeDrowsinessSystem:
    """实时疲劳检测系统"""
    
    def __init__(self, model_path, device='cuda'):
        self.device = device
        
        # 加载模型
        self.model = ViTDrowsinessDetector(
            model_name='vit_base_patch16_224',
            num_classes=2,
            pretrained=False
        )
        self.model.load_state_dict(torch.load(model_path, map_location=device))
        self.model.to(device)
        self.model.eval()
        
        # 加载人脸和眼睛检测器
        self.face_cascade = cv2.CascadeClassifier(
            cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
        )
        self.eye_cascade = cv2.CascadeClassifier(
            cv2.data.haarcascades + 'haarcascade_eye.xml'
        )
        
        # 预处理
        self.transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                               std=[0.229, 0.224, 0.225])
        ])
        
        # 疲劳评分
        self.drowsiness_score = 0
        self.score_threshold = 15
        self.history = []
        
    def detect_eyes(self, frame):
        """检测人脸和眼睛区域"""
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        
        # 检测人脸
        faces = self.face_cascade.detectMultiScale(gray, 1.3, 5)
        
        eye_regions = []
        for (x, y, w, h) in faces:
            roi_gray = gray[y:y+h, x:x+w]
            roi_color = frame[y:y+h, x:x+w]
            
            # 检测眼睛
            eyes = self.eye_cascade.detectMultiScale(roi_gray)
            
            for (ex, ey, ew, eh) in eyes:
                eye_img = roi_color[ey:ey+eh, ex:ex+ew]
                eye_regions.append({
                    'image': eye_img,
                    'bbox': (x+ex, y+ey, ew, eh)
                })
        
        return eye_regions, faces
    
    def predict_eye_state(self, eye_img):
        """预测眼睛状态"""
        try:
            # 转换为PIL图像
            eye_pil = Image.fromarray(cv2.cvtColor(eye_img, cv2.COLOR_BGR2RGB))
            
            # 预处理
            eye_tensor = self.transform(eye_pil).unsqueeze(0).to(self.device)
            
            # 推理
            with torch.no_grad():
                output = self.model(eye_tensor)
                prob = F.softmax(output, dim=1)
                pred = torch.argmax(prob, dim=1).item()
            
            # pred: 0=Close, 1=Open
            return pred == 0  # 返回True表示闭眼
            
        except Exception as e:
            print(f"预测错误: {e}")
            return False
    
    def update_score(self, eyes_closed):
        """更新疲劳评分"""
        if eyes_closed:
            self.drowsiness_score += 1
        else:
            self.drowsiness_score = max(0, self.drowsiness_score - 1)
        
        self.history.append(self.drowsiness_score)
        
        # 保留最近100帧历史
        if len(self.history) > 100:
            self.history.pop(0)
    
    def check_drowsiness(self):
        """检查是否疲劳"""
        return self.drowsiness_score >= self.score_threshold
    
    def run(self, video_source=0):
        """主循环"""
        cap = cv2.VideoCapture(video_source)
        
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            
            # 检测眼睛
            eye_regions, faces = self.detect_eyes(frame)
            
            # 绘制人脸框
            for (x, y, w, h) in faces:
                cv2.rectangle(frame, (x, y), (x+w, y+h), (255, 0, 0), 2)
            
            # 预测每只眼睛的状态
            all_closed = True
            for eye in eye_regions:
                is_closed = self.predict_eye_state(eye['image'])
                
                # 绘制眼睛框
                ex, ey, ew, eh = eye['bbox']
                color = (0, 0, 255) if is_closed else (0, 255, 0)
                cv2.rectangle(frame, (ex, ey), (ex+ew, ey+eh), color, 2)
                
                if not is_closed:
                    all_closed = False
            
            # 更新评分
            if len(eye_regions) > 0:
                self.update_score(all_closed)
            
            # 显示状态
            cv2.putText(frame, f'Score: {self.drowsiness_score}', (10, 30),
                       cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
            
            # 检查疲劳
            if self.check_drowsiness():
                cv2.putText(frame, 'DROWSY ALERT!', (10, 70),
                           cv2.FONT_HERSHEY_SIMPLEX, 1.5, (0, 0, 255), 3)
                # 触发声音警报
                self._trigger_alarm()
            
            cv2.imshow('Drowsiness Detection', frame)
            
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        
        cap.release()
        cv2.destroyAllWindows()
    
    def _trigger_alarm(self):
        """触发警报"""
        # 可以集成声音或振动警报
        print("ALERT: Drowsy driver detected!")


# 训练脚本
def train_vit_drowsiness(train_dir, val_dir, epochs=30, batch_size=32):
    """训练ViT疲劳检测模型"""
    
    # 数据加载
    train_transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(15),
        transforms.ColorJitter(brightness=0.2, contrast=0.2),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
    
    val_transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
    
    from torchvision.datasets import ImageFolder
    train_dataset = ImageFolder(train_dir, transform=train_transform)
    val_dataset = ImageFolder(val_dir, transform=val_transform)
    
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, num_workers=4
    )
    val_loader = torch.utils.data.DataLoader(
        val_dataset, batch_size=batch_size, shuffle=False, num_workers=4
    )
    
    # 模型初始化
    model = ViTDrowsinessDetector(
        model_name='vit_base_patch16_224',
        num_classes=2,
        pretrained=True
    )
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    
    # 损失和优化器
    criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
    optimizer = torch.optim.AdamW([
        {'params': model.backbone.parameters(), 'lr': 1e-5},
        {'params': model.classifier.parameters(), 'lr': 1e-3}
    ], weight_decay=1e-4)
    
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, T_0=10, T_mult=2
    )
    
    # 训练循环
    best_acc = 0
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        correct = 0
        total = 0
        
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
        
        # 验证
        model.eval()
        val_correct = 0
        val_total = 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = outputs.max(1)
                val_total += labels.size(0)
                val_correct += predicted.eq(labels).sum().item()
        
        train_acc = correct / total
        val_acc = val_correct / val_total
        
        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), 'best_vit_drowsiness.pth')
        
        print(f'Epoch {epoch+1}/{epochs}: '
              f'Train Loss={train_loss/len(train_loader):.4f}, '
              f'Train Acc={train_acc:.4f}, '
              f'Val Acc={val_acc:.4f}')
        
        scheduler.step()
    
    return model

实验结果

1. 数据集统计

数据集	总样本	Open-Eyes	Close-Eyes	分辨率	环境
MRL Eye	84,898	42,952	41,946	多种	多光照
NTHU-DDD	66,521	30,491	36,030	640×480	日/夜
CEW	27,200	-	-	多种	Wild

2. 模型性能对比

模型	架构	参数量	MRL准确率	NTHU准确率	CEW准确率	平均
VGG19	CNN	143M	98.7%	96.5%	94.2%	96.5%
ResNet50V2	CNN	25.6M	97.3%	95.8%	93.7%	95.6%
DenseNet169	CNN	14.1M	96.8%	94.2%	92.1%	94.4%
MobileNetV3	CNN	5.4M	94.5%	92.3%	89.6%	92.1%
ViT-Base	Transformer	86M	99.15%	98.2%	96.8%	98.0%
Swin-Tiny	Transformer	28M	98.9%	97.8%	95.9%	97.5%

3. 关键指标详细对比

模型	Accuracy	Precision	Recall	F1-Score	AUC
VGG19	98.7%	98.5%	98.9%	98.7%	0.997
ViT-Base	99.15%	99.1%	99.2%	99.1%	0.999
Swin-Tiny	98.9%	98.7%	99.1%	98.9%	0.998

4. 光照鲁棒性测试

光照条件	VGG19准确率	ViT准确率	Swin准确率
正常光照	99.2%	99.5%	99.3%
低光照	92.3%	96.8%	95.2%
强光	94.1%	97.2%	96.5%
背光	89.7%	94.5%	93.1%
平均	93.8%	97.0%	96.0%

5. 边缘设备部署性能

平台	模型	推理延迟	帧率	内存占用	功耗
Jetson Nano	ViT-Tiny	45ms	22fps	850MB	5W
Jetson AGX Orin	ViT-Base	18ms	55fps	2.1GB	12W
Jetson AGX Orin	Swin-Tiny	12ms	83fps	1.8GB	10W
Qualcomm 8255	Swin-Tiny	15ms	66fps	1.5GB	6W

IMS应用启示

1. Transformer架构成为DMS新标准

相比CNN的优势：

特性	CNN	Transformer	IMS影响
全局依赖	受限(局部感受野)	✅ 全局注意力	检测精度提升
迁移学习	需大量微调	✅ 预训练有效	数据需求降低
可解释性	需额外设计	✅ 原生注意力图	满足功能安全要求
计算开销	较低	较高	需优化部署

IMS落地建议：

高端车型采用ViT-Base/Swin-Base，追求最高准确率
中端车型采用Swin-Tiny/ViT-Tiny，平衡性能和成本
入门车型采用MobileNet+轻量注意力模块

2. Euro NCAP 2026合规策略

Euro NCAP要求	传统CNN方案	Transformer方案	差距
分心检测准确率>95%	92-94%	97-99%	+5%
疲劳检测准确率>90%	88-91%	94-98%	+6%
低光照性能>85%	78-82%	92-96%	+13%
推理延迟<50ms	15-30ms	12-45ms	相当

3. 功能安全与可解释性

CAM注意力图应用：

# 生成可解释性报告
def generate_explanation(attention_map, prediction):
    """生成检测结果的解释"""
    if prediction == 'closed':
        # 提取高注意力区域
        high_attn = attention_map > 0.7
        coverage = high_attn.sum() / high_attn.numel()
        
        explanation = {
            'prediction': '眼睛闭合',
            'confidence': attention_map.max().item(),
            'attention_coverage': coverage.item(),
            'reason': '检测到眼睑区域闭合，瞳孔不可见'
        }
    return explanation

ISO 26262合规：

注意力图提供决策依据，满足可追溯性要求
集成置信度评估，低置信度触发降级模式
双通道冗余设计：ViT + Swin并行推理

4. 实时部署优化策略

量化与剪枝：

# 动态量化示例
def quantize_model(model):
    model.eval()
    quantized = torch.quantization.quantize_dynamic(
        model,
        {nn.Linear, nn.LayerNorm},
        dtype=torch.qint8
    )
    return quantized

# 效果对比
# FP32: 18ms, 2.1GB
# INT8: 8ms, 1.2GB  (延迟降低55%, 内存减少43%)

部署优化建议：

优化技术	延迟降低	精度损失	适用平台
FP16量化	30-40%	<0.1%	所有GPU
INT8量化	50-60%	0.3-0.5%	支持INT8的NPU
知识蒸馏	-	<0.5%	所有平台
模型剪枝	20-30%	0.5-1%	所有平台

5. 多任务扩展能力

Transformer架构易于扩展到多任务学习：

class MultiTaskDMS(nn.Module):
    """多任务DMS模型"""
    
    def __init__(self):
        self.backbone = timm.create_model('swin_base_patch4_window7_224', 
                                          num_classes=0)
        
        # 多个任务头
        self.eye_state_head = nn.Linear(1024, 2)  # 开/闭眼
        self.gaze_head = nn.Linear(1024, 9)       # 9方向注视
        self.blink_head = nn.Linear(1024, 3)      # 正常/快/慢眨眼
        self.drowsiness_head = nn.Linear(1024, 4) # KSS 0-3级
        
    def forward(self, x):
        features = self.backbone(x)
        return {
            'eye_state': self.eye_state_head(features),
            'gaze': self.gaze_head(features),
            'blink': self.blink_head(features),
            'drowsiness': self.drowsiness_head(features)
        }

优势：

单模型完成多个DMS功能，降低系统复杂度
特征共享，提高综合性能
满足Euro NCAP 2026的多维度检测要求

参考文献

Scientific Reports (2025). Real-time driver drowsiness detection using transformer architectures. DOI: 10.1038/s41598-025-02111-x
Dosovitskiy et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Liu et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021.
Euro NCAP (2026). Assessment Protocol - Safe Driving v1.0.
MRL Eye Dataset (2018). Machine Learning Research Lab.
NTHU-DDD Dataset. National Tsing Hua University Driver Drowsiness Detection.

DMS技术 > 学术研究

#Transformer #疲劳检测 #Euro NCAP #实时部署 #ViT #Swin

Transformer疲劳检测SOTA方案：ViT/Swin架构实现99.15%准确率的实时部署

https://dapalm.com/2026/06/21/2026-06-21-transformer-drowsiness-detection/

作者

Mars

发布于

2026年6月21日

许可协议

高通SA8295P平台：DMS/OMS/CPD一体化座舱方案上一篇

UWB雷达CPD儿童检测：零BOM成本的生命体征监测方案下一篇