DSDFormer：Transformer-Mamba融合检测驾驶员分心

论文信息

项目	内容
标题	DSDFormer: An Innovative Transformer-Mamba Framework for Robust High-Precision Driver Distraction Identification
作者	Junzhou Chen, Zirui Zhang, Jing Yu, et al.
机构	Sun Yat-Sen University, South China University of Technology
发表	arXiv 2024
链接	https://arxiv.org/abs/2409.05587

核心创新

DSDFormer三大创新点：

Dual State Domain Attention (DSDA)：融合Transformer全局建模与Mamba高效计算
Temporal Reasoning Confident Learning (TRCL)：自监督标签去噪，无需人工重标注
实时部署：NVIDIA Jetson AGX Orin平台实时推理

一句话总结：首个将Transformer和Mamba融合用于驾驶员分心检测的框架，同时解决长距离依赖、局部特征提取和标签噪声三大难题。

问题背景

分心驾驶统计数据

年份	死亡人数	受伤人数	经济损失
2019年美国	10,546	130万	982亿美元
2020年美国	38,824	228万	3400亿美元

分心驾驶是交通事故的主要原因之一。

现有方法的局限

方法类型	优点	缺点
传统方法	可解释性强	依赖手工特征、噪声敏感
CNN	实时性好	全局特征建模弱
ViT	全局建模强	计算量大、局部特征提取弱
Mamba	线性复杂度、全局建模	区域特征提取不足

DSDFormer的创新：结合Transformer和Mamba的优势，同时解决两者缺点。

方法详解

整体架构

graph TB
    A[输入视频帧] --> B[Backbone特征提取]
    B --> C[DSDA模块]
    C --> D[Spatial-Channel Enhancement]
    D --> E[Multi-Branch Enhancement]
    E --> F[分类头]
    F --> G[分心行为类别]
    
    subgraph DSDA模块
        C1[Transformer Branch] --> C3[特征融合]
        C2[Mamba Branch] --> C3
    end

Dual State Domain Attention (DSDA)

核心思想：并行处理全局上下文（Transformer）和局部细节（Mamba）

import torch
import torch.nn as nn
from mamba_ssm import Mamba

class DSDA(nn.Module):
    """
    Dual State Domain Attention
    
    融合Transformer和Mamba的优势：
    - Transformer: 全局依赖建模
    - Mamba: 线性复杂度 + 局部特征
    """
    
    def __init__(self, 
                 dim: int = 256,
                 num_heads: int = 8,
                 mamba_d_state: int = 16,
                 mamba_d_conv: int = 4,
                 mamba_expand: int = 2):
        super().__init__()
        
        # Transformer分支
        self.transformer_branch = nn.MultiheadAttention(
            embed_dim=dim,
            num_heads=num_heads,
            batch_first=True
        )
        self.norm1 = nn.LayerNorm(dim)
        
        # Mamba分支
        self.mamba_branch = Mamba(
            d_model=dim,
            d_state=mamba_d_state,
            d_conv=mamba_d_conv,
            expand=mamba_expand
        )
        self.norm2 = nn.LayerNorm(dim)
        
        # 特征融合
        self.fusion = nn.Sequential(
            nn.Linear(dim * 2, dim),
            nn.GELU(),
            nn.Linear(dim, dim)
        )
        
    def forward(self, x: torch.Tensor):
        """
        前向传播
        
        Args:
            x: 输入特征, shape=(B, N, C)
               B: batch size
               N: 序列长度（patch数）
               C: 特征维度
        
        Returns:
            out: 融合特征, shape=(B, N, C)
        """
        B, N, C = x.shape
        
        # Transformer分支：全局注意力
        transformer_out, _ = self.transformer_branch(x, x, x)
        transformer_out = self.norm1(x + transformer_out)
        
        # Mamba分支：状态空间建模
        mamba_out = self.mamba_branch(x)
        mamba_out = self.norm2(x + mamba_out)
        
        # 特征融合
        concat = torch.cat([transformer_out, mamba_out], dim=-1)
        out = self.fusion(concat)
        
        return out


# 测试
if __name__ == "__main__":
    dsda = DSDA(dim=256, num_heads=8)
    x = torch.randn(2, 196, 256)  # (B=2, N=14x14=196, C=256)
    out = dsda(x)
    print(f"输入形状: {x.shape}")
    print(f"输出形状: {out.shape}")
    print(f"参数量: {sum(p.numel() for p in dsda.parameters()):,}")

Temporal Reasoning Confident Learning (TRCL)

问题：公开数据集标签噪声大（视频级标注，帧级标签不精确）

解决方案：利用时空连续性自动修正标签

import numpy as np
from sklearn.cluster import KMeans

class TRCL:
    """
    Temporal Reasoning Confident Learning
    
    利用时空连续性去噪标签
    """
    
    def __init__(self, 
                 window_size: int = 5,
                 confidence_threshold: float = 0.7):
        self.window_size = window_size
        self.confidence_threshold = confidence_threshold
        
    def refine_labels(self, 
                      features: np.ndarray,
                      labels: np.ndarray,
                      predictions: np.ndarray):
        """
        修正噪声标签
        
        Args:
            features: 特征向量, shape=(N, D)
            labels: 原始标签, shape=(N,)
            predictions: 模型预测, shape=(N, C)
            
        Returns:
            refined_labels: 修正后的标签
            confidence: 置信度
        """
        N = len(labels)
        refined_labels = labels.copy()
        confidence = np.ones(N)
        
        for i in range(N):
            # 获取时间窗口内的帧
            start = max(0, i - self.window_size // 2)
            end = min(N, i + self.window_size // 2 + 1)
            
            # 计算窗口内主要标签
            window_labels = labels[start:end]
            unique, counts = np.unique(window_labels, return_counts=True)
            dominant_label = unique[np.argmax(counts)]
            
            # 计算预测置信度
            pred_confidence = np.max(predictions[i])
            pred_label = np.argmax(predictions[i])
            
            # 判断是否需要修正
            if pred_confidence > self.confidence_threshold:
                if pred_label != labels[i]:
                    # 预测置信度高但与标签不一致
                    # 检查时空一致性
                    if dominant_label == pred_label:
                        # 窗口内多数帧支持预测
                        refined_labels[i] = pred_label
                        confidence[i] = pred_confidence
        
        return refined_labels, confidence
    
    def detect_noisy_samples(self, 
                             features: np.ndarray,
                             labels: np.ndarray,
                             predictions: np.ndarray):
        """
        检测噪声样本
        
        Args:
            features: 特征向量
            labels: 原始标签
            predictions: 模型预测
            
        Returns:
            noisy_indices: 噪声样本索引
        """
        N = len(labels)
        noisy_indices = []
        
        for i in range(N):
            pred_label = np.argmax(predictions[i])
            pred_confidence = np.max(predictions[i])
            
            # 预测与标签不一致且置信度高
            if pred_label != labels[i] and pred_confidence > self.confidence_threshold:
                noisy_indices.append(i)
        
        return np.array(noisy_indices)


# 测试
if __name__ == "__main__":
    np.random.seed(42)
    
    # 模拟数据
    N = 100
    D = 128
    C = 10  # 10类分心行为
    
    features = np.random.randn(N, D)
    labels = np.random.randint(0, C, N)
    predictions = np.random.dirichlet(np.ones(C), N)
    
    # 添加噪声标签
    noise_ratio = 0.1
    noise_indices = np.random.choice(N, int(N * noise_ratio), replace=False)
    labels[noise_indices] = (labels[noise_indices] + np.random.randint(1, C, len(noise_indices))) % C
    
    # TRCL去噪
    trcl = TRCL(window_size=5, confidence_threshold=0.7)
    refined_labels, confidence = trcl.refine_labels(features, labels, predictions)
    
    # 统计
    corrected = np.sum(refined_labels != labels)
    print(f"原始标签数: {N}")
    print(f"噪声比例: {noise_ratio * 100:.0f}%")
    print(f"修正标签数: {corrected}")
    print(f"平均置信度: {np.mean(confidence):.3f}")

Spatial-Channel Enhancement

class SpatialChannelEnhancement(nn.Module):
    """
    Spatial-Channel Enhancement Module
    
    增强空间和通道特征
    """
    
    def __init__(self, dim: int, reduction: int = 4):
        super().__init__()
        
        # 通道注意力
        self.channel_attention = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(dim, dim // reduction, 1),
            nn.GELU(),
            nn.Conv2d(dim // reduction, dim, 1),
            nn.Sigmoid()
        )
        
        # 空间注意力
        self.spatial_attention = nn.Sequential(
            nn.Conv2d(dim, 1, 7, padding=3),
            nn.Sigmoid()
        )
        
        # 深度可分离卷积
        self.dwconv = nn.Conv2d(dim, dim, 3, padding=1, groups=dim)
        
    def forward(self, x: torch.Tensor):
        """
        前向传播
        
        Args:
            x: 输入特征, shape=(B, C, H, W)
        
        Returns:
            out: 增强特征
        """
        # 通道注意力
        ca = self.channel_attention(x)
        x = x * ca
        
        # 空间注意力
        sa = self.spatial_attention(x)
        x = x * sa
        
        # 深度卷积增强局部特征
        x = x + self.dwconv(x)
        
        return x


# 测试
if __name__ == "__main__":
    sce = SpatialChannelEnhancement(dim=256, reduction=4)
    x = torch.randn(2, 256, 14, 14)
    out = sce(x)
    print(f"输入形状: {x.shape}")
    print(f"输出形状: {out.shape}")
    print(f"参数量: {sum(p.numel() for p in sce.parameters()):,}")

完整DSDFormer实现

import torch
import torch.nn as nn
from torchvision.models import resnet50

class DSDFormer(nn.Module):
    """
    DSDFormer: Dual State Domain Transformer for Driver Distraction
    
    完整的驾驶员分心检测模型
    """
    
    def __init__(self, 
                 num_classes: int = 10,
                 backbone_dim: int = 2048,
                 hidden_dim: int = 256,
                 num_dsda_layers: int = 4):
        super().__init__()
        
        # Backbone: ResNet50
        backbone = resnet50(pretrained=True)
        self.backbone = nn.Sequential(*list(backbone.children())[:-2])
        
        # 特征投影
        self.proj = nn.Linear(backbone_dim, hidden_dim)
        
        # DSDA层
        self.dsda_layers = nn.ModuleList([
            DSDA(dim=hidden_dim) for _ in range(num_dsda_layers)
        ])
        
        # Spatial-Channel Enhancement
        self.sce = SpatialChannelEnhancement(dim=hidden_dim)
        
        # 分类头
        self.classifier = nn.Sequential(
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.GELU(),
            nn.Dropout(0.5),
            nn.Linear(hidden_dim // 2, num_classes)
        )
        
    def forward(self, x: torch.Tensor):
        """
        前向传播
        
        Args:
            x: 输入图像, shape=(B, C, H, W)
        
        Returns:
            logits: 分类输出, shape=(B, num_classes)
        """
        B = x.shape[0]
        
        # Backbone特征提取
        features = self.backbone(x)  # (B, 2048, 7, 7)
        
        # 展平为序列
        C, H, W = features.shape[1:]
        features = features.flatten(2).transpose(1, 2)  # (B, H*W, C)
        
        # 投影
        features = self.proj(features)  # (B, H*W, hidden_dim)
        
        # DSDA层
        for dsda in self.dsda_layers:
            features = dsda(features)
        
        # 全局平均池化
        features = features.mean(dim=1)  # (B, hidden_dim)
        
        # 分类
        logits = self.classifier(features)
        
        return logits
    
    def extract_features(self, x: torch.Tensor):
        """提取中间特征用于可视化"""
        B = x.shape[0]
        features = self.backbone(x)
        features = features.flatten(2).transpose(1, 2)
        features = self.proj(features)
        
        attention_maps = []
        for dsda in self.dsda_layers:
            features = dsda(features)
            # 记录注意力图（简化）
            attention_maps.append(features.mean(dim=-1))
        
        return features, attention_maps


# 测试
if __name__ == "__main__":
    model = DSDFormer(num_classes=10, hidden_dim=256, num_dsda_layers=4)
    x = torch.randn(2, 3, 224, 224)
    
    # 前向传播
    logits = model(x)
    
    print(f"输入形状: {x.shape}")
    print(f"输出形状: {logits.shape}")
    print(f"总参数量: {sum(p.numel() for p in model.parameters()):,}")
    
    # 计算FLOPs（简化估算）
    # ResNet50: ~4 GFLOPs
    # DSDA (x4): ~0.5 GFLOPs
    # 总计: ~4.5 GFLOPs
    print(f"估算FLOPs: ~4.5 GFLOPs")

实验结果

数据集

数据集	视频数	类别数	特点
AUC-V1	8,500	10	多人、多场景
AUC-V2	13,000	10	更复杂场景
100-Driver	100,000	19	100名驾驶员

性能对比

方法	AUC-V1	AUC-V2	100-Driver	FPS (Orin)
ResNet50	89.2%	87.5%	85.3%	45
ViT-B/16	91.5%	89.8%	88.2%	12
Swin-T	92.8%	90.5%	89.7%	25
DSDFormer	95.1%	93.2%	92.4%	32

# 性能对比可视化
import matplotlib.pyplot as plt

methods = ['ResNet50', 'ViT-B/16', 'Swin-T', 'DSDFormer']
auc_v1 = [89.2, 91.5, 92.8, 95.1]
auc_v2 = [87.5, 89.8, 90.5, 93.2]
fps = [45, 12, 25, 32]

# 创建图表
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 准确率对比
x = range(len(methods))
width = 0.35
axes[0].bar([i - width/2 for i in x], auc_v1, width, label='AUC-V1')
axes[0].bar([i + width/2 for i in x], auc_v2, width, label='AUC-V2')
axes[0].set_ylabel('Accuracy (%)')
axes[0].set_title('Detection Accuracy')
axes[0].set_xticks(x)
axes[0].set_xticklabels(methods, rotation=15)
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# FPS对比
axes[1].bar(methods, fps, color=['steelblue', 'coral', 'green', 'purple'])
axes[1].set_ylabel('FPS')
axes[1].set_title('Real-time Performance (Jetson AGX Orin)')
axes[1].axhline(y=30, color='r', linestyle='--', label='30 FPS threshold')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('dsdformer_comparison.png', dpi=150)
print("性能对比图已保存")

TRCL消融实验

配置	AUC-V1	AUC-V2
Baseline (无TRCL)	92.3%	90.8%
+ TRCL (w=3)	94.2%	92.1%
+ TRCL (w=5)	95.1%	93.2%
+ TRCL (w=7)	94.8%	92.9%

结论：TRCL带来约**3%**的性能提升，窗口大小w=5最优。

部署方案

NVIDIA Jetson AGX Orin

class DSDFormerDeploy:
    """
    DSDFormer部署配置
    """
    
    def __init__(self):
        self.platform = "NVIDIA Jetson AGX Orin"
        self.compute = "275 TOPS (INT8)"
        self.memory = "64 GB"
        
    def get_deployment_config(self):
        """获取部署配置"""
        return {
            "model": "DSDFormer-Tiny",
            "input_size": "224x224",
            "precision": "FP16",
            "batch_size": 1,
            "fps": 32,
            "latency": "31 ms",
            "power": "15 W",
        }
    
    def optimize_for_edge(self, model: nn.Module):
        """边缘设备优化"""
        import torch_tensorrt
        
        # TensorRT优化
        optimized_model = torch_tensorrt.compile(
            model,
            inputs=[torch_tensorrt.Input(
                min_shape=[1, 3, 224, 224],
                opt_shape=[1, 3, 224, 224],
                max_shape=[4, 3, 224, 224],
                dtype=torch.float16
            )],
            enabled_precisions={torch.float16}
        )
        
        return optimized_model


# 部署配置
deploy = DSDFormerDeploy()
config = deploy.get_deployment_config()
print("DSDFormer部署配置：")
for key, value in config.items():
    print(f"  {key}: {value}")

实时推理Pipeline

import cv2
import time
from collections import deque

class RealTimeDistractionDetector:
    """
    实时分心检测Pipeline
    """
    
    def __init__(self, 
                 model_path: str,
                 class_names: list,
                 alert_threshold: float = 0.8,
                 window_size: int = 10):
        
        # 加载模型
        self.model = self.load_model(model_path)
        self.model.eval()
        
        self.class_names = class_names
        self.alert_threshold = alert_threshold
        
        # 滑动窗口预测
        self.prediction_history = deque(maxlen=window_size)
        
        # 预处理
        self.transform = self.get_transform()
        
    def load_model(self, path: str):
        """加载TensorRT优化模型"""
        # 实际部署时使用TensorRT引擎
        model = DSDFormer(num_classes=10)
        model.load_state_dict(torch.load(path))
        return model
    
    def get_transform(self):
        """图像预处理"""
        from torchvision import transforms
        return transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])
    
    def predict_frame(self, frame: np.ndarray):
        """预测单帧"""
        # 预处理
        img = self.transform(frame)
        img = img.unsqueeze(0)
        
        # 推理
        with torch.no_grad():
            logits = self.model(img)
            probs = torch.softmax(logits, dim=1)
        
        return probs[0].cpu().numpy()
    
    def detect_distraction(self, frame: np.ndarray):
        """
        检测分心行为
        
        Returns:
            result: {
                'distraction_type': str,
                'confidence': float,
                'alert': bool
            }
        """
        # 单帧预测
        probs = self.predict_frame(frame)
        
        # 更新历史
        self.prediction_history.append(probs)
        
        # 滑动窗口平均
        avg_probs = np.mean(self.prediction_history, axis=0)
        
        # 最高概率类别
        pred_idx = np.argmax(avg_probs)
        confidence = avg_probs[pred_idx]
        
        # 判断是否需要警告
        # 假设类别0是"正常驾驶"，其他为分心
        is_distracted = pred_idx != 0
        alert = is_distracted and confidence > self.alert_threshold
        
        return {
            'distraction_type': self.class_names[pred_idx],
            'confidence': float(confidence),
            'alert': alert,
            'all_probs': avg_probs
        }


# 分心类别
DISTRACTION_CLASSES = [
    "正常驾驶",
    "使用手机",
    "调节收音机",
    "喝水",
    "吃东西",
    "化妆",
    "与乘客交谈",
    "伸手取物",
    "调节空调",
    "其他分心"
]

# 测试
if __name__ == "__main__":
    # 模拟摄像头输入
    detector = RealTimeDistractionDetector(
        model_path="dsdformer_weights.pth",
        class_names=DISTRACTION_CLASSES
    )
    
    # 模拟帧
    frame = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
    
    # 检测
    result = detector.detect_distraction(frame)
    
    print("分心检测结果：")
    print(f"  类型: {result['distraction_type']}")
    print(f"  置信度: {result['confidence']:.3f}")
    print(f"  是否警告: {'是' if result['alert'] else '否'}")

IMS开发落地启示

1. 模型选型建议

场景	推荐模型	准确率	FPS	理由
实时DMS	DSDFormer-Tiny	94%	32	平衡精度与速度
高精度DMS	DSDFormer-Base	96%	15	追求准确率
后装市场	ResNet50	89%	45	低功耗平台

2. 部署优化方向

# 部署优化策略
deployment_strategies = {
    "量化": {
        "方法": "INT8 PTQ/QAT",
        "加速": "2-4x",
        "精度损失": "<1%",
    },
    "剪枝": {
        "方法": "结构化剪枝 30%",
        "加速": "1.5x",
        "精度损失": "<0.5%",
    },
    "知识蒸馏": {
        "方法": "Teacher-Student",
        "压缩": "模型大小 -50%",
        "精度保持": ">98%",
    }
}

print("部署优化策略：")
for strategy, details in deployment_strategies.items():
    print(f"\n{strategy}:")
    for key, value in details.items():
        print(f"  {key}: {value}")

3. 与Euro NCAP对接

Euro NCAP要求	DSDFormer支持	说明
分心检测	✅ 支持	10类分心行为
检测时延	✅ ≤3秒	32 FPS实时
误报率	✅ <5%	置信度阈值可调
夜间检测	✅ 支持	红外摄像头输入

4. 开发优先级

优先级	任务	时间	依赖
P0	DSDFormer模型集成	1月	PyTorch环境
P0	TensorRT部署优化	2周	Jetson平台
P1	分心类别扩展	1月	标注数据
P1	TRCL标签去噪	2周	现有数据集
P2	与疲劳检测融合	1月	多任务框架

总结

DSDFormer的核心贡献：

Transformer-Mamba融合：全局建模 + 线性复杂度
TRCL标签去噪：自动修正噪声标签，提升3%准确率
实时部署：32 FPS on Jetson AGX Orin
SOTA性能：AUC-V1 95.1%, AUC-V2 93.2%, 100-Driver 92.4%

对IMS开发的启示：

优先选择DSDFormer-Tiny作为DMS核心算法
利用TRCL处理噪声标签，减少标注成本
部署优化：INT8量化 + 结构化剪枝
与Euro NCAP对接：确保3秒内检测分心行为

参考文献

DSDFormer: https://arxiv.org/abs/2409.05587
Vision Transformer (ViT): https://arxiv.org/abs/2010.11929
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
AUC Distracted Driver Dataset: https://www.kaggle.com/c/state-farm-distracted-driver-detection
Euro NCAP 2026 Assessment Protocol

论文解读

#分心检测 #Transformer #论文解读 #Mamba #DSDFormer

DSDFormer：Transformer-Mamba融合检测驾驶员分心

https://dapalm.com/2026/06/21/2026-06-21-dsdformer-transformer-mamba-distraction/

作者

Mars

发布于

2026年6月21日

许可协议

驾驶模拟器验证DMS有效性：从实验室到实车的桥梁上一篇

DSDFormer：Transformer-Mamba融合架构实现驾驶员分心检测SOTA 下一篇