EyeCue: 认知分心检测突破性进展 | IJCAI 2026 论文详解与代码复现

论文信息

标题: EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

作者: Lang Zhang, JinYi Yoon, Matthew Corbett, Abhijit Sarkar, Bo Ji

机构: Virginia Tech, Inha University, Army Cyber Institute at West Point

会议: IJCAI 2026 (International Joint Conference on Artificial Intelligence)

链接: arXiv:2605.07859

代码: github.com/langzhang2000/EyeCue

下载: 论文PDF

核心创新

EyeCue 是首个结合眼动追踪和自中心视频（egocentric video）检测驾驶员认知分心的框架。

三大突破点

非侵入式检测: 无需 EEG/生理传感器，仅用摄像头+眼动追踪
跨模态融合: 眼动信号与视频场景的交叉注意力机制
大规模数据集: CogDrive 数据集（3,662 样本，多场景覆盖）

问题定义：认知分心的独特挑战

三类分心的对比

分心类型	定义	检测难度	传统方法
手动分心	手离开方向盘（如拿手机）	⭐ 容易	摄像头检测手部姿态
视觉分心	眼睛离开道路（如看导航）	⭐⭐ 中等	视线追踪 + ROI判定
认知分心	思维游离（如想工作）	⭐⭐⭐⭐⭐ 极难	需要理解注意力-场景交互

认知分心的隐蔽性


graph LR
    A[驾驶员] --> B{认知状态}
    B -->|正常| C[眼睛看路
注意力集中]
    B -->|认知分心| D[眼睛看路
思维游离]
    
    C --> E[✓ 传统DMS检测]
    D --> F[✗ 传统DMS漏检
EyeCue检测]
    
    style D fill:#ff6b6b
    style F fill:#4ecdc4

关键洞察: 认知分心不体现在”看哪里”，而体现在”看什么 + 怎么看”。

EyeCue 架构详解

整体框架


graph TB
    subgraph 输入
        V[自中心视频
Egocentric Video]
        G[眼动序列
Gaze Sequence]
    end
    
    subgraph 编码器
        VE[VideoEncoder
TimeSformer]
        GE[GazeEncoder
Transformer]
    end
    
    subgraph 融合模块
        GPS[Gaze-guided
Patch Selection]
        CA[Cross-Attention
Semantic Fusion]
    end
    
    subgraph 输出
        CH[ClassificationHead]
        R[分心/正常]
    end
    
    V --> VE
    G --> GE
    VE --> GPS
    GE --> GPS
    GPS --> CA
    VE --> CA
    GE --> CA
    CA --> CH --> R

核心模块详解

1. VideoEncoder（视频编码器）

import torch
import torch.nn as nn
from transformers import TimesformerModel

class VideoEncoder(nn.Module):
    """
    TimeSformer 视频编码器
    
    输入: (B, T, C, H, W) - B个视频，每个T帧
    输出: (B, T×196, D) - 空间-时间token序列
    
    关键点:
    - 使用预训练的 TimeSformer-base (Kinetics-600)
    - 每帧划分为 14×14 = 196 个patch
    - D=768 维特征
    """
    
    def __init__(self, model_name="facebook/timesformer-base-finetuned-k600"):
        super().__init__()
        self.timesformer = TimesformerModel.from_pretrained(model_name)
        self.hidden_dim = 768
        
    def forward(self, video_clip):
        """
        Args:
            video_clip: (B, T, C, H, W) 视频片段
        
        Returns:
            video_tokens: (B, T×196, D) 视频patch tokens
            cls_token: (B, 1, D) CLS token
        """
        # TimeSformer 内部处理时空注意力
        outputs = self.timesformer(video_clip)
        
        # 提取所有patch tokens (排除CLS)
        last_hidden_state = outputs.last_hidden_state  # (B, T×196+1, D)
        cls_token = last_hidden_state[:, 0:1, :]       # (B, 1, D)
        video_tokens = last_hidden_state[:, 1:, :]     # (B, T×196, D)
        
        return video_tokens, cls_token

为什么选择 TimeSformer？

特性	TimeSformer	3D CNN (如I3D)	VideoMAE
时间建模	分离的时空注意力	3D卷积	Masked重建
预训练数据	Kinetics-600	Kinetics-400	大规模无标注
Patch级别访问	✅ 直接	❌ 无patch概念	✅ 直接
部署难度	⭐⭐ 中等	⭐ 简单	⭐⭐⭐ 复杂

2. GazeEncoder（眼动编码器）

class GazeEncoder(nn.Module):
    """
    眼动序列编码器
    
    输入: (B, T, 2) - T帧的(x,y)眼动坐标
    输出: (B, T, D) - 眼动特征序列
    
    关键创新:
    - 可学习的 CLS token 聚合全局眼动模式
    - 位置编码 + 自注意力捕获时间依赖
    """
    
    def __init__(self, d_model=768, nhead=8, num_layers=2):
        super().__init__()
        
        # 线性投影: (x,y) → D维特征
        self.gaze_embed = nn.Linear(2, d_model)
        
        # 可学习的CLS token
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))
        
        # 位置编码
        self.pos_embed = nn.Parameter(torch.randn(1, 17, d_model))  # 最多16帧+1个CLS
        
        # Transformer编码器
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=2048,
            dropout=0.1,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
    def forward(self, gaze_coords):
        """
        Args:
            gaze_coords: (B, T, 2) 归一化的眼动坐标
        
        Returns:
            cls_out: (B, 1, D) 全局眼动特征
            gaze_tokens: (B, T, D) 眼动序列特征
        """
        B, T, _ = gaze_coords.shape
        
        # 投影到高维空间
        gaze_embed = self.gaze_embed(gaze_coords)  # (B, T, D)
        
        # 添加CLS token
        cls_tokens = self.cls_token.expand(B, -1, -1)  # (B, 1, D)
        x = torch.cat([cls_tokens, gaze_embed], dim=1)  # (B, T+1, D)
        
        # 添加位置编码
        x = x + self.pos_embed[:, :T+1, :]
        
        # Transformer编码
        x = self.transformer(x)  # (B, T+1, D)
        
        cls_out = x[:, 0:1, :]    # (B, 1, D)
        gaze_tokens = x[:, 1:, :] # (B, T, D)
        
        return cls_out, gaze_tokens

3. Gaze-guided Patch Selection（眼动引导的Patch选择）

这是 EyeCue 的核心创新！

class GazeGuidedPatchSelector(nn.Module):
    """
    眼动引导的视频patch选择
    
    核心思想:
    - 将每帧的眼动坐标 (x,y) 映射到对应的 patch token
    - 提取驾驶员正在注视的区域特征
    
    实现:
    - 视频帧划分为 14×14 网格
    - 眼动坐标归一化到 [0,1] 范围
    - 计算对应的 patch 索引
    """
    
    def __init__(self, patch_grid_size=14):
        super().__init__()
        self.grid_size = patch_grid_size
        
    def forward(self, video_tokens, gaze_coords):
        """
        Args:
            video_tokens: (B, T×196, D) 所有patch tokens
            gaze_coords: (B, T, 2) 眼动坐标
        
        Returns:
            selected_patches: (B, T, D) 眼动注视的patch特征
        """
        B, T, _ = gaze_coords.shape
        D = video_tokens.shape[-1]
        
        # 将眼动坐标映射到patch索引
        # gaze_coords: (B, T, 2) 范围 [0, 1]
        patch_x = (gaze_coords[:, :, 0] * self.grid_size).long()  # (B, T)
        patch_y = (gaze_coords[:, :, 1] * self.grid_size).long()  # (B, T)
        
        # 计算线性索引: patch_idx = t * 196 + y * 14 + x
        patch_indices = torch.zeros(B, T, dtype=torch.long, device=video_tokens.device)
        
        for t in range(T):
            # 每帧的patch索引
            frame_patch_idx = patch_y[:, t] * self.grid_size + patch_x[:, t]
            patch_indices[:, t] = t * (self.grid_size ** 2) + frame_patch_idx
        
        # 提取选中的patch
        selected_patches = torch.zeros(B, T, D, device=video_tokens.device)
        
        for b in range(B):
            for t in range(T):
                idx = patch_indices[b, t]
                selected_patches[b, t] = video_tokens[b, idx]
        
        return selected_patches

4. Cross-Attention Semantic Fusion（交叉注意力语义融合）

class CrossAttentionBlock(nn.Module):
    """
    跨模态交叉注意力
    
    Query: 眼动特征 (需要"理解"场景)
    Key/Value: 视频patch特征 (场景内容)
    
    作用: 让眼动特征主动查询视觉上下文
    """
    
    def __init__(self, d_model=768, nhead=8):
        super().__init__()
        
        self.cross_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=nhead,
            batch_first=True
        )
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 2048),
            nn.GELU(),
            nn.Linear(2048, d_model)
        )
        
    def forward(self, query, key_value):
        """
        Args:
            query: (B, T, D) 眼动特征
            key_value: (B, L, D) 视频特征
        
        Returns:
            out: (B, T, D) 融合后的特征
        """
        # 交叉注意力
        attn_out, attn_weights = self.cross_attn(
            query=query,
            key=key_value,
            value=key_value
        )
        
        # 残差连接 + LayerNorm
        query = self.norm1(query + attn_out)
        
        # FFN
        query = self.norm2(query + self.ffn(query))
        
        return query, attn_weights


class SemanticFusion(nn.Module):
    """
    多层交叉注意力融合
    """
    
    def __init__(self, d_model=768, nhead=8, num_layers=3):
        super().__init__()
        
        self.layers = nn.ModuleList([
            CrossAttentionBlock(d_model, nhead)
            for _ in range(num_layers)
        ])
        
    def forward(self, gaze_tokens, video_tokens):
        """
        Args:
            gaze_tokens: (B, T, D)
            video_tokens: (B, T×196, D)
        
        Returns:
            fused: (B, 1, D) 融合后的全局特征
        """
        x = gaze_tokens
        
        for layer in self.layers:
            x, _ = layer(query=x, key_value=video_tokens)
        
        # 取最后时刻的特征作为全局表示
        fused = x[:, -1:, :]  # (B, 1, D)
        
        return fused

完整模型整合

class EyeCue(nn.Module):
    """
    EyeCue: 完整的认知分心检测模型
    
    输入:
        - video_clip: (B, T, C, H, W) 自中心视频
        - gaze_coords: (B, T, 2) 眼动坐标序列
    
    输出:
        - logits: (B, 2) 分类结果 [正常, 分心]
    """
    
    def __init__(self, num_frames=8):
        super().__init__()
        
        # 视频编码器
        self.video_encoder = VideoEncoder()
        
        # 眼动编码器
        self.gaze_encoder = GazeEncoder(d_model=768, nhead=8, num_layers=2)
        
        # Patch选择器
        self.patch_selector = GazeGuidedPatchSelector(patch_grid_size=14)
        
        # 跨模态融合
        self.semantic_fusion = SemanticFusion(d_model=768, nhead=8, num_layers=3)
        
        # 分类头
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(768, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 2)
        )
        
    def forward(self, video_clip, gaze_coords):
        """
        Args:
            video_clip: (B, T, 3, 224, 224)
            gaze_coords: (B, T, 2) 归一化坐标 [0, 1]
        
        Returns:
            logits: (B, 2)
        """
        # 1. 编码视频
        video_tokens, video_cls = self.video_encoder(video_clip)
        # video_tokens: (B, T×196, 768)
        
        # 2. 编码眼动
        gaze_cls, gaze_tokens = self.gaze_encoder(gaze_coords)
        # gaze_tokens: (B, T, 768)
        
        # 3. 选择眼动注视的patch
        selected_patches = self.patch_selector(video_tokens, gaze_coords)
        # selected_patches: (B, T, 768)
        
        # 4. 跨模态融合
        fused = self.semantic_fusion(gaze_tokens, video_tokens)
        # fused: (B, 1, 768)
        
        # 5. 分类
        logits = self.classifier(fused)
        
        return logits


# ==================== 使用示例 ====================

if __name__ == "__main__":
    # 创建模型
    model = EyeCue(num_frames=8)
    
    # 模拟输入
    B, T = 2, 8
    video = torch.randn(B, T, 3, 224, 224)
    gaze = torch.rand(B, T, 2)  # 归一化的眼动坐标
    
    # 前向传播
    logits = model(video, gaze)
    
    print(f"输入视频: {video.shape}")
    print(f"输入眼动: {gaze.shape}")
    print(f"输出logits: {logits.shape}")
    
    # 预测
    pred = torch.argmax(logits, dim=1)
    print(f"预测结果: {pred}")
    print(f"0=正常驾驶, 1=认知分心")

CogDrive 数据集详解

数据集统计

指标	数值
总样本数	3,662
正常样本	1,831 (50%)
分心样本	1,831 (50%)
视频分辨率	224×224
帧率	30 fps
眼动采样率	60 Hz

数据来源

数据集	场景	样本数
DR(eye)VE	高速/城市道路	588
BDD-A	多样化驾驶场景	1,200
DADA-2000	事故场景	924
TrafficGaze	复杂交通	950

场景分布


pie title CogDrive 场景分布
    "城市道路" : 1200
    "高速公路" : 900
    "住宅区" : 800
    "复杂交通" : 762

数据加载代码

import os
import torch
from torch.utils.data import Dataset
import cv2
import numpy as np

class CogDriveDataset(Dataset):
    """
    CogDrive 数据集加载器
    
    数据结构:
        all_video_raw_resize/
            ├── sample_001.mp4
            ├── sample_002.mp4
            └── ...
        all_gaze_coordinate/
            ├── sample_001.npy
            ├── sample_002.npy
            └── ...
        labels.txt
    """
    
    def __init__(self, 
                 video_dir,
                 gaze_dir,
                 label_file,
                 num_frames=8,
                 transform=None):
        super().__init__()
        
        self.video_dir = video_dir
        self.gaze_dir = gaze_dir
        self.num_frames = num_frames
        self.transform = transform
        
        # 加载标签
        self.samples = []
        with open(label_file, 'r') as f:
            for line in f:
                parts = line.strip().split(',')
                video_name = parts[0]
                label = int(parts[1])
                self.samples.append((video_name, label))
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        video_name, label = self.samples[idx]
        
        # 加载视频
        video_path = os.path.join(self.video_dir, f"{video_name}.mp4")
        frames = self._load_video(video_path)
        
        # 加载眼动数据
        gaze_path = os.path.join(self.gaze_dir, f"{video_name}.npy")
        gaze_coords = np.load(gaze_path)
        
        # 采样固定帧数
        frames, gaze_coords = self._sample_frames(frames, gaze_coords)
        
        # 转换为tensor
        frames = torch.from_numpy(frames).float() / 255.0
        frames = frames.permute(0, 3, 1, 2)  # (T, H, W, C) → (T, C, H, W)
        
        gaze_coords = torch.from_numpy(gaze_coords).float()
        
        return frames, gaze_coords, label
    
    def _load_video(self, video_path):
        """加载视频所有帧"""
        cap = cv2.VideoCapture(video_path)
        frames = []
        
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            frame = cv2.resize(frame, (224, 224))
            frames.append(frame)
        
        cap.release()
        return np.array(frames)
    
    def _sample_frames(self, frames, gaze_coords):
        """均匀采样固定帧数"""
        total_frames = len(frames)
        
        if total_frames >= self.num_frames:
            indices = np.linspace(0, total_frames-1, self.num_frames, dtype=int)
        else:
            # 如果帧数不足，重复最后一帧
            indices = list(range(total_frames))
            indices += [total_frames-1] * (self.num_frames - total_frames)
        
        frames = frames[indices]
        gaze_coords = gaze_coords[indices]
        
        return frames, gaze_coords


# 使用示例
if __name__ == "__main__":
    dataset = CogDriveDataset(
        video_dir="/path/to/all_video_raw_resize",
        gaze_dir="/path/to/all_gaze_coordinate",
        label_file="/path/to/labels.txt",
        num_frames=8
    )
    
    print(f"数据集大小: {len(dataset)}")
    
    # 获取一个样本
    frames, gaze, label = dataset[0]
    print(f"视频帧: {frames.shape}")  # (8, 3, 224, 224)
    print(f"眼动数据: {gaze.shape}")  # (8, 2)
    print(f"标签: {label}")

实验结果与性能对比

主要结果

方法	输入模态	准确率	F1-Score
EyeCue (本文)	视频+眼动	74.38%	0.742
TimeSformer	仅视频	67.21%	0.668
Gaze-Only	仅眼动	61.53%	0.612
DCDD	图像+眼动	66.42%	0.659
VideoMAE	仅视频	65.88%	0.653
ViViT	仅视频	64.92%	0.645

不同场景的性能

场景	准确率	备注
高速公路	76.2%	场景相对单一
城市道路	73.8%	场景复杂度高
住宅区	71.5%	低速场景
复杂交通	70.3%	多目标场景
平均	72.95%	跨场景泛化性强

消融实验


graph LR
    A[基线: 67.21%] --> B[+眼动编码器: 69.45%]
    B --> C[+Patch选择: 71.82%]
    C --> D[+交叉注意力: 74.38%]
    
    style D fill:#4ecdc4

组件	准确率	增益
TimeSformer (基线)	67.21%	-
+ GazeEncoder	69.45%	+2.24%
+ Patch Selection	71.82%	+2.37%
+ Cross-Attention	74.38%	+2.56%

部署指南

硬件需求

配置项	最低要求	推荐配置
GPU	GTX 1080 (8GB)	RTX 3080 (10GB)
CPU	4核	8核
内存	16GB	32GB
存储	50GB SSD	100GB NVMe

安装步骤

# 1. 克隆代码
git clone https://github.com/langzhang2000/EyeCue.git
cd EyeCue

# 2. 创建虚拟环境
conda create -n eyecue python=3.9
conda activate eyecue

# 3. 安装PyTorch (CUDA 11.8)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
  --index-url https://download.pytorch.org/whl/cu118

# 4. 安装依赖
pip install -r requirements.txt

# 5. 下载数据集
# 从 Google Drive 下载 CogDrive 数据集
# 链接: https://drive.google.com/drive/folders/1m3Xh8aVtOiX9IGyhLH9TlP4nX6gSrWi3

训练命令

python new_train.py \
  --train_list data/train.txt \
  --val_list data/val.txt \
  --batch_size 4 \
  --epochs 15 \
  --lr 1e-5 \
  --clip_len 8 \
  --save_dir checkpoints

推理代码

import torch
from models import EyeCue

# 加载模型
model = EyeCue(num_frames=8)
model.load_state_dict(torch.load("checkpoints/best_model.pth"))
model.eval()

# 推理
def detect_cognitive_distraction(video_path, gaze_data):
    """
    检测认知分心
    
    Args:
        video_path: 视频文件路径
        gaze_data: (T, 2) 眼动坐标数组
    
    Returns:
        is_distracted: bool 是否分心
        confidence: float 置信度
    """
    # 加载视频
    frames = load_video(video_path)  # (T, 3, 224, 224)
    gaze = torch.from_numpy(gaze_data).float()
    
    # 前向传播
    with torch.no_grad():
        logits = model(frames.unsqueeze(0), gaze.unsqueeze(0))
    
    # 解析结果
    prob = torch.softmax(logits, dim=1)
    is_distracted = prob[0, 1] > 0.5
    confidence = prob[0, 1].item()
    
    return is_distracted, confidence

IMS 开发启示

对 IMS 的直接价值

1. 认知分心检测落地路线

阶段	时间	目标	关键技术
Phase 1	Q3 2026	原型验证	复现EyeCue，评估性能
Phase 2	Q4 2026	工程化	模型压缩，边缘部署
Phase 3	Q1 2027	产品集成	与现有DMS融合

2. 硬件选型建议

硬件	现有配置	EyeCue需求	差距分析
红外摄像头	✅ OV2311 (2MP)	RGB摄像头	需增加RGB摄像头
眼动追踪	⚠️ 需验证	需要高精度眼动	验证现有眼动精度
处理器	✅ QCS8255 (26 TOPS)	GPU推理	NPU适配需要优化

3. 技术挑战与解决方案

挑战	影响	解决方案
模型体积大	无法在嵌入式部署	知识蒸馏 + 量化
眼动精度要求高	现有设备可能不满足	使用眼动数据增强
实时性要求	推理延迟需<100ms	模型剪枝 + TensorRT
跨场景泛化	中国道路场景差异	使用中国数据微调

技术路线图


graph TB
    A[论文复现
2026 Q3] --> B[模型压缩
2026 Q4]
    B --> C[NPU适配
2026 Q4]
    C --> D[数据采集
2027 Q1]
    D --> E[场景微调
2027 Q1]
    E --> F[产品集成
2027 Q2]
    
    style A fill:#4ecdc4
    style F fill:#ff6b6b

代码复现优先级

高优先级（立即执行）

搭建训练环境
- 下载 CogDrive 数据集
- 验证模型训练流程
- 评估基准性能
模型压缩实验
- TimeSformer → MobileViT 替换
- 量化到 INT8
- 剪枝稀疏化
NPU 部署验证
- ONNX 导出
- Qualcomm SNPE 转换
- 推理延迟测试

中优先级（Q4 2026）

中国场景数据采集
- 真实驾驶场景录制
- 眼动数据标注
- 认知分心标签采集
模型微调
- 使用中国数据微调
- 针对特定场景优化
- 提升跨场景泛化

低优先级（待定）

多模态融合扩展
- 融合车辆CAN数据
- 融合生理信号（可选）
- 提升检测鲁棒性

关键代码实现细节

损失函数

import torch.nn as nn
import torch.nn.functional as F

class FocalLoss(nn.Module):
    """
    Focal Loss for handling class imbalance
    
    FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)
    
    Args:
        alpha: 权重因子，用于平衡正负样本
        gamma: 聚焦参数，减少简单样本的权重
    """
    
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, logits, targets):
        """
        Args:
            logits: (B, 2) 模型输出
            targets: (B,) 标签 {0, 1}
        """
        ce_loss = F.cross_entropy(logits, targets, reduction='none')
        pt = torch.exp(-ce_loss)  # 预测概率
        
        # Focal term
        focal_term = (1 - pt) ** self.gamma
        
        # Alpha balancing
        alpha_t = torch.where(targets == 1, self.alpha, 1 - self.alpha)
        
        loss = alpha_t * focal_term * ce_loss
        
        return loss.mean()

数据增强

import albumentations as A
from albumentations.pytorch import ToTensorV2

def get_training_augmentation():
    """
    训练时的数据增强策略
    """
    return A.Compose([
        # 几何变换
        A.HorizontalFlip(p=0.5),
        A.ShiftScaleRotate(
            shift_limit=0.1,
            scale_limit=0.1,
            rotate_limit=15,
            p=0.5
        ),
        
        # 颜色变换
        A.OneOf([
            A.RandomBrightnessContrast(
                brightness_limit=0.2,
                contrast_limit=0.2,
                p=1
            ),
            A.HueSaturationValue(
                hue_shift_limit=20,
                sat_shift_limit=30,
                val_shift_limit=20,
                p=1
            ),
        ], p=0.5),
        
        # 降噪
        A.GaussNoise(var_limit=(10, 50), p=0.3),
        
        # 模糊
        A.OneOf([
            A.MotionBlur(blur_limit=5, p=1),
            A.GaussianBlur(blur_limit=5, p=1),
        ], p=0.3),
        
        # 归一化
        A.Normalize(mean=[0.485, 0.456, 0.406], 
                    std=[0.229, 0.224, 0.225]),
        
        ToTensorV2()
    ])


def get_gaze_augmentation():
    """
    眼动数据增强
    
    模拟眼动追踪的误差和抖动
    """
    def augment_gaze(gaze_coords, noise_std=0.02):
        """
        Args:
            gaze_coords: (T, 2) 眼动坐标
            noise_std: 高斯噪声标准差
        """
        noise = np.random.normal(0, noise_std, gaze_coords.shape)
        gaze_coords = gaze_coords + noise
        
        # 裁剪到有效范围
        gaze_coords = np.clip(gaze_coords, 0, 1)
        
        return gaze_coords
    
    return augment_gaze

评估指标

from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    confusion_matrix,
    roc_auc_score
)

def evaluate_model(model, dataloader, device):
    """
    全面评估模型性能
    
    Returns:
        metrics: 包含所有评估指标的字典
    """
    model.eval()
    
    all_preds = []
    all_labels = []
    all_probs = []
    
    with torch.no_grad():
        for videos, gazes, labels in dataloader:
            videos = videos.to(device)
            gazes = gazes.to(device)
            
            # 前向传播
            logits = model(videos, gazes)
            probs = torch.softmax(logits, dim=1)
            preds = torch.argmax(logits, dim=1)
            
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.numpy())
            all_probs.extend(probs[:, 1].cpu().numpy())
    
    # 计算指标
    metrics = {
        'accuracy': accuracy_score(all_labels, all_preds),
        'precision': precision_score(all_labels, all_preds),
        'recall': recall_score(all_labels, all_preds),
        'f1': f1_score(all_labels, all_preds),
        'auc_roc': roc_auc_score(all_labels, all_probs),
        'confusion_matrix': confusion_matrix(all_labels, all_preds)
    }
    
    return metrics

总结与展望

核心贡献

首个非侵入式认知分心检测框架
- 无需 EEG 等生理传感器
- 仅用摄像头 + 眼动追踪
创新的跨模态融合机制
- 眼动引导的 patch 选择
- 交叉注意力语义融合
大规模多场景数据集
- CogDrive: 3,662 样本
- 覆盖多种驾驶场景

对 IMS 的价值

维度	价值
技术突破	解决认知分心检测难题
产品竞争力	领先 Euro NCAP 2026 要求
差异化优势	多模态融合创新

下一步行动

立即: 下载论文和代码，搭建实验环境
本周: 验证模型性能，评估部署可行性
本月: 制定工程化路线，启动模型压缩实验

参考资料

本文由 OpenClaw AI 研究助手生成 | 2026-06-04

IMS研究

#DMS #分心检测 #Euro NCAP #认知分心 #注意力

EyeCue: 认知分心检测突破性进展 | IJCAI 2026 论文详解与代码复现

https://dapalm.com/2026/06/04/2026-06-04-EyeCue-Cognitive-Distraction-IJCAI2026/

作者

Mars

发布于

2026年6月4日

许可协议

DeepCPD：WiFi感知儿童存在检测论文解读与代码复现上一篇

FIFA：细粒度帧间注意力驾驶员视线估计论文解读与代码复现下一篇