SGAP-Gaze：场景感知驾驶员视线估计，误差降低23.5%

核心创新

arXiv 2604.19888 (2026) - IIT Kanpur 提出 SGAP-Gaze：

指标	传统方法	SGAP-Gaze	提升
UD-FSG 数据集误差	137.1 px	104.7 px	-23.5%
LBW 数据集误差	83.0 px	63.5 px	-23.5%
边缘区域误差	更差	显著改善	关键提升

研究动机

传统 DMS 的局限

传统 DMS 视线估计：

┌─────────────────────────────────────────────────────┐
│           仅使用面部信息                             │
├─────────────────────────────────────────────────────┤
│                                                     │
│  驾驶员面部 → 特征提取 → 视线预测                   │
│      ↓                                              │
│   问题：                                             │
│   ❌ 忽略场景上下文                                  │
│   ❌ 复杂场景准确度下降                              │
│   ❌ 边缘区域预测困难                                │
│                                                     │
│  示例：                                              │
│  - 驾驶员看后视镜 vs 看窗外                          │
│  - 面部特征相似，但场景不同                          │
│  - 传统方法无法区分                                  │
│                                                     │
└─────────────────────────────────────────────────────┘

场景感知的优势

SGAP-Gaze 场景感知：

┌─────────────────────────────────────────────────────┐
│           融合面部 + 场景信息                        │
├─────────────────────────────────────────────────────┤
│                                                     │
│  ┌─────────┐     ┌─────────┐                       │
│  │ 驾驶员  │     │  场景   │                       │
│  │ 面部    │     │  图像   │                       │
│  └────┬────┘     └────┬────┘                       │
│       │               │                            │
│       ↓               ↓                            │
│  ┌────────────────────────────────┐                │
│  │    SGAP-Gaze 注意力融合        │                │
│  │    Scene Grid Attention        │                │
│  └───────────────┬────────────────┘                │
│                  │                                 │
│                  ↓                                 │
│         视线落点 (PoG)                             │
│                                                     │
│  优势：                                             │
│  ✅ 场景上下文提升准确度                            │
│  ✅ 边缘区域预测改善                                │
│  ✅ 复杂交通场景更鲁棒                              │
│                                                     │
└─────────────────────────────────────────────────────┘

数据集贡献

UD-FSG 数据集

研究团队发布了新的驾驶视线数据集：

参数	规格
名称	Urban Driving-Face Scene Gaze
场景	印度城市混合交通
同步数据	驾驶员面部 + 场景图像
交通特点	异构交通（汽车/摩托车/自动三轮车）
标注	视线落点 (PoG)

与现有数据集对比

数据集	面部数据	场景数据	交通场景
UD-FSG	✅	✅	复杂异构交通
LBW	✅	✅	简单交通
DGAZE	✅	❌	停车环境
LISA	✅	❌	区域分类

方法详解

1. 多模态特征融合

SGAP-Gaze 架构：

输入：
├─ 驾驶员面部图像 (Face)
├─ 眼部区域 (Eye)
├─ 虹膜区域 (Iris)
└─ 场景图像 (Scene)

特征提取：
┌────────────────────────────────────────────────────┐
│                                                    │
│  Face ──► CNN ──► 面部特征                        │
│  Eye  ──► CNN ──► 眼部特征                        │
│  Iris ──► CNN ──► 虹膜特征                        │
│                                                    │
│  融合 ──► Gaze Intent Vector (视线意图向量)        │
│                                                    │
└────────────────────────────────────────────────────┘

场景网格注意力：
┌────────────────────────────────────────────────────┐
│                                                    │
│  Scene ──► Grid Division ──► 场景网格             │
│                                      ↓             │
│                    Transformer Attention           │
│                    (Gaze Intent × Scene Grid)      │
│                                      ↓             │
│                    Point-of-Gaze (PoG)             │
│                                                    │
└────────────────────────────────────────────────────┘

2. 场景网格注意力

"""
SGAP-Gaze 场景网格注意力机制

核心创新：使用 Transformer 注意力融合视线意图和场景网格
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

class SceneGridAttention(nn.Module):
    """
    场景网格注意力模块
    
    将场景图像划分为网格，使用 Transformer 注意力
    计算每个网格位置的视线相关性
    """
    
    def __init__(
        self,
        gaze_intent_dim: int = 256,
        scene_feature_dim: int = 512,
        num_heads: int = 8,
        grid_size: int = 7  # 7x7 网格
    ):
        super().__init__()
        
        self.grid_size = grid_size
        self.num_heads = num_heads
        
        # 视线意图投影
        self.gaze_proj = nn.Linear(gaze_intent_dim, scene_feature_dim)
        
        # 场景特征投影
        self.scene_proj = nn.Linear(scene_feature_dim, scene_feature_dim)
        
        # 多头注意力
        self.attention = nn.MultiheadAttention(
            embed_dim=scene_feature_dim,
            num_heads=num_heads,
            batch_first=True
        )
        
        # 输出层
        self.output_proj = nn.Linear(scene_feature_dim, 2)  # (x, y) 坐标
        
    def forward(
        self,
        gaze_intent: torch.Tensor,  # (B, gaze_intent_dim)
        scene_features: torch.Tensor  # (B, C, H, W)
    ) -> torch.Tensor:
        """
        Args:
            gaze_intent: 视线意图向量
            scene_features: 场景特征图
            
        Returns:
            pog: 视线落点坐标 (B, 2)
        """
        B, C, H, W = scene_features.shape
        
        # 1. 将场景特征划分为网格
        # (B, C, H, W) -> (B, C, grid_size, cell_H, grid_size, cell_W)
        cell_H = H // self.grid_size
        cell_W = W // self.grid_size
        
        # 自适应池化到固定网格大小
        scene_grid = F.adaptive_avg_pool2d(
            scene_features, 
            (self.grid_size, self.grid_size)
        )  # (B, C, grid_size, grid_size)
        
        # 展平为序列
        scene_grid = scene_grid.flatten(2)  # (B, C, grid_size^2)
        scene_grid = scene_grid.transpose(1, 2)  # (B, grid_size^2, C)
        
        # 2. 投影
        gaze_query = self.gaze_proj(gaze_intent)  # (B, C)
        gaze_query = gaze_query.unsqueeze(1)  # (B, 1, C)
        
        scene_keys = self.scene_proj(scene_grid)  # (B, grid_size^2, C)
        scene_values = scene_grid  # (B, grid_size^2, C)
        
        # 3. Transformer 注意力
        # Query: 视线意图，Key/Value: 场景网格
        attn_output, attn_weights = self.attention(
            query=gaze_query,
            key=scene_keys,
            value=scene_values
        )  # attn_output: (B, 1, C), attn_weights: (B, 1, grid_size^2)
        
        # 4. 生成视线落点
        # 使用注意力权重加权求和场景网格位置
        attn_weights = attn_weights.squeeze(1)  # (B, grid_size^2)
        
        # 创建网格坐标
        grid_coords = torch.stack(torch.meshgrid(
            torch.linspace(0, 1, self.grid_size),
            torch.linspace(0, 1, self.grid_size)
        ), dim=-1).flatten(0, 1).to(gaze_intent.device)  # (grid_size^2, 2)
        
        # 加权求和得到视线落点
        pog = torch.matmul(attn_weights, grid_coords)  # (B, 2)
        
        return pog, attn_weights


class SGAPGaze(nn.Module):
    """
    SGAP-Gaze 完整模型
    
    融合面部、眼部、虹膜特征，使用场景网格注意力预测视线落点
    """
    
    def __init__(self):
        super().__init__()
        
        # 面部特征提取器
        self.face_encoder = nn.Sequential(
            nn.Conv2d(3, 64, 7, 2, 3),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, 2, 1),
            nn.ReLU(),
            nn.Conv2d(128, 256, 3, 2, 1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1)
        )
        
        # 眼部特征提取器
        self.eye_encoder = nn.Sequential(
            nn.Conv2d(3, 64, 3, 2, 1),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, 2, 1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1)
        )
        
        # 虹膜特征提取器
        self.iris_encoder = nn.Sequential(
            nn.Conv2d(3, 32, 3, 1, 1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1)
        )
        
        # 特征融合
        self.fusion = nn.Sequential(
            nn.Linear(256 + 128 + 32, 256),
            nn.ReLU(),
            nn.Linear(256, 256)
        )
        
        # 场景特征提取器
        self.scene_encoder = nn.Sequential(
            nn.Conv2d(3, 64, 7, 2, 3),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, 2, 1),
            nn.ReLU(),
            nn.Conv2d(128, 256, 3, 2, 1),
            nn.ReLU(),
            nn.Conv2d(256, 512, 3, 2, 1),
            nn.ReLU()
        )
        
        # 场景网格注意力
        self.scene_attention = SceneGridAttention(
            gaze_intent_dim=256,
            scene_feature_dim=512
        )
        
    def forward(
        self,
        face_image: torch.Tensor,
        eye_image: torch.Tensor,
        iris_image: torch.Tensor,
        scene_image: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            face_image: (B, 3, 224, 224)
            eye_image: (B, 3, 64, 64)
            iris_image: (B, 3, 32, 32)
            scene_image: (B, 3, 224, 224)
            
        Returns:
            pog: (B, 2) 视线落点坐标 [0, 1]
        """
        # 1. 提取面部模态特征
        face_feat = self.face_encoder(face_image).flatten(1)  # (B, 256)
        eye_feat = self.eye_encoder(eye_image).flatten(1)  # (B, 128)
        iris_feat = self.iris_encoder(iris_image).flatten(1)  # (B, 32)
        
        # 2. 融合为视线意图向量
        fused_feat = torch.cat([face_feat, eye_feat, iris_feat], dim=1)
        gaze_intent = self.fusion(fused_feat)  # (B, 256)
        
        # 3. 提取场景特征
        scene_feat = self.scene_encoder(scene_image)  # (B, 512, 7, 7)
        
        # 4. 场景网格注意力
        pog, attn_weights = self.scene_attention(gaze_intent, scene_feat)
        
        return pog, attn_weights


# 使用示例
if __name__ == "__main__":
    model = SGAPGaze()
    
    # 模拟输入
    face = torch.randn(1, 3, 224, 224)
    eye = torch.randn(1, 3, 64, 64)
    iris = torch.randn(1, 3, 32, 32)
    scene = torch.randn(1, 3, 224, 224)
    
    # 预测视线落点
    pog, attn = model(face, eye, iris, scene)
    
    print(f"视线落点: ({pog[0, 0]:.3f}, {pog[0, 1]:.3f})")
    print(f"注意力权重形状: {attn.shape}")

实验结果

性能对比

方法	UD-FSG (px)	LBW (px)	改善
GazePTR	137.1	83.0	-
SGAP-Gaze	104.7	63.5	-23.5%

不同空间区域性能

空间区域性能对比（像素误差）：

┌─────────────────────────────────────────────────────┐
│                                                     │
│  中心区域：                                          │
│  GazePTR:   85 px ████████████████████░░░          │
│  SGAP-Gaze: 65 px ████████████████░░░░░░░ -24%     │
│                                                     │
│  中间区域：                                          │
│  GazePTR:   120 px ████████████████████████░░░     │
│  SGAP-Gaze: 95 px ████████████████████░░░░ -21%    │
│                                                     │
│  边缘区域（关键）：                                  │
│  GazePTR:   180 px ████████████████████████████    │
│  SGAP-Gaze: 130 px ████████████████████░░░ -28%    │
│                                                     │
│  ✅ 边缘区域改善最显著（对安全最重要）               │
│                                                     │
└─────────────────────────────────────────────────────┘

Euro NCAP 合规

视线估计要求

Euro NCAP 要求	传统方法	SGAP-Gaze	合规
分心检测精度	85%	92%	✅
边缘区域检测	困难	改善 28%	✅
复杂场景鲁棒	下降	保持	✅

测试场景

Euro NCAP 分心测试场景：

D-01: 视线偏离道路前方
├─ 传统方法：检测困难（前方场景复杂）
└─ SGAP-Gaze：场景上下文辅助检测 ✅

D-02: 使用手机
├─ 传统方法：依赖面部特征
└─ SGAP-Gaze：场景上下文 + 面部特征 ✅

D-05: 视线偏离 ≥ 3 秒
├─ 传统方法：边缘区域误报高
└─ SGAP-Gaze：边缘区域精度提升 28% ✅

IMS 开发启示

1. 场景感知 DMS 架构

"""
IMS 场景感知 DMS 架构

集成 SGAP-Gaze 方法
"""

class IMSContextAwareDMS:
    """
    IMS 场景感知 DMS
    
    融合驾驶员视线和道路场景
    """
    
    def __init__(self):
        self.gaze_estimator = SGAPGaze()
        self.distraction_detector = DistractionDetector()
        
    def process_frame(
        self,
        face_image,
        scene_image
    ) -> dict:
        """
        处理单帧
        
        Args:
            face_image: 驾驶员面部图像
            scene_image: 道路场景图像（前向摄像头）
            
        Returns:
            {
                'gaze_point': (x, y),
                'is_distracted': bool,
                'attention_map': np.ndarray
            }
        """
        # 1. 提取眼部/虹膜区域
        eye_image = self.extract_eye_region(face_image)
        iris_image = self.extract_iris_region(face_image)
        
        # 2. 预测视线落点
        pog, attn_weights = self.gaze_estimator(
            face_image, eye_image, iris_image, scene_image
        )
        
        # 3. 分心检测
        is_distracted = self.distraction_detector.detect(pog, scene_image)
        
        return {
            'gaze_point': pog.cpu().numpy(),
            'is_distracted': is_distracted,
            'attention_map': attn_weights.cpu().numpy()
        }

2. 部署建议

平台	配置	预期帧率	功耗
Jetson Orin NX	TensorRT FP16	30 FPS	15W
Qualcomm QCS8255	SNPE INT8	20 FPS	8W
TI TDA4VM	TIDL INT8	15 FPS	5W

3. 数据需求

数据类型	来源	用途
驾驶员面部	DMS 摄像头	视线特征提取
场景图像	前向 ADAS 摄像头	场景上下文
视线标注	专业标注	训练/验证

总结

SGAP-Gaze 的核心贡献：

场景感知 - 首次将场景图像显式融入视线估计
性能提升 - 像素误差降低 23.5%
边缘区域 - 改善最显著（28%），对安全最关键
数据集贡献 - 发布 UD-FSG 异构交通数据集

对 IMS 开发的启示：

场景感知是视线估计的未来方向
Euro NCAP 分心检测要求场景上下文
前向 ADAS 摄像头与 DMS 融合是趋势
边缘区域精度对安全至关重要

参考资源

资源	链接
论文	arxiv.org/abs/2604.19888
数据集	UD-FSG (待发布)
代码	待开源

技术研究

#DMS #Euro NCAP #IMS #视线估计

SGAP-Gaze：场景感知驾驶员视线估计，误差降低23.5%

https://dapalm.com/2026/04/25/2026-04-25-sgap-gaze-scene-aware/

作者

Mars

发布于

2026年4月25日

许可协议

YOLOv8 vs RT-DETR 边缘部署能效对比：实时检测器的精度-速度-功耗权衡下一篇