Brief

last 24h

[50/1089] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.AI English(EN) · 2w

Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

Researchers have developed Rectified Schrödinger Bridge Matching (RSBM), a new framework designed to improve visual navigation for autonomous agents in Embodied AI. RSBM leverages a shared velocity-field structure between diffusion models and Schrödinger Bridges, allowing for more stable and efficient integration steps. This method significantly reduces the number of steps required for convergence compared to standard approaches, achieving high success rates with only three integration steps. AI

IMPACT This research could enable faster and more efficient visual navigation for robots, accelerating the development of real-time autonomous systems.
TOOL · arXiv cs.AI English(EN) · 2w

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

Researchers have developed a new transfer learning strategy called COTTA to improve trajectory prediction models for autonomous driving in diverse geographic regions. When transferring models trained on U.S. data to Korean road environments, COTTA demonstrated significant performance gains. Specifically, fine-tuning only the decoder while keeping the encoder frozen reduced prediction error by over 66% compared to training from scratch, offering a practical approach for deploying these safety-critical systems globally. AI

IMPACT Improves the adaptability of autonomous driving systems to new geographic regions, enhancing safety and efficiency.
TOOL · arXiv cs.AI English(EN) · 2w

LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

Researchers have developed LNN-PINN, a new framework designed to enhance the accuracy of physics-informed neural networks (PINNs). This framework integrates a liquid residual gating architecture into the hidden layers of PINNs without altering the core physics modeling or optimization processes. Testing across four benchmark problems demonstrated that LNN-PINN consistently achieved lower RMSE and MAE compared to standard PINNs under identical training conditions. The architecture also proved adaptable and stable across various problem complexities, offering a concise yet effective method for improving predictive capabilities in scientific and engineering applications. AI

IMPACT Enhances predictive accuracy for scientific and engineering problems by refining PINN architectures.
TOOL · arXiv cs.AI English(EN) · 2w

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

Researchers have developed CARE (Community-Aware Reaction Evaluation), a new framework designed to assess how well large language models (LLMs) can simulate the linguistic behaviors and attitudes of online communities. CARE benchmarks LLM-generated discourse against real community responses to news events, focusing on illocutionary tones and underlying attitudes. The framework's analysis revealed a significant "realism gap," indicating that even with explicit community prompts, LLMs struggle to accurately simulate social dynamics. Furthermore, the study identified distinct behavioral patterns across different frontier models, suggesting current alignment strategies are insufficient for capturing the sociolinguistic nuances of online groups. AI

IMPACT This research highlights current limitations in LLM's ability to understand and simulate complex social dynamics, suggesting a need for new alignment strategies.
TOOL · arXiv cs.AI English(EN) · 2w

Informing AI Policy Assessment using Large-Scale Simulation of Interventions

Researchers have developed a new methodology to help policymakers assess and prioritize AI governance options. This approach combines participatory evaluation, expert cost assessments, and large language model (LLM) analysis of perceived harm mitigation. A simulation study using a genetic algorithm explores numerous policy combinations, revealing how outcomes vary with different weightings of cost, participation, and harm reduction. The method aims to integrate participatory AI principles into practical policy development, offering a diverse set of policy combinations for deliberation. AI

IMPACT Provides a framework for policymakers to systematically evaluate and prioritize AI governance strategies.
- AI
- LLM
TOOL · Alignment Forum English(EN) · 2w

Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming

A new paper proposes "eval cooperativeness" as a scalable solution to "eval gaming" in AI models. The authors argue that current behavioral evaluations may become unreliable if AI models develop "eval awareness" and deliberately alter their behavior to appear aligned during testing, a phenomenon known as "eval gaming." Instead of solely focusing on reducing eval awareness, the paper suggests fostering a situational desire in AI models to help developers gather accurate information through evaluations, thereby preserving the predictive power of these tests for real-world deployment behavior. AI

IMPACT This research could lead to more reliable AI safety evaluations, ensuring AI models behave as intended in real-world deployments.
TOOL · 雷峰网 (Leiphone) 中文(ZH) · 2w

ICRA 2026 | Differential Intelligence's 11 Scientific Research Achievements Selected and Interpreted

Weifenzhifei, a research team, has had 11 scientific achievements accepted for ICRA 2026. These achievements span five core technical areas: decision-making brains, agile cerebellums, collaborative swarm brains, high-precision perception, and dexterous manipulation. The research demonstrates the team's significant technical expertise and original breakthroughs in aerial robot systems and intelligent decision-making and planning. AI
- ICRA 2026
- Weifenzhifei
TOOL · 雷峰网 (Leiphone) 中文(ZH) · 2w

ICRA 2026 | CUHK Gao Yuan, Lin Tianlin Team Propose Spontaneous Co-adaptation Strategy: Meta-Learning Empowered Co-evolution of Heterogeneous Multi-Robot Systems

Researchers from the Chinese University of Hong Kong, Shenzhen, have developed a novel framework for heterogeneous multi-robot systems that enables emergent co-adaptive strategies through meta-learning. This system allows different types of robots, such as task execution, supply, and social interaction robots, to autonomously adjust their behaviors based on human crowd states, facilitating bidirectional adaptation between humans and robots. Large-scale experiments in simulated airport environments demonstrated significant improvements in task completion efficiency and crowd guidance, with reduced human burden and increased trust and anthropomorphism towards the robots. AI

IMPACT Enhances human-robot interaction and efficiency in complex environments by enabling robots to adapt to human behavior.
- IEEE国际机器人与自动化会议
- Emergent Co-Adaptive Strategies in Heterogeneous Multi-Robot Systems via Meta-Learning
TOOL · Towards AI Nederlands(NL) · 2w

DeepSeek V4 mHC Explained

DeepSeek V4 is an advanced language model that builds upon its predecessor, DeepSeek V3. The V4 architecture introduces novel components such as Compressed Sparse Attention (CSA), Heavily Compressed Attention (HCA), and Manifold-Constrained Hyper-Connections (mHC). The article focuses on explaining mHC, a technique that enhances the traditional residual connections in neural networks by employing multiple parallel residual streams, leading to more structured and stable training. AI

IMPACT Explains novel architectural components that could influence future large language model designs.
TOOL · arXiv cs.CV English(EN) · 2w

Source-Free Domain Adaptation for Geospatial Point Cloud Semantic Segmentation

Researchers have developed a new source-free unsupervised domain adaptation framework called LoGo for semantic segmentation of 3D geospatial point clouds. This method addresses the common issue of domain shifts that degrade model performance in remote sensing applications, particularly when source-domain data is inaccessible due to privacy or policy constraints. LoGo utilizes a local-level class-balanced prototype estimation to handle data with long-tailed distributions and a global-level optimal transport alignment to correct biases towards majority classes. A dual-consistency pseudo-label filtering mechanism further refines the process for self-training, and experiments show LoGo outperforms existing state-of-the-art methods on challenging benchmarks. AI

IMPACT This research offers a novel approach to improve the accuracy of AI models in analyzing 3D geospatial data, particularly in scenarios where data privacy is a concern.
TOOL · arXiv cs.CV English(EN) · 2w

Axial-Centric Cross-Plane Attention for 3D Medical Image Classification

Researchers have developed a new axial-centric cross-plane attention architecture for 3D medical image classification, designed to mimic how clinicians interpret medical scans. This approach prioritizes the axial plane while incorporating complementary information from coronal and sagittal planes. Experiments on the MedMNIST3D benchmark demonstrated that this method surpasses existing 3D and multi-plane models in accuracy and AUC, with a lightweight variant also showing competitive performance. AI

IMPACT This architectural innovation could lead to more accurate and clinically relevant AI tools for medical image analysis.
TOOL · arXiv cs.CV English(EN) · 2w

LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

Researchers have developed LuxRemix, a new method for editing lighting in indoor scenes captured from a single multi-view perspective. The approach uses a generative model to decompose complex illumination into individual light sources, allowing for independent control over their state, color, and intensity. This is integrated with a 3D Gaussian splatting representation to enable real-time interactive relighting, producing photorealistic results on both synthetic and real-world data. AI
- Christian Richardt
- LuxRemix
TOOL · arXiv cs.CV English(EN) · 2w

No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

Researchers have developed RoVTL, a novel framework designed to handle missing tabular data in multimodal learning, particularly for medical biobanks. The framework employs contrastive pretraining with simulated missingness and a unique "Tabular More vs. Fewer" loss during downstream tuning to ensure consistent performance regardless of data completeness. RoVTL has demonstrated superior robustness on cardiac MRI scans from the UK Biobank and shows promise for generalization to other medical imaging datasets and even natural images. AI
TOOL · arXiv cs.CV English(EN) · 2w

Unique Lives, Shared World: Learning from Single-Life Videos

Researchers have introduced a novel "single-life" learning paradigm for training vision models exclusively on egocentric videos from a single individual. This approach leverages multiple viewpoints within one person's life to develop a self-supervised visual encoder. The study found that models trained independently on different lives exhibit highly aligned geometric understanding and that these single-life models generalize effectively to downstream tasks like depth estimation in new environments. Notably, training on just 30 hours of data from one week of an individual's life yielded performance comparable to using 30 hours of diverse web data, underscoring the power of single-life representation learning. AI

IMPACT This research suggests that AI models can achieve strong performance with less diverse data by focusing on individual perspectives, potentially reducing data curation needs for certain applications.
- arXiv
- Tengda Han
TOOL · arXiv cs.CV English(EN) · 2w

LiM-YOLO: Less is More with Pyramid Level Shift for Ship Detection in Optical Remote Sensing

Researchers have developed LiM-YOLO, a novel object detection model optimized for identifying ships in optical remote sensing imagery. The model addresses limitations in standard YOLO architectures by shifting the detection head to lower pyramid levels, improving the representation of small, high-aspect-ratio targets. LiM-YOLO also incorporates a group-normalized auxiliary projection module to enhance training stability on high-resolution satellite inputs. This streamlined detector achieves state-of-the-art performance with significantly fewer parameters than existing models. AI

IMPACT This research offers a more efficient and accurate method for object detection in satellite imagery, potentially improving surveillance and maritime monitoring capabilities.
TOOL · arXiv cs.CV English(EN) · 2w

Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression

Researchers have developed a new multimodal regression framework called Re-M3Dr to improve the prediction of Mean Deviation (MD) in ophthalmology. While combining Optical Coherence Tomography (OCT) and fundus photography (FP) is intuitively expected to enhance performance, the study found that multimodal fusion often underperforms unimodal models due to data distribution imbalances and modality learning conflicts. Re-M3Dr addresses these issues by improving unimodal representation with adaptive margin-based supervised contrastive learning and stabilizing joint optimization through sharpness-aware gradient modulation. Experiments showed Re-M3Dr achieved an average 29% reduction in Mean Squared Error (MSE) compared to state-of-the-art multimodal methods. AI
TOOL · arXiv cs.CV English(EN) · 2w

Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation

Researchers have developed a novel architecture for RGB-Thermal semantic segmentation, addressing challenges in adverse lighting conditions. The proposed method utilizes dual ConvNeXt V2 backbones with stage-wise, modality-adaptive fusion. It incorporates a Frequency-Based Fusion Module for early-stage features and a semantic fusion module with cross-modal attention for late-stage features, improving scene understanding by effectively integrating visible and infrared imagery. AI

IMPACT This research could lead to more robust computer vision systems for autonomous driving and other applications requiring scene understanding in challenging lighting conditions.
TOOL · arXiv cs.CV English(EN) · 2w

Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer

Researchers have developed a new method for training-free diffusion-based style transfer that improves the balance between style fidelity and content preservation. By systematically exploring the optimal injection points for style across different decoder layers and denoising timesteps, they found that decreasing schedules, with stronger structural signal injection in earlier layers and timesteps, yield superior results. This approach, which also incorporates ControlNet geometric conditioning, expands the Pareto frontier, offering better tradeoffs than existing methods like StyleID. The new configuration achieved a 6.1% relative improvement in ArtFID score and has been validated across numerous configurations and metrics. AI

IMPACT This research offers improved control over style transfer in diffusion models, potentially leading to more nuanced and higher-quality image stylization for creative applications.
TOOL · arXiv cs.CV English(EN) · 2w

Dual-Thresholded Heatmap-Guided Proposal Clustering and Negative Certainty Supervision with Enhanced Base Network for Weakly Supervised Object Detection

Researchers have introduced a new method called DANCE for weakly supervised object detection (WSOD), which aims to improve accuracy without requiring precise bounding box annotations. DANCE addresses limitations in existing methods by using a heatmap-guided proposal selector to generate more accurate pseudo ground truth boxes that capture whole objects and differentiate adjacent instances. It also incorporates a background class representation and negative certainty supervision to accelerate convergence and bridge semantic gaps. AI

IMPACT This research could lead to more efficient and accurate object detection systems, reducing the need for extensive manual annotation.
TOOL · arXiv cs.CV English(EN) · 2w

Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation

Researchers have developed a new training technique called Detail Consistent Distillation (DCD) to improve the efficiency of 3D MRI segmentation models. DCD is a stage-wise distillation framework that preserves fine structural details, such as small lesions and sharp boundaries, which are often lost in compressed models. By aligning teacher-student features in a wavelet-decomposed representation during training, DCD enhances segmentation performance on benchmarks like BraTS 2024 and ISLES 2022 without adding any inference-time overhead. AI

IMPACT This new distillation technique could lead to more efficient and accurate AI models for medical image analysis, improving diagnostic capabilities.
TOOL · arXiv cs.CV English(EN) · 2w

Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

Researchers have developed a new zero-shot object re-identification pipeline for egocentric kitchen videos, addressing challenges like viewpoint changes and occlusions. The proposed method, built around the SAM3 segmentation model, significantly improves performance over existing feature extractors. By integrating SAM3 with DINOv2 and CLIP, and incorporating geometric consistency checks, the pipeline achieves a notable increase in accuracy. AI

IMPACT This research offers a more robust method for identifying objects in complex, egocentric video data, potentially improving applications in robotics and assistive technologies.
- I-JEPA
- DreamSim
- DINOv2
- SAM3
- EPIC-Kitchens
TOOL · arXiv cs.CV English(EN) · 2w

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

Researchers have introduced DETR-ViP, a novel framework designed to enhance visual prompted object detection. The method addresses suboptimal performance by focusing on creating class-distinguishable visual prompts, which are often superior to text prompts for recognizing rare categories. DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative representations, alongside a selective fusion strategy for stable detection. Experiments on datasets like COCO and LVIS show significant improvements over existing state-of-the-art approaches. AI
- COCO
- DETR-ViP
- Roboflow100
- LVIS
- Bo Qian
TOOL · arXiv cs.CV English(EN) · 2w

What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies

A new survey paper published on arXiv categorizes attention-worthy elements in urban street scenes to enhance road safety through computer vision. The paper proposes a taxonomy that divides critical traffic entities into anomalies and normal but critical elements, encompassing ten categories and twenty subclasses. It analyzes 35 vision-driven tasks and 73 datasets, highlighting their strengths and weaknesses to guide researchers and optimize resource allocation in the field. AI

IMPACT Provides a structured framework for computer vision research in road safety, identifying gaps and guiding future dataset development.
- arXiv
- Yaoqi Huang
TOOL · arXiv cs.CV English(EN) · 2w

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

Researchers have introduced ImViD, a new dataset designed to advance the creation of immersive volumetric videos for virtual and augmented reality experiences. This dataset captures multi-view video and synchronized audio at 5K resolution and 60FPS, enabling more complete scene reconstruction and interaction within a 6-DoF space. ImViD aims to stimulate further research in this area by providing a benchmark for existing methods and a baseline pipeline for producing these advanced VR content types. AI
- Zhengxian Yang
- ImViD
TOOL · arXiv cs.CV English(EN) · 2w

LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

Researchers have developed a new framework called LDP-Slicing to address the challenge of applying Local Differential Privacy (LDP) to image data. Traditional LDP methods struggle with the high dimensionality of pixel spaces, leading to significant utility degradation. LDP-Slicing decomposes pixel values into bit-planes, allowing LDP mechanisms to be applied at the bit level. This approach, combined with perceptual obfuscation and optimized privacy budget allocation, achieves rigorous pixel-level $\varepsilon$-LDP while maintaining high utility for downstream tasks like face recognition and image classification. Experiments show LDP-Slicing outperforms existing methods with minimal computational overhead. AI

IMPACT Introduces a novel approach to enhance privacy in image data for machine learning tasks without significant utility loss.
TOOL · arXiv cs.CV English(EN) · 2w

Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth

Researchers have developed SLIM (Sparse-LiDAR Injected Monocular geometry), a novel approach to enhance monocular depth estimation for long-range driving scenarios. SLIM adapts the MoGe-2 model to directly incorporate sparse LiDAR data, overcoming limitations of previous methods that relied on interpolated dense priors. This new model demonstrates significant improvements in accuracy for distances between 50-150 meters, reducing absolute relative error by up to 51% compared to baseline models on simulated datasets. AI

IMPACT This research could lead to more robust and accurate depth perception in autonomous driving systems, especially in challenging long-range scenarios.
- Sparse-LiDAR
- MoGe-2
- KITTI
- Virtual KITTI
- CARLA
TOOL · arXiv cs.CV English(EN) · 2w

AD-H: Language-guided Autonomous Driving with Hierarchical Agents

Researchers have developed AD-H, a new hierarchical multi-agent framework for language-guided autonomous driving. This system separates high-level decision-making by a multimodal large language model (MLLM) planner from low-level vehicle control executed by a lightweight controller. The framework aims to bridge the abstraction gap between natural language instructions and vehicle actions, improving generalization and instruction-following capabilities. AI
- multimodal large language model
- Zaibin Zhang
TOOL · arXiv cs.CV English(EN) · 2w

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos

Researchers have developed UniMVU, a new framework for multimodal video understanding that dynamically fuses information from various sources like audio, depth maps, and temporal evidence. This approach uses instruction-aware gating at two levels: inner-modality gates highlight important regions within a single modality, and modality-level gates adjust the importance of entire streams. UniMVU has demonstrated consistent improvements across six benchmarks, outperforming static fusion methods by up to 13.5 CIDEr points, and its gating mechanism aligns with human judgment of modality relevance. AI

IMPACT This framework could improve how AI models process and understand complex video data with multiple synchronized streams.
- UniMVU
- arXiv
TOOL · arXiv cs.CV English(EN) · 2w

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

Researchers have developed InHabit, a novel system for generating large-scale, photorealistic 3D datasets of humans interacting with environments. This system leverages image foundation models to propose actions and insert humans into 3D scenes, then refines these insertions into physically plausible SMPL-X bodies. The resulting dataset, InHabitants, comprises 78,000 samples across approximately 800 scenes and has demonstrated improvements in 3D human-scene reconstruction and contact estimation tasks. AI
TOOL · arXiv cs.CV English(EN) · 2w

Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis

Researchers have developed MedVCR, a new framework for medical video diagnosis that aims to improve accuracy by incorporating clinical reasoning and counterfactual analysis. Unlike previous methods that focused solely on visual appearance, MedVCR synthesizes potential pathological states and learns diagnostic knowledge from clinical rules. This approach has shown performance gains of 2.6% to 10.2% over existing methods in settings like colposcopy and colonoscopy. AI

IMPACT This framework could lead to more accurate and reliable medical diagnoses from video data by integrating clinical reasoning.
- arXiv
- MedVCR
TOOL · arXiv cs.CV English(EN) · 2w

RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

Researchers have introduced RoMo, a large-scale dataset designed to advance human motion generation in AI. This dataset addresses limitations in existing motion data by combining high-fidelity sequences with in-the-wild collections. RoMo features a taxonomy-aware filtering pipeline to ensure quality, detailed annotations, and a hierarchical semantic structure for fine-grained evaluation. Models trained on RoMo have demonstrated state-of-the-art performance in fidelity, diversity, and understanding complex text prompts. AI
- Motion Toolbox
TOOL · arXiv cs.CV Bahasa(ID) · 2w

RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields

Researchers have developed RadarSim, a novel differentiable renderer that simulates Doppler radar range images by utilizing the high angular resolution of RGB cameras. This approach initializes a neural field from camera data, enabling the generation of sharper geometry and more detailed radar range frames compared to radar-only reconstruction methods. The system was validated using a new dataset of calibrated radar-camera recordings, demonstrating its effectiveness in improving radar simulation. AI
- RGB cameras
- RadarSim
TOOL · arXiv cs.CV English(EN) · 2w

Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis

Researchers have developed a new method called Dimensional Distribution Emotion State (DDES) to analyze the emotional content of artworks. This approach uses a continuous bi-dimensional emotion space, leveraging valence and arousal, to improve the training of deep learning models for visual emotion analysis. The goal is to assist museum curators in designing emotion-based exhibitions by predicting the emotional response evoked by art, thereby reducing the need for manual annotation and potential curator bias. AI

IMPACT This research could enable more data-driven approaches to curating art exhibitions, potentially increasing visitor engagement and accessibility.
TOOL · arXiv cs.CV English(EN) · 2w

DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation

Researchers have developed DuoGesture, a novel dual-stream approach for generating co-speech gestures that separates semantic and rhythmic components. This method uses a Semantic Variational Information Bottleneck to manage the interplay between semantic gestures and rhythmic beat motion, and Motion-Grounded Semantic Conditioning to better align semantic gestures with speech. An Inertial Beat Prior module further refines rhythmic consistency by incorporating biomechanical principles. Evaluations indicate DuoGesture surpasses existing holistic models in both objective and subjective measures. AI

IMPACT Introduces a novel approach to AI-driven gesture generation, potentially improving human-computer interaction and virtual character animation.
- DuoGesture
TOOL · arXiv cs.CV English(EN) · 2w

Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

Researchers have developed a new framework called Contrastive Curriculum for Robust Dataset Distillation (C$^2$R) to improve the robustness of distilled datasets. Unlike previous methods that treated all adversarial perturbations equally, C$^2$R prioritizes samples with the smallest robust margins and explicitly widens the separation between decision boundaries. This approach leads to better accuracy-robustness trade-offs, achieving superior robust accuracy across various datasets and attacks. AI
TOOL · arXiv cs.CV English(EN) · 2w

EgoExo-WM: Unlocking Exo Video for Ego World Models

Researchers have developed EgoExo-WM, a novel method to enhance egocentric world models by leveraging abundant exocentric video data. This approach extracts structured body pose from exocentric videos and transforms them into an egocentric perspective, informed by human kinematics. Training egocentric world models with this converted data significantly improves prediction quality and downstream planning performance, enabling applications in robot planning and augmented-reality guidance. AI

IMPACT Enables training of more robust egocentric world models by leveraging readily available exocentric video data.
- Danny Tran
- EgoExo-WM
TOOL · arXiv cs.CV English(EN) · 2w

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

Researchers have developed RadJEPA, a novel self-supervised learning framework for medical image analysis, specifically for chest X-rays. Unlike previous methods that rely on paired image-text data, RadJEPA learns from approximately 840,000 unlabeled X-ray images by predicting masked regions from a visible context. This approach aims to overcome limitations of clinical narrative bias and data availability. Evaluations show RadJEPA matches or surpasses existing image-only and vision-language baselines in radiology report generation, disease classification, and semantic segmentation tasks. AI

IMPACT This research could enable more robust medical image analysis by reducing reliance on labeled data, potentially improving diagnostic tools.
- MedLLaVA
- RadJEPA
- Anas Khan
- Vicuna-7B
- Qwen-2.5
- BLIP-2
- Phi-4
TOOL · arXiv cs.CV English(EN) · 2w

Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Researchers have introduced the min-Sliced Transport Plan (min-STP) framework to address the computational cost of Optimal Transport (OT) in computer vision tasks. This new approach optimizes a one-dimensional projection to create a conditional transport plan, significantly reducing computation time. The study also investigates the transferability of these optimized slicers to new distribution pairs, finding that they remain effective even with slight data perturbations, enabling efficient transfer across related tasks and amortized training for applications like point cloud alignment and generative modeling. AI

IMPACT This research could accelerate the application of Optimal Transport in areas like point cloud alignment and generative modeling by reducing computational overhead.
TOOL · arXiv cs.CV English(EN) · 2w

Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification

A new study published on arXiv explores the intuitive decision-making processes of machine learning researchers when selecting source datasets for transfer learning in medical image classification. The research, conducted via a survey, reveals that practitioners' choices are influenced by task dependency, community norms, dataset characteristics, and perceived similarity, rather than solely by systematic principles. Notably, the study found a disconnect between similarity ratings and expected performance, and a general lack of consideration for ethical and fairness implications in dataset selection. AI

IMPACT Highlights a gap in systematic source dataset selection for medical imaging transfer learning, suggesting a need for better tools and frameworks to improve generalizability and patient outcomes.
- arXiv
- Amelia Jiménez-Sánchez
TOOL · arXiv cs.CV English(EN) · 2w

ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking

Researchers have developed ISTASTrack, a novel hybrid tracking system that combines artificial neural networks (ANNs) with spiking neural networks (SNNs) for RGB-event visual object tracking. This system utilizes a transformer-based architecture with specialized ISTA adapters to facilitate bidirectional feature interaction between the RGB and event data streams. The approach aims to effectively fuse information from these heterogeneous sources, leading to state-of-the-art performance and high energy efficiency on benchmark datasets. AI

IMPACT Introduces a novel hybrid ANN-SNN architecture for improved visual tracking efficiency and performance.
TOOL · arXiv cs.CV English(EN) · 2w

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

Researchers have developed GeoSolver, a new framework designed to improve the reasoning capabilities of Vision-Language Models (VLMs) in remote sensing. The system utilizes a novel process reward model (PRM) trained on a large-scale dataset called Geo-PRM-2M, which provides fine-grained feedback on the visual faithfulness of intermediate reasoning steps. By integrating this PRM with a reinforcement learning algorithm, GeoSolver-9B achieves state-of-the-art performance on remote sensing benchmarks and demonstrates robust test-time scaling capabilities, enhancing both its own performance and that of general-purpose VLMs. AI

IMPACT Enhances VLM reasoning in remote sensing, potentially improving accuracy and reliability in geospatial analysis.
TOOL · arXiv cs.CV English(EN) · 2w

When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI

A new research paper explores the challenges in synthesizing amyloid PET scans from structural MRI data for Alzheimer's disease diagnosis. The study posits that the inconsistency in model performance stems from a fundamental biological ambiguity: MRI reflects neurodegeneration while PET measures amyloid pathology, which can be temporally decoupled. This leads to ambiguous one-to-many mappings between MRI patterns and amyloid states, making the synthesis task intrinsically ill-posed. The research demonstrates that while unambiguous mappings can be learned in isolation, performance degrades when data ambiguity is present. Integrating multimodal inputs, such as plasma biomarkers, can resolve this ambiguity, improve performance, and restore stability, suggesting that multimodal integration is key for progress rather than solely architectural complexity. AI

IMPACT Highlights the need for multimodal data integration in AI models for medical diagnostics, moving beyond architectural complexity to address inherent data ambiguities.
TOOL · arXiv cs.CV English(EN) · 2w

Spectral Principal Paths: A Spectral Perspective on Linear Representation Formation in LLMs

Researchers have introduced the Spectral Principal Path (SPP) framework to explain how linear representations form in large language models (LLMs). This framework is based on the Input-Space Linearity Hypothesis, which suggests that concept-aligned directions originate in the input space and are maintained through network layers. The SPP framework provides theoretical stability guarantees and identifies conditions like spectral gap and context incoherence that preserve these directions, offering potential implications for AI fairness and transparency. AI

IMPACT Provides a theoretical framework for understanding and potentially controlling concept alignment in LLMs, impacting AI fairness and transparency.
TOOL · arXiv cs.CV English(EN) · 2w

Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

Researchers have developed a novel Cross-Task Attention Bridge (CTAB) module to enhance multi-task learning for autonomous driving perception. This bidirectional module facilitates feature exchange between 3D detection and segmentation tasks within a shared Bird's-Eye-View (BEV) space. By allowing detection cues to refine segmentation and segmentation context to anchor detection, CTAB aims to improve overall perception accuracy. Experiments on the nuScenes dataset demonstrated CTAB's effectiveness in enhancing segmentation while maintaining competitive 3D detection performance. AI
- nuScenes
- Ozgur Erkent Dr.
TOOL · arXiv cs.CV English(EN) · 2w

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

Researchers have developed a new multi-modal classification framework that effectively fuses satellite and street-level imagery for building inspection. Utilizing a Perceiver IO architecture and a shared DINOv2 backbone, the system can process a variable number of street-level views without padding and simultaneously predict multiple roof element and material classes. A novel RGB-M masking strategy, which incorporates the building footprint mask as a fourth input channel, demonstrated superior performance over hard cropping, leading to significant per-class gains for street-visible attributes. AI

IMPACT Introduces a flexible architecture for multi-modal data fusion in computer vision tasks, potentially improving accuracy in real-world applications like urban planning and infrastructure assessment.
- Perceiver IO
- DINOv2
TOOL · arXiv cs.CV English(EN) · 2w

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

Researchers have identified a key issue in feature distillation for Vision Transformers (ViTs), particularly when compressing models. They discovered that while individual images are compressible, the overall dataset exhibits a complex structure with rotating low-rank subspaces. This 'encoding mismatch' means that standard distillation methods fail because the token-level energy distribution across channels doesn't align with the teacher model's architecture. To address this, the paper proposes two simple fixes: 'Lift,' which adds a lightweight projector at inference, and 'WideLast,' which widens the student's final block. These methods significantly improve the performance of compressed ViTs, as demonstrated on ImageNet-1K. AI

IMPACT Offers new techniques to improve the efficiency and performance of Vision Transformer models, crucial for deployment on resource-constrained devices.
TOOL · arXiv cs.CV English(EN) · 2w

Advancing Metallic Surface Defect Detection via Anomaly-Guided Pretraining on a Large Industrial Dataset

Researchers have developed a new pretraining method called Anomaly-Guided Self-Supervised Pretraining (AGSSP) to improve metallic surface defect detection. This approach uses anomaly maps to guide the model's learning, helping it distinguish subtle defects from complex backgrounds. AGSSP involves a two-stage process: first, pretraining the backbone by distilling knowledge from anomaly maps, and second, pretraining the detector with pseudo-defect boxes derived from these maps. Experiments show AGSSP significantly boosts performance, with improvements of up to 10% in [email protected] and 11.4% in [email protected]:0.95 compared to models pretrained on natural image datasets. AI
- ImageNet
- AGSSP
- Chuni Liu
TOOL · arXiv cs.CV English(EN) · 2w

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

Researchers have developed CRoFT, a novel fine-tuning framework designed to enhance the generalization capabilities of vision-language pre-trained models (VL-PTMs) when encountering out-of-distribution (OOD) data. The method concurrently optimizes for improved generalization to covariate shifts and effective detection of unseen classes, addressing a critical gap in current fine-tuning practices. By minimizing the gradient magnitude of energy scores on training data, CRoFT promotes domain-consistent Hessians of classification loss, a key indicator for OOD generalization. AI

IMPACT Enhances AI model robustness to unseen data, potentially improving real-world deployment reliability.
TOOL · arXiv cs.CV English(EN) · 2w

Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

Researchers have developed a new method for efficiently fine-tuning 3D foundation models, addressing the challenges posed by variations in texture, geometry, camera motion, and lighting. The approach involves generating synthetic datasets with controlled variations, fine-tuning LoRA adapters on these datasets to extract distinct, approximately disentangled subspaces for each variation type. Integrating these subspaces results in a reduced LoRA subspace that improves prediction accuracy on downstream tasks, demonstrating generalization to real-world datasets. AI
TOOL · arXiv cs.CV English(EN) · 2w

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

Researchers have developed GS-CLIP, a novel framework for zero-shot 3D anomaly detection. This approach addresses limitations in existing methods that struggle with geometric detail loss and incomplete visual understanding by using CLIP. GS-CLIP employs a two-stage learning process that generates text prompts with 3D geometric priors and utilizes a synergistic view representation learning architecture. This architecture processes rendered and depth images in parallel, fusing their features for enhanced anomaly detection. AI

IMPACT Introduces a new method for detecting anomalies in 3D data without prior training, potentially improving applications in manufacturing and medical imaging.
- Zehao Deng
- GS-CLIP