DFG FOR “Learning to Sense” – Repositories - ZESS

ABOUT – L2S NEWS – L2S VISION – L2S TEAM – L2S PROJECTS – L2S REPOS – L2S COLLABORATORS – L2S TALKS – OPEN POSITIONS

L2S Repositories and Data Sets

RAWDet-7

https://github.com/Mishalfatima/RawDet-7

Most vision models are trained on RGB images processed through ISP pipelines optimized for human perception, which can discard sensor-level information useful for machine reasoning. RAW images preserve unprocessed scene data, enabling models to leverage richer cues for both object detection and object description, capturing fine-grained details, spatial relationships, and contextual information often lost in processed images. To support research in this domain, we introduce RAWDet-7, a large-scale dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments, densely annotated for seven object categories following MS-COCO and LVIS conventions

Further reading:
Mishal Fatima*, Shashank Agnihotri*, Kanchana Vaishnavi Gandikota, Michael Moeller, Margret Keuper: RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images. arXiv preprint, arXiv:2602.03760 (2026).

RobustSpring

https://spring-benchmark.org/

Learning-based vision systems for scene understanding must not only be accurate, but also robust to degradations that arise in real-world sensing conditions, such as blur, noise, compression artifacts, and adverse weather. To support research in this direction, we introduce RobustSpring, a benchmark dataset for evaluating robustness to image corruptions in optical flow, scene flow, and stereo. Based on the Spring benchmark, RobustSpring provides 20 corrupted versions of stereo video data, with corruptions integrated consistently in time, stereo, and depth where applicable. In total, the dataset contains 40,000 frames, or 20,000 stereo frame pairs, and enables standardized robustness evaluation alongside accuracy on the same benchmark. This makes it possible to systematically study the stability and real-world applicability of dense matching models under challenging visual conditions.

Dataset:
https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-5047

Further reading:
Victor Oei*, Jenny Schmalfuss*, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andreas Bulling, Andres Bruhn: RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo. In The Fourteenth International Conference on Learning Representations (ICLR) (2026).

GeoDiv

https://abhipsabasu.github.io/geodiv/samples.html

Learning-based vision systems are increasingly used to generate visual content, yet their outputs often lack geographical diversity and can reinforce region-specific socio-economic stereotypes. In the spirit of Learning to Sense, this raises the broader question of how well modern generative models capture the diversity of the visual world across different countries and contexts, and how such behavior can be measured systematically. To support research in this direction, we introduce GeoDiv, a framework and dataset for measuring geographical diversity in text-to-image models. GeoDiv comprises 160,000 synthetic images generated by four open-source diffusion models across 10 common entities and 16 countries, together with structured annotations and interpretable diversity scores along socio-economic and visual dimensions. This enables systematic analysis of geographical bias and diversity in generative vision systems, and supports the study of how learning-based models represent the world beyond conventional performance metrics.

Dataset:
https://drive.google.com/drive/folders/1QtXcxCzPq8iteq1FehFDjNmLMKZhaLUE

Further reading:
Abhipsa Basu*, Mohana Singh*, Shashank Agnihotri, Margret Keuper, R. Venkatesh Babu: GeoDiv: Framework for Measuring Geographical Diversity in Text-to-Image Models. In The Fourteenth International Conference on Learning Representations (ICLR), (2026).

InverseTHzSim

https://github.com/memamsaleh/InverseThzSim

A differentiable electromagnetic wave simulator at mm-wave and terahertz frequencies for reconstruction-free parameter estimation.

Further reading:
Mohamed Saleh, Matthias Kahl, Peter Haring Bolívar, Andreas Kolb: Direct Detection of Object Parameters From Raw Data in MIMO-SAR Imaging at 235–270 GHz via Inverse Simulation. In IEEE Journal of Microwaves, vol. 6, no. 1, pp. 126-136 (Jan. 2026).

NAG

https://github.com/jp-schneider/nag

Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing – the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset – by 5 dB PSNR increase compared to existing methods – and make environmental editing possible in high resolution and visual quality – creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably – by more than 7 dB in PSNR – to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes.

Project Page:
https://princeton-computational-imaging.github.io/nag/

Further Reading:
Jan Philipp Schneider, Pratik Singh Bisht, Ilya Chugunov, Andreas Kolb, Michael Moeller, Felix Heide: Neural Atlas Graphs for Dynamic Scene Decomposition and Editing. arXiv preprint, arXiv:2509.16336, (2025).

LEECH

https://zenodo.org/records/17021089

A Hirudo T.S. (leech) sample was imaged using a Fourier Ptychographic Microscope (FPM) under green LED illumination. The optical system comprises a 10× objective lens with a numerical aperture of 0.3 which allows to image a sufficiently wide field of view of the sample. Using the FPM setup, a sequence of low-resolution intensity images was captured by sequentially illuminating the sample from different angles using an LED array. For each illumination angle, multiple exposures were taken and merged to form High Dynamic Range (HDR) images, allowing better preservation of both dark and bright regions in the sample. The HDR dataset was captured and processed by John Meshreki [1] at the CSE Optics Lab, ZESS, University of Siegen.

Further reading:
[1] John Meshreki, Syed Muhammad Kazim, Ivo Ihrke: Optical system characterization in Fourier ptychographic microscopy. In Opt. Continuum 3, pp. 2218-2231 (2024).

WEAR

https://mariusbock.github.io/wear/

Research has shown the complementarity of camera- and inertial-based data for modeling human activities, yet datasets with both egocentric video and inertial-based sensor data remain scarce. In this paper, we introduce WEAR, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). Data from 22 participants performing a total of 18 different workout activities was collected with synchronized inertial (acceleration) and camera (egocentric video) data recorded at 11 different outside locations. WEAR provides a challenging prediction scenario in changing outdoor environments using a sensor placement, in line with recent trends in real-world applications. Benchmark results show that through our sensor placement, each modality interestingly offers complementary strengths and weaknesses in their prediction performance. Further, in light of the recent success of single-stage Temporal Action Localization (TAL) models, we demonstrate their versatility of not only being trained using visual data, but also using raw inertial data and being capable to fuse both modalities by means of simple concatenation.

Further reading:
Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, Michael Moeller: WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 8, Issue 4, Article No.: 175, pp. 1-21 (2024).

AWESOME

https://github.com/jp-schneider/awesome

Image segmentation has greatly advanced over the past ten years. Yet, even the most recent techniques face difficulties producing good results in challenging situations, e.g., if training data are scarce, out-of-distribution examples need to be segmented, or if objects are occluded. In such situations, the inclusion of (geometric) constraints can improve the segmentation quality significantly. In this paper, we study the constraint of the segmented region being segmented convex. Unlike prior work that encourages such a property with computationally expensive penalties on segmentation masks represented explicitly on a grid of pixels, our work is the first to consider an implicit representation. Specifically, we represent the segmentation as a parameterized function that maps spatial coordinates to the likeliness of a pixel belonging to the fore- or background. This enables us to provably ensure the convexity of the segmented regions with the help of input convex neural networks. Numerical experiments demonstrate how to encourage explicit and implicit representations to match in order to benefit from the convexity constraints in several challenging segmentation scenarios.

Further Reading:
Jan Philipp Schneider, Mishal Fatima, Jovita Lukasik, Andreas Kolb, Margret Keuper, Michael Moeller: Implicit representations for constrained image segmentation. In International Conference on Machine Learning (ICML), pp. 43765-43790 (2024).
Jan Philipp Schneider, Mishal Fatima, Jovita Lukasik, Andreas Kolb, Margret Keuper, Michael Moeller: Implicit Representations for Image Segmentation. In UniReps Workshop: Unifying Representations in Neural Models. (2023).

————————————————
* equal contributions