Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. A recent research direction for improving few-shot classifiers involves augmenting the labelled samples with synthetic images created by state-of-the-art text-to-image generation models. Following this trend, we propose Diversified in-domain synthesis with efficient fine-tuning (DISEF), a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. DISEF consists of two main components. First, we propose a novel text-to-image augmentation pipeline that, by leveraging the real samples and their rich semantics coming from an advanced captioning model, promotes in-domain sample diversity for better generalization. Second, we emphasize the importance of effective model fine-tuning in few-shot recognition, proposing to use Low-Rank Adaptation (LoRA) for joint adaptation of the text and image encoders in a Vision Language Model. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
Preprint
Collaborative Neural Painting
Nicola Dall’Asen, Willi Menapace , Elia Peruzzo , Enver Sangineto , Yiming Wang , and 1 more author
The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, Collaborative Neural Painting (CNP), to facilitate collaborative art painting generation between humans and machines. Given any number of user-input brushstrokes as the context or just the desired object class, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users’ modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research.
ICCV
Object-aware Gaze Target Detection
Francesco Tonini , Nicola Dall’Asen, Cigdem Beyan , and Elisa Ricci
In International Conference on Computer Vision (ICCV) , 2023
Gaze target detection aims to predict the image location where the person is looking and the probability that a gaze is out of the scene. Several works have tackled this task by regressing a gaze heatmap centered on the gaze location, however, they overlooked decoding the relationship between the people and the gazed objects. This paper proposes a Transformer-based architecture that automatically detects objects (including heads) in the scene to build associations between every head and the gazed-head/object, resulting in a comprehensive, explainable gaze analysis composed of: gaze target area, gaze pixel point, the class and the image location of the gazed-object. Upon evaluation of the in-the-wild benchmarks, our method achieves state-of-the-art results on all metrics (up to 2.91% gain in AUC, 50% reduction in gaze distance, and 9% gain in out-of-frame average precision) for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects.
ICIAP
Unsupervised Video Anomaly Detection with Diffusion Models Conditioned on Compact Motion Representations
*Anil Osman Tur , *Nicola Dall’Asen, Cigdem Beyan , and Elisa Ricci
In 22nd International Conference on Image analysis and Processing (ICIAP) , 2023
This paper aims to address the unsupervised video anomaly detection (VAD) problem, which involves classifying each frame in a video as normal or abnormal, without any access to labels. To accomplish this, the proposed method employs conditional diffusion models, where the input data is the spatiotemporal features extracted from a pre-trained network, and the condition is the features extracted from compact motion representations that summarize a given video segment in terms of its motion and appearance. Our method utilizes a data-driven threshold and considers a high reconstruction error as an indicator of anomalous events. This study is the first to utilize compact motion representations for VAD and the experiments conducted on two large-scale VAD benchmarks demonstrate that they supply relevant information to the diffusion model, and consequently improve VAD performances \textitw.r.t the prior art. Importantly, our method exhibits better generalization performance across different datasets, notably outperforming both the state-of-the-art and baseline methods.
ICIP
Exploring Diffusion Models for Unsupervised Video Anomaly Detection
Anil Osman Tur , Nicola Dall’Asen, Cigdem Beyan , and Elisa Ricci
In International Conference on Image Processing (ICIP) , 2023
This paper investigates the performance of diffusion models for video anomaly detection (VAD) within the most challenging but also the most operational scenario in which the data annotations are not used. As being sparse, diverse, contextual, and often ambiguous, detecting abnormal events precisely is a very ambitious task. To this end, we rely only on the information-rich spatio-temporal data, and the reconstruction power of the diffusion models such that a high reconstruction error is utilized to decide the abnormality. Experiments performed on two large-scale video anomaly detection datasets demonstrate the consistent improvement of the proposed method over the state-of-the-art generative models while in some cases our method achieves better scores than the more complex models. This is the first study using a diffusion model and examining its parameters’ influence to present guidance for VAD in surveillance scenarios.
2022
ICIAP
Graph-based Generative Face Anonymisation with Pose Preservation
Nicola Dall’Asen, Yiming Wang , Hao Tang , Luca Zanella , and Elisa Ricci
In 21st International Conference on Image analysis and Processing (ICIAP) , 2022
We propose AnonyGAN, a GAN-based solution for face anonymisation which replaces the visual information corresponding to a source identity with a condition identity provided as any single image. With the goal to maintain the geometric attributes of the source face, i.e., the facial pose and expression, and to promote more natural face generation, we propose to exploit a Bipartite Graph to explicitly model the relations between the facial landmarks of the source identity and the ones of the condition identity through a deep model. We further propose a landmark attention model to relax the manual selection of facial landmarks, allowing the network to weight the landmarks for the best visual naturalness and pose preservation. Finally, to facilitate the appearance learning, we propose a hybrid training strategy to address the challenge caused by the lack of direct pixel-level supervision. We evaluate our method and its variants on two public datasets, CelebA and LFW, in terms of visual naturalness, facial pose preservation and of its impacts on face detection and re-identification. We prove that AnonyGAN significantly outperforms the state-of-the-art methods in terms of visual naturalness, face detection and pose preservation.
Ital-IA
Responsible AI at the edge: towards privacy-preserving smart cities
Luca Zanella , Yiming Wang , Nicola Dall’Asen, Alberto Ancilotto , Francesco Paissan , and 4 more authors
In Ital-IA 2022 Convegno del Laboratorio nazionale CINI-AIIS , Feb 2022
With the massive amount of data produced by ambient environmental sensors, many AI-based solutions are emerging to support new smart cities’ applications. However, these data may contain sensitive personal information, calling for responsible AI solutions. FBK proposes a privacy-preserving subsystem with a set of technological components that enable responsible AI and prevent unauthorised usage of personal data at the data storage and during data transmission under the context of Smart Cities. We demonstrate the proposed solution under an EU project MARVEL, where both video and audio anonymisation components are deployed at the edge level, enabled by a model compression component for complexity reduction. We discuss each component’s technical challenges, current progress, and future directions.
2021
BalkanCom
MARVEL: Multimodal Extreme Scale Data Analytics for Smart Cities Environments
Nicola Dall’Asen, and al.
In 2021 International Balkan Conference on Communications and Networking (BalkanCom) , Sep 2021
A Smart City based on data acquisition, handling and intelligent analysis requires efficient design and implementation of the respective AI technologies and the underlying infrastructure for seamlessly analyzing the large amounts of data in real-time. The EU project MARVEL will research solutions that can improve the integration of multiple data sources in a Smart City environment for harnessing the advantages rooted in multimodal perception of the surrounding environment.