TixelJobs
L
lirisvia Indeed

Human behavior understanding in videos using multimodal foundation models

Bron, ARA, FRPosted 2mo ago
NLP / LLMMid Level#python#pytorch#tensorflow#gpt#computer-vision

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against Human behavior understanding in videos using multimodal foundation models at liris. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

Human behavior understanding in videos using multimodal foundation models

Réf ABG-135841
Stage master 2 / Ingénieur
Durée 4 mois
Salaire net mensuel gratification de stage

18/02/2026

LIRIS
Lieu de travail
Bron Auvergne-Rhône-Alpes France
Champs scientifiques
  • Informatique
Mots clés
computer vision, automatic video analysis, human behavior understanding, deep learning
Date limite de candidature
05/03/2027

Établissement recruteur

Site web :

https://liris.cnrs.fr/liris

Le Laboratoire d’InfoRmatique en Image et Systèmes d’information (LIRIS) est une unité mixte de recherche (UMR 5205) du CNRS, de l'INSA de Lyon, de l'Université Claude Bernard Lyon 1, de l'Université Lumière Lyon 2 et de l'Ecole Centrale de Lyon.

Description

Context of the study:

Human behavior understanding is a key task for several fields of application, from human assisted living and disease diagnosis in healthcare to industry problems, like task training and completion evaluation. Deep neural networks and, more recently, multimodal foundation models have brought a new level of performance to research problems in video understanding (e.g., Dino v3, VideoLLaMA, InternVideo2). However, the performance of such methods in behavior understanding, like emotion recognition, is still limited compared to generic scene understanding (Lian et al., 2024).

This internship subject will study and evaluate the latest multimodal foundations models as building blocks for a pipeline for human behavior understanding. We will focus on methods capable of describing emotion and gesture recognition in long videos and explore their performance outside datasets with controlled conditions (i.e., in the wild).


Tasks:

  • Revise the state of the art on methods for multimodal video understanding applicable for behavior understanding, identifying their limitations on the characterization of the target behavioral aspects.
  • Propose a spatio-temporal based deep neural pipeline that can detect the target behavioral events in space and time.
  • Write a research article to share the developed work with the computer vision community, accompanied by an open-source repository to foster reproducible research.

Related bibliographic references:

  • Zheng Lian, et al., GPT-4V with emotion: A zero-shot benchmark for Generalized Emotion Recognition, Information Fusion, Volume 108, 2024, 102367, ISSN 1566-2535, https://doi.org/10.1016/j.inffus.2024.102367.
  • Boqiang Zhang, et al., VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding, 2025, https://arxiv.org/abs/2501.13106
  • Yi Wang, et al. InternVideo2: Scaling Foundation Models for Multimodal Video Understanding. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXV. Springer-Verlag, Berlin, Heidelberg, 396–416. https://doi.org/10.1007/978-3-031-73013-9_23
  • Oriane Siméoni, et al., DinoV3, 2025, https://arxiv.org/abs/2508.10104

Profil

Profile of the candidate:

We are looking for a motivated candidate with a strong background in computer science or applied mathematics.

  • The candidate must currently be enrolled in a Master 1 or 2 program, or be in the final years of engineering school (Bac+4 or +5 in France)
  • Experience in image processing, computer vision, and/or machine learning will be a plus.

If the internship leads to an international publication, we may study opportunities to pursue the research carried out with a PhD in a similar topic.

Language: French or English

Expected skills:

  • Mastering of Python language
  • OpenCV library
  • Versioning tools (GIT)

The following skills would be considered as a plus:

  • Framework PyTorch or TensorFlow.
  • Dockerlike tools and platforms

Duration: 4-6 months

Expected internship period: Late April-October, with an imposed summer break

Prise de fonction

01/05/2026

Share
Human behavior understanding in videos using multimodal foundation models at liris | TixelJobs — Jobs at AI Companies