Video Understanding

Video Understanding : Understanding, searching and organizing dynamic video content

Automatic understanding and interpretation of video content 1 is a key enabling factor for a range of practical applications such as organizing and searching home videos or content-aware video advertising. For example, interpreting videos of “making a birthday cake” or “planting a tree” could provide effective means for advertising products in local grocery stores or garden centers.

The goal of this work is to automatically generate annotations of complex dynamic events in video. We wish to deal with events involving multiple people interacting with each other, objects and the scene, for example people at a party in a house. The goal is to generate structured annotations going beyond simple tags. Examples include entire text sentences describing the video content as well as bounding boxes or segmentations spatially and temporally localizing the described objects and people in video. Such annotations will in turn open-up the possibility to organize and search video content using well-developed technology from e.g text search or natural language processing.

We build on the considerable progress in visual object, scene and human action recognition achieved in the last ten years, as well as the recent advances in large-scale machine learning that enable optimizing complex structured models using massive sets of data. In particular, we develop structured video representations of people, objects and scenes, able to describe their complex interactions as well as their long-term evolution in the dynamic scene.

To this end, we investigate different models of the video stream including: (i) designing explicit representations of scene geometry together with scene entities and their interactions as well as (ii) directly learning mid-level representations of spatio-temporal video content using dictionary learning or convolutional neural networks.

To train the developed models we design weakly-supervised learning methods making use of videos and the associated readily-available metadata such as text, speech or associated depth (in the case of 3D videos).

To enable accessing the massive amounts of available video data we also develop representations that allow for efficient extraction and indexing. We believe such models and methods are required to make a breakthrough in automatic understanding of dynamic video scènes.

  • Ivan Laptev
    I am affiliated with Inria Paris – Rocquencourt that is a unit of the French National Institute for Research in Computer Science and ...

Former members:
  • Karteek Alahari Inria Grenoble - Rhône-Alpes
  • Léon Bottou Microsoft Research - Redmond
  • Yang Hua Microsoft Research - Inria Joint Centre (PhD)
  • Maxime Oquab Microsoft Research - Inria Joint Centre (PhD)
  • Cordelia Schmid Inria Grenoble - Rhône-Alpes

2018

Thèse

titre
Convolutional neural networks: towards less supervision for visual recognition
auteur
Maxime Oquab
article
Computer Science [cs]. Ecole Normale Supérieure (ENS); ED 386 : École doctorale de sciences mathématiques de Paris centre, UPMC, 2018. English
Accès au texte intégral et bibtex
https://hal.inria.fr/tel-01803967/file/Oquab%20PhD%20Thesis.pdf BibTex

2016

Communication dans un congrès

titre
Weakly-Supervised Semantic Segmentation using Motion Cues
auteur
Pavel Tokmakov, Karteek Alahari, Cordelia Schmid
article
ECCV – European Conference on Computer Vision, Oct 2016, Amsterdam, Netherlands. pp.388-404, ⟨10.1007/978-3-319-46493-0_24⟩
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01292794/file/mcnn.pdf BibTex
titre
ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization
auteur
Vadim Kantorov, Maxime Oquab, Minsu Cho, Ivan Laptev
article
ECCV 2016, Oct 2016, Amsterdam, Netherlands. pp.350 – 365, ⟨10.1007/978-3-319-46454-1_22⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01421772/file/contextlocnet_eccv2016.pdf BibTex
titre
Unsupervised Learning from Narrated Instruction Videos
auteur
Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien
article
CVPR2016 – 29th IEEE Conference on Computer Vision and Pattern Recognition, Jun 2016, Las Vegas, United States
Accès au bibtex
https://arxiv.org/pdf/1506.09215 BibTex

2015

Communication dans un congrès

titre
P-CNN: Pose-based CNN Features for Action Recognition
auteur
Guilhem Chéron, Ivan Laptev, Cordelia Schmid
article
ICCV – IEEE International Conference on Computer Vision, Dec 2015, Santiago, Chile. pp.3218-3226, ⟨10.1109/ICCV.2015.368⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01187690/file/P-CNN_cheronICCV15.pdf BibTex
titre
Online Object Tracking with Proposal Selection
auteur
Yang Hua, Karteek Alahari, Cordelia Schmid
article
ICCV – IEEE International Conference on Computer Vision, Dec 2015, Santiago, Chile. pp.3092-3100, ⟨10.1109/ICCV.2015.354⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01207196/file/paper.pdf BibTex
titre
Is object localization for free? – Weakly-supervised learning with convolutional neural networks
auteur
Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic
article
IEEE Conference on Computer Vision and Pattern Recognition, Jun 2015, Boston, United States
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01015140/file/Oquab15.pdf BibTex
titre
On Pairwise Cost for Multi-Object Network Flow Tracking
auteur
Visesh Chari, Simon Lacoste-Julien, Ivan Laptev, Josef Sivic
article
CVPR 2015 – 28th IEEE Conference on Computer Vision and Pattern Recognition, Jun 2015, Boston, United States
Accès au bibtex
https://arxiv.org/pdf/1408.3304 BibTex

2014

Article dans une revue

titre
Activity representation with motion hierarchies
auteur
Adrien Gaidon, Zaid Harchaoui, Cordelia Schmid
article
International Journal of Computer Vision, Springer Verlag, 2014, 107 (3), pp.219-238. ⟨10.1007/s11263-013-0677-1⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-00908581/file/tracklets_journal.pdf BibTex

Communication dans un congrès

titre
Category-specific video summarization
auteur
Danila Potapov, Matthijs Douze, Zaid Harchaoui, Cordelia Schmid
article
ECCV – European Conference on Computer Vision, Sep 2014, Zurich, Switzerland. pp.540-555, ⟨10.1007/978-3-319-10599-4_35⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01022967/file/video_summarization.pdf BibTex
titre
Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks
auteur
Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic
article
IEEE Conference on Computer Vision and Pattern Recognition, Jun 2014, Columbus, OH, United States
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-00911179/file/oquab14.pdf BibTex
titre
Mixing Body-Part Sequences for Human Pose Estimation
auteur
Anoop Cherian, Julien Mairal, Karteek Alahari, Cordelia Schmid
article
CVPR – IEEE Conference on Computer Vision & Pattern Recognition, Jun 2014, Columbus, OH, United States. pp. 2361-2368, ⟨10.1109/CVPR.2014.302⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-00978643/file/posecvpr2014.pdf BibTex
titre
Occlusion and Motion Reasoning for Long-term Tracking
auteur
Yang Hua, Karteek Alahari, Cordelia Schmid
article
ECCV – European Conference on Computer Vision, Sep 2014, Zurich, Switzerland. pp.172-187, ⟨10.1007/978-3-319-10599-4_12⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-01020149/file/tracking.pdf BibTex

2013

Article dans une revue

titre
Temporal Localization of Actions with Actoms
auteur
Adrien Gaidon, Zaid Harchaoui, Cordelia Schmid
article
IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2013, 35 (11), pp.2782-2795. ⟨10.1109/TPAMI.2013.65⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-00804627/file/ASM_TPAMI_Gaidon.pdf BibTex

2012

Communication dans un congrès

titre
Recognizing activities with cluster-trees of tracklets
auteur
Adrien Gaidon, Zaid Harchaoui, Cordelia Schmid
article
BMVC 2012 – British Machine Vision Conference, Sep 2012, Guildford, United Kingdom. pp.30.1-30.13, ⟨10.5244/C.26.30⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-00722955/file/gaidon_tracklets_bmvc2012.pdf BibTex

Rapport

titre
Temporal Localization of Actions with Actoms
auteur
Adrien Gaidon, Zaid Harchaoui, Cordelia Schmid
article
[Research Report] RR-7930, INRIA. 2012
Accès au texte intégral et bibtex
https://hal.inria.fr/hal-00687312/file/RR-7930.pdf BibTex

2011

Communication dans un congrès

titre
Actom Sequence Models for Efficient Action Detection
auteur
Adrien Gaidon, Zaid Harchaoui, Cordelia Schmid
article
CVPR 2011 – IEEE Conference on Computer Vision & Pattern Recognition, Jun 2011, Colorado Springs, United States. pp.3201-3208, ⟨10.1109/CVPR.2011.5995646⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/inria-00575217/file/1513.pdf BibTex
titre
A time series kernel for action recognition
auteur
Adrien Gaidon, Zaid Harchaoui, Cordelia Schmid
article
BMVC 2011 – British Machine Vision Conference, Aug 2011, Dundee, United Kingdom. pp.63.1-63.11, ⟨10.5244/C.25.63⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/inria-00613089/file/kernel_time_series.pdf BibTex

2009

Communication dans un congrès

titre
Mining visual actions from movies
auteur
Adrien Gaidon, Marcin Marszalek, Cordelia Schmid
article
British Machine Vision Conference, British Machine Vision Association, Sep 2009, Londres, United Kingdom. pp.125.1-125.11, ⟨10.5244/C.23.125⟩
Accès au texte intégral et bibtex
https://hal.inria.fr/inria-00440973/file/gaidon_mining_actions_bmvc2009.pdf BibTex