Context: 4D modeling offers exciting new prospects for shape and performance capture in virtual and mixed reality contexts. It is the process of retrieving geometry and appearance information from multiple temporal frames of a scene or subject, observed from one or multiple color or depth cameras. Leveraging temporal analysis opens the possibility for geometric and / or texture detail refinement, as exhibited in the dynamic fusion work for example, or in the Morpheo team work “High Resolution 3D Shape Texture from Multiple Videos”. It generally necessitates to solve 3D sequence tracking / alignment with techniques such as non-rigid ICP, in order to accumulate information in a time independent and pose corrected representation. Many challenges remain to practically achieve these goals, in terms of information representation, computation time, full-body scope of the temporal accumulation, treatment of everyday and loose clothing and wearable accessories, sequence geometry realignment under fast motions, and efficient and robust estimation, for applicability in a virtual/mixed reality context.
Objective: We propose to explore how hybrid representations, and pre-learned low or mid-level scale geometry, motion and texture characteristics and statistics could contribute both to efficiently encode their likely set in the field, and accelerate the process of retrieving their estimate through discriminative approaches. Recent learning techniques such as random forests and convolutional neural networks can be efficiently used toward this goal. High detailed representation can be acquired with specific setups (e.g. the Kinovis platform@INRIA Grenoble) that would allow to produce the training data necessary for these techniques with high precision, where the encoded knowledge can then be used in more practical everyday setups with fewer, lower resolution video cameras. We typically envision breakthroughs in the 4D modelling domain which would allow to rapidly build a virtual avatar of a person in an interactive environment equipped with a few cameras / depth sensors, such that both its geometry and appearance would be gradually estimated and refined, and immediately usable for mixed reality augmentation and telepresence contexts.
Animation Synthesis
Context: Recent 4D shape capture system have the ability to model dynamic scenes in a precise and realistic ways, hence providing a meaningful way to produce contents for VAR applications. Besides the properties of the 4D Models, e.g. its precision, that can still be improved, an interesting issue is how to recombine recorded 4D models in order to produce new animations and with respect to constraints, e.g. interactive systems or motion transfer. In addition to broaden creative possibilities with 4D models, this also reduces the need for exhaustive acquisitions when large datasets or complex dynamic scenes (e.g. a crowd) are targeted.
Objective: While some preliminary works have already investigated the concatenation or the motion transfer between 4D models there are still many directions to explore:
. How to recombine or transfer information in a hierarchical way between body subjects and between their body parts.
. How to design interactive systems where animations are produced in real-time.
. How to model the motion and appearance information of dynamic shapes in order to enable interpolation (e.g pose spacesfor motion)
. How to create complex scenes with interacting shapes, e.g. crowds
. How to preserve individual style when creating or reproducing a given motion
We propose to explore some of the above issues with a particular emphasize on interactive setups for VR or AR experiences with HMD and to take advantage of the Kinovis platform to produce data for that purpose.
People can easily learn how to change a flat tire of a car or assemble an IKEA shelf by observing other people doing the same task, for example, by watching a narrated instruction video. In addition, they can easily perform the same task in a new environment, for example, at their home. This involves advanced visual intelligence abilities such as recognition of objects and their function as well as interpreting sequences of human actions that achieve a specific task. However, currently there is no artificial system with a similar cognitive visual competence. The scientific objectives are to develop models, representations and learning algorithms for (i) automatic understanding of task-driven complex human activities from videos narrated with natural language in order to (ii) give people instructions in a new environment via an augmented reality device such as the Microsoft HoloLens. Solving these challenges will enable virtual assistants that may guide a child through simple games to improve his/her manipulation and language skills; help an elderly person to achieve everyday tasks; or facilitate the training of a new worker for highly-specialized machinery maintenance. HoloLens will provide an ideal physical embodiment of such a personal assistant. In addition, the developed visual intelligence capabilities will prove useful for constructing smart assistive robots that automatically learn new skills by just observing people. This project is expected to make a step towards these applications by developing new models and algorithms that capture complex person-object interactions from Internet instruction data and generalize those interactions to new never seen before environments seen from a first-person view via the HoloLens device. The project is organized into two topics.
Learning from instruction videos
The objective is to develop video representations and learning algorithms for understanding task-driven complex human activities from videos narrated with natural language. The goal is, given a set of instruction videos depicting a particular task, such as “changing car tire”, to learn a model of this task comprising the individual steps, the corresponding object manipulations and their language descriptions (e.g.“jack-up the car”, “undo the bolts”, “remove the flat tire”, etc.). The resulting representations will be generic so that they can be applied to give instructions in a never seen before environment. In particular, this subproject (i) will investigate joint models of language and video based on distributed and recurrent language representations, (ii) will develop efficient large-scale weakly-supervised learning techniques that can be applied to thousands of tasks and will enable learning representations shared among similar manipulating actions collected from the Internet and (iii) will investigate models (e.g. in the form of causal graphs) for representing and learning the complex relationships between the sub-goals needed to achieve a given task.
Modeling and recognizing hand-object interactions
The goal of this part of the project is to recognize sequences of object manipulations such as to suggest next steps and to correct a person performing an activity. Towards this goal we plan to model poses of hands and objects from egocentric videos depicting hand-object interactions. The non-rigid 3D hand poses and 3D poses of known objects will be addressed jointly while respecting constraints of physical forces involved in the interaction. We propose to address hand-object pose estimation within a CNN framework. A new loss function will be defined to penalize pose errors of hands and objects individually as well as to penalize physically implausible hand-object interactions. To train the model, we will generate a large amount of synthetic egocentric videos with objects manipulated by hands. The method will be evaluated on real videos with object interactions.