Topic: The goal of this PhD is to better understand videos by exploiting associated textual data. For still images some success has been achieved in learning correspondences between objects and textual keywords . In video, it has been demonstrated that transcripts aligned with the video can be a very useful source of weak supervision for learning the appearance of characters  and human actions .
Existing work does not attempt to use text as a form of supervision for learning spatio-temporal constraints between scenes, humans, objects and their interactions in video. In addition, the text is typically considered as a supervisory signal for visual learning and the opposite direction, where visual information would help disambiguate text interpretation, is not considered. In this PhD, we propose to go beyond the state-of-the-art and turn textual annotations into a more complete and accurate supervisory signal for the different stages of the scene/object/human action interpretation process. In particular, we want to develop spatio-temporal correspondences between videos and the available text annotations, and exploit these correspondences as constraints for learning actions in videos.
* Master degree (preferably in Computer Science or Applied Mathematics; Electrical Engineering will also be considered)
* Solid programming skills; the project involves programming in C
* Solid mathematics knowledge (especially linear algebra and
* Creative and highly motivated
* Fluent in English, both written and spoken
* Prior knowledge in the areas of computer vision, machine learning or data mining is a plus (ideally a Master thesis in a related field)
Duration: 3-4 years
Start date: September 2012.
Location: INRIA Grenoble, France. Grenoble lies in the French Alpes and offers ideal conditions for skiing, hiking, climbing etc.
Contact: Cordelia Schmid, email@example.com
Please send applications via email, including:
* a complete CV
* graduation marks
* topic of your Master thesis
* the name and email address of two references (including your Master thesis supervisor)
 M. Guillaumin, T. Mensink, J. Verbeek and C. Schmid.
International Journal of Computer Vision, 2012.
 M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of
automatic naming of characters in TV video. Image and Vision
 O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic
annotation of human actions in video. In International Conference on
Computer Vision, 2009.