Open PhD Position: Learning Representations of large-scale Multi-relational Data. Application to link prediction in Knowledge Bases.

Supervision : Antoine Bordes and Yves Grandvalet, CNRS – Université de Technologie de Compiègne.

Dates : position open from November 1st, 2012 to January 1st, 2013.
(earlier or later start dates can be negotiable)

Context :

A PhD studentship is available as part of the French ANR funded project EVEREST on “lEarning high-leVEl REpresentations of large Sparse Tensors”
being undertaken by Heudiasyc laboratory in Université de Technologie de Compiègne, with a partnership of Xerox Research Center Europe (Grenoble, France). See https://www.hds.utc.fr/everest for more details on the project.

The student will be based in the Heudiasyc laboratory in Compiègne
(France) and join the DI team headed by Yves Grandvalet. He/she will be supervised by Antoine Bordes (https://www.hds.utc.fr/~bordesan) and Yves Grandvalet (https://www.hds.utc.fr/~grandval). Heudiasyc is a joint laboratory with the Université de Technologie de Compiègne (UTC) and the French governmental agency for research (CNRS). In 2011, it was rated A+ (the highest rate) by the French Research evaluation agency (AERES).
Heudiasyc fosters interdisciplinary research on information science and technology including machine learning, uncertain reasoning, operations research, robotics and knowledge management. In 2011 Heudiasyc was awarded with an excellence project (LabEx) on the « Control of Technological Systems of Systems ». The project will also include a collaboration with Xerox Research Center Europe, through interactions with Guillaume Bouchard (http://www.xrce.xerox.com/).

The studentship is funded by an ANR project and will start between November 1st, 2012 and January 1st, 2013. The studentship is funded for 3 years (currently 1850€ per month — gross salary).

Requirements :

The PhD candidate should have or expect to obtain a MSc or equivalent in computer science or mathematics. The following qualities are desirable :
strong interests in machine learning or statistics ; excellent record of academic and/or professional achievement ; strong mathematical skills ; strong programming skills ; good written and spoken communication skills in French or English. The ideal candidate should be able to conduct theoretical research, but also implement and test models on very large datasets.

Project description :

Huge amounts of structured and relational data are available in many domains of engineering, industry or research ranging from the Semantic Web, or bioinformatics to recommender systems. As a result, knowledge bases (KBs), such as Freebase, WordNet or GeneOntology, became essential tools for storing, manipulating and accessing information, but they are also incomplete, imprecise and far too large to be used as efficiently and broadly as they could. Hence, there is need for methods able to summarize, complete or merge these large databases. This is the motivation of the project.

The data of these KBs is naturally represented as a so called multi-relational graph consisting of nodes associated with entities and of different types of edges between nodes corresponding to the different types of relations. The first phase of the project will consist in developing and evaluating an approach based on energy-based learning [4] for deriving high-level representations of such multi-relational graphs.
By high-level, we mean that these representations should enable to condense the original databases, to complete them by filling in missing values, and to ease their matching and merging. Energy-based models could provide a new direction to deal with multi-relational data, which will be compared with traditional low-rank methods [5] or Bayesian approaches [2,6]. They have already shown some promising preliminary results [1]. A goal of the thesis will also be to bridge energy-based learning in this context and tensor factorization [3].

In a second phase, the qualities of these new representations will be applied to link prediction, i.e. uncover relationships in a multi-relational graph that probably exist but have not been observed, on benchmark data and on real-world data provided by Xerox.

References :

[1] Bordes, A., J. Weston, R. Collobert, and Y. Bengio. “Learning Structured Embeddings of Knowledge Bases.” Proceedings of the International Conference on Artificial Intelligence (AAAI). AAAI Press, 2011.
[2] Kemp, C., J.B. Tenenbaum, T.L. Griffiths, T. Yamada, and N. Ueda.
“Learning Systems of Concepts with an Infinite Relational Model.”
Proceedings of the International Conference on Artificial Intelligence (AAAI). AAAI Press, 2006.
[3] Koida, T.G., and B.W. Bader. “Tensor Decompositions and Applications.”
SIAM Review, 2008.
[4] Lecun Y, Chopra S, Hadsell R, marc’aurelio R, Huang f (2006) A tutorial on Energy-Based learning. In: Bakir G, Hofman T, sch ̈olkopf B, Smola A, Taskar B (eds) Predicting Structured Data, MIT Press [5] Nickel, M., V. Tresp, and H.-P. Kriegel. “A Three-Way Model for Collective Learning on Multi-Relational Data.” Proceedinsg of the International Conference on Machine Learning (ICML). Bellevue, WA:
Omnipress, 2011.
[6] Sutskever, I., R. Salakhutdinov, and J.B. Tenenbaum. “Modelling Relational Data using Bayesian Clustered Tensor Factorization.” Avances in Neural Information Processing Systems (NIPS). Vancouver, BC, 2010.

Contact and application :

Applicants should send (preferably as a single PDF file):
* a CV
* a brief statement of research interests
* references (with email and phone number)
* their academic transcript
* a sample of strongest publications or course work (e.g. Master thesis)

Applications and inquiries should be directed to:
* Antoine Bordes – antoine.bordes@hds.utc.fr
* Yves Grandvalet – yves.grandvalet@hds.utc.fr