* Large Scale Text Categorization
* Grenoble University, Grenoble, France
* Position open from 1 September 2011 (earlier or later start dates are negotiable)
Several categorization problems rely on category systems comprising several thousands of categories, in which new elements have to be placed. For example, DMOZ (http://dmoz.org/), which is supposed to be the largest repository of the web, contains more than 590,000 categories, in which new web pages are categorized by a team of worldwide volunteers, each being in charge of a sub-portion of the system. A similar situation is faced with systems aiming at assigning concepts or tags, from a given ontology/thesaurus or sets of tags, to documents or parts of documents. PubMed, for example, contains more than 16 million references, the abstracts of which are indexed with concepts from MeSH (http://www.nlm.nih.gov/pubs/factsheets/mesh.html), a thesaurus comprising more than 150,000 concepts. Similarly, social networks à la Flickr make use of tags which are manually assigned by users, but which can be automatically “propagated” to ensure consistence over the entire network. In all these examples, the number of categories involved requires, for both maintaining and browsing the category system, a hierarchical organization of the categories.
Several approaches have been investigated for deploying categorizers in large-scale category systems, as big-bang approaches, in which one tries to directly categorize documents into the leaf categories, without making use of the category hierarchy, or top-down approaches, in which the hierarchy is exploited to divide the overall classification problem into smaller ones. Studies reported in  and  suggested that the direct deployment of a single classifier onto the complete set of categories (big-bang approaches) may lead to training and classification time which are unacceptable. On the contrary, top-down approaches, by simplifying the overall problem, lead to acceptable classification times. However they tend to propagate errors made at higher levels of the hierarchy. More recently, alternative approaches have been proposed, aiming at locally propagating tags to close neighbors according to the tags already present and the topology of the network , or attempting to find efficient representations of large scale collections (either of documents, as the semantic hashing approach proposed in , or of the category system, as the conditional probability trees defined in ).
We want to study these different approaches, and in particular devise methods to obtain the optimal subset of categories to be considered at any given point in a tree or cascade of categorizers. This is related to the two following questions:
1. What happens on the generalization error of the classifier, when we replace a category by its children?
2. What is the optimal (according to the generalization error) subset of categories to retain, and how to efficiently represent them and deploy categorizers?
We want to investigate these points with several types of classifiers (as discriminative, large margin classifiers, or maximum likelihood classifiers), and develop theoretically well founded methods which are proved effective on very large data collections, as the ones currently used in the Large Scale Hierarchial Text Categorization Challenge of the PASCAL network of excellence (http://lshtc.iit.demokritos.gr:10000/LSHTC2_CFP).
*Keywords: classification, large set of categories, conditional probability trees
 T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, W.-Y. Ma. Support vector machines classification with a very large-scale taxonomy, SIGKDD Explorations, 2005.
 O. Madani, W. Greiner, D. Kempe, M. R. Salavatipou. Recall systems: Efficient learning and use of category indices, International Conference on Artificial Intelligence and Statistics (AISTATS), 2007.
 S. Bao, B. Yang, B. Fei, S. Xu, Z. Su, Y. Yu. Boosting social annotations using propagation, Proceeding of the 17th ACM conference on Information and knowledge management (CIKM ’08), 2008.
 R. Salakhutdinov, G. Hinton. Semantic hashing, International Journal of Approximate Reasoning, 2009.
 A. Beygelzimer, J. Langford, Y. Lifshits, G. Sorkin, A. Strehl. Conditional Probability Tree Estimation Analysis and Algorithms, 2009 (arXiv:0903.4217v2).
The candidate for the PhD position should have a Master’s degree in computer science or applied mathematics with knowledge of machine learning concepts. The ideal candidate should be able to conduct theoretical research, but also implement and test models on very large datasets.
Applicants should send (preferably as a single PDF file):
* a CV
* a brief statement of research interests
* references (with email and phone number)
* their academic transcript
* a sample of strongest publications or course work (e.g. Master thesis)
Applications and inquiries should be directed to:
Cecile Amblard – cecile.amblard(at)imag.fr
Eric Gaussier – eric.gaussier(at)imag.fr
3 years (starting in Sept. 2011)
The PhD is fully financed through a French Research National Agency grant.