Hierarchies are becoming ever more popular for the organization of documents, particularly on the Web (Web directories are an example of such hierarchies). Along with their widespread use comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting problems arise. In particular it is one of the rare situations where data sparsity remains an issue despite the vastness of available data. The reasons for this are the simultaneous increase in the number of classes and their hierarchical organization. The latter leads to a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.
Research on large-scale classification so far has focused on situations involving a large number of documents and/or a large numbers of features, with a limited number of categories. However, this is not the case in hierarchical category systems, such as DMOZ, or the International Patent Classification, where in addition to the large number of documents and features, a large number of categories exist, in the order of tens or hundreds of thousands. Approaching this problem, either existing large-scale classifiers can be extended, or new methods need to be developed. The goal of this workshop is to discuss and assess some of these strategies, covering all or part of the issues mentioned above.
- Eric Gaussier, LIG, Grenoble, France
- George Paliouras, NCSR "Demokritos", Athens, Greece
- Aris Kosmopoulos, NCSR "Demokritos" and AUEB, Athens, Greece
Sujeevan Aseervatham, LIG, Grenoble, France