Challenge description

There are two main goals for this challenge:

  • to test and compare different text visualization methods, ideas and algorithms on a common data-set, and
  • to contribute to the Pascal dissemination and promotion activities by using data about scientific publications from Pascal’s EPrints server.

The challenge is split into two parts where the task in the first round is very general. A more precise task, together with precise evaluation criterias, will be decided at the workshop in Venice, based on the submissions and ideas from the first round.

The data is available in the raw ePrints format. Besides it we offer a preprocessed version of the raw data into many standard formats. We also processed the data into a “baseline” visualization which can be viewed with the Document Atlas software. In this way we have split the whole pipeline of visualization process into more steps and allow the participants to work only on the parts of their interests. They can either:

  • process the provided data to generate a better input for Document Atlas,
  • visualize the provided processed baseline data differently and nicer than Document Atlas, or
  • visualize the raw or pre-processed data on a novel way,

with the goal of:

  • discovering main areas covered by the papers,
  • discovering area developments trough time,
  • helping the researchers with recommendation on which papers to read,
  • helping at finding the right reviewers for a new papers.

Rules

Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three visualizations made out of any of the provided data. If a participant submits more than three solutions, the last three will be used at evaluation.

Organizers

  • Blaz Fortuna, IJS, Slovenia
  • Marko Grobelnik, IJS, Slovenia
  • Steve R. Gunn, University of Southampton, UK

Description

Given is the data-set with approx. 720 publications uploaded onto Pascal EPrints-Server. The data-set was provided by Steve Gunn.

The following fields are available for each publication from the Eprints-Server:

  • Date — date of publication
  • Title — short text description
  • Authors’ names — normalized and consolidated names (one person has just one surface form)
  • Authors’ institutions — normalized and consolidated names (one institutions has just one surface form)
  • Abstract — on average, 10-20 lines of text

The documents are sorted by the date of publication with the earlies paper being on the top of the files in all the formats.

We also provide processed data which can be transformed as to an input for Document Atlas visualization tool. You can substitute some parts of the input (document positions, landscape matrix, keywords) with your own results and just use the Document Atlas as a front end (GUI) for your visualization. All the necessary software can be downloaded from this page.

Downloads — data

The data-set is available in the following formats:
(for format descriptions check the bottom of this page!)

Raw format
Description: One big file with all the information included.
Download: XML

 

Bag-Of-Words format for publications
Description: Preprocessed raw data-set where for each publication from there is one TFIDF vector.
Download: MATLAB-bow, text

 

Bag-Of-Words format for authors
Description: Preprocessed raw data-set where for each author from there is one TFIDF vector.
Download: MATLAB-bow, text

 

Bag-Of-Words format for institutions
Description: Preprocessed raw data-set where for each institution from there is one TFIDF vector.
Download: MATLAB-bow, text

 

Graph format for words
Description: Preprocessed raw data-set where for each word there is a vertex in the graph; two words are connected if they appear in the same publication (title and abstract).
Download: MATLAB-graph, Pajek, GraphML

 

Graph format for authors
Description: Preprocessed raw data-set where for each author there is vertex in the graph; two authors are connected if they are in co-authorship.
Download: MATLAB-graph, Pajek, GraphML

 

Graph format for institutions
Description: Preprocessed raw data-set where for each institution there is a vertex in the graph; two institutions are connected if any of their two employees are in co-authorship.
Download: MATLAB-graph, Pajek, GraphML

 

Processed data for document atlas
Description: This is the processed version of the Bag-of-Words files for publications and for authors. It can be used to generate input file for document atlas using Txt2VizMap utility. For detailed information checked the bottom of the page.
Download: publications, authors

 

Downloads — software

Description: This software bundle includes an utility Txt2VizMap for creating .VizMap files which Document Atlas can read on the input and the Document Atlas. For more details about this visualization check the paper on Document Atlas. Please not that .NET framework 2.0 is needed in order to run Document Atlas.
Download: software

 

Format descriptions

Here are details about formats in which the upper data is made available:

  • XML — easy to read XML format with list of publications.
  • MATLAB-bow — sparse term-document matrix (matrix with TFIDF vectors of documents as columns) which can be loaded with Matlab. It includes map of publications/authors/institutions to column numbers and of words to row numbers.
  • text — one big text file where each line represent one document with the first word in line being the document’s name.
  • MATLAB-graph — sparse adjacency matrix which can be loaded with Matlab. It includes map of words/authors/institutions to vertices number
  • Pajek — format appropriate for usage in “Pajek” network analysis package.
  • GraphMLstandard XML format for storing graphs, to be available soon.

More on Matlab formats:
Files for Matlab are stored in a format, which can be easily transformed to Matlab’s sparse matrix using spconvert command.
Example which loads term-document matrix with papers:
load papers.dat
X = spconvert(papers);
In files with maps, word in i-th line of file corresponds to the lable of i-th element (row or column in matrix or vertex in graph).

Document Atlas:
Visualization in Document Atlas consist of more parts and in order to generate input for it using Txt2VizMap utility they must be provided (each in separate file). Here is the list of parts together with description of their file formats:

  • documents — one big text file where each line represent one document with the first word in line being the document’s name.
  • documents’ positions — text file containing documents positions. First line of the file holds a list of x coordinates and the second line of the file holds a list of y coordinates. The coordinates should be provided in the same order as the documents in the upper file. All the positions should be normalised so that 0<=x<=1 and 0<=y<=1.
  • landscapes — matrix containing the height points of the landscape. One line of the text file holds one row of matrix. The top row of matrix corresponds to the first line in file. The top left corner of the matrix corresponds to the top left corner of the screen and the bottom right corner of the matrix corresponds to the bottom right corner of the screen.
  • keywords — list of keywords that appear on the landscape in the Document Atlas. One line of text file corresponds to one word with word being the first element in the line followed by its x and y coordinates.

Note that only file with documents and file with their positions have to be provided. If landscapes are not provided they are calculated using standard Document Atlas method. Same holds for the keywords.

More landscapes can be provided and the user can choose between them inside the Document Atlas. Name of the landscape show in the program corresponds to its file name. A series of landscapes can be used for example to bring a time component into the visualization.

Example: the following command line creates Document Atlas input file from the processed data available from this page:
Txt2VizMap.exe -inlndoc:papers.txt
-ipoint:txt_papers.points.dat
-ils:txt_papers.Landscape1.dat;txt_papers.Landscape2.dat
-ikeywd:txt_papers.keywd.dat -o:txt_papers.VizMap

Evaluation criteria

For evaluation of the submitted results in the second round, we plan to use 4 criteria:

  • Usability of visualization — The goal is to assess usability of particular visualization in different practical contexts.
  • Innovativeness — The goal is to estimate how innovative are the ideas used for visualization.
  • Aesthetics of the image — Here we are aiming to identify the “nicest” images from the challenge.
  • General Pascal-researchers’ voting over the web about “who likes what”.

Since all the criteria are subjective, we will hire experts for judging about the quality. Each of the criteria will generate a separate results ranking.