There are two main goals for this challenge:
- to test and compare different text visualization methods, ideas and algorithms on a common data-set, and
- to contribute to the Pascal dissemination and promotion activities by using data about scientific publications from Pascal’s EPrints server.
The challenge is split into two parts where the task in the first round is very general. A more precise task, together with precise evaluation criterias, will be decided at the workshop in Venice, based on the submissions and ideas from the first round.
The data is available in the raw ePrints format. Besides it we offer a preprocessed version of the raw data into many standard formats. We also processed the data into a “baseline” visualization which can be viewed with the Document Atlas software. In this way we have split the whole pipeline of visualization process into more steps and allow the participants to work only on the parts of their interests. They can either:
- process the provided data to generate a better input for Document Atlas,
- visualize the provided processed baseline data differently and nicer than Document Atlas, or
- visualize the raw or pre-processed data on a novel way,
with the goal of:
- discovering main areas covered by the papers,
- discovering area developments trough time,
- helping the researchers with recommendation on which papers to read,
- helping at finding the right reviewers for a new papers.
Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three visualizations made out of any of the provided data. If a participant submits more than three solutions, the last three will be used at evaluation.
- Blaz Fortuna, IJS, Slovenia
- Marko Grobelnik, IJS, Slovenia
- Steve R. Gunn, University of Southampton, UK
Given is the data-set with approx. 720 publications uploaded onto Pascal EPrints-Server. The data-set was provided by Steve Gunn.
The following fields are available for each publication from the Eprints-Server:
- Date — date of publication
- Title — short text description
- Authors’ names — normalized and consolidated names (one person has just one surface form)
- Authors’ institutions — normalized and consolidated names (one institutions has just one surface form)
- Abstract — on average, 10-20 lines of text
The documents are sorted by the date of publication with the earlies paper being on the top of the files in all the formats.
We also provide processed data which can be transformed as to an input for Document Atlas visualization tool. You can substitute some parts of the input (document positions, landscape matrix, keywords) with your own results and just use the Document Atlas as a front end (GUI) for your visualization. All the necessary software can be downloaded from this page.
Downloads — data
The data-set is available in the following formats:
(for format descriptions check the bottom of this page!)
|Description:||One big file with all the information included.|
|Bag-Of-Words format for publications|
|Description:||Preprocessed raw data-set where for each publication from there is one TFIDF vector.|
|Bag-Of-Words format for authors|
|Description:||Preprocessed raw data-set where for each author from there is one TFIDF vector.|
|Bag-Of-Words format for institutions|
|Description:||Preprocessed raw data-set where for each institution from there is one TFIDF vector.|
|Graph format for words|
|Description:||Preprocessed raw data-set where for each word there is a vertex in the graph; two words are connected if they appear in the same publication (title and abstract).|
|Download:||MATLAB-graph, Pajek, GraphML|
|Graph format for authors|
|Description:||Preprocessed raw data-set where for each author there is vertex in the graph; two authors are connected if they are in co-authorship.|
|Download:||MATLAB-graph, Pajek, GraphML|
|Graph format for institutions|
|Description:||Preprocessed raw data-set where for each institution there is a vertex in the graph; two institutions are connected if any of their two employees are in co-authorship.|
|Download:||MATLAB-graph, Pajek, GraphML|
|Processed data for document atlas|
|Description:||This is the processed version of the Bag-of-Words files for publications and for authors. It can be used to generate input file for document atlas using Txt2VizMap utility. For detailed information checked the bottom of the page.|
Downloads — software
|Description:||This software bundle includes an utility Txt2VizMap for creating .VizMap files which Document Atlas can read on the input and the Document Atlas. For more details about this visualization check the paper on Document Atlas. Please not that .NET framework 2.0 is needed in order to run Document Atlas.|
Here are details about formats in which the upper data is made available:
- XML — easy to read XML format with list of publications.
- MATLAB-bow — sparse term-document matrix (matrix with TFIDF vectors of documents as columns) which can be loaded with Matlab. It includes map of publications/authors/institutions to column numbers and of words to row numbers.
- text — one big text file where each line represent one document with the first word in line being the document’s name.
- MATLAB-graph — sparse adjacency matrix which can be loaded with Matlab. It includes map of words/authors/institutions to vertices number
- Pajek — format appropriate for usage in “Pajek” network analysis package.
- GraphML — standard XML format for storing graphs, to be available soon.
More on Matlab formats:
Files for Matlab are stored in a format, which can be easily transformed to Matlab’s sparse matrix using spconvert command.
Example which loads term-document matrix with papers:
X = spconvert(papers);
In files with maps, word in i-th line of file corresponds to the lable of i-th element (row or column in matrix or vertex in graph).
Visualization in Document Atlas consist of more parts and in order to generate input for it using Txt2VizMap utility they must be provided (each in separate file). Here is the list of parts together with description of their file formats:
- documents — one big text file where each line represent one document with the first word in line being the document’s name.
- documents’ positions — text file containing documents positions. First line of the file holds a list of x coordinates and the second line of the file holds a list of y coordinates. The coordinates should be provided in the same order as the documents in the upper file. All the positions should be normalised so that 0<=x<=1 and 0<=y<=1.
- landscapes — matrix containing the height points of the landscape. One line of the text file holds one row of matrix. The top row of matrix corresponds to the first line in file. The top left corner of the matrix corresponds to the top left corner of the screen and the bottom right corner of the matrix corresponds to the bottom right corner of the screen.
- keywords — list of keywords that appear on the landscape in the Document Atlas. One line of text file corresponds to one word with word being the first element in the line followed by its x and y coordinates.
Note that only file with documents and file with their positions have to be provided. If landscapes are not provided they are calculated using standard Document Atlas method. Same holds for the keywords.
More landscapes can be provided and the user can choose between them inside the Document Atlas. Name of the landscape show in the program corresponds to its file name. A series of landscapes can be used for example to bring a time component into the visualization.
Example: the following command line creates Document Atlas input file from the processed data available from this page:
For evaluation of the submitted results in the second round, we plan to use 4 criteria:
- Usability of visualization — The goal is to assess usability of particular visualization in different practical contexts.
- Innovativeness — The goal is to estimate how innovative are the ideas used for visualization.
- Aesthetics of the image — Here we are aiming to identify the “nicest” images from the challenge.
- General Pascal-researchers’ voting over the web about “who likes what”.
Since all the criteria are subjective, we will hire experts for judging about the quality. Each of the criteria will generate a separate results ranking.