Web Spam Challenge 2008: Feature vectors

As in previous versions, we provide a set of pre-computed features extracted from the contents and links in the collection. These features may be used by the participant teams in addition to any automatic technique they choose to use.

The feature vectors as comma-separated text files are available in:

Features in Matlab format

You can also download the features in matlab format. Each feature set is divided into two files: a small one for the hosts with labels for training, and a large one for the unlabeled hosts for computing the predictions:

Features in ARFF format

You can also download the features in arff format for Weka. Each feature set is divided into two files: a small one for the hosts with labels for training, and a large one for the unlabeled hosts for computing the predictions:

If you have any comment or suggestion about these files, or you find any problems with them, please contact ludovic [dot] denoyer [at] lip6 [dot] fr

Submitting predictions

The predictions you submit for the challenge can use all of the labels in SET 1 that we are providing as training data. Predictions are expected to be produced by fully automatic systems. We do not allow participants to manually modify the predictions obtained by their systems.

Data format

To submit the predictions, create a comma-separated plain text file containing one line per host, including the hostname, your predicted label (“nonspam” or “spam”), and the probability your model predicts of the host being spam. Use 0.0 for nonspam and 1.0 for spam if your model does not generate such probabilities.


#hostname,prediction,probability_spam
alex.crystalmark.co.uk,spam,0.9000
alexchamberlin.mysite.wanadoo-members.co.uk,nonspam,0.1750
alexwy.20six.co.uk,nonspam,0.0520
alfieplastering.mysite.wanadoo-members.co.uk,spam,0.7890
...

Note that you do not need to provide predictions for the labeled elements of the dataset. The evaluation of the participating entries will be based on the labels provided for (a subset of) the elements that were not labeled.

Submitting your predictions

It is good to include standard validation results in the abstract accompanying your submission (e.g.: indicating the accuracy obtained by holding a part of the data as testing set and/or cross-validating), but remember that the predictions you submit can (and perhaps should) be obtained using all of the labels you have as training set.

Something similar holds for any TrustRank/BadRank/Neighborhood-based feature you want to use. For validating, some of the labels can be held while generating such a feature, but for the predictions you submit for the challenge, you can use all of the labels you have for generating the feature.

A maximum of two sets of predictions per participant team will be allowed. If you are submitting two sets of predictions, please submit each of them as a separate entry.

Submissions of predictions to the Web Spam Challenge must be accompanied by a 1-page PDF abstract (template) containing a high-level description of the system used for generating such predictions and the data sources used by your system.

Submit your predictions using EasyChair. Submit a 1-page PDF abstract (template) containing a high-level description of the system used for generating the predictions and the data sources used by your system. Attach to your submission a .txt containing the predictions in the format described above.

Evaluation Metrics

Test set and ground truth

After data collection for the web spam challenge, labeled data was randomly split into two sets: train and test (2/3rd training and 1/3rd test). The training set was released along with labels, content, links, and some pre-computed feature vectors. Each test sample was evaluated by one or more judges. These judgments will be used to compute a spamicity score for each host, by taking the average of the assessments using the following weights:

  • NONSPAM counts as 0.0
  • BORDERLINE counts as 0.5
  • SPAM counts as 1.0

Judgements labeled as CAN’T CLASSIFY will be dropped from the spamicity score calculation. Ground truth will be produced by marking samples with a spamicity score > 0.5 as SPAM, and those < 0.5 as NONSPAM. Samples with no judgments, or with spamicity score exactly equal to 0.5 will not be considered in the test set.

The test labels are here: http://chato.cl/webspam/datasets/uk2007/labels/webspam-uk2007-testlabels.tar.gz

Evaluation of the predicted spamicity

Submitted predictions are four-tuples: #hostname, prediction, probability_spam (see format). The prediction is a real number which corresponds to the predictions for spammicity as defined above. We will be using Area Under the ROC Curve (AUC) as the evaluation metric. This evaluation metric aims at measuring the performance of the prediction of spamicity. An easy way of calculating this (and also to obtain a precision-recall curve) is to use the perf program, e.g.:

% cat team_predicted_spamicity.txt \
  | sed 's/NONSPAM/0/g' | sed 's/SPAM/1/g' \
  | grep -v '^#' | awk '{print $2,$3}' | perf -PRF -AUC -plot pr


0.3333 1.0000


1.0000 0.7500
1.0000 0.6000
1.0000 0.5000

PRF    0.85714   pred_thresh  0.500000
ROC    0.88889

Ranking and tie breaking

Using the predicted spamicity scores, entries will be sorted in decreasing order of AUC. The team with the highest AUC score will be ranked first. If two consecutively ranked submissions differ by less than 1 percentage point (0.01) in their AUC score a tie will be declared for that rank.

If the first two ranks produce a tie, it will be resolved in the following manner. The test set will be randomly partitioned into five disjoint sets of 20%, and the submission with the lower AUC variance will be declared the winner.