- Conditions of participation: Anybody who complies with the rules of the challenge is welcome to participate. There are two modes of participation:
- As a data donor by making an entry in the Repository.
- As a problem solver by submitting results on at least one of the proposed tasks.
- Dissemination: The challenge results will be presented at a NIPS 2008 conference workshop, December 12, 2008. To present at the workshop, abstracts must be submitted before October 24, 2008 to firstname.lastname@example.org. Participants are not required to attend the workshop and the workshop is open to non-challenge participants. The proceedings of the competition will be published by the Journal of Machine Learning Research (JMLR).
- Anonymity: The participants who do not submit a paper to the workshop can elect to remain anonymous. Their results will be published, but their name will remain confidential.
- Tasks: A number of datasets on which tasks have been defined are available, see the Task page. More tasks will be added from time to time, if new data are donated. Data donated show up immediately in the Repository, but they become part of the challenge only after beeing reviewed by the organizers, who then add them to the Task page. To be informed of task updates, request to be added to our mailing list by sending email to email@example.com.
- Milestone and final results: Results must be submitted between the start and the termination of the challenge. The challenge starts on September 15, 2008 and is scheduled to terminate on November 12, November 19, 2008. Each participating team is allowed to submit one set of results per task for the final evaluation. If more than one set of results is submitted, the last one will be taken into account. The results of the final evaluation will be publicly released at the workshop. In addition, optionally, each participating team may submit one set of results before October 15, 2008 to be part of the milestone evaluation, whose results will be publicly but anonymously released.
- Submission method: The results on each task must be sent to the designated contact person, see the Task page. In case of problem, send email to firstname.lastname@example.org.
- Evaluation and rewards: To compete towards the prizes, the participants must submit a 6-page paper describing their donated dataset (if they entered as a donor) or their task-solving method(s) and result(s) (if they entered as a solver), before November 21, 2008, to email@example.com (A sample paper and a Latex style file are provided). The challenge participants must append their fact sheet to their paper, see template provided in Latex (sample paper appendix), MS Word, or Acrobat formats. Each participant is allowed to submit several papers, if they address or propose distinct problems. The contributions of the participants will be evaluated by the organizers on the basis of their challenge performance results, the post-challenge tests (see reproducibility), AND the paper, using the following criteria: Performance in challenge and general Usefulness, Novelty and Originality, Sanity, Insight, Reproducibility, and Clarity of presentation. The data donors may provide solutions to their own problems, however such contribution will not count towards winning the prizes. Close collaborators of data donors having access to information, which may give them an unfair advantage, should disclose this fact to the organizers. The best papers will be selected for presentation at the workshop and several Prizes will be awarded.
- Reproducibility: Participation is not conditioned on delivering your code nor publishing your methods. However, we will ask the top ranking participants to voluntarily cooperate to reproduce their results. This will include filling out a fact sheet about their methods (get template in Latex, MS Word, or Acrobat formats) and eventually participating to post-challenge tests and sending us their code, including the source code. The outcome of our attempt to reproduce your results will be published and add credibility to your results.
The Pot-luck challenge datasets are a selection of the Repository datasets. Presently, we propose the following tasks:
- CYTO: Causal Protein-Signaling Networks in human T cells. The task is to learn a protein signaling network from multicolor flow cytometry data, recording the molecular activity of 11 proteins. There is on average 800 samples per experimental condition, corresponding to various perturbations of the system (manipulations). The authors used a Bayesian network approach and demonstrated that they recover most of the known signaling network structure, while discovering some new hypothetical regulations (causal relationships). The tasks suggested to the challenge participants include reproducing the results of the paper and finding a method to assess the confidence of the causal relationships uncovered. The evaluation is via submitted papers.
- LOCANET: LOcal CAusal NETwork We regroup under the name LOCANET a number of tasks consisting in finding the local causal structure around a given target variable (depth 3 network). The following datasets lend themselves to performing such a task: REGED, CINA, SIDO, MARTI. The results are evaluated by the organizers upon submition of the local structure in a designated format. In addition, the toy dataset LUCAS can be used for self evaluation.
- PROMO: Simple causal effects in time series. The task is to identify which promotions affect sales. This is an artificial dataset of about 1000 promotion variables and 100 product sales. The goal is to predict a 1000×100 boolean influence matrix, indicating for each (i,j) element whether the ith promotion has a causal influence of the sales of the jth product. Data is provided as time series, with a daily value for each variable for three years (i.e., 1095 days). The ground truth for the influence matrix is provided, so the participants can self-evaluate their results, and submit a paper to compete for the prizes.
- SIGNET: Abscisic Acid Signaling Network. The objective is to determine the set of 43 boolean rules that describe the interactions of the nodes within a plant signaling network. The dataset includes 300 separate boolean pseudodynamic simulations of the true rules, using an asynchronous update scheme. This is an artificial dataset inspired by a real biological system. The results are evaluated by the contact person upon submission of the results in a designated format.
- TIED: Target Information Equivalent Dataset. This is an artificial simulated dataset constructed to illustrate that there may be many minimal sets of features with optimal predictivity (i.e., Markov boundaries) and likewise many sets of features that are statistically indistinguishable from the set of direct causes and direct effects of the target. The tasks suggested include determining all statistically undistinguishable sets of direct causes and effects, or Markov boundaries of the target variable, and predicting the target variable on test data. The results are evaluated by the contact person upon submission of the results in a designated format.
Note that the participants are ultimately judged on their paper(s). For each dataset, the tasks proposed are only suggestions. The participants are invited to use the data is a creative way and propose their own task(s).
New: tasks proposed by participants
October 30: The following tasks proposed by participants are now included in the challenge.
- CauseEffectPairs: Distinguishing between cause and effect. The data set consists of 8 N x 2 matrices, each representing a cause-effect pair and the task is to identify which variable is the cause and which one the effect. The origin of the data is hidden for the participants but known to the organizers. The data sets are chosen such that we expect common agreement on which one is the cause and which one the effect. Even though part of the statistical dependences may also be due to hidden common causes, common sense tells us that there is a significant cause-effect-relation.
- STEMMATOLOGY: Computer-assisted stemmatology. Stemmatology (a.k.a. stemmatics) studies relations among different variants of a document that have been gradually built from an original text by copying and modifying earlier versions. The aim of such study is to reconstruct the family tree (causal graph) of the variants. We provide a dataset to evaluate methods for computer-assisted stemmatology. The ground truth is provided, as are evaluation criteria to allow the ranking of the results of different methods. We hope this will facilitate the development of novel approaches, including but not restricted to hierarchical clustering, graphical modeling, link analysis, phylogenetics, string-matching, etc.
Another way to participate in the challenge is to donate data. You will need to first Register . Then you will need to fill out a form to Deposit your data.Our repository does not actually store data, it points to a web page YOUR_DATA.html, which you maintain, and from which your data is accessible. If you do not have a web server allowing you to maintain a web page for your data, you may use the UCI Machine Learning Repository, which physically archives data or contact us at firstname.lastname@example.org.
Your entry can be edited after submission, but we recommend that you prepare your submission in a text file before filling out the form.
Tips to fill out your submission form:
- Contact name/URL: Select a person, which will be available to answer questions and evaluate results.
- Resource type: Choose “data” (eventually you can submit a generative model of data, then choose “model”).
- Resource name: A short easy-to-memorize acronym.
- Resource url: This is the web page you maintain YOUR_DATA.html.
- Title: A title describing your dataset.
- Authors: A comma separated list of authors.
- Key facts: Data dimensions (number of variables, number of entries), variable types, missing data, etc.
- Keywords: A comma separated list of keywords.
- Abstract: A brief description of your dataset and the task to be solved, including:
- Data description.
- Task description.
- Result format.
- Result submission method.
- Evaluation metrics.
- Suggestion of other tasks.
If you provide a web page YOUR_DATA.html, more details can be given there.
- Supplements 1, 2, and 3: Use these fields to provide a direct pointer to a zip archive to download data, a published paper or a report on the data, some slides, etc.