Defying the Curse of Dimensionality: Competitive Seizure Prediction with Kaggle

Gavin Gray

November 28th 2014

What does a Kaggle competition look like?

The view from Github

Commits by time.
Commits by time.
Punch card graph of hours when we were working.
Punch card graph of hours when we were working.

The view from the data

Graph showing example pre-ictal samples from the raw data (Kaggle Inc, 2014).
Graph showing example pre-ictal samples from the raw data (Kaggle Inc, 2014).
Stand-in example spectrogram(Queiroz et al., 2009).
Stand-in example spectrogram(Queiroz et al., 2009).
Coloured by class.
Coloured by class.
Coloured by hour.
Coloured by hour.

The view from Kaggle

Final leaderboard results (Kaggle Inc, 2014).
Final leaderboard results (Kaggle Inc, 2014).

What can you work on?

Historical competitions

A sample from 143 completed competitions:

  • Heritage Health Prize
  • Merck Molecular Activity Challenge
  • Observing Dark Worlds
  • The Marinexplore and Cornell University Whale Detection Challenge
  • Africa Soil Property Prediction Challenge
  • CONNECTOMICS (that is the whole name)
  • Many, many corporate competitions...


Free to use anything to get the job done. We used:

  • Matlab
  • Scikit-learn
  • Git
  • Various other Python packages
  • Working with HDF5s
  • MongoDB


It's possible to quickly try things out to see if they'll work.


Comprehensive list of features can be found in the [repository][repo]. Useful extractions were:

  • cln,csp,dwn_feat_pib_ratioBB_:
    • cln - Cleaned
    • csp - Common Spatial Patterns (transformation)
    • dwn - Downsampled
    • pib - Power in band
    • ratioBB - ratio of power to broadband power
  • cln,ica,dwn_feat_mvar-PDC_:
    • ica - Independent Component Analysis (transformation)
    • mvar - coefficients of fitted Multivariate-AutoRegressive model
    • PDC - Partial Directed Coherence for MVAR
  • And approximately 850 other options...

Machine learning

Here is a list (incomplete) of what we tried:

  • Random Forests
    • Random forest classifiers
    • Totally random tree embedding
    • Extra-tree feature selection
  • Support Vector Machines
    • Various different kernels
  • Logistic Regression
  • Adaboost
  • Platt scaling
  • Univariate feature selection
  • Restricted Boltzmann machine
  • Recursive feature elimination
  • ...

Organising the project

  • Teamwork with git experience
  • TDD
  • Code documentation

Tips and tricks in seizure prediction

Our process

Our data flow chart.
Our data flow chart.

Model averaging

Genetic algorithms

Michael Hill's method came up on Github two days ago (Hill, 2014):

...population size of 30 and runs for 10 generations. The population is initialised with random feature masks consisting of roughly 55% features activated and the other 45% masked away. The fitness function is simply a CV ROC AUC score.

His model:

The default selected classifier for submission is linear regression.


Our repository can be found at:

Competitive Data Science



  • Get to try new things
  • Learn new skills
  • Working break from your PhD - you get immediate feedback
  • Might discover something useful


  • Can quickly absorb time
  • You have to have a good team
  • Models people create are not necessarily useful:
    • Netflix challenge
    • Engineered ensemble models are over-complicated


The next competitions coming up are:

  • BCI Challenge @ NER 2015 - $1,000
  • Helping Santa's Helpers - $20,000
  • Click-Through Rate Prediction - $15,000

Some text.


Hill M (2014) Github - michaelHills/seizure-prediction. Available from:

Kaggle Inc (2014) American epilepsy society seizure prediction challenge. Available from:

Queiroz CM, Gorter JA, Silva FHL da, et al. (2009) Dynamics of evoked local field potentials in the hippocampus of epileptic rats with spontaneous seizures. Journal of Neurophysiology, 101(3), 1588–1597, Available from: (accessed 28 November 2014).