Part 1: A Gentle Start

Note

This tutorial part is also available for download as an IPython notebook: [ipynb]

The purpose of this first tutorial part is to make you familiar with a few basic properties and building blocks of PyMVPA. Let’s have a slow start and compute a cross-validation analysis.

Virtually every Python script starts with some import statements that load functionality provided elsewhere. We start this tutorial by importing some little helpers (including all of PyMVPA) we are going to use in the tutorial, and whose purpose we are going to see shortly.

>>> from mvpa2.tutorial_suite import *

Getting the data

As a first step, we will load an fMRI dataset that is the first subject of the classic study of Haxby et al. (2001). For the sake of simplicity we are using a helper function that loads and pre-processes the data in a way that is similar to what the original authors did. Later on we will get back to this point and look at what was done in greater detail, but for now it is as simple as:

>>> ds = get_haxby2001_data()

What we get as ds is a PyMVPA dataset that contains the fMRI data, and a lot of additional information which we will investigate later on. In the original study the authors split the dataset in half (in odd and even runs), and computed a pattern of activation for each stimulus category in each half. Hence, the dataset consists of 16 patterns which are called samples in PyMVPA (one for each of the eight categories in each half of the experiment). The number of samples in a dataset is equivalent to its length, and can be queried by:

>>> print len(ds)
16

Most datasets in PyMVPA are represented as a two-dimensional array, where the first axis is the samples axis, and the second axis represents the features of the samples. In the Haxby study the authors used a region of interest (ROI) in the ventral temporal cortex. For subject 1 this ROI comprises 577 voxels. Since the analysis was done on the voxel activation patterns, those voxels are the actual features of this dataset, and hence we have 577 of them.

>>> print ds.nfeatures
577

We can also access this information via the shape property of the dataset:

>>> print ds.shape
(16, 577)

The most important information for a classification analysis, besides the data, are the so-called labels or targets that are assigned to the samples, since they define the model that should be learned by a classifier, and serve as target values to assess the prediction accuracy. The dataset stores these targets in its collection of sample attributes (hence the collection name sa), and they can be accessed by their attribute name, either through the collection, or via a shortcut.

>>> print ds.sa.targets
['bottle' 'cat' 'chair' 'face' 'house' 'scissors' 'scrambledpix' 'shoe'
 'bottle' 'cat' 'chair' 'face' 'house' 'scissors' 'scrambledpix' 'shoe']
>>> print ds.targets
['bottle' 'cat' 'chair' 'face' 'house' 'scissors' 'scrambledpix' 'shoe'
 'bottle' 'cat' 'chair' 'face' 'house' 'scissors' 'scrambledpix' 'shoe']

As it can be seen, PyMVPA can handle literal labels, and there is no need to recode them into numerical values.

Exercise

Besides the collection of sample attributes sa, each dataset has two additional collections: fa for feature attributes and a for general dataset attributes. All these collections are actually instances of a Python dict. Investigate what additional attributes are stored in this particular dataset.

Dealing With A Classifier

All that we are missing for a first attempt of a classification analysis of this dataset is a classifier. This time we will not use a magic function to help us, but will create the classifier ourselves. The original study employed a so-called 1-nearest-neighbor classifier, using correlation as a distance measure. In PyMVPA this type of classifier is provided by the kNN class, that makes it possible to specify the desired parameters.

>>> clf = kNN(k=1, dfx=one_minus_correlation, voting='majority')

A k-Nearest-Neighbor classifier performs classification based on the similarity of a sample with respect to each sample in a training dataset. The value of k specifies the number of neighbors to derive a prediction, dfx sets the distance measure that determines the neighbors, and voting selects a strategy to choose a single label from the set of targets assigned to these neighbors.

Exercise

Access the built-in help to inspect the kNN class regarding additional configuration options.

Now that we have a classifier instance it can be easily trained by passing the dataset to its train() method.

>>> clf.train(ds)

A trained classifier can subsequently be used to perform classifications of unlabeled samples. The classification can be assessed by comparing these predictions to the target labels.

>>> predictions = clf.predict(ds.samples)
>>> np.mean(predictions == ds.sa.targets)
1.0

We see that the classifier performs remarkably well on our dataset – it doesn’t make even a single prediction error. However, most of the time we would not be particularly interested in the prediction accuracy of a classifier on the same dataset that it got trained with.

Exercise

Think about why this particular classifier will always perform error-free classification of the training data – regardless of the actual dataset content. If the reason is not immediately obvious, take a look at chapter 13.3 in The Elements of Statistical Learning. Investigate how the accuracy varies with different values of k. Why is that?

Instead, we are interested in the generalizability of the classifier on new, unseen data so we could, in principle, use it to label unlabeled data. Because we only have a single dataset it needs to be split into (at least) two parts to achieve this. In the original study Haxby and colleagues split the dataset into pattern of activations from odd versus even-numbered runs. Our dataset has this information in the runtype sample attribute:

>>> print ds.sa.runtype
['even' 'even' 'even' 'even' 'even' 'even' 'even' 'even' 'odd' 'odd' 'odd'
 'odd' 'odd' 'odd' 'odd' 'odd']

Using this attribute we can now easily split the dataset into two. PyMVPA datasets can be sliced in similar ways as NumPy‘s ndarray. The following calls select the subset of samples (i.e. rows in the datasets), where the value of the runtype attribute is either the string ‘even’ or ‘odd’.

>>> ds_split1 = ds[ds.sa.runtype == 'odd']
>>> len(ds_split1)
8
>>> ds_split2 = ds[ds.sa.runtype == 'even']
>>> len(ds_split2)
8

Now we could repeat the steps above: call train() with one dataset half and predict() with the other, and compute the prediction accuracy manually. However, a more convenient way is to let the classifier do this for us. Many objects in PyMVPA support a post-processing step that we can use to compute something from the actual results. The example below computes the mismatch error of classifier predictions and the target values stored in our dataset. To make this work, we do not call the classifier’s predict() method anymore, but “call” the classifier directly with the test dataset. This is a very common usage pattern in PyMVPA that we shall see a lot over the course of this tutorial. Again, please note that we compute an error now, hence lower values represent more accurate classification.

>>> clf.set_postproc(BinaryFxNode(mean_mismatch_error, 'targets'))
>>> clf.train(ds_split2)
>>> err = clf(ds_split1)
>>> print np.asscalar(err)
0.125

In this case, our choice of which half of the dataset is used for training and which half for testing was completely arbitrary, hence we also estimate the transfer error after swapping the roles:

>>> clf.train(ds_split1)
>>> err = clf(ds_split2)
>>> print np.asscalar(err)
0.0

We see that on average the classifier error is really low, and we achieve an accuracy level comparable to the results reported in the original study.

Cross-validation

What we have just done manually, was splitting the dataset into combinations of training and testing datasets, given a specific sample attribute – in this case the information whether a pattern of activation or sample came from even or odd runs. We ran the classification analysis on each split to estimate the performance of the classifier model. In general, this approach is called cross-validation, and involves splitting the dataset in multiple pairs of subsets, choosing sample groups by some criterion, and estimating the classifier performance by training it on the first dataset in a split and testing against the second dataset from the same split.

PyMVPA provides a way to allow complete cross-validation procedures to run fully automatic, without the need for manual splitting of a dataset. Using the CrossValidation class a cross-validation is set up by specifying what measure should be computed on each dataset split, and how dataset splits shall be generated. The measure that is usually computed is the transfer error that we already looked at in the previous section. The second element, a generator for datasets, is another very common tool in PyMVPA. The following example uses HalfPartitioner, a generator that, when called with a dataset, marks all samples regarding their association with the first or second half of the dataset. This happens based on the values of a specified sample attribute – in this case runtype – much like the manual dataset splitting that we have performed earlier. HalfPartitioner will make sure to subsequently assign samples to both halves, i.e. samples of the first half in the first generated dataset, will be in the second half of the second generated dataset. With these two techniques we can replicate our manual cross-validation easily – reusing our existing classifier, but without the custom post-processing step.

>>> # disable post-processing again
>>> clf.set_postproc(None)
>>> # dataset generator
>>> hpart = HalfPartitioner(attr='runtype')
>>> # complete cross-validation facility
>>> cv = CrossValidation(clf, hpart)

Exercise

Try calling the hpart object with our dataset. What happens? Now try passing the dataset to its generate() methods. What happens now? Make yourself familiar with the concept of a Python generator. Investigate what the code snippet list(xrange(5)) does, and try to adapt it to the HalfPartitioner.

Once the cv object is created, it can be called with a dataset, just like we did with the classifier before. It will internally perform all dataset partitioning, split each generated dataset into training and testing sets (based on the partitions), and train and test the classifier repeatedly. Finally it will return the results of all cross-validation folds.

>>> cv_results = cv(ds)
>>> np.mean(cv_results)
0.0625

Actually, the cross-validation results are returned as another dataset that has one sample per fold and a single feature with the computed transfer-error per fold.

>>> len(cv_results)
2
>>> cv_results.samples
array([[ 0.   ],
       [ 0.125]])

This could be the end of a very simple introduction into cross-validation with PyMVPA. However, since we were cheating a bit in the beginning, we actually still don’t know how to import data other than the single subject from the Haxby study. This is the topic of the next chapter.

References

Haxby et al. (2001)
Classic MVPA study. Its subject 1 serves as the example dataset in this tutorial part.
Hastie et al. (2009)
Comprehensive reference of statistical learning methods.

Table Of Contents

Previous topic

Tutorial Prerequisites

Next topic

Part 2: Dataset Basics and Concepts

NeuroDebian

NITRC-listed