This content refers to the previous stable release of PyMVPA.
Please visit
www.pymvpa.org for the most
recent version of PyMVPA and its documentation.
datasets.splitters
Module: datasets.splitters
Inheritance diagram for mvpa.datasets.splitters:
Collection of dataset splitters.
Module Description
Splitters are destined to split the provided dataset varous ways to
simplify cross-validation analysis, implement boosting of the
estimates, or sample null-space via permutation testing.
Most of the splitters at the moment split 2-ways – conventionally
first part is used for training, and 2nd part for testing by
CrossValidatedTransferError and SplitClassifier.
Brief Description of Available Splitters
- NoneSplitter - just return full dataset as the desired part (training/testing)
- OddEvenSplitter - 2 splits: (odd samples,even samples) and (even, odd)
- HalfSplitter - 2 splits: (first half, second half) and (second, first)
- NFoldSplitter - splits for N-Fold cross validation.
Module Organization
Classes
-
class mvpa.datasets.splitters.CustomSplitter(splitrule, **kwargs)
Bases: mvpa.datasets.splitters.Splitter
Split a dataset using an arbitrary custom rule.
The splitter is configured by passing a custom spitting rule (splitrule)
to its constructor. Such a rule is basically a sequence of split
definitions. Every single element in this sequence results in excatly one
split generated by the Splitter. Each element is another sequence for
sequences of sample ids for each dataset that shall be generated in the
split.
Example:
Generate two splits. In the first split the second dataset
contains all samples with sample attributes corresponding to
either 0, 1 or 2. The first dataset of the first split contains
all samples which are not split into the second dataset.
The second split yields three datasets. The first with all samples
corresponding to sample attributes 1 and 2, the second dataset
contains only samples with attrbiute 3 and the last dataset
contains the samples with attribute 5 and 6.
CustomSplitter([(None, [0, 1, 2]), ([1,2], [3], [5, 6])])
See also
Please refer to the documentation of the base class for more information:
Splitter
Cheap init.
Parameters: |
- nperlabel (int or str (or list of them) or float) – Number of dataset samples per label to be included in each
split. If given as a float, it must be in [0,1] range and would
mean the ratio of selected samples per each label.
Two special strings are recognized: ‘all’ uses all available
samples (default) and ‘equal’ uses the maximum number of samples
the can be provided by all of the classes. This value might be
provided as a sequence whos length matches the number of datasets
per split and indicates the configuration for the respective dataset
in each split.
- nrunspersplit (int) – Number of times samples for each split are chosen. This
is mostly useful if a subset of the available samples
is used in each split and the subset is randomly
selected for each run (see the nperlabel argument).
- permute (bool) – If set to True, the labels of each generated dataset
will be permuted on a per-chunk basis.
- count (None or int) – Desired number of splits to be output. It is limited by the
number of splits possible for a given splitter
(e.g. OddEvenSplitter can have only up to 2 splits). If None,
all splits are output (default).
- strategy (str) – If count is not None, possible strategies are possible:
first
First count splits are chosen
random
Random (without replacement) count splits are chosen
equidistant
Splits which are equidistant from each other
- discard_boundary (None or int or sequence of int) – If not None, how many samples on the boundaries between
parts of the split to discard in the training part.
If int, then discarded in all parts. If a sequence, numbers
to discard are given per part of the split.
E.g. if splitter splits only into (training, testing)
parts, then `discard_boundary`=(2,0) would instruct to discard
2 samples from training which are on the boundary with testing.
- attr (str) – Sample attribute used to determine splits.
- reverse (bool) – If True, the order of datasets in the split is reversed, e.g.
instead of (training, testing), (training, testing) will be spit
out
|
-
class mvpa.datasets.splitters.HalfSplitter(**kwargs)
Bases: mvpa.datasets.splitters.Splitter
Split a dataset into two halves of the sample attribute.
The splitter yields to splits: first (1st half, 2nd half) and second
(2nd half, 1st half).
See also
Please refer to the documentation of the base class for more information:
Splitter
Cheap init.
Parameters: |
- nperlabel (int or str (or list of them) or float) – Number of dataset samples per label to be included in each
split. If given as a float, it must be in [0,1] range and would
mean the ratio of selected samples per each label.
Two special strings are recognized: ‘all’ uses all available
samples (default) and ‘equal’ uses the maximum number of samples
the can be provided by all of the classes. This value might be
provided as a sequence whos length matches the number of datasets
per split and indicates the configuration for the respective dataset
in each split.
- nrunspersplit (int) – Number of times samples for each split are chosen. This
is mostly useful if a subset of the available samples
is used in each split and the subset is randomly
selected for each run (see the nperlabel argument).
- permute (bool) – If set to True, the labels of each generated dataset
will be permuted on a per-chunk basis.
- count (None or int) – Desired number of splits to be output. It is limited by the
number of splits possible for a given splitter
(e.g. OddEvenSplitter can have only up to 2 splits). If None,
all splits are output (default).
- strategy (str) – If count is not None, possible strategies are possible:
first
First count splits are chosen
random
Random (without replacement) count splits are chosen
equidistant
Splits which are equidistant from each other
- discard_boundary (None or int or sequence of int) – If not None, how many samples on the boundaries between
parts of the split to discard in the training part.
If int, then discarded in all parts. If a sequence, numbers
to discard are given per part of the split.
E.g. if splitter splits only into (training, testing)
parts, then `discard_boundary`=(2,0) would instruct to discard
2 samples from training which are on the boundary with testing.
- attr (str) – Sample attribute used to determine splits.
- reverse (bool) – If True, the order of datasets in the split is reversed, e.g.
instead of (training, testing), (training, testing) will be spit
out
|
-
class mvpa.datasets.splitters.NFoldSplitter(cvtype=1, **kwargs)
Bases: mvpa.datasets.splitters.Splitter
Generic N-fold data splitter.
Provide folding splitting. Given a dataset with N chunks, with
cvtype=1 (which is default), it would generate N splits, where
each chunk sequentially is taken out (with replacement) for
cross-validation. Example, if there is 4 chunks, splits for
cvtype=1 are:
[[1, 2, 3], [0]]
[[0, 2, 3], [1]]
[[0, 1, 3], [2]]
[[0, 1, 2], [3]]
If cvtype>1, then all possible combinations of cvtype number of
chunks are taken out for testing, so for cvtype=2 in previous
example:
[[2, 3], [0, 1]]
[[1, 3], [0, 2]]
[[1, 2], [0, 3]]
[[0, 3], [1, 2]]
[[0, 2], [1, 3]]
[[0, 1], [2, 3]]
See also
Please refer to the documentation of the base class for more information:
Splitter
Initialize the N-fold splitter.
Parameters: |
- cvtype (int) – Type of cross-validation: N-(cvtype)
- nperlabel (int or str (or list of them) or float) – Number of dataset samples per label to be included in each
split. If given as a float, it must be in [0,1] range and would
mean the ratio of selected samples per each label.
Two special strings are recognized: ‘all’ uses all available
samples (default) and ‘equal’ uses the maximum number of samples
the can be provided by all of the classes. This value might be
provided as a sequence whos length matches the number of datasets
per split and indicates the configuration for the respective dataset
in each split.
- nrunspersplit (int) – Number of times samples for each split are chosen. This
is mostly useful if a subset of the available samples
is used in each split and the subset is randomly
selected for each run (see the nperlabel argument).
- permute (bool) – If set to True, the labels of each generated dataset
will be permuted on a per-chunk basis.
- count (None or int) – Desired number of splits to be output. It is limited by the
number of splits possible for a given splitter
(e.g. OddEvenSplitter can have only up to 2 splits). If None,
all splits are output (default).
- strategy (str) – If count is not None, possible strategies are possible:
first
First count splits are chosen
random
Random (without replacement) count splits are chosen
equidistant
Splits which are equidistant from each other
- discard_boundary (None or int or sequence of int) – If not None, how many samples on the boundaries between
parts of the split to discard in the training part.
If int, then discarded in all parts. If a sequence, numbers
to discard are given per part of the split.
E.g. if splitter splits only into (training, testing)
parts, then `discard_boundary`=(2,0) would instruct to discard
2 samples from training which are on the boundary with testing.
- attr (str) – Sample attribute used to determine splits.
- reverse (bool) – If True, the order of datasets in the split is reversed, e.g.
instead of (training, testing), (training, testing) will be spit
out
|
-
class mvpa.datasets.splitters.NGroupSplitter(ngroups=4, **kwargs)
Bases: mvpa.datasets.splitters.Splitter
Split a dataset into N-groups of the sample attribute.
For example, NGroupSplitter(2) is the same as the HalfSplitter and
yields to splits: first (1st half, 2nd half) and second (2nd half,
1st half).
See also
Please refer to the documentation of the base class for more information:
Splitter
Initialize the N-group splitter.
Parameters: |
- ngroups (int) – Number of groups to split the attribute into.
- nperlabel (int or str (or list of them) or float) – Number of dataset samples per label to be included in each
split. If given as a float, it must be in [0,1] range and would
mean the ratio of selected samples per each label.
Two special strings are recognized: ‘all’ uses all available
samples (default) and ‘equal’ uses the maximum number of samples
the can be provided by all of the classes. This value might be
provided as a sequence whos length matches the number of datasets
per split and indicates the configuration for the respective dataset
in each split.
- nrunspersplit (int) – Number of times samples for each split are chosen. This
is mostly useful if a subset of the available samples
is used in each split and the subset is randomly
selected for each run (see the nperlabel argument).
- permute (bool) – If set to True, the labels of each generated dataset
will be permuted on a per-chunk basis.
- count (None or int) – Desired number of splits to be output. It is limited by the
number of splits possible for a given splitter
(e.g. OddEvenSplitter can have only up to 2 splits). If None,
all splits are output (default).
- strategy (str) – If count is not None, possible strategies are possible:
first
First count splits are chosen
random
Random (without replacement) count splits are chosen
equidistant
Splits which are equidistant from each other
- discard_boundary (None or int or sequence of int) – If not None, how many samples on the boundaries between
parts of the split to discard in the training part.
If int, then discarded in all parts. If a sequence, numbers
to discard are given per part of the split.
E.g. if splitter splits only into (training, testing)
parts, then `discard_boundary`=(2,0) would instruct to discard
2 samples from training which are on the boundary with testing.
- attr (str) – Sample attribute used to determine splits.
- reverse (bool) – If True, the order of datasets in the split is reversed, e.g.
instead of (training, testing), (training, testing) will be spit
out
|
-
class mvpa.datasets.splitters.NoneSplitter(mode='second', **kwargs)
Bases: mvpa.datasets.splitters.Splitter
This is a dataset splitter that does not split. It simply returns
the full dataset that it is called with.
The passed dataset is returned as the second element of the 2-tuple.
The first element of that tuple will always be ‘None’.
See also
Please refer to the documentation of the base class for more information:
Splitter
Cheap init – nothing special
Parameters: |
- mode – Either ‘first’ or ‘second’ (default) – which output dataset
would actually contain the samples
- nperlabel (int or str (or list of them) or float) – Number of dataset samples per label to be included in each
split. If given as a float, it must be in [0,1] range and would
mean the ratio of selected samples per each label.
Two special strings are recognized: ‘all’ uses all available
samples (default) and ‘equal’ uses the maximum number of samples
the can be provided by all of the classes. This value might be
provided as a sequence whos length matches the number of datasets
per split and indicates the configuration for the respective dataset
in each split.
- nrunspersplit (int) – Number of times samples for each split are chosen. This
is mostly useful if a subset of the available samples
is used in each split and the subset is randomly
selected for each run (see the nperlabel argument).
- permute (bool) – If set to True, the labels of each generated dataset
will be permuted on a per-chunk basis.
- count (None or int) – Desired number of splits to be output. It is limited by the
number of splits possible for a given splitter
(e.g. OddEvenSplitter can have only up to 2 splits). If None,
all splits are output (default).
- strategy (str) – If count is not None, possible strategies are possible:
first
First count splits are chosen
random
Random (without replacement) count splits are chosen
equidistant
Splits which are equidistant from each other
- discard_boundary (None or int or sequence of int) – If not None, how many samples on the boundaries between
parts of the split to discard in the training part.
If int, then discarded in all parts. If a sequence, numbers
to discard are given per part of the split.
E.g. if splitter splits only into (training, testing)
parts, then `discard_boundary`=(2,0) would instruct to discard
2 samples from training which are on the boundary with testing.
- attr (str) – Sample attribute used to determine splits.
- reverse (bool) – If True, the order of datasets in the split is reversed, e.g.
instead of (training, testing), (training, testing) will be spit
out
|
-
class mvpa.datasets.splitters.OddEvenSplitter(usevalues=False, **kwargs)
Bases: mvpa.datasets.splitters.Splitter
Split a dataset into odd and even values of the sample attribute.
The splitter yields to splits: first (odd, even) and second (even, odd).
See also
Please refer to the documentation of the base class for more information:
Splitter
Cheap init.
Parameters: |
- usevalues (bool) – If True the values of the attribute used for splitting will be
used to determine odd and even samples. If False odd and even
chunks are defined by the order of attribute values, i.e. first
unique attribute is odd, second is even, despite the
corresponding values might indicate the opposite (e.g. in case
of [2,3].
- nperlabel (int or str (or list of them) or float) – Number of dataset samples per label to be included in each
split. If given as a float, it must be in [0,1] range and would
mean the ratio of selected samples per each label.
Two special strings are recognized: ‘all’ uses all available
samples (default) and ‘equal’ uses the maximum number of samples
the can be provided by all of the classes. This value might be
provided as a sequence whos length matches the number of datasets
per split and indicates the configuration for the respective dataset
in each split.
- nrunspersplit (int) – Number of times samples for each split are chosen. This
is mostly useful if a subset of the available samples
is used in each split and the subset is randomly
selected for each run (see the nperlabel argument).
- permute (bool) – If set to True, the labels of each generated dataset
will be permuted on a per-chunk basis.
- count (None or int) – Desired number of splits to be output. It is limited by the
number of splits possible for a given splitter
(e.g. OddEvenSplitter can have only up to 2 splits). If None,
all splits are output (default).
- strategy (str) – If count is not None, possible strategies are possible:
first
First count splits are chosen
random
Random (without replacement) count splits are chosen
equidistant
Splits which are equidistant from each other
- discard_boundary (None or int or sequence of int) – If not None, how many samples on the boundaries between
parts of the split to discard in the training part.
If int, then discarded in all parts. If a sequence, numbers
to discard are given per part of the split.
E.g. if splitter splits only into (training, testing)
parts, then `discard_boundary`=(2,0) would instruct to discard
2 samples from training which are on the boundary with testing.
- attr (str) – Sample attribute used to determine splits.
- reverse (bool) – If True, the order of datasets in the split is reversed, e.g.
instead of (training, testing), (training, testing) will be spit
out
|
-
class mvpa.datasets.splitters.Splitter(nperlabel='all', nrunspersplit=1, permute=False, count=None, strategy='equidistant', discard_boundary=None, attr='chunks', reverse=False)
Bases: object
Base class of dataset splitters.
Each splitter should be initialized with all its necessary parameters. The
final splitting is done running the splitter object on a certain Dataset
via __call__(). This method has to be implemented like a generator, i.e. it
has to return every possible split with a yield() call.
Each split has to be returned as a sequence of Datasets. The properties
of the splitted dataset may vary between implementations. It is possible
to declare a sequence element as ‘None’.
Please note, that even if there is only one Dataset returned it has to be
an element in a sequence and not just the Dataset object!
Initialize splitter base.
Parameters: |
- nperlabel (int or str (or list of them) or float) – Number of dataset samples per label to be included in each
split. If given as a float, it must be in [0,1] range and would
mean the ratio of selected samples per each label.
Two special strings are recognized: ‘all’ uses all available
samples (default) and ‘equal’ uses the maximum number of samples
the can be provided by all of the classes. This value might be
provided as a sequence whos length matches the number of datasets
per split and indicates the configuration for the respective dataset
in each split.
- nrunspersplit (int) – Number of times samples for each split are chosen. This
is mostly useful if a subset of the available samples
is used in each split and the subset is randomly
selected for each run (see the nperlabel argument).
- permute (bool) – If set to True, the labels of each generated dataset
will be permuted on a per-chunk basis.
- count (None or int) – Desired number of splits to be output. It is limited by the
number of splits possible for a given splitter
(e.g. OddEvenSplitter can have only up to 2 splits). If None,
all splits are output (default).
- strategy (str) – If count is not None, possible strategies are possible:
first
First count splits are chosen
random
Random (without replacement) count splits are chosen
equidistant
Splits which are equidistant from each other
- discard_boundary (None or int or sequence of int) – If not None, how many samples on the boundaries between
parts of the split to discard in the training part.
If int, then discarded in all parts. If a sequence, numbers
to discard are given per part of the split.
E.g. if splitter splits only into (training, testing)
parts, then `discard_boundary`=(2,0) would instruct to discard
2 samples from training which are on the boundary with testing.
- attr (str) – Sample attribute used to determine splits.
- reverse (bool) – If True, the order of datasets in the split is reversed, e.g.
instead of (training, testing), (training, testing) will be spit
out
|
-
setNPerLabel(value)
Set the number of samples per label in the split datasets.
‘equal’ sets sample size to highest possible number of samples that
can be provided by each class. ‘all’ uses all available samples
(default).
-
splitDataset(dataset, specs)
Split a dataset by separating the samples where the configured
sample attribute matches an element of specs.
Parameters: |
- dataset (Dataset) – This is this source dataset.
- specs (sequence of sequences) – Contains ids of a sample attribute that shall be split into the
another dataset.
|
Returns: | Tuple of splitted datasets.
|
-
splitcfg(dataset)
- Return splitcfg for a given dataset
-
strategy