Generic storage class for datasets with multiple attributes.
A dataset consists of four pieces. The core is a two-dimensional array that has variables (so-called features) in its columns and the associated observations (so-called samples) in the rows. In addition a dataset may have any number of attributes for features and samples. Unsurprisingly, these are called ‘feature attributes’ and ‘sample attributes’. Each attribute is a vector of any datatype that contains a value per each item (feature or sample). Both types of attributes are organized in their respective collections – accessible via the sa (sample attribute) and fa (feature attribute) attributes. Finally, a dataset itself may have any number of additional attributes (i.e. a mapper) that are stored in their own collection that is accessible via the a attribute (see examples below).
Attributes : | sa : Collection
fa : Collection
a : Collection
|
---|
Notes
Any dataset might have a mapper attached that is stored as a dataset attribute called mapper.
Examples
The simplest way to create a dataset is from a 2D array.
>>> import numpy as np
>>> from mvpa2.datasets import *
>>> samples = np.arange(12).reshape((4,3))
>>> ds = AttrDataset(samples)
>>> ds.nsamples
4
>>> ds.nfeatures
3
>>> ds.samples
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
The above dataset can only be used for unsupervised machine-learning algorithms, since it doesn’t have any targets associated with its samples. However, creating a labeled dataset is equally simple.
>>> ds_labeled = dataset_wizard(samples, targets=range(4))
Both the labeled and the unlabeled dataset share the same samples array. No copying is performed.
>>> ds.samples is ds_labeled.samples
True
If the data should not be shared the samples array has to be copied beforehand.
The targets are available from the samples attributes collection, but also via the convenience property targets.
>>> ds_labeled.sa.targets is ds_labeled.targets
True
If desired, it is possible to add an arbitrary amount of additional attributes. Regardless if their original sequence type they will be converted into an array.
>>> ds_labeled.sa['lovesme'] = [0,0,1,0]
>>> ds_labeled.sa.lovesme
array([0, 0, 1, 0])
An alternative method to create datasets with arbitrary attributes is to provide the attribute collections to the constructor itself – which would also test for an appropriate size of the given attributes:
>>> fancyds = AttrDataset(samples, sa={'targets': range(4),
... 'lovesme': [0,0,1,0]})
>>> fancyds.sa.lovesme
array([0, 0, 1, 0])
Exactly the same logic applies to feature attributes as well.
Datasets can be sliced (selecting a subset of samples and/or features) similar to arrays. Selection is possible using boolean selection masks, index sequences or slicing arguments. The following calls for samples selection all result in the same dataset:
>>> sel1 = ds[np.array([False, True, True])]
>>> sel2 = ds[[1,2]]
>>> sel3 = ds[1:3]
>>> np.all(sel1.samples == sel2.samples)
True
>>> np.all(sel2.samples == sel3.samples)
True
During selection data is only copied if necessary. If the slicing syntax is used the resulting dataset will share the samples with the original dataset.
>>> sel1.samples.base is ds.samples
False
>>> sel2.samples.base is ds.samples
False
>>> sel3.samples.base is ds.samples
True
For feature selection the syntax is very similar they are just represented on the second axis of the samples array. Plain feature selection is achieved be keeping all samples and select a subset of features (all syntax variants for samples selection are also supported for feature selection).
>>> fsel = ds[:, 1:3]
>>> fsel.samples
array([[ 1, 2],
[ 4, 5],
[ 7, 8],
[10, 11]])
It is also possible to simultaneously selection a subset of samples and features. Using the slicing syntax now copying will be performed.
>>> fsel = ds[:3, 1:3]
>>> fsel.samples
array([[1, 2],
[4, 5],
[7, 8]])
>>> fsel.samples.base is ds.samples
True
Please note that simultaneous selection of samples and features is not always congruent to array slicing.
>>> ds[[0,1,2], [1,2]].samples
array([[1, 2],
[4, 5],
[7, 8]])
Whereas the call: ‘ds.samples[[0,1,2], [1,2]]’ would not be possible. In AttrDatasets selection of samples and features is always applied individually and independently to each axis.
A Dataset might have an arbitrary number of attributes for samples, features, or the dataset as a whole. However, only the data samples themselves are required.
Parameters : | samples : ndarray
sa : SampleAttributesCollection
fa : FeatureAttributesCollection
a : DatasetAttributesCollection
|
---|
chunks
targets
Lookup collection that contains an attribute of a given name.
Collections are search in the following order: sample attributes, feature attributes, dataset attributes. The first collection containing a matching attribute is returned.
Parameters : | attr : str
|
---|---|
Returns : | Collection :
|
Create a dataset from segmented, per-channel timeseries.
Channels are assumes to contain multiple, equally spaced acquisition timepoints. The dataset will contain additional feature attributes associating each feature with a specific channel and timepoint.
Parameters : | samples : ndarray
t0 : float
dt : float
channelids : list
targets, chunks :
|
---|
Convenience method to create dataset.
Datasets can be created from N-dimensional samples. Data arrays with more than two dimensions are going to be flattened, while preserving the first axis (separating the samples) and concatenating all other as the second axis. Optionally, it is possible to specify targets and chunk attributes for all samples, and masking of the input data (only selecting elements corresponding to non-zero mask elements
Parameters : | samples : ndarray
targets : scalar or ndarray, optional
chunks : scalar or ndarray, optional
mask : ndarray, optional
mapper : Mapper instance, optional
flatten : None or bool, optional
space : str, optional
|
---|---|
Returns : | instance : Dataset |
Return an attribute from a collection.
A collection can be specified, but can also be auto-detected.
Parameters : | name : str
|
---|---|
Returns : | (attr, collection) :
|
Feed this dataset through a trained mapper (forward).
Parameters : | mapper : Mapper
|
---|---|
Returns : | Dataset :
|
To verify if dataset is in the same state as when smth else was done
Like if classifier was trained on the same dataset as in question
Provide the first element of samples array.
Notes
Introduced to provide compatibility with numpy.asscalar. See numpy.ndarray.item for more information.
Set an attribute in a collection.
Parameters : | name : str
value : array
|
---|
chunks
targets