Version 0.20.0

September, 2018

This release packs in a mountain of bug fixes, features and enhancements for the Scikit-learn library, and improvements to the documentation and examples. Thanks to our contributors!

Warning

Version 0.20 is the last version of scikit-learn to support Python 2.7 and Python 3.4. Scikit-learn 0.21 will require Python 3.5 or higher.

Highlights

We have tried to improve our support for common data-science use-cases including missing values, categorical variables, heterogeneous data, and features/targets with unusual distributions. Missing values in features, represented by NaNs, are now accepted in column-wise preprocessing such as scalers. Each feature is fitted disregarding NaNs, and data containing NaNs can be transformed. The new impute module provides estimators for learning despite missing data.

ColumnTransformer handles the case where different features or columns of a pandas.DataFrame need different preprocessing. String or pandas Categorical columns can now be encoded with OneHotEncoder or OrdinalEncoder.

TransformedTargetRegressor helps when the regression target needs to be transformed to be modeled. PowerTransformer and KBinsDiscretizer join QuantileTransformer as non-linear transformations.

Beyond this, we have added sample_weight support to several estimators (including KMeans, BayesianRidge and KernelDensity) and improved stopping criteria in others (including MLPRegressor, GradientBoostingRegressor and SGDRegressor).

This release is also the first to be accompanied by a Glossary of Common Terms and API Elements developed by Joel Nothman. The glossary is a reference resource to help users and contributors become familiar with the terminology and conventions used in Scikit-learn.

Sorry if your contribution didn’t make it into the highlights. There’s a lot here…

Changed models

The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.

Details are listed in the changelog below.

(While we are trying to better inform users by providing this information, we cannot assure that this list is complete.)

Known Major Bugs

  • #11924: LogisticRegressionCV with solver=’lbfgs’ and multi_class=’multinomial’ may be non-deterministic or otherwise broken on macOS. This appears to be the case on Travis CI servers, but has not been confirmed on personal MacBooks! This issue has been present in previous releases.

Changelog

Support for Python 3.3 has been officially dropped.

sklearn.cluster

sklearn.compose

sklearn.covariance

sklearn.datasets

sklearn.decomposition

sklearn.discriminant_analysis

  • Efficiency Memory usage improvement for _class_means and _class_cov in discriminant_analysis. #10898 by Nanxin Chen.

sklearn.dummy

sklearn.ensemble

sklearn.feature_extraction

sklearn.feature_selection

sklearn.gaussian_process

sklearn.impute

sklearn.isotonic

sklearn.linear_model

sklearn.manifold

  • Efficiency Speed improvements for both ‘exact’ and ‘barnes_hut’ methods in manifold.TSNE. #10593 and #10610 by Tom Dupre la Tour.
  • Feature Support sparse input in manifold.Isomap.fit. #8554 by Leland McInnes.
  • Feature manifold.t_sne.trustworthiness accepts metrics other than Euclidean. #9775 by William de Vazelhes.
  • Fix Fixed a bug in manifold.spectral_embedding where the normalization of the spectrum was using a division instead of a multiplication. #8129 by Jan Margeta, Guillaume Lemaitre, and Devansh D..
  • API Change Feature Deprecate precomputed parameter in function manifold.t_sne.trustworthiness. Instead, the new parameter metric should be used with any compatible metric including ‘precomputed’, in which case the input matrix X should be a matrix of pairwise distances or squared distances. #9775 by William de Vazelhes.
  • API Change Deprecate precomputed parameter in function manifold.t_sne.trustworthiness. Instead, the new parameter metric should be used with any compatible metric including ‘precomputed’, in which case the input matrix X should be a matrix of pairwise distances or squared distances. #9775 by William de Vazelhes.

sklearn.metrics

sklearn.mixture

sklearn.model_selection

sklearn.multioutput

sklearn.naive_bayes

sklearn.neighbors

sklearn.neural_network

sklearn.pipeline

sklearn.preprocessing

sklearn.svm

sklearn.tree

  • Enhancement Although private (and hence not assured API stability), tree._criterion.ClassificationCriterion and tree._criterion.RegressionCriterion may now be cimported and extended. #10325 by Camil Staps.
  • Fix Fixed a bug in tree.BaseDecisionTree with splitter=”best” where split threshold could become infinite when values in X were near infinite. #10536 by Jonathan Ohayon.
  • Fix Fixed a bug in tree.MAE to ensure sample weights are being used during the calculation of tree MAE impurity. Previous behaviour could cause suboptimal splits to be chosen since the impurity calculation considered all samples to be of equal weight importance. #11464 by John Stott.

sklearn.utils

Multiple modules

Miscellaneous

Changes to estimator checks

These changes mostly affect library developers.