Scikit-learn has a tree-based implementation of kernel density estimation in KernelDensity
. You may checkout this post by the original author. The tree-based implementation outperforms the naive implementation gaussian_kde
in scipy, thus being suitable for large datasets.
Yet as revealed in #25623 and #26658, KernelDensity
consistently gave incorrect results on data with non-unit variance. In this blog post we will go through the underlying issues and the corresponding fix.
Problem Reproduction
Let us begin with a simple example to reveal the problem. Here we draw samples from the normal distribution, but with different scales (i.e., standard deviations) 1, 0.1, and 10, and perform kernel density estimation respectively.
As we can see from the plots, scikit-learn’s results were far from the underlying distribution for scale=0.1
and scale=10
while scipy gave reasonable estimations.
Problem Analysis
We will start with the univariate case. By definition, let
where scale=0.1
, the chosen bandwidth would be too large and over-smoothens the estimation, and for scale=10
, the chosen bandwidth would be too small causing the estimation to be too sensitive to noise.
Moreover in scipy’s context, the bw_method
parameter (equivalent to scikit-learn’s bandwidth
parameter) does not directly give the
What about multivariate data? The formula is similar, such that
except that this time we have a bandwidth matrix. Also similar to the univariate case, a good bandwidth matrix should be chosen proportional to
Proposed Solution
The issue is not hard to solve, but to first summarize our analysis, we want the bandwidth
parameter to be properly scaled before estimation because:
- We want to respect the rules of thumb, e.g., Scott’s rule and Silverman’s rule.
- We want to be consistent with other scientific Python libraries like
scipy
andstatsmodels
to avoid confusion of users. - We want to make the API easy-to-use. In particular, there is no clear reason to let users manually handle data variance when choosing bandwidth, especially in the multivariate case where users would even need to choose a bandwidth vector if the scaling is not performed internally.
Let bandwidth
parameter (i.e., before scaling), then we can rewrite the formula into
Compared with the previous (incorrect) implementation
it suffices to (1) input np.cov
) with Cholesky decomposition (scipy.linalg.cholesky
), so scipy.linalg.solve_triangular
).
You may want to check out the actual fix in #27971 as well.
Future Work
-
There is another potential issue #27186 in the scikit-learn tree-based implementation, which is not yet confirmed as it has not revealed any unexpected behavior in practice.
-
The
sample_weight
support (originally implemented in #10803) in kernel density estimation is unexpectedly slow. The tree-based implementation of scikit-learn was meant to be faster than the naive implementation ofscipy
, but withsample_weight
it is several or tens of times slower, even if we allow some tolerance.