inconsistent behavior when using custom grid #15

blasern · 2019-02-18T19:07:18Z

FFTKDE seems to change the grid internally when providing a custom grid in 2d. Results do not seem to match with the results from NaiveKDE or TreeKDE. See below for an example

# imports
import numpy as np
import matplotlib.pyplot as plt
import KDEpy
%matplotlib inline

# Create bimodal 2D data
data = np.vstack((np.random.randn(2**8, 2), 
                  np.random.randn(2**8, 2) + (0, 5)))

# Create 2D grid
grid_size = 20
min_X = np.min(data, axis=0)
max_X = np.max(data, axis=0)
grid_margins = tuple(np.linspace(mn, mx, grid_size)
                     for mn, mx in zip(min_X, max_X))
grid = np.stack(np.meshgrid(*grid_margins), -1).reshape(-1, len(grid_margins))

# density estimates
y1 = KDEpy.FFTKDE(bw=0.2).fit(data).evaluate(grid)
y2 = KDEpy.TreeKDE(bw=0.2).fit(data).evaluate(grid)
y3 = KDEpy.NaiveKDE(bw=0.2).fit(data).evaluate(grid)

# plots
fig, axes = plt.subplots(2, 2,  figsize = (10, 10))
axes[0, 0].set_title('Data')
axes[0, 0].scatter(data[:, 0], data[:, 1])
axes[0, 0].set_title('FFTKDE')
axes[0, 1].scatter(grid[:, 0], grid[:, 1], c=y1)
axes[0, 0].set_title('TreeKDE')
axes[1, 0].scatter(grid[:, 0], grid[:, 1], c=y2)
axes[0, 0].set_title('Naive')
axes[1, 1].scatter(grid[:, 0], grid[:, 1], c=y3)

The text was updated successfully, but these errors were encountered:

tommyod · 2019-02-18T20:30:01Z

Thanks for raising the Issue @blasern . I agree that something is wrong here. I hope to look into this bug very soon, this weekend or the next. I will report my findings here and patch it up with a regression test.

tommyod · 2019-02-19T06:38:38Z

I believe I've found the problem.

Notice first that the following code produces incorrect results

# imports
import numpy as np
import matplotlib.pyplot as plt
import KDEpy

# Create bimodal 2D data
data = np.array([[-1, 1], [0, 1], [1, 1], [-1, -1], [0, -1], [1, -1]])

# Create 2D grid
grid_x = np.linspace(-2, 2, 2 ** 5)
grid_y = np.linspace(-2, 2, 2 ** 4)
grid = np.stack(np.meshgrid(grid_x, grid_y), -1).reshape(-1, 2)

# density estimates
y1 = KDEpy.FFTKDE(bw=0.2).fit(data).evaluate(grid)
y2 = KDEpy.TreeKDE(bw=0.2).fit(data).evaluate(grid)
y3 = KDEpy.NaiveKDE(bw=0.2).fit(data).evaluate(grid)

# plots
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
axes[0, 0].set_title("Data")
axes[0, 0].scatter(data[:, 0], data[:, 1], marker="x", color="red")

axes[0, 1].set_title("FFTKDE")
axes[0, 1].scatter(grid[:, 0], grid[:, 1], c=y1)
axes[0, 1].scatter(data[:, 0], data[:, 1], marker="x", color="red")

axes[1, 0].set_title("TreeKDE")
axes[1, 0].scatter(grid[:, 0], grid[:, 1], c=y2)
axes[1, 0].scatter(data[:, 0], data[:, 1], marker="x", color="red")

axes[1, 1].set_title("Naive")
axes[1, 1].scatter(grid[:, 0], grid[:, 1], c=y3)
axes[1, 1].scatter(data[:, 0], data[:, 1], marker="x", color="red")

print(np.mean(np.abs(y2 - y3)))  # 2.3899e-05
print(np.mean(np.abs(y1 - y3)))  # 0.0945866
print(np.mean(np.abs(y1 - y2)))  # 0.0945963

Changing grid creation slightly produces another error

grid = np.stack(np.meshgrid(grid_y, grid_x), -1).reshape(-1, 2)

Everything works with the following grid

grid = np.stack(np.meshgrid(grid_y, grid_x), -1).reshape(-1, 2)
grid[:, [0, 1]] = grid[:, [1, 0]] # Swap indices

Explanation

The last code works because the grid values are sorted in order of dimensionality. I.e. it's sorted by x_1, then by x_2 in this case. It looks like the following (simplified version):

x_1	x_2
-1	-1
-1	0
-1	1
0	-1
0	0
0	1
1	-1
1	0
1	1

It's required that the grid has this structure to achieve fast linear binning. If I remember correctly: the algorithm must know where to place a data point in the grid in constant time, so it must be able to compute the index immediately. This requires the above grid structure. The (somewhat cryptic) lines of code which perform this is found here:

KDEpy/KDEpy/cutils.pyx

Line 241 in 9e15675

# (1) To compute the index of this corner, and the value, we must

Solution

We should verify that user-defined grids obey the above structure, i.e. that they are sorted by x_1, x_2, ... ,x_D. With N grid points in each of the D dimensions, there are N^D grid points in total. I believe sorting would require O(N^D log (N^D)) operations. Querying whether or not the grid is sorted by the above key requires O(N^D * D) operations, since for each of the N^D grid points D operations are needed.

I propose to add such a check to user-specified grids. I will also perform some more testing. Thoughts on this @blasern ?

blasern · 2019-02-19T13:16:38Z

I still find it slightly inconvenient that I cannot choose the grid I want. As I've said before, I may not even want to use an axis-aligned grid.

On the other hand, as long as this requirement is documented and you check the input grids (perhaps with an option to turn this check off, for speed) I think that's fine.

tommyod · 2019-02-19T13:24:49Z

Great. 👍

I agree that it's inconvenient to have to use an axis-aligned grid, the discussion about non-axis aligned (yet equally spaced) fast algorithms could continue at Issue #6 . Arbitrary grids are possible using the other algorithms.

Thanks again for your valuable input. I'll leave this issue open until I've patched the software.

tommyod self-assigned this Feb 18, 2019

tommyod mentioned this issue Feb 19, 2019

[WIP] Function to check user-specified grid #16

Merged

tommyod closed this as completed in #16 Feb 23, 2019

gabrieldernbach mentioned this issue Oct 25, 2022

Grid must be sorted error #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent behavior when using custom grid #15

inconsistent behavior when using custom grid #15

blasern commented Feb 18, 2019

tommyod commented Feb 18, 2019

tommyod commented Feb 19, 2019

blasern commented Feb 19, 2019

tommyod commented Feb 19, 2019

inconsistent behavior when using custom grid #15

inconsistent behavior when using custom grid #15

Comments

blasern commented Feb 18, 2019

tommyod commented Feb 18, 2019

tommyod commented Feb 19, 2019

Notice first that the following code produces incorrect results

Changing grid creation slightly produces another error

Everything works with the following grid

Explanation

Solution

blasern commented Feb 19, 2019

tommyod commented Feb 19, 2019