-
Notifications
You must be signed in to change notification settings - Fork 4
/
README
89 lines (71 loc) · 4.05 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
M-tree data structure to perform k-NN searches
=============================================
[![Build Status](https://travis-ci.org/tburette/mtree.svg?branch=master)](https://travis-ci.org/tburette/mtree)
This is an implementation of the M-tree, a data structure to find the element(s) the most similar to a given element.
The M-tree is a tree based implementation of the concept of metric space
( http://en.wikipedia.org/wiki/Metric_space ), it is similar to b-tree.
Implementation based on the paper
'M-tree: An Efficient Access Method for Similarity Search in Metric Spaces'
To use the M-tree you only need to pass two things to it:
- a set of objects to store.
- a distance function `d(x, y)` that returns a number establishing
how similar two objects are.
Usage:
======
>>> def d_int(x, y): # define a distance function for numbers
... return abs(x - y)
...
>>> tree = MTree(d_int, max_node_size=4) # create an empty M-tree
>>> tree.add(1) # add object 1 to the tree
>>> tree.add_all([5, 9]) # add objects 5 and 9
>>> tree.search(10) # search the object closest to 10. Will return 9
>>> [9]
>>> tree.search(9, 2) # search the two objects closest to 9.
>>> [5, 9]
The size of nodes (optional argument `max_node_size`) has a large influence on
the number of calls of the distance function (`d`).
The objects you insert in the tree can be anything as long as the
distance function you provide is able to handle them correctly.
The distance function (`d`) must be provided when the tree is created.
It takes as a parameter two objects and return a number telling how
similar the two objects are. The smaller the number, the more similar the
objects are. The number returned can be an integer, float,... Technically
anything that behaves like a number (<, <=, >,...).
The distance function MUST respect the following properties:
- d always return the same value given the same parameters
- Non negativity: forall x, y: d(x, y) >= 0
d must never return a negative value. If the value your function returns
can be negative but has a lower bound (e.g. never returns anything lower
than -100) you can fix this by systematically increasing the value of
all the number returned (e.g. return value +100).
- Symmetry: forall x, y: d(x, y) = d(y, x)
The same value must be returned no matter what the order of the parameters
are.
- Identity: forall x, y: d(x, y) = 0 means that x = y
- Triangle inequality: forall x, y, z d(x, z) <= d(x, y) + d(y, z)
The distance from one point to a second is always smaller or equal to the
the distance from one point to an intermediary + the distance from the
intermediary to the second point.
Here is an analogy to help understand this property. Imagine a road
going directly between two towns. It never turns, it is a perfectly
straight line. This is obviously the shortest way to get between the two
town. Now imagine we pick a position anywhere we want. If we go from
one town to the other by passing trough this position, it is impossible to
have travelled less than by following the straight road.
If the distance function violates one of these rule, the M-tree may
return erroneous results.
If the same object is inserted multiple times, it will be considered as
different object by the tree.
Notes
=====
This implementation is memory only. The tree is not stored on disk.
This may be a problem if the objects you store are large (pictures, sound,...)
Although the tree itself resides in memory you can store the objects it contains on disk (or online,...). For example the objects you pass to the tree could
be path to files; the d function would load the files from disk to perform the
comparisons.
To maintain good performance while minimizing memory usage, a good trade-off
is to store in the objects only the path to the actual data as well as the key
features that define the data. The distance function (d) can then compare
the objects using the features without the need for disk access
That way, searches are fast (no disk access) while keeping data on disk.
Compatible Python 2 and 3.