Command-line interface for NGT with Quantization for indexing high-dimensional data
qbg - proximity search for high dimensional data with quantization
$ qbg command [option] index [data]
Note:
When the environment variable POSIXLY_CORERECT is set on some platforms such as Cygwin, you should specifiy options before the command as follows.
$ qbg [option] command index [data]
qbg handles two types of graphs with quantization: Quantized Graph (QG) and Quantized Blob Graph (QBG).
command for the quantized graph is one of:
command for the quantized blob graph is one of:
Make and initialize a QG directory for the quantized graph in the specified NGT index directory, and insert the data in the NGT index into the QG index.
$ qbg create-qg [-P number_of_extended_dimensions] [-Q number_of_subvector_dimensions] index
index
Specify the name of the directory for the existing index such as ANNG or ONNG to be quantized. The index only with L2 distance and normalized cosine similarity distance can be quantized. You should build the ANNG or ONNG with normalized cosine similarity in order to use cosine similarity for the quantized graph.
-P number_of_extended_dimensions
Specify the number of the extended dimensions. The number should be greater than or equal to the number of the genuine dimensions, and also should be a multiple of 4. When this option is not specified, the smallest multiple of 4 that is greater than the dimension is set to the number of the extended dimensions.
-Q number_of_subvector_dimension
Specify the number of the subvector dimensions. The number should be less than or equal to the the number of the extended dimensions, and also should be a divisor of the number of the extended dimensions. When this option is not specified, one is set to the number of the subvector dimensions.
Quantize the objects of the specified index and build a quantized graph into the index.
$ qbg build-qg [-o number_of_objects_for_quantization] [-E max_number_of_edges] [-M number_of_trials] index
index
Specify the name of the directory for the existing index such as ANNG or ONNG to be quantized. The index only with L2 distance and normalized cosine similarity distance can be quantized. You should build the ANNG or ONNG with normalized cosine similarity in order to use cosine similarity for the quantized graph.
-o number_of_objects_for_quantization
Specify the number of object for quantization and optimization. The number should be less than or equal to the number of the registered objects.
-E max_number_of_edges
Specify the maximum number of edges to build a qunatized graph. Since every 16 objects that are associated with edges of each node are processed, the number should be a multiple of 16.
-M number_of_trials
Specify the number of trials to optimize the subvector quantization.
Search the index using the specified query data.
$ qbg search-qg [-n number_of_search_objects] [-e search_range_coefficient] [-p result_expansion]
[-r search_radius] index query_data
index
Specify the path of the existing quantized index.
query_data
Specify the path of the file containing query data. This file shall consist of one item of query data per line and each dimensional element of that data item shall be delimited by a space or tab. Each search shall be sequentially performed when providing multiple queries.
-n number_of_search_objects (default: 20)
Specify the number of search objects.
-e search_range_coefficient (default = 0.02)
Specify the magnification coefficient (epsilon) of the search range. A larger value means higher accuracy but slower searching, while a smaller value means a drop in accuracy but faster searching. While it is desirable to adjust this value within the range of 0 - 0.1, a negative value (> -1.0) may also be specified.
-p result_expansion (default = 3.0)
Specify the expansion ratio of the number of approximate inner search objects to the number of search objects. For example, when the ratio is 10 and the number of search objects is 20, the number of the approximate search objects is set to 200 inside the search processing. A larger value brings higher accuracy but slower searching.
Make and initialize a QBG directory for the quantized blob graph.
$ qbg create [-d number_of_dimension] [-P number_of_extended_dimensions] [-O object_type] [-D distance_function] [-C cluster_data_type] index
index
Specify the name of the directory for QBG.
-d number_of_dimensions
Specify the number of dimensions of registration data.
-P number_of_extended_dimensions
Specify the number of the extended dimensions. The number should be greater than or equal to the number of the genuine dimensions, and also should be a multiple of 4. When this option is not specified, the smallest multiple of 4 that is greater than the dimension is set to the number of the extended dimensions.
-O object_type
Specify the data object type.
- c: 1 byte unsigned integer
- f: 4 byte floating point number (default)
-D distance_function
Specify the distance function as follows.
- 2: L2 distance (default)
- c: Cosine similarity
-C cluster_data_type
Specify the cluster data type.
- pq4: 4 bit product quantization (default)
- sq8: 1 byte scalar quantization
-N number_of_subvectors
Specify the number of subvectors that should be a divisor of the number of the extended dimensions.
Append the specified data to the specified index.
$ qbg append index registration_data
Quantize the objects of the specified index and build a quantized graph into the index.
$ qbg build [-o number_of_objects_for_quantization] [-E max_number_of_edges] [-M number_of_trials] [-P rotation] index
index
Specify the name of the directory for the existing index such as ANNG or ONNG to be quantized. The index only with L2 distance and normalized cosine similarity distance can be quantized. You should build the ANNG or ONNG with normalized cosine similarity in order to use cosine similarity for the quantized graph.
-o number_of_objects_for_quantization
Specify the number of object for quantization and optimization. The number should be less than or equal to the number of the registered objects.
-P rotation
Specify the transform matrix type for the inserted and query object to optimize the subvector quantization.
- r: Rotation matrix.
- R: Rotation and repositioning matrix.
- p: Repositioning matrix.
- n: No matrix.
-M number_of_trials
Specify the number of trials to optimize the subvector quantization.
Search the index using the specified query data.
$ qbg search [-n number_of_search_objects] [-e search_range_coefficient] [-p result_expansion]
index query_data
index
Specify the path of the existing quantized index.
query_data
Specify the path of the file containing query data. This file shall consist of one item of query data per line and each dimensional element of that data item shall be delimited by a space or tab. Each search shall be sequentially performed when providing multiple queries.
-n number_of_search_objects (default: 20)
Specify the number of search objects.
-e search_range_coefficient (default = 0.02)
Specify the magnification coefficient (epsilon) of the search range. A larger value means higher accuracy but slower searching, while a smaller value means a drop in accuracy but faster searching. While it is desirable to adjust this value within the range of 0 - 0.1, a negative value (> -1.0) may also be specified.
-B blob_search_range_coefficient (default = 0.0)
Specify the magnification coefficient (epsilon) of the search range for the quantized blob graph.
-N number_of_explored_nodes (default = 256)
Specify the number of the explored nodes in the graph. When the number of the explored nodes reached the specified number, the search is terminated.
-p result_expansion (default = 0.0)
Specify the expansion ratio of the number of approximate inner search objects to the number of search objects. For example, when the ratio is 10 and the number of search objects is 20, the number of the approximate search objects is set to 200 inside the search processing. A larger value brings higher accuracy but slower searching.
$ curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean.tsv
$ curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean_query.tsv
$ head -1 sift-128-euclidean_query.tsv > query.tsv
Build an ANNG for 128-dimensional, floating point data:
$ ngt create -d 128 -o f -D 2 anng sift-128-euclidean.tsv
Data loading time=15.4804 (sec) 15480.4 (msec)
# of objects=1000000
Processed 100000 objects. time= 4.26452 (sec)
...
Processed 1000000 objects. time= 7.06745 (sec)
Index creation time=63.3504 (sec) 63350.4 (msec)
Create and initialize the quantized graph:
$ qbg create-qg anng
creating...
appending...
Build the quantized graph:
$ qbg build-qg anng
optimizing...
building the inverted index...
building the quantized graph...
Search k nearest neighbors with the quantized graph:
$ qbg search-qg -n 20 -e 0.02 anng query.tsv
Query No.1
Rank ID Distance
1 932086 232.871
2 934877 234.715
3 561814 243.99
...
20 2177 276.781
Query Time= 0.0005034 (sec), 0.5034 (msec)
Average Query Time= 0.0005034 (sec), 0.5034 (msec), (0.0005034/1)
Build an ANNG having more edges for higher performance:
$ ngt create -d 128 -o f -D 2 -E 40 anng-40 sift-128-euclidean.tsv
Build an ONNG:
$ ngt reconstruct-graph -m S -E 64 -o 64 -i 120 anng-40 onng-40
Create and initialize the quantized graph:
$ qbg create-qg onng-40
Build the quantized graph:
$ qbg build-qg onng-40
Search k nearest neighbors with the quantized graph:
$ qbg search -n 20 -e 0.02 onng-40 query.tsv
$ curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean.tsv
$ curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean_query.tsv
$ head -1 sift-128-euclidean_query.tsv > query.tsv
Create and initialize the quantized blob graph:
$ qbg create -d 128 -D 2 -N 128 qbg-index
Append objects:
$ qbg append qbg-index sift-128-euclidean.tsv
Build the quantized graph:
$ qbg build qbg-index
Search k nearest neighbors with the quantized blob graph:
$ qbg search -n 20 -e 0.02 qbg-index query.tsv