Skip to content

Commit

Permalink
Feature/hnswlib (#24)
Browse files Browse the repository at this point in the history
* #9 fix for method typo

* #3 docs for ODFE index configuration and hyper-parameters

* #9 reqs freeze

* #11 optimized params to reach 700k vectors indexed (still not 1M)

* #16 indexer for hnswlib, stores randomly generated vectors into binary index on disk

* #16 I/O for binary vector format from Yandex (image dataset)

* #16 hnswlib indexer for big-ann

* #16 vector data visualizer (tensorboard)

* #17 NSW graph visualization

* #17 NSW graph implementation

* #17 pca and t-sne

* #17 viz code (fbin->tsv)

* #17 sharding

* #17 sharding algorithm, first two steps

* #17 sharding algorithm, first two steps

* added toml config

* added IDE path

---------

Co-authored-by: dmitry.kan <[email protected]>
  • Loading branch information
DmitryKey and dmitry.kan authored Aug 28, 2024
1 parent e45f938 commit 5a25783
Show file tree
Hide file tree
Showing 23 changed files with 13,975 additions and 122 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,6 @@ dmypy.json

# Pyre type checker
.pyre/

# IDE
.idea
78 changes: 66 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Also, if you are interested in Vector Databases and Neural Search Frameworks, th

Tech stack:
- Hugging Face
- Solr / Elasticsearch / ODFE (OpenSearch)
- Solr / Elasticsearch / OpenSearch
- streamlit
- Python 3.8 (upgraded recently)
Expand All @@ -45,17 +46,42 @@ If you encounter issues with the above installation, consider installing full li

`pip install -r requirements_freeze.txt`

This sets up the stage for our further experiment with Solr.
# Let's install bert-as-service components

`pip install bert-serving-server`

`pip install bert-serving-client`

# Download a pre-trained BERT model
into the `bert-model/` directory in this project. I have chosen [uncased_L-12_H-768_A-12.zip](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)
for this experiment. Unzip it.

# Now let's start the BERT service

`bash start_bert_server.sh`

# Run a sample bert client
python src/bert_client.py
to compute vectors for 3 sample sentences:

Bert vectors for sentences ['First do it', 'then do it right', 'then do it better'] : [[ 0.13186474 0.32404128 -0.82704437 ... -0.3711958 -0.39250174
-0.31721866]
[ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
-0.11345179]
[ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366 -0.39310536
0.07640187]]

The steps so far set up the stage for our further experiment with indexing in the preferred search engine.

# Dataset
This is by far the key ingredient of every experiment. You want to find an interesting
collection of texts, that are suitable for semantic level search. Well, maybe all texts are. I have chosen a collection of abstracts from DBPedia,
that I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into `data/dbpedia` directory in bz2 format.
which I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into `data/dbpedia` directory in bz2 format.
You don't need to extract this file onto disk: the provided code will read directly from the compressed file.

# Preprocessing and Indexing: Solr
Before running preprocessing / indexing, you need to configure the vector plugin, which allows to index and query the vector data.
You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring
# Data preprocessing and Indexing in Solr
Before running preprocessing / indexing, you need to configure the vector plugin, which allows indexing and querying the vector data.
You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring/releases

After the plugin's jar has been added, configure it in the solrconfig.xml like so:

Expand Down Expand Up @@ -94,16 +120,42 @@ We know how many abstracts there are:
bzcat data/dbpedia/long_abstracts_en.ttl.bz2 | wc -l
5045733

# Preprocessing and Indexing: Elasticsearch
# Data preprocessing and Indexing in Elasticsearch
This project implements several ways to index vector data:
* `src/index_dbpedia_abstracts_elastic.py` vanilla Elasticsearch: using `dense_vector` data type
* `src/index_dbpedia_abstracts_elastiknn.py` Elastiknn plugin: implements own data type. I used `elastiknn_dense_float_vector`
* `src/index_dbpedia_abstracts_opendistro.py` OpenDistro for Elasticsearch: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing
* `src/index_dbpedia_abstracts_elastiknn.py` elastiknn plugin: implements own data type. I used `elastiknn_dense_float_vector`

Each indexer relies on ready-made Elasticsearch mapping file, that can be found in `es_conf/` directory:
* es_conf/vector_settings.json is used for vanilla vector search
* es_conf/elastiknn_settings.json is used for KNN vector search implemented with [elastiknn plugin](https://github.com/alexklibisz/elastiknn).

To configure elastiknn, please refer to its excellent documentation.

# Data preprocessing and Indexing in ODFE (Open Distro for Elasticsearch)

* `src/index_dbpedia_abstracts_opendistro.py` ODFE: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing

It is important to understand, that unlike vanilla or elastiknn implementation (Java), nmslib implements HNSW graphs in C++ and therefore
ODFE will use off-heap memory to build this data structure. In order to achieve optimal indexing and search performance, you need to consider
the following hyper-parameters:

* number of shards and number of replicas: https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/#primary-and-replica-shards
* KNN space type: `cosinesimil`, `hammingbit`, `l1`, `l2`
* refresh interval
* number of segments in the Lucene index
* circuit_breaker_limit -- cluster level setting, controlling the portion of RAM used for off-heap graphs

The recommended formula for computing RAM used for storing the graphs:

RAM(vector_dimension) = 1.1 * (4 * vector_dimension + 8 * M) bytes / vector

In this project we compute vectors with 768 dimensions. For 1M vectors and M=16 we will need:

Each indexer relies on ready-made Elasticsearch mapping file, that can be found in `es_conf/` directory.
RAM(768) = 1.1 * (4 * 768 + 8 * 16) * 1,000,000 ~= 3.28 GB

Replicas will double the amount of RAM needed for your cluster.

# Preprocessing and Indexing: GSI APU
# Data preprocessing and Indexing in GSI APU
In order to use GSI APU solution, a user needs to produce two files:
numpy 2D array with vectors of desired dimension (768 in my case)
a pickle file with document ids matching the document ids of the said vectors in Elasticsearch.
Expand All @@ -112,9 +164,10 @@ After these data files get uploaded to the GSI server, the same data gets indexe
Since I’ve run into indexing performance with bert-as-service solution,
I decided to take SBERT approach from Hugging Face to prepare the numpy and pickle array files.
This allowed me to index into Elasticsearch freely at any time, without waiting for days.
You can use this script to do this on DBPedia data, which allows using:
You can use this script to do this on DBPedia data, which allows choosing between:

EmbeddingModel.HUGGING_FACE_SENTENCE (SBERT)
EmbeddingModel.BERT_UNCASED_768 (bert-as-service)

To generate the numpy and pickle files, use the following script: `scr/create_gsi_files.py`.
This script produces two files:
Expand All @@ -130,7 +183,8 @@ Running the BERT search demo
===
There are two streamlit demos for running BERT search
for Solr and Elasticsearch. Each demo compares to BM25 based search.
The following scripts assume either Elasticsearch or Solr running with the index containing field with embeddings.
The following assumes that you have bert-as-service up and running (if not, launch it with `bash start_bert_server.sh`)
and either Elasticsearch or Solr running with the index containing field with embeddings.

To run a demo, execute the following on the command line from the project root:

Expand Down
Binary file added data/viz/nsw-graphs/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 2 additions & 3 deletions es_conf/opendistro_settings.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
{
"settings": {
"number_of_shards": 1,
"number_of_shards": 2,
"number_of_replicas": 0,
"index": {
"knn": true,
"knn.space_type": "cosinesimil",
"refresh_interval": "-1",
"merge.scheduler.max_thread_count": 1
"refresh_interval": "60s"
}
},
"mappings": {
Expand Down
Binary file removed img/bert.jpeg
Binary file not shown.
Binary file removed img/solr.png
Binary file not shown.
135 changes: 135 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
[project]
dynamic = ["version"]
name = "bert-solr-search"
description = "Compute and search dense vectors using Apache Solr, Elasticsearch, OpenSearch and various vector search implementations"
keywords = ["vector search", "embeddings", "apache solr", "elasticsearch", "opensearch", "elastiknn"]

requires-python = ">= 3.8"

authors = [
{name = "Dmitry Kan"},
]
maintainers = [
{name = "Dmitry Kan"}
]

readme = "README.md"

license = {file = "LICENSE"}

dependencies = [
"absl-py==0.12.0",
"altair==4.1.0",
"appnope==0.1.2",
"argon2-cffi==20.1.0",
"astor==0.8.1",
"async-generator==1.10",
"attrs==21.2.0",
"backcall==0.2.0",
"base58==2.1.0",
"bert-serving-client==1.10.0",
"bert-serving-server==1.10.0",
"bleach==3.3.0",
"blinker==1.4",
"Brotli==1.0.9",
"cached-property==1.5.2",
"cachetools==4.2.2",
"certifi==2020.12.5",
"cffi==1.14.5",
"chardet==4.0.0",
"click==7.1.2",
"decorator==5.0.7",
"defusedxml==0.7.1",
"entrypoints==0.3",
"Flask==1.1.2",
"Flask-Compress==1.9.0",
"Flask-Cors==3.0.10",
"Flask-JSON==0.3.4",
"gast==0.2.2",
"gitdb==4.0.7",
"GitPython==3.1.14",
"google-pasta==0.2.0",
"GPUtil==1.4.0",
"grpcio==1.37.1",
"h5py==3.2.1",
"idna==2.10",
"importlib-metadata==4.0.1",
"ipykernel==5.5.4",
"ipython==7.23.1",
"ipython-genutils==0.2.0",
"ipywidgets==7.6.3",
"itsdangerous==1.1.0",
"jedi==0.18.0",
"Jinja2==2.11.3",
"joblib==1.0.1",
"jsonschema==3.2.0",
"jupyter-client==6.1.12",
"jupyter-core==4.7.1",
"jupyterlab-pygments==0.1.2",
"jupyterlab-widgets==1.0.0",
"Keras-Applications==1.0.8",
"Keras-Preprocessing==1.1.2",
"Markdown==3.3.4",
"MarkupSafe==1.1.1",
"matplotlib-inline==0.1.2",
"mistune==0.8.4",
"nbclient==0.5.3",
"nbconvert==6.0.7",
"nbformat==5.1.3",
"nest-asyncio==1.5.1",
"notebook==6.3.0",
"numpy==1.18.5",
"opt-einsum==3.3.0",
"packaging==20.9",
"pandas==1.2.4",
"pandocfilters==1.4.3",
"parso==0.8.2",
"pexpect==4.8.0",
"pickleshare==0.7.5",
"Pillow==8.2.0",
"plotly==4.9.0",
"prometheus-client==0.10.1",
"prompt-toolkit==3.0.18",
"protobuf==3.16.0",
"ptyprocess==0.7.0",
"pyarrow==4.0.0",
"pycparser==2.20",
"pydeck==0.6.2",
"Pygments==2.9.0",
"pyparsing==2.4.7",
"pyrsistent==0.17.3",
"python-dateutil==2.8.1",
"pytz==2021.1",
"pyzmq==22.0.3",
"requests==2.25.1",
"retrying==1.3.3",
"scikit-learn==0.24.2",
"scipy==1.6.3",
"Send2Trash==1.5.0",
"six==1.16.0",
"sklearn==0.0",
"smmap==4.0.0",
"streamlit==0.81.1",
"tensorboard==1.15.0",
"tensorflow==1.15.0",
"tensorflow-estimator==1.15.1",
"termcolor==1.1.0",
"terminado==0.9.4",
"testpath==0.4.4",
"threadpoolctl==2.1.0",
"toml==0.10.2",
"toolz==0.11.1",
"tornado==6.1",
"traitlets==5.0.5",
"typing-extensions==3.10.0.0",
"tzlocal==2.1",
"urllib3==1.26.4",
"validators==0.18.2",
"wcwidth==0.2.5",
"webencodings==0.5.1",
"Werkzeug==1.0.1",
"widgetsnbextension==3.5.1",
"wrapt==1.12.1",
"zipp==3.4.1",

]
Loading

0 comments on commit 5a25783

Please sign in to comment.