Feature/hnswlib (#24)

* #9 fix for method typo * #3 docs for ODFE index configuration and hyper-parameters * #9 reqs freeze * #11 optimized params to reach 700k vectors indexed (still not 1M) * #16 indexer for hnswlib, stores randomly generated vectors into binary index on disk * #16 I/O for binary vector format from Yandex (image dataset) * #16 hnswlib indexer for big-ann * #16 vector data visualizer (tensorboard) * #17 NSW graph visualization * #17 NSW graph implementation * #17 pca and t-sne * #17 viz code (fbin->tsv) * #17 sharding * #17 sharding algorithm, first two steps * #17 sharding algorithm, first two steps * added toml config * added IDE path --------- Co-authored-by: dmitry.kan <[email protected]>
DmitryKey · Aug 28, 2024 · 5a25783 · 5a25783
1 parent e45f938
commit 5a25783
Show file tree

Hide file tree

Showing 23 changed files with 13,975 additions and 122 deletions.
diff --git a/.gitignore b/.gitignore
@@ -128,3 +128,6 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# IDE
+.idea
diff --git a/README.md b/README.md
@@ -33,6 +33,7 @@ Also, if you are interested in Vector Databases and Neural Search Frameworks, th
 
 Tech stack:
 - Hugging Face
+- Solr / Elasticsearch / ODFE (OpenSearch)
 - Solr / Elasticsearch / OpenSearch
 - streamlit
 - Python 3.8 (upgraded recently)
@@ -45,17 +46,42 @@ If you encounter issues with the above installation, consider installing full li
 
 `pip install -r requirements_freeze.txt`
 
-This sets up the stage for our further experiment with Solr.
+# Let's install bert-as-service components
+
+`pip install bert-serving-server`
+
+`pip install bert-serving-client`    
+
+# Download a pre-trained BERT model 
+into the `bert-model/` directory in this project. I have chosen [uncased_L-12_H-768_A-12.zip](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)
+for this experiment. Unzip it.
+
+# Now let's start the BERT service
+
+`bash start_bert_server.sh`
+
+# Run a sample bert client
+    python src/bert_client.py
+ to compute vectors for 3 sample sentences:
+
+        Bert vectors for sentences ['First do it', 'then do it right', 'then do it better'] : [[ 0.13186474  0.32404128 -0.82704437 ... -0.3711958  -0.39250174
+          -0.31721866]
+         [ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
+          -0.11345179]
+         [ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366  -0.39310536
+           0.07640187]]
+
+The steps so far set up the stage for our further experiment with indexing in the preferred search engine.
 
 # Dataset
 This is by far the key ingredient of every experiment. You want to find an interesting
 collection of texts, that are suitable for semantic level search. Well, maybe all texts are. I have chosen a collection of abstracts from DBPedia,
-that I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into `data/dbpedia` directory in bz2 format.
+which I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into `data/dbpedia` directory in bz2 format.
 You don't need to extract this file onto disk: the provided code will read directly from the compressed file.
 
-# Preprocessing and Indexing: Solr
-Before running preprocessing / indexing, you need to configure the vector plugin, which allows to index and query the vector data.
-You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring
+# Data preprocessing and Indexing in Solr
+Before running preprocessing / indexing, you need to configure the vector plugin, which allows indexing and querying the vector data.
+You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring/releases
 
 After the plugin's jar has been added, configure it in the solrconfig.xml like so:
 
@@ -94,16 +120,42 @@ We know how many abstracts there are:
     bzcat data/dbpedia/long_abstracts_en.ttl.bz2 | wc -l
     5045733
 
-# Preprocessing and Indexing: Elasticsearch
+# Data preprocessing and Indexing in Elasticsearch
 This project implements several ways to index vector data:
 * `src/index_dbpedia_abstracts_elastic.py` vanilla Elasticsearch: using `dense_vector` data type
-* `src/index_dbpedia_abstracts_elastiknn.py` Elastiknn plugin: implements own data type. I used `elastiknn_dense_float_vector`
-* `src/index_dbpedia_abstracts_opendistro.py` OpenDistro for Elasticsearch: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing
+* `src/index_dbpedia_abstracts_elastiknn.py` elastiknn plugin: implements own data type. I used `elastiknn_dense_float_vector`
+
+Each indexer relies on ready-made Elasticsearch mapping file, that can be found in `es_conf/` directory:
+* es_conf/vector_settings.json is used for vanilla vector search
+* es_conf/elastiknn_settings.json is used for KNN vector search implemented with [elastiknn plugin](https://github.com/alexklibisz/elastiknn).
+
+To configure elastiknn, please refer to its excellent documentation.
+
+# Data preprocessing and Indexing in ODFE (Open Distro for Elasticsearch)
+
+* `src/index_dbpedia_abstracts_opendistro.py` ODFE: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing
+
+It is important to understand, that unlike vanilla or elastiknn implementation (Java), nmslib implements HNSW graphs in C++ and therefore
+ODFE will use off-heap memory to build this data structure. In order to achieve optimal indexing and search performance, you need to consider
+the following hyper-parameters:
+
+* number of shards and number of replicas: https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/#primary-and-replica-shards
+* KNN space type: `cosinesimil`, `hammingbit`, `l1`, `l2`
+* refresh interval
+* number of segments in the Lucene index
+* circuit_breaker_limit -- cluster level setting, controlling the portion of RAM used for off-heap graphs
+
+The recommended formula for computing RAM used for storing the graphs:
+
+    RAM(vector_dimension) = 1.1 * (4 * vector_dimension + 8 * M) bytes / vector
+
+In this project we compute vectors with 768 dimensions. For 1M vectors and M=16 we will need:
 
-Each indexer relies on ready-made Elasticsearch mapping file, that can be found in `es_conf/` directory.
+   RAM(768) = 1.1 * (4 * 768 + 8 * 16) * 1,000,000 ~= 3.28 GB
 
+Replicas will double the amount of RAM needed for your cluster.
 
-# Preprocessing and Indexing: GSI APU
+# Data preprocessing and Indexing in GSI APU
 In order to use GSI APU solution, a user needs to produce two files:
 numpy 2D array with vectors of desired dimension (768 in my case)
 a pickle file with document ids matching the document ids of the said vectors in Elasticsearch.
@@ -112,9 +164,10 @@ After these data files get uploaded to the GSI server, the same data gets indexe
 Since I’ve run into indexing performance with bert-as-service solution, 
 I decided to take SBERT approach from Hugging Face to prepare the numpy and pickle array files. 
 This allowed me to index into Elasticsearch freely at any time, without waiting for days.
-You can use this script to do this on DBPedia data, which allows using:
+You can use this script to do this on DBPedia data, which allows choosing between:
 
     EmbeddingModel.HUGGING_FACE_SENTENCE (SBERT)
+    EmbeddingModel.BERT_UNCASED_768 (bert-as-service)
 
 To generate the numpy and pickle files, use the following script: `scr/create_gsi_files.py`.
 This script produces two files:
@@ -130,7 +183,8 @@ Running the BERT search demo
 ===
 There are two streamlit demos for running BERT search
 for Solr and Elasticsearch. Each demo compares to BM25 based search.
-The following scripts assume either Elasticsearch or Solr running with the index containing field with embeddings.
+The following assumes that you have bert-as-service up and running (if not, launch it with `bash start_bert_server.sh`)
+and either Elasticsearch or Solr running with the index containing field with embeddings.
 
 To run a demo, execute the following on the command line from the project root:
 

diff --git a/data/viz/nsw-graphs/image.png b/data/viz/nsw-graphs/image.png
diff --git a/data/viz/tensorboard/pca_100k_points_query.learn.50M.fbin.png b/data/viz/tensorboard/pca_100k_points_query.learn.50M.fbin.png
diff --git a/data/viz/tensorboard/t-sne_10k_points_query.learn.50M.fbin.png b/data/viz/tensorboard/t-sne_10k_points_query.learn.50M.fbin.png
diff --git a/es_conf/opendistro_settings.json b/es_conf/opendistro_settings.json
@@ -1,12 +1,11 @@
 {
   "settings": {
-    "number_of_shards": 1,
+    "number_of_shards": 2,
     "number_of_replicas": 0,
     "index": {
       "knn": true,
       "knn.space_type": "cosinesimil",
-      "refresh_interval": "-1",
-      "merge.scheduler.max_thread_count": 1
+      "refresh_interval": "60s"
     }
   },
   "mappings": {

diff --git a/img/bert.jpeg b/img/bert.jpeg
diff --git a/img/solr.png b/img/solr.png
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,135 @@
+[project]
+dynamic = ["version"]
+name = "bert-solr-search"
+description = "Compute and search dense vectors using Apache Solr, Elasticsearch, OpenSearch and various vector search implementations"
+keywords = ["vector search", "embeddings", "apache solr", "elasticsearch", "opensearch", "elastiknn"]
+
+requires-python = ">= 3.8"
+
+authors = [
+  {name = "Dmitry Kan"},
+]
+maintainers = [
+  {name = "Dmitry Kan"}
+]
+
+readme = "README.md"
+
+license = {file = "LICENSE"}
+
+dependencies = [
+"absl-py==0.12.0",
+"altair==4.1.0",
+"appnope==0.1.2",
+"argon2-cffi==20.1.0",
+"astor==0.8.1",
+"async-generator==1.10",
+"attrs==21.2.0",
+"backcall==0.2.0",
+"base58==2.1.0",
+"bert-serving-client==1.10.0",
+"bert-serving-server==1.10.0",
+"bleach==3.3.0",
+"blinker==1.4",
+"Brotli==1.0.9",
+"cached-property==1.5.2",
+"cachetools==4.2.2",
+"certifi==2020.12.5",
+"cffi==1.14.5",
+"chardet==4.0.0",
+"click==7.1.2",
+"decorator==5.0.7",
+"defusedxml==0.7.1",
+"entrypoints==0.3",
+"Flask==1.1.2",
+"Flask-Compress==1.9.0",
+"Flask-Cors==3.0.10",
+"Flask-JSON==0.3.4",
+"gast==0.2.2",
+"gitdb==4.0.7",
+"GitPython==3.1.14",
+"google-pasta==0.2.0",
+"GPUtil==1.4.0",
+"grpcio==1.37.1",
+"h5py==3.2.1",
+"idna==2.10",
+"importlib-metadata==4.0.1",
+"ipykernel==5.5.4",
+"ipython==7.23.1",
+"ipython-genutils==0.2.0",
+"ipywidgets==7.6.3",
+"itsdangerous==1.1.0",
+"jedi==0.18.0",
+"Jinja2==2.11.3",
+"joblib==1.0.1",
+"jsonschema==3.2.0",
+"jupyter-client==6.1.12",
+"jupyter-core==4.7.1",
+"jupyterlab-pygments==0.1.2",
+"jupyterlab-widgets==1.0.0",
+"Keras-Applications==1.0.8",
+"Keras-Preprocessing==1.1.2",
+"Markdown==3.3.4",
+"MarkupSafe==1.1.1",
+"matplotlib-inline==0.1.2",
+"mistune==0.8.4",
+"nbclient==0.5.3",
+"nbconvert==6.0.7",
+"nbformat==5.1.3",
+"nest-asyncio==1.5.1",
+"notebook==6.3.0",
+"numpy==1.18.5",
+"opt-einsum==3.3.0",
+"packaging==20.9",
+"pandas==1.2.4",
+"pandocfilters==1.4.3",
+"parso==0.8.2",
+"pexpect==4.8.0",
+"pickleshare==0.7.5",
+"Pillow==8.2.0",
+"plotly==4.9.0",
+"prometheus-client==0.10.1",
+"prompt-toolkit==3.0.18",
+"protobuf==3.16.0",
+"ptyprocess==0.7.0",
+"pyarrow==4.0.0",
+"pycparser==2.20",
+"pydeck==0.6.2",
+"Pygments==2.9.0",
+"pyparsing==2.4.7",
+"pyrsistent==0.17.3",
+"python-dateutil==2.8.1",
+"pytz==2021.1",
+"pyzmq==22.0.3",
+"requests==2.25.1",
+"retrying==1.3.3",
+"scikit-learn==0.24.2",
+"scipy==1.6.3",
+"Send2Trash==1.5.0",
+"six==1.16.0",
+"sklearn==0.0",
+"smmap==4.0.0",
+"streamlit==0.81.1",
+"tensorboard==1.15.0",
+"tensorflow==1.15.0",
+"tensorflow-estimator==1.15.1",
+"termcolor==1.1.0",
+"terminado==0.9.4",
+"testpath==0.4.4",
+"threadpoolctl==2.1.0",
+"toml==0.10.2",
+"toolz==0.11.1",
+"tornado==6.1",
+"traitlets==5.0.5",
+"typing-extensions==3.10.0.0",
+"tzlocal==2.1",
+"urllib3==1.26.4",
+"validators==0.18.2",
+"wcwidth==0.2.5",
+"webencodings==0.5.1",
+"Werkzeug==1.0.1",
+"widgetsnbextension==3.5.1",
+"wrapt==1.12.1",
+"zipp==3.4.1",
+
+]
-Original file line number
+Diff line change
@@ Expand Up / @@ -128,3 +128,6 @@ dmypy.json @@
     # Pyre type checker
     .pyre/
+    # IDE
+    .idea