Implemented a searching algorithm to check a sentence through a preset database of emails and find out the most similar in meaning
- dataset is a global variable which contains all the data that our model will use. Hence, update the dataset variable.
dataset = set_dataset(<your_dataset>)
Note that the dataset variable should be an array of strings. - Now , to run the program , run :
unprep, prep = load_dataset_and_preprocess(dataset)
- For BERT Queries , you will have to run :
unprep_index = build_bert_embeddings_index(unprep)
unprep_indx_to_email = build_reference(unprep)
- For GLOVE Queries, you will have run :
embeddings_dict = init_glove_embeddings()
prep_embeddings, prep_sentence = build_glove_index(prep)
- For BERT Queries , you will have to run :
- Once the dataset is fixed , you will have to rerun the entire code , since we are pre-processing and storing all the embedded data in our files.
- Now , to execute a query, call the search_ function, depending on whether you want to execute search based on glove-embeddings or bert-embeddings. Since both will give accurate results , the choice is left on the user.
- BERT Query (Normal data):
bert_search(query, unprep_index, unprep, unprep_indx_to_email, top_n=4, is_preprocess=False)
- BERT Query (Pre-processed data) :
bert_search(query, prep_index, prep, prep_indx_to_email, top_n=4, is_preprocess=True)
- GLOVE Query (Normal data):
glove_search(query, unprep_embeddings, unprep, unprep_indx_to_email, top_n=4, is_preprocess=False)
- GLOVE Query (Pre-processed data) :
glove_search(query, prep_embeddings, prep, prep_indx_to_email, top_n=4, is_preprocess=True)
- BERT Query (Normal data):
We install these libraries each time when the program is run, and it is recommended to enable GPU on your google colab as well, to speed up the pre-processing.
- datasets
- We have used datasets for taking our dataset. If you have a custom dataset, you can omit this.
- nltk
- numpy
- pickle
- mxnet (Optional)
- It is recommended to use a GPU to speed up the pre-processing for your dataset. You can skip it, if you wish to.
- Also, mxnet-cuda is specific to nvidia graphic cards. Please note this.
- It is recommended to use a GPU to speed up the pre-processing for your dataset. You can skip it, if you wish to.
- gluonnlp *(Optional, it is a dependency for mxnet)
- bert-embedding
- re(regex)
- pdoc3 ( for documentation )
- Fork the repository to your own account
- Clone the repository to your local system , make any changes you wish to
git clone https://github.com/<your_username>/Search-Engine
- For each new feature, create a new branch of the name feature_name
git checkout -b <feature_name>
- Comment the code properly with all necessary comments wherever needed
- While coding , follow the iflake coding guidelines for a cleaner and better code quality.
- You can install iflake8 on your system as :
pip install flake8
- You can run it as :
flake8 path/to/your/code
- You can install iflake8 on your system as :
- Push the changes to your fork
- Create a pull request
- Resolve conflicts as required
- datasets
pip install datasets
- nltk
pip install nltk
- numpy
pip install numpy
- pickle
pip install pickle5
- mxnet
pip install mxnet-cu101
- gluonnlp
pip install gluonnlp
- bert-embedding
pip install bert-embedding --no-deps
- re(regex)
pip install regex