Search Engine 🔍

Search engine that conducts text retrieval opeartions on an extensive compilation of 8.8 million documents available here The project is split upon two primary stages:

Document Indexing, which consists in developing data structures and mechanisms required for efficient retrieval

Query Execution, which focuses in using data structures and queries provided by a user to retrieve most relevant document in the collection

Performances 🚀

In the following plots are displayed performances of the Search Engine both in Conjunctive and Disjunctive queries using TFIDF as scoring function and particular parameters configurations:

CONJUNCTIVE	DISJUNCTIVE

Project Structure and Modules 📁

The Search Engine is composed by the following main modules:

Common, which contains bean classes and managers used by other modules

Preprocessing , which is in charge of cleaning, tokenizing, stemming and stopword removing document and query text

Indexing, which performs indexing of the collection saving main data structures on disk and executing merging of them
Query processing , which performs processing of queries using different Document Processors and Scoring Functions

How to configure and compile modules

Indexing module

The Indexer module can be configured using config.properties file, which allows to set the following properties:

Option	Description
stopwords	Choose a stopwords list to be removed
preprocessing.remove.stopwords	Enable or disable stopword removal
preprocessing.enable.stemming	Enable stemming
invertedIndex.useCompression	Enable docids and frequencies compression
memory.threshold	Set the memory threshold above which the Block is stored to disk
skipblocks.maxLen	Set the maximum length of a Skip Block

Query processing module

Configuration properties

The Query processing module can be configured using config.properties file, which allows to set the following properties.

General properties:

Setting	Description
query.parameters.nResults	Set the number of documents to be retrieved in the corpus

Document processor and scoring function specific properties:

Setting	Description
scoring.MaxScore.threshold	Set MaxScore threshold
scoring.BM25.k1	Set parameter k1 for BM25 scoring function
scoring.BM25.B	Set parameter B for BM25 scoring function

Performance properties:

Setting	Description
performance.iterators.useCache	Enable cache for Skip Blocks inside an iterator
performance.iterators.cache.size	Set the cache size for Skip Blocks inside an iterator
performance.iterators.useThreads	Enable threads for Skip Blocks inside an iterator
performance.iterators.threads.howMany	Set the number of threads for Skip Blocks inside an iterator
performance.iteratorFactory.cache.enabled	Enable cache for Posting List Iterators
performance.iteratorFactory.cache.size	Set the cache size for Posting List Iterators
performance.iteratorFactory.threads.enabled	Enable threads for Posting List Iterators
performance.iteratorFactory.threads.howMany	Set the number of threads for Posting List Iterators

Execution Path

Run the code from the /java folder.

Specify the absolute path for the collection to be indexed inside config.properties at entry data.collection.path.

Compiling properties

On the other hand Query processing module can also be compiled using options, that will override the properties inside config.properties file. The available options are the followings:

Option	Description
--results	Set the number of documents to be returned by the query
--scoring	Set the scoring function between TFIDF and BM25
--queryType	Choose query type between disjunctive and conjunctive
--processingType	Choose document processor type between TAAT, DAAT, and MaxScore
--stopWords	Enable stopwords removal
--wordStemming	Enable words stemming

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.idea		.idea
java		java
.gitignore		.gitignore
MIRCV_Search_Engine_Documentation.pdf		MIRCV_Search_Engine_Documentation.pdf
README.md		README.md
queryPerformanceSearchEngine.xlsx		queryPerformanceSearchEngine.xlsx
test-collection10.tsv		test-collection10.tsv
test-collection100.tsv		test-collection100.tsv
test-collection20000.tsv		test-collection20000.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine 🔍

Performances 🚀

Project Structure and Modules 📁

How to configure and compile modules

Indexing module

Query processing module

Configuration properties

Execution Path

Compiling properties

About

Releases

Packages

Contributors 3

Languages

pieruccim/search-engine

Folders and files

Latest commit

History

Repository files navigation

Search Engine 🔍

Performances 🚀

Project Structure and Modules 📁

How to configure and compile modules

Indexing module

Query processing module

Configuration properties

Execution Path

Compiling properties

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages