kawadi (કવાડિ in Gujarati) (Axe in English) is a versatile tool that used as a form of weapon and is used to cut, shape and split wood.
kawadi is collection of small tools that I found useful for me more often. Currently it contains text search which searches a string inside another string.
Text search in kawadi uses sliding window technique to search for a word or phrase in another text. The step size in the sliding window is 1 and the window size is the size of the word/phrase we are interested in.
For example, if the text we are interested in searching is "The big brown fox jumped over the lazy dog" and the word that we want to search is "brown fox".
text = "The big brown fox jumped over the lazy dog"
interested_word = "brown fox"
window_size = len(interested.split()) -> len(["brown", ["fox"]])
slides = sliding_window(text, window_size) -> ['The', 'big']['big', 'brown']
['brown', 'fox']['fox', 'jumped']['jumped', 'over']['over', 'the']['the', 'lazy']['lazy', 'dog']
for each slide in slides
score(" ".join(slide), interested_word)
if score >= threshold then
select slide
else
continue
Currently, there are 3 similarity scores are calculated and averaged to calculate the final score. These similarity scores are Cosine
, JaroWinkler
and Normalized Levinstine
similarities.
- In creating labeled dataset for Named Entity Recognition.
- Quick search and replace text in huge amount of text.
Regular expressions are tricky and its possible to make them dynamic but that very difficult, for the use case I had in mind regular expressions were not a viable option. I tried creating a dataset myself for a data science project I was working on with regular expression but it become really complex very soon, so its not scalable. Also, the time taken to create these regular expressions is much higher for me as I am a novice.
However I would like to point out that this project is only viable for small scale or medium scale dataset creation, for big data I would potentially use something like ElasticSearch.
from kawadi.text_search import SearchInText
search = SearchInText()
text_to_find = "String distance algorithm"
text_to_search = """SIFT4 is a general purpose string distance algorithm inspired by JaroWinkler and
Longest Common Subsequence. It was developed to produce a distance measure that matches as close as
possible to the human perception of string distance. Hence it takes into account elements like character
substitution, character distance, longest common subsequence etc. It was developed using experimental testing,
and without theoretical background."""
result = search.find_in_text(text_to_find, text_to_search)
print(result)
[
{
"sim_score": 1.0,
"searched_text": "string distance algorithm",
"to_find": "string distance algorithm",
"start": 27,
"end": 52,
}
]
If the text that needs to be searched is big, SearchInText
can utilize multiprocessing
to make the search fast.
from kawadi.text_search import SearchInText
search = SearchInText(multiprocessing=True, max_workers=8)
Its often the case that the provided string similarity score is not enough for the use case that you may have. For this very case, you can add, your own score calculation.
from kawadi.text_search import SearchInText
def my_custom_fun(**kwargs):
slide_of_text:str = kwargs["slide_of_text"]
text_to_find:str = kwargs["text_to_find"]
# Here you can then go on to do preprocessing if you like,
# or use char based n-gram string matching scores.
return score: float
search = SearchInText(search_threshold=0.9, custom_score_func= your custom func)
This custom score function will have access to two things slide_of_text
for every slide in text (From the example above, "The big", "big brown" and so on...) and text_to_find
.
Note: The return type of this custom function should be same as the type of
search_threshold
as you can see from the above example.
Stable Release: pip install kawadi
Development Head: pip install git+https://github.com/jdvala/kawadi.git
See CONTRIBUTING.md for information related to developing the code.
Free software: MIT license