-
Notifications
You must be signed in to change notification settings - Fork 211
This is a project that provides some basic tools for Turkish NLP applications.
This project is in maintenance mode.
For most NLP tasks, Zemberek will probably not provide state of the art results comparing with modern NLP tools. However, it still can be useful for preprocessing of Turkish text (tokenizations, sentence segmentation, lemmatizations etc.) and when creating a baseline for some applications.
Morphological alaysis and generation tool Zemberek2 is no longer maintained. Zemberek-NLP is developed from almost scratch and shares almost no code with Zemberek2. There were several shortcomings of Zemberek2 such as: Too strict parsing, incompatible formatting, weak dictionary, complex code, slowness, no disambiguation. Hopefully Zemberek-nlp will address all those issues.
Morphological analysis is used for finding meaningful syntactic parts (Morphemes) of a word. Such as root words and suffixes. For example, "kalemlerimden (from my pencils)" is analyzed as follows in Zemberek:
[kalem:Noun] kalem:Noun+ler:A3pl+im:P1sg+den:Abl
[kalem:Noun] → lemma and root POS.
kalem:Noun → Stem. This may be different than lemma.
ler:A3pl → Plural suffix `A3pl` with form `ler`
im:P1sg → First person singular possessive suffix `P1sg` with form `im`
den:Abl → Ablative suffix `Abl` (from) with form "den"
Usually the actual letters of a morpheme, such as den
in the example above is called surface form
and representing suffix name Abl
is called lexical form
.
Finding morphemes for Turkish computationally is not easy, such system requires knowledge of complex phonetic and hand crafted suffix sequence rules (morphotactics).
Many Turkish words are highly ambiguous. A single word can have 2 to 10 correct analyses with different stem and suffixes depending on the context. For example, word "yarın" can be interpreted as follows:
[yar:Noun] yar:Noun+A3sg+ın:Gen (cliff's)
[yar:Noun] yar:Noun+A3sg+ın:P2sg (your cliff)
[yarı:Noun] yarı:Noun+A3sg+n:P2sg (your half)
[yarı:Adj] yarı:Adj|Zero→Noun+A3sg+n:P2sg (your half, root is adjective)
[yarın:Adv] yarın:Adv (during tomorrow)
[yarın:Noun,Time] yarın:Noun+A3sg (tomorrow) This is the most common.
[yarmak:Verb] yar:Verb+Imp+ın:A2pl (split!)
For resolving ambiguity, a simple machine learning mechanism is trained with hand tagged sentences. It uses context words and their analyses to determine the correct result.
Like all data driven statistical systems, disambiguation mechanism may produce wrong results. Besides, morphological disambiguation is a hard problem. You need a lot of training data for algorithms to generate good models. Generating training data is a time consuming process.
However, as of 0.14.0, performance of the disambiguation mechanism is much improved. If possible, switch to 0.14.0 or higher versions. We expect to improve the quality in further versions by adding more data.
There are two ways to use Zemberek-Nlp with Python. One is to access it natively from Python with jpype
. @ozturkberkay's Git repository Zemberek-Python-Examples provides working zemberek examples.
Second alternative is to use Zemberek-NLP's own gRrpc server. Here you need full zemberek jar file and run it with
java -jar zemberek-full.jar StartGrpcServer
After that, you can access it with provided Python files as explained here
Yes, it is trivial to access stem and lemmas from the parse result. However, for correct stemming good disambiguation is required.
After version 0.11.0 there is a simple spelling functionality available in normalization module.
Currently zemberek-nlp does not offer deasciifier functionality directly. But TurkishMorphology class can be configured to ignore diacritics symbols during analysis. There are several applications available in internet that use Deniz Yuret's deasciifier algorithm .
Yes. Use lang-id module for this. There are also alternatives like language-detector. Keep in mind that this module is for detecting the language of text with reasonable character count (usually more than 20 characters). It is usually not suitable for detecting the language of individual words.
Zemeberek2 code was completely Turkish. It was one of the point that made it attractive for new comers. However, we wanted Zemberek-nlp to be used in global NLP community and academia and therefore used English in the code. Not that it worked out that way, but still we stick to that decision.
But feel free to use Turkish in issue section.
We do not have extensions for external applications for now. But it is easy to write a Turkish stemmer or lemmatizer (There is already a Lucene-Solr Turkish Analysis project available using different NLP tools.).
There is a LibreOffice spell checker extension available.
It is possible in theory, but we have not tried it. Library is more suitable for server or desktop usage in it's current state.
Most Turkish morphological parsing tools use an FST (Finite state transducer). Oflazer, Sak and Çöltekin uses this approach. FST greatly simplifies the parser and it is very fast. However, we did not go that route because:
- Good FST tools were not available for Java.
- Some FST tools were too low level
- You cannot modify the search graph at run-time if you use an FST tool.
Zemberek uses a different approach and uses a graph that is created programatically. It is slower but programming gives more flexibility.
Yes, as long as abiding the distribution requirements of Apache 2.0 license, you can use Zemberek source or binaries even in closed sourced commercial projects.
There are many tools for Turkish NLP available. Some are:
- Kemal Oflazer's command line parser
- Olcay Taner Yıldız's NLP Toolkit
- Haşim Sak's morphological parser and disambiguator.
- Çağrı Çöltekin's TRmorph.
- Ali Ok's trnltk-java
- ITU Turkish NLP pipeline
- TS Corpus provides variety of Turkish linguistic corpora and online NLP tools.
- Harun Reşit Zafer's nuve
- Deniz Yüret's deasciifier and disambiguator.
- Odtü-Sabancı Tree-bank
- Weka, Open-NLP, NLTK, Stanford NLP and many recent Neural Network based tools (Tensorflow, Pytorch etc.) can be trained for Turkish.
There are many books available for Turkish Grammar. There is also a slightly outdated documentation with perspective of Zemberek developers available here .
Zemberek is the main spring of a watch in Turkish. Etymologically It comes from Persian word "zanbūrak زنبورك", meaning "little bee". Long ago @mdakin picked this word as it sounds funny/interesting.
They are Danger Mouse and Penfold from animated series Danger Mouse - Tehlikeli Fare.