Skip to content
Ahmet A. Akın edited this page Feb 16, 2016 · 36 revisions

Welcome to the zemberek-nlp wiki!

FAQ

What is the state of project?

I am no longer working with the project regularly (Neither my brother Mehmet). However, I will continue add features and fix bugs without a timeline.

What is the difference between Zemberek2 and Zemberek-nlp

Zemberek2 is no longer maintained. Zemberek-nlp had developed from almost scratch and shares almost no code with Zemberek2. There were several shortcomings of Zemberek2 such as: Too strict parsing, incompatible formatting, weak dictionary, complex code, slow, no disambiguation. We aimed to fix all those issues with Zemberek-nlp and succeeded some of it. But unfortunately complexity of the code is a major issue that has hindered the development.

Why disambiguation does not work well?

Current disambiguation mechanism uses an HMM system that uses two language models. However models are trained from a relatively small and noisy corpus. Therefore disambiguation performance is sub-par. We did test a separate Perceptron system and it works much better, but we have not incorporated it to Zemberek system. Even we do that, performance will probably not be more than %80. We had some ideas about automatically generating training sets and attacking most ambiguous words, but we could not work on it.

Can I use it as stemmer-lemmatizer?

Yes, it is trivial to access stem and lemmas from the parse result. However, for correct stemming you need good disambiguation.

Can I add a new dictionary item programatically?

Yes. https://github.com/ahmetaa/turkish-nlp-examples/blob/master/src/main/java/morphology/AddNewDictionaryItem.java

Can I generate words?

Yes. https://github.com/ahmetaa/turkish-nlp-examples/blob/master/src/main/java/morphology/ChangeStem.java

Why is the code in English?

Zemeberek2 code was completely Turkish. It was one of the point that made it attractive for new comers. However, we wanted Zemberek-nlp to be used in global NLP community and academia and therefore used English in the code. Not that it worked out that way, but still we stick to that decision.

What about Libre Office or Lucene-Solr extensions?

We do not plan to write extensions for external applications now. But it is trivial to write a Turkish stemmer or lemmatizer. For LibreOffice, perhaps updating tr-spell with a really large corpus is a better idea.

Is Morphological parsing overrated?

For some NLP tasks, I believe so. When you have a lot of data, importance of morphological parsing accuracy diminishes. Often, tools like Zemberek are only used for lemmatization and stemming. Even no deterministic morphology tools are required sometimes. For example, a recent work on Ner for Turkish shows that without any advanced morphological parsing, systems can achieve excellent results. Statistical unsupervised morpheme tools can work quite well for Speech Recognition systems. However, sparsity of Turkish is still a problem and advanced morphology is still used in some tasks such as Machine Translation, Dependency Parsing and morphological language models. That's said, some see the recent advances in neural networks as the dawn of unsupervised methods where tools like deterministic morphological parsing have little importance.

Why don't you use an FST?

Most Turkish morphological parsing tools uses an FST (Finite state transducer). Oflazer, Sak and Coltekin uses these approach. FST greatly simplifies the parser and it is very fast. However, we did not go that route because:

  • Good FST tools were not available for Java.
  • Some FST tools were too low level
  • You cannot modify the search graph at run-time if you use an FST tool.

We instead created a graph programatically. But our design turned out to be complex and inadequate for some exception cases.

What are the alternatives ?

There are many tools for Turkish NLP available. Some are:

  • Kemal Oflazer's command line parser
  • Haşim Sak's morphological parser and disambiguator.
  • Tr-morph project.
  • ITU Turkish NLP pipeline
  • Deniz Yüret's deasciifier and disambiguator.
  • Odtü-Sabancı Tree-bank
  • Ali Ok's parser.
  • Weka and Open-NLP tools can be trained for Turkish.
Clone this wiki locally