Add PMI to textstat_collocation? #14

koheiw · 2018-12-16T22:28:14Z

A post on SO brings me back to my old idea to add PMI to textstat_collocation(). It is less good as the lambda but super fast to run. PMI will be computed by

PMI = P("a b c") / (P("a") * P("b") * P("c"))

Where P("x") is a probability of "x" in the corpus.

The text was updated successfully, but these errors were encountered:

kbenoit · 2018-12-17T07:14:19Z

Yes we have that in https://github.com/kbenoit/quanteda.collocationsdev, where the idea is to use this for comparison in our paper (still under development) about collocations. We had this in but took it out while we prove that the log-linear approach is superior. Once we work that out (soon I hope!) we should definitely consider returning some of the other measures.

This is all standard stuff, e.g. https://nlp.stanford.edu/fsnlp/promo/colloc.pdf. However this does not for sizes > 2, since in the PMI example the marginal probabilities (in the denominator) need to account for P(a, b), P(b, c), P(a, c) as well. That's our angle with lambda.

Suggest we keep this separate as is and kick ourselves (myself) to flesh out the collocations paper, where we can sort all this out (with Jouni's input of course, as planned).

koheiw self-assigned this Dec 16, 2018

kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PMI to textstat_collocation? #14

Add PMI to textstat_collocation? #14

koheiw commented Dec 16, 2018

kbenoit commented Dec 17, 2018

Add PMI to textstat_collocation? #14

Add PMI to textstat_collocation? #14

Comments

koheiw commented Dec 16, 2018

kbenoit commented Dec 17, 2018