-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MATTR calculation defaults to wrong window length if provided window length exceeds document length #60
Comments
Thanks for pointing this out, I'll fix it asap. |
Hello, thanks for responding! I've thought about this some more and the fix I suggested may also be inadequate. One could come across a case (like I have recently) where there is a very wide range of document lengths. In practical terms, the calculation could default to using a window size of 1 or 2 for calculations, which would render MATTR and MSTTR meaningless as well. I wonder if it would make sense to write it such that any documents with fewer tokens than the window width simply don't get a MATTR/MSTTR rather than the one based on a window of the minimum document length. Would appreciate your thoughts, and be happy to assist if it is possible to do so! Thanks for a great set of tools. |
That's a good idea - set a minimum document length below which a document has an NA returned for a moving average measure. |
I thought I would share how I ended up doing it for my project, in case it's helpful. I simply check whether the dfm is empty in the function that calculates MATTR, and then return NA for it. compute_mattr<- function (x, MATTR_window = 100L, min_window = 5L)
{
if (MATTR_window < 1)
stop("MATTR_window must be positive")
if (any(ntoken(x) < MATTR_window)) {
MATTR_window <- min_window
warning("MATTR_window exceeds some documents' token lengths, resetting to minimum window size: ",
min_window, call. = FALSE)
}
if (any(ntoken(x) < min_window)) {
warning("min_window exceeds some documents' token lengths, these documents will return NA",
call. = FALSE)
}
x <- tokens_ngrams(x, n = MATTR_window, concatenator = " ")
# check whether the dfm is empty and return NA, else go on as previously
check_dfm <- function(y){
txdfm <-dfm(tokens(y))
if(!sum(txdfm)) return(NA)
quanteda.textstats::textstat_lexdiv(txdfm, "TTR")[["TTR"]]
}
temp <- lapply(as.list(x), check_dfm)
result <- unlist(lapply(temp, mean))
return(result)
}
txt <- c("fish sticks",
"Anyway, like I was sayin', shrimp is the fruit of the sea. You can
barbecue it, boil it, broil it, bake it, saute it.",
"There's shrimp-kabobs,
shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
sandwich.")
toks <- tokens(txt)
#> compute_mattr(toks, MATTR_window = 35, min_window = 5)
# text1 text2 text3
# NA 0.9057471 0.8574074
# Warning messages:
# 1: MATTR_window exceeds some documents' token lengths, resetting to minimum window size: 5
# 2: min_window exceeds some documents' token lengths, these documents will return NA
> I worried that these checks would slow the function down on large corpora but in my (limited) tests it seems fine. The other alternative is to allow just thought I'd put this here in case it's helpful. |
Hello,
apololgies if my issue is based on a misunderstanding.
When I use
textstat_lexdiv
to calculate MATTR, and a document is shorter than theMATTR_window
specified as an argument to the function, the function throws an error.This is because the function (
compute_mattr
) checks for this case, and resets theMATTR_window
value to the longest document in the corpus. Using this value in thetokens_ngrams
function down the line creates a list with empty entries, which trips up the calculation of the TTR and causes the error.I believe the window should be set to the shortest document in the corpus -- as MATTR is calculated by averaging the TTRs of a moving window across the document, it seems reasonable for that window to be the length of the shortest document. An alternative would be rewriting it so it returns
NA
for the documents that are too short to calculate this value.Reproducible Example
More Details
I'm including the original function below, with suggested fix in comments
Again, if I've misunderstood any conceptual issue (which may well be, as the same process is applied to MSTTR), apologies -- new to these text diversity measures. If not, happy to do a pull request if that saves you some time!
The text was updated successfully, but these errors were encountered: