-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
as.data.frame() for texstat_simil drops zeros by default #10
Comments
You're right. I seem to remember a discussion of this issue when we revised the similarity computations, but we don't seem to have caught this. Here's a rewrite of the library("quanteda")
## Package version: 2.1.2.9000
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
# example
dfmat <- dfm(c("a b c c", "c d d", "d d e"))
sim <- textstat_simil(dfmat, method = "cosine")
sim
## textstat_simil object; method = "cosine"
## text1 text2 text3
## text1 1.000 0.365 0
## text2 0.365 1.000 0.8
## text3 0 0.800 1.0
as.data.frame(sim)
## document1 document2 cosine
## 1 text1 text2 0.3651484
## 2 text2 text3 0.8000000
as.data.frame.textstat_proxy <- function(x, diag = FALSE, upper = FALSE) {
# form pairs
df <- as.data.frame(as.matrix(x))
vals <- utils::stack(df)$values
docs <- rownames(df)
docpairs <- expand.grid(docs, docs)
colnames(docpairs) <- c("document1", "document2")
result <- data.frame(docpairs, vals)
colnames(result)[3] <- x@method
# handle diagonal
if (!diag) {
result <- result[result$document1 != result$document2, ]
}
# handle upper
if (!upper) {
result <- result[!duplicated(t(apply(result[, 1:2], 1, sort))), ]
}
result
}
as.data.frame(sim)
## document1 document2 cosine
## 2 text2 text1 0.3651484
## 3 text3 text1 0.0000000
## 6 text3 text2 0.8000000 @koheiw is it worth adding an option to the existing function or even replacing it? |
I am a bit surprised to see zero in
|
Ah right, so we do have the zeroes, as I thought, but we lose them in the conversion to data.frame in: and to list in: I can't see any reason why would want to drop the zeros for similarity when coercing to these objects. So I think the solution is to modify the |
Note that this affects zeroes in library("quanteda")
## Package version: 2.1.2.9000
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
dfmat <- dfm(c("c d d", "c d d", "d d e"))
dis <- textstat_dist(dfmat, method = "euclidean")
dis
## textstat_dist object; method = "euclidean"
## text1 text2 text3
## text1 0 0 1.41
## text2 0 0 1.41
## text3 1.41 1.41 0
as.data.frame(dis)
## document1 document2 euclidean
## 1 text1 text3 1.414214
## 2 text2 text3 1.414214 |
While the textstat_simil output object contains all values, applying as.data.frame() drops all zeros. It took me a moment to notice that this is where I have been dropping observations and I couldn't find it mentioned in the documentation. I understand this behaviour helps reducing the often considerable size of the data frame, but in some cases the user may find it useful to keep all pairwise observations. I would therefore suggest including an argument akin to
as.data.frame(x, upper = TRUE)
.The text was updated successfully, but these errors were encountered: