Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.data.frame() for texstat_simil drops zeros by default #10

Open
michalovadek opened this issue Nov 9, 2020 · 4 comments
Open

as.data.frame() for texstat_simil drops zeros by default #10

michalovadek opened this issue Nov 9, 2020 · 4 comments

Comments

@michalovadek
Copy link

While the textstat_simil output object contains all values, applying as.data.frame() drops all zeros. It took me a moment to notice that this is where I have been dropping observations and I couldn't find it mentioned in the documentation. I understand this behaviour helps reducing the often considerable size of the data frame, but in some cases the user may find it useful to keep all pairwise observations. I would therefore suggest including an argument akin to as.data.frame(x, upper = TRUE).

@kbenoit
Copy link
Contributor

kbenoit commented Nov 10, 2020

You're right. I seem to remember a discussion of this issue when we revised the similarity computations, but we don't seem to have caught this.

Here's a rewrite of the as.data.frame() function for the similarity objects:

library("quanteda")
## Package version: 2.1.2.9000
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

# example
dfmat <- dfm(c("a b c c", "c d d", "d d e"))
sim <- textstat_simil(dfmat, method = "cosine")
sim
## textstat_simil object; method = "cosine"
##       text1 text2 text3
## text1 1.000 0.365     0
## text2 0.365 1.000   0.8
## text3     0 0.800   1.0

as.data.frame(sim)
##   document1 document2    cosine
## 1     text1     text2 0.3651484
## 2     text2     text3 0.8000000

as.data.frame.textstat_proxy <- function(x, diag = FALSE, upper = FALSE) {
  # form pairs
  df <- as.data.frame(as.matrix(x))
  vals <- utils::stack(df)$values
  docs <- rownames(df)
  docpairs <- expand.grid(docs, docs)
  colnames(docpairs) <- c("document1", "document2")
  result <- data.frame(docpairs, vals)
  colnames(result)[3] <- x@method

  # handle diagonal
  if (!diag) {
    result <- result[result$document1 != result$document2, ]
  }

  # handle upper
  if (!upper) {
    result <- result[!duplicated(t(apply(result[, 1:2], 1, sort))), ]
  }

  result
}

as.data.frame(sim)
##   document1 document2    cosine
## 2     text2     text1 0.3651484
## 3     text3     text1 0.0000000
## 6     text3     text2 0.8000000

@koheiw is it worth adding an option to the existing function or even replacing it?

@koheiw
Copy link
Collaborator

koheiw commented Nov 10, 2020

I am a bit surprised to see zero in sim. I thought resulting matrices are sparse, but they are not. The most natural approach for me is expose drop0 to users of textstat_simil() so that they can decided if they keep zero in the matrix and data.frame.

> require(quanteda)
> dfmat <- dfm(c("a b c c", "c d d", "d d e"))
> sim <- textstat_simil(dfmat, method = "cosine")
> sim@x
[1] 1.0000000 0.3651484 1.0000000 0.0000000 0.8000000 1.0000000
> 
> sim2 <-proxyC::simil(dfmat, method = "cosine", drop0 = TRUE)
> sim2@x
[1] 1.0000000 0.3651484 1.0000000 0.8000000 1.0000000

@kbenoit
Copy link
Contributor

kbenoit commented Nov 11, 2020

Ah right, so we do have the zeroes, as I thought, but we lose them in the conversion to data.frame in:
https://github.com/quanteda/quanteda/blob/96478db9ae452d0354a2617f5de5b4f6bfb89b78/R/textstat_simil.R#L427

and to list in:
https://github.com/quanteda/quanteda/blob/96478db9ae452d0354a2617f5de5b4f6bfb89b78/R/textstat_simil.R#L395

I can't see any reason why would want to drop the zeros for similarity when coercing to these objects. So I think the solution is to modify the as.list() and as.data.frame() methods to not obliterate the zeroes through the conversion to triplet.

@kbenoit
Copy link
Contributor

kbenoit commented Nov 11, 2020

Note that this affects zeroes in textstat_dist() as well, since it's a single function for both simil and dist variants:

library("quanteda")
## Package version: 2.1.2.9000
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

dfmat <- dfm(c("c d d", "c d d", "d d e"))
dis <- textstat_dist(dfmat, method = "euclidean")
dis
## textstat_dist object; method = "euclidean"
##       text1 text2 text3
## text1     0     0  1.41
## text2     0     0  1.41
## text3  1.41  1.41     0
as.data.frame(dis)
##   document1 document2 euclidean
## 1     text1     text3  1.414214
## 2     text2     text3  1.414214

@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants