as.data.frame() for texstat_simil drops zeros by default #10

michalovadek · 2020-11-09T15:15:27Z

While the textstat_simil output object contains all values, applying as.data.frame() drops all zeros. It took me a moment to notice that this is where I have been dropping observations and I couldn't find it mentioned in the documentation. I understand this behaviour helps reducing the often considerable size of the data frame, but in some cases the user may find it useful to keep all pairwise observations. I would therefore suggest including an argument akin to as.data.frame(x, upper = TRUE).

The text was updated successfully, but these errors were encountered:

kbenoit · 2020-11-10T17:33:13Z

You're right. I seem to remember a discussion of this issue when we revised the similarity computations, but we don't seem to have caught this.

Here's a rewrite of the as.data.frame() function for the similarity objects:

library("quanteda")
## Package version: 2.1.2.9000
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

# example
dfmat <- dfm(c("a b c c", "c d d", "d d e"))
sim <- textstat_simil(dfmat, method = "cosine")
sim
## textstat_simil object; method = "cosine"
##       text1 text2 text3
## text1 1.000 0.365     0
## text2 0.365 1.000   0.8
## text3     0 0.800   1.0

as.data.frame(sim)
##   document1 document2    cosine
## 1     text1     text2 0.3651484
## 2     text2     text3 0.8000000

as.data.frame.textstat_proxy <- function(x, diag = FALSE, upper = FALSE) {
  # form pairs
  df <- as.data.frame(as.matrix(x))
  vals <- utils::stack(df)$values
  docs <- rownames(df)
  docpairs <- expand.grid(docs, docs)
  colnames(docpairs) <- c("document1", "document2")
  result <- data.frame(docpairs, vals)
  colnames(result)[3] <- x@method

  # handle diagonal
  if (!diag) {
    result <- result[result$document1 != result$document2, ]
  }

  # handle upper
  if (!upper) {
    result <- result[!duplicated(t(apply(result[, 1:2], 1, sort))), ]
  }

  result
}

as.data.frame(sim)
##   document1 document2    cosine
## 2     text2     text1 0.3651484
## 3     text3     text1 0.0000000
## 6     text3     text2 0.8000000

@koheiw is it worth adding an option to the existing function or even replacing it?

koheiw · 2020-11-10T22:06:20Z

I am a bit surprised to see zero in sim. I thought resulting matrices are sparse, but they are not. The most natural approach for me is expose drop0 to users of textstat_simil() so that they can decided if they keep zero in the matrix and data.frame.

> require(quanteda)
> dfmat <- dfm(c("a b c c", "c d d", "d d e"))
> sim <- textstat_simil(dfmat, method = "cosine")
> sim@x
[1] 1.0000000 0.3651484 1.0000000 0.0000000 0.8000000 1.0000000
> 
> sim2 <-proxyC::simil(dfmat, method = "cosine", drop0 = TRUE)
> sim2@x
[1] 1.0000000 0.3651484 1.0000000 0.8000000 1.0000000

kbenoit · 2020-11-11T07:38:28Z

Ah right, so we do have the zeroes, as I thought, but we lose them in the conversion to data.frame in:
https://github.com/quanteda/quanteda/blob/96478db9ae452d0354a2617f5de5b4f6bfb89b78/R/textstat_simil.R#L427

and to list in:
https://github.com/quanteda/quanteda/blob/96478db9ae452d0354a2617f5de5b4f6bfb89b78/R/textstat_simil.R#L395

I can't see any reason why would want to drop the zeros for similarity when coercing to these objects. So I think the solution is to modify the as.list() and as.data.frame() methods to not obliterate the zeroes through the conversion to triplet.

kbenoit · 2020-11-11T07:43:05Z

Note that this affects zeroes in textstat_dist() as well, since it's a single function for both simil and dist variants:

library("quanteda")
## Package version: 2.1.2.9000
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

dfmat <- dfm(c("c d d", "c d d", "d d e"))
dis <- textstat_dist(dfmat, method = "euclidean")
dis
## textstat_dist object; method = "euclidean"
##       text1 text2 text3
## text1     0     0  1.41
## text2     0     0  1.41
## text3  1.41  1.41     0
as.data.frame(dis)
##   document1 document2 euclidean
## 1     text1     text3  1.414214
## 2     text2     text3  1.414214

kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as.data.frame() for texstat_simil drops zeros by default #10

as.data.frame() for texstat_simil drops zeros by default #10

michalovadek commented Nov 9, 2020

kbenoit commented Nov 10, 2020

koheiw commented Nov 10, 2020

kbenoit commented Nov 11, 2020

kbenoit commented Nov 11, 2020 •

edited

Loading

as.data.frame() for texstat_simil drops zeros by default #10

as.data.frame() for texstat_simil drops zeros by default #10

Comments

michalovadek commented Nov 9, 2020

kbenoit commented Nov 10, 2020

koheiw commented Nov 10, 2020

kbenoit commented Nov 11, 2020

kbenoit commented Nov 11, 2020 • edited Loading

kbenoit commented Nov 11, 2020 •

edited

Loading