multChernoff

This package computes finite-sample tail bounds of the likelihood ratio test (LRT) under multinomial sampling. The tail bounds can be used to obtain conservative p-values and critical values.

This is useful for inference when the sample size is comparable to or even smaller than the alphabet size, where the standard chi-square asymptotic (Wilks' theorem) may not hold.

Installation

You can install the released version of multChernoff from GitHub with:

devtools::install_github("richardkwo/multChernoff")

Examples

Through the following examples, we will use the finite-sample critical value criticalValue to construct a convex confidence region on the underlying probability vector. We will use the package CVXR for writing and solving convex programs. CVXR can be installed with:

install.packages("CVXR")

Example 1: What percentage of butterflies were unseen by Corbet?

Naturalist Alexander Steven Corbet spent two years trapping butterflies in Peninsular Malaysia. Here is the data collected by Corbet.

Frequency	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Species	118	74	44	24	29	22	20	19	20	15	12	14	6	12	6

Here we ask the question: what percentage of butterflies in Malaya belonged to the species that Corbet had not seen? That is, we want to estimate the proportion of butterflies from all the unseen species. Clearly, the MLE is zero based on the sample. Instead, we ask for an upper bound with 95% confidence.

Let k = 435 + 1, where 435 is the number of species observed by Corbet. The corresponding observed distribution is the distribution from the table above with a zero entry at the end. In the following program, we maximize the last entry of the underlying distribution subject to a tail bound on the LRT, which equals the Kullback-Leibler divergence KL(observed distribution || true distribution) multiplied by twice the sample size.

See the paper at the bottom of this page for details.

library(multChernoff)
library(CVXR)
library(plyr)

# Corbet butterfly data
# https://en.wikipedia.org/wiki/Unseen_species_problem
corbet.butterfly <- data.frame(j=1:15, n_j=c(118,74,44,24,29,22,20,19,20,15,12,14,6,12,6))
n.butterfly <- Reduce(c, alply(corbet.butterfly, 1, function(.df) rep(.df$j, .df$n_j)))

alpha <- 0.05
n.observed <- c(n.butterfly, 0)    # the last one is the unseen
n <- sum(n.observed)
k <- length(n.observed)
p.observed <- n.observed / n

# critical value
t.alpha <- criticalValue(k, n, p=alpha, verbose = TRUE)
cat(sprintf("critical value = %f\n", t.alpha))

We get printout critical value = 962.402970. This value is then used in the following convex program.

# convex program
p <- Variable(k)
obj <- p[k]
constr <- list(p>=0, 
               sum(p) == 1, 
               2 * n * sum(p.observed * (log(p.observed) - log(p))) <= t.alpha)
prob <- Problem(Maximize(obj), constr)
result <- solve(prob, verbose=FALSE) # you should try other solvers

# result
print(result$status)
unseen <- result$value
p.maximizer <- c(result$getValue(p))
cat(sprintf("unseen <= %f\n", unseen))

With MOSEK solver, the percentage of unseen species is at most 21.1%.

Example 2: Analysis of a binary instrumental variable model

Consider the following data from the Lipid Research Coronary Primary Prevention Trial.

	X=0, Y=0	X=0, Y=1	X=1, Y=0	X=1, Y=1
Z=0 (placebo group)	158	14	0	0
Z=1 (treatment group)	52	12	23	78

Here X is the actual treatment received and Y is the outcome. The binary instrumental variable (IV) model imposes constraints on the underlying probability P(X, Y| Z=0) and P(X, Y|Z=1). These can be expressed in terms of the generalized instrumental inequalities. At the same time, these probabilities lie in a convex confidence region characterized by the LRT critical value. In particular, this is an LRT on two independent multinomial samples (Z=0 arm and Z=1 arm). Critical value of this kind can be computed by criticalValue(c(k1, k2), c(n1, n2), p=0.05). The true P(X, Y| Z=0) and P(X, Y|Z=1) are identified by intersecting the aforementioned two regions. If the intersection is empty, the IV model is rejected.

Further, this intersected region implies upper and lower bounds on the average causal effect (ACE) through the Balke-Pearl bounds. The following script executes such analysis.

library(CVXR)
library(multChernoff)

# each row is a level of Z
data <- matrix(c(158,14,0,0,52,12,23,78), nrow=2, byrow = TRUE)
n.z <- apply(data, 1, sum)
p.empirical <- data / n.z

# get critical value ----
alpha <- 0.05
t.alpha <- criticalValue(rep(4, 2), n.z, p=alpha, verbose = TRUE) 
cat(sprintf("critical value = %f\n", t.alpha))

# make objective -----
p <- Variable(2, 4)
g <- function(x, y) {
  a <- rep(1, 4)
  b <- rep(0, 4)
  d <- rep(0, 4)
  if (x==0 && y==0) {
    a[2] <- 0
    b[1] <- b[3] <- 1
    d[1] <- d[4] <- 1
  } else if (x==0 && y==1) {
    a[1] <- 0
    b[2] <- b[3] <- 1
    d[2] <- d[4] <- 1
  } else if (x==1 && y==0) {
    a[4] <- 0
    b[1] <- b[3] <- 1
    d[2] <- d[3] <- 1
  } else {
    a[3] <- 0
    b[1] <- b[4] <- 1
    d[2] <- d[4] <- 1
  }
  A <- min_entries(p %*% a)
  H <- p %*% replicate(2, b)
  I <- p %*% replicate(2, d)
  B <- min_entries(H + t(I))
  return(min_elemwise(A, B))
}

g.empirical <- function(x, y, p) {
  a <- rep(1, 4)
  b <- rep(0, 4)
  d <- rep(0, 4)
  if (x==0 && y==0) {
    a[2] <- 0
    b[1] <- b[3] <- 1
    d[1] <- d[4] <- 1
  } else if (x==0 && y==1) {
    a[1] <- 0
    b[2] <- b[3] <- 1
    d[2] <- d[4] <- 1
  } else if (x==1 && y==0) {
    a[4] <- 0
    b[1] <- b[3] <- 1
    d[2] <- d[3] <- 1
  } else {
    a[3] <- 0
    b[1] <- b[4] <- 1
    d[2] <- d[4] <- 1
  }
  A <- min(p %*% a)
  H <- p %*% replicate(2, b)
  I <- p %*% replicate(2, d)
  B <- min(H + t(I))
  return(min(A, B))
}

# constraints ------
KL <- sum(sum_entries(p.empirical * (log(p.empirical) - log(p)), axis=1) * 2 * n.z)
IV.fun.1 <- function(x) {
  if (x==0) {
    max_entries(p[,1]) + max_entries(p[,2])
  } else {
    max_entries(p[,3]) + max_entries(p[,4])
  }
}
IV.fun.2 <- function(x, y) {
  if (x==0 && y==0) {
    a <- max_entries(p[,1])
    b <- min_entries(p[,1] + p[,3])
    d <- min_entries(p[,1] + p[,4])
  } else if (x==0 && y==1) {
    a <- max_entries(p[,2])
    b <- min_entries(p[,2] + p[,4])
    d <- min_entries(p[,2] + p[,3])
  } else if (x==1 && y==0) {
    a <- max_entries(p[,3])
    b <- min_entries(p[,1] + p[,3])
    d <- min_entries(p[,2] + p[,3])
  } else {
    a <- max_entries(p[,4])
    b <- min_entries(p[,2] + p[,4])
    d <- min_entries(p[,1] + p[,4])
  }
  return(a - b - d)
}
IV.fun.3 <- function() {
  min_entries(p[,2] + p[,4]) + min_entries(p[,2] + p[,3]) + 
    min_entries(p[,1] + p[,3]) + min_entries(p[,1] + p[,4])
}

constr <- list(p >= 0, 
               sum_entries(p, axis=1) == 1, 
               KL <= t.alpha, 
               IV.fun.1(0) <= 1, IV.fun.1(1) <= 1, 
               IV.fun.2(0,0) <= 0, IV.fun.2(0,1) <= 0, IV.fun.2(1,0) <= 0, IV.fun.2(1,1) <= 0,
               IV.fun.3() >= 1)

# ACE lower bound ----
obj <- 1 - g(1,0) - g(0,1)
prob <- Problem(Minimize(obj), constr) 
result <- solve(prob)

print(result$status)
ACE.lb <- result$value
p.lb <- result$getValue(p)

# ACE upper bound ----
obj <- g(0,0) + g(1,1) - 1
prob <- Problem(Maximize(obj), constr) 
result <- solve(prob, solver="ECOS")

print(result$status)
ACE.ub <- result$value
p.ub <- result$getValue(p)

cat(sprintf("\n%.3f <= ACE <= %.3f (naive plugin gives [%.3f, %.3f])\n", 
            ACE.lb, ACE.ub, 
            1 - g.empirical(1,0,p.empirical) - g.empirical(0,1,p.empirical),
            g.empirical(0,0,p.empirical) + g.empirical(1,1,p.empirical) - 1))

# p-value ------
prob.pval <- Problem(Minimize(KL), list(
  p >= 0, 
  sum_entries(p, axis=1) == 1, 
  IV.fun.1(0) <= 1, IV.fun.1(1) <= 1, 
  IV.fun.2(0,0) <= 0, IV.fun.2(0,1) <= 0, IV.fun.2(1,0) <= 0, IV.fun.2(1,1) <= 0,
  IV.fun.3() >= 1
))
result.pval <- solve(prob.pval)
print(result.pval$status)
KL.ml <- max(result.pval$value, 0)
p.mle <- result.pval$getValue(p)
p.value <- tailProbBound(KL.ml, rep(4, 2), n.z, verbose = TRUE)

cat(sprintf("\np-value (conservative) = %g\n", p.value))

Running the script, we get

0.151 <= ACE <= 0.907 (naive plugin gives [0.391, 0.779])
p-value (conservative) = 1

Citation

If you find this useful, consider citing the following article.

F. Richard Guo and Thomas S. Richardson, "Chernoff-Type Concentration of Empirical Probabilities in Relative Entropy," in IEEE Transactions on Information Theory, vol. 67, no. 1, pp. 549-558, Jan. 2021.

An open-access preprint is available here.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
R		R
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

multChernoff

Installation

Examples

Example 1: What percentage of butterflies were unseen by Corbet?

Example 2: Analysis of a binary instrumental variable model

Citation

About

Licenses found

Releases

Packages

Languages

License

Licenses found

richardkwo/multChernoff

Folders and files

Latest commit

History

Repository files navigation

multChernoff

Installation

Examples

Example 1: What percentage of butterflies were unseen by Corbet?

Example 2: Analysis of a binary instrumental variable model

Citation

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages