Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using our own aic and bic instead of GLM's versions. #13

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

codetalker7
Copy link
Member

This PR is to fix revert back to our own implementations of aic and bic functions (from the older code, till commit e66244f) instead of using the GLM versions of those functions. The formula for computing the AIC of a frequentist model is
$$\text{aic} = 2\cdot p - 2\cdot\text{loglikelihood}$$
where $$p$$ is the number of parameters of the model, and $$\text{loglikelihood}$$ is the log-likelihood of the model.

The formula for computing the BIC of a frequentist model is
$$\text{bic} = \log(n)\cdot p - 2\cdot \text{loglikelihood}$$
where $$n$$ is the number of datapoints.

Instead of $p$, GLM uses $k$, consumed degrees of freedom.

We have added a property ndims to the FrequentistRegression struct; ndims is a tuple containing the dimensions of the dataset on which the model was trained. So, ndims[1] is the number of points in the dataset, and ndims[2] is the number of parameters in the model.

In the older code (till commit e66244f), the number of parameters for the frequentist linear regression model was size(X, 2) + 1; for other models, it was just size(X, 2). We have kept this in mind and made the relevant changes in the file (for instance, look at src/frequentist/linear_regression.jl line 60).

dimensions of the dataset on which the model was trained.
versions, we give our own implementation). Also fixed the number of
parameters (`ndims[2]`) for frequentist linear regression.
@sourish-cmi
Copy link
Collaborator

This looks like a simple PR, but the test is failing. Any thought?

@ShouvikGhosh2048 @ayushpatnaikgit

@codetalker7
Copy link
Member Author

This PR is quite old; I believe the errors in the tests are old as well (for instance, predictors not being defined), and the condition a < b failing for Bayesian Linear Regression with Uniform prior. The first one has been resolved, but I believe the second one still exists for some choice of random seed.

@sourish-cmi
Copy link
Collaborator

  • predictors not being defined is being fixed.

  • the condition a < b failing for Bayesian Linear Regression with Uniform prior. In this error, I just checked. If someone chooses h>0 as an argument, then by construction a < b cannot happen Bayesian Linear Regression with Uniform prior.

It would be better to create fresh PR.

On a different thought:

Shall we make a PR to GLM? What do you think?

@mousum-github : shall we make a small PR to GLM?

Mousum da, The AIC and BIC of Julia-GLM do not match R's GLM. Hence we created our own code. But I think we should do a PR to GLM.

Will you be able to help @codetalker7 and @ShouvikGhosh2048 to develop this PR for GLM?

@sourish-cmi
Copy link
Collaborator

@codetalker7 @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github @ajaynshah

I thoroughly tested aic and bic of GLM. I believe the aic and bic of GLM is rolled back. If I call GLM's aic and bic - it throws the error.

Now only StatsBase.aic(model) and StatsBase.bic(model) is working, and the result exactly matches with that of R.

So I think we should not implement our own aic and bic. Rather we should integrate StatsBase.aic(model) and StatsBase.bic(model) with CRRao.

Here is the detailed test:

Linear Regression in R

> library(datasets)
> mod1 = lm(mpg~hp+disp,mtcars)
> AIC(mod1)
[1] 168.6186
> BIC(mod1)
[1] 174.4815

Linear Regression in Julia

using RDatasets, GLM, StatsBase
mtcars = dataset("datasets","mtcars");
mod1 = lm(@formula(MPG~HP+Disp),mtcars);
StatsBase.aic(mod1)
Julia> 168.61856413576285
StatsBase.bic(mod1)
Julia> 174.48150774696177

Clearly, aic and bic from StatsBase match exactly with R for linear regression.

Logistic Regression in R

> turnout=read.csv(file = "turnout.csv")
> mod2 = glm(Vote ~ Age + Race + Income,turnout,family=binomial(link="logit"))
> AIC(mod2)
[1] 2113.593
> BIC(mod2)
[1] 2135.997

Logistic Regression in Julia

turnout = dataset("Zelig", "turnout");
mod2 = glm(@formula(Vote ~ Age + Race + Income),turnout,Binomial(), LogitLink());
StatsBase.aic(mod2)
Julia> 2113.592978817039
StatsBase.bic(mod2)
Julia> 2135.9965886552072

Poisson Regression with R

> library(MASS)
> mod3 = glm(Days ~ Eth+Sex+Age+Lrn, quine, family = poisson(link = "log"))
> AIC(mod3)
[1] 2299.184
> BIC(mod3)
[1] 2320.069

Poisson Regression with Julia

quine = dataset("MASS", "quine")
mod3 = glm(@formula(Days ~ Eth+Sex+Age+Lrn), quine, Poisson(), LogLink());
StatsBase.aic(mod3)
Julia> 2299.1836302853603
StatsBase.bic(mod3)
Julia> 2320.0688766373187

Clearly, aic and bic from StatsBase match exactly with R for all linear, logistic and Poisson regression.

@ajaynshah
Copy link
Member

ajaynshah commented Sep 23, 2022 via email

@sourish-cmi
Copy link
Collaborator

@codetalker7 @ajaynshah @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github

So shall we just rename this PR as "Integrating aic and bic of StatsBase with CRRao" ?

@mousum-github
Copy link
Member

@codetalker7 @ajaynshah @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github

So shall we just rename this PR as "Integrating aic and bic of StatsBase with CRRao"?

For all regression type models, there is a base function for AIC and BIC. While calling the AIC or BIC, we need to pass the model. All the necessary values like log-likelihood, n, p etc are extracted from the given model and computed AIC, BIC.

@sourish-cmi
Copy link
Collaborator

@codetalker7 @ajaynshah @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github
So shall we just rename this PR as "Integrating aic and bic of StatsBase with CRRao"?

For all regression type models, there is a base function for AIC and BIC. While calling the AIC or BIC, we need to pass the model. All the necessary values like log-likelihood, n, p etc are extracted from the given model and computed AIC, BIC.

Yes - I think we can use it for Bayesian models as well. We have to ensure we return log-likelihood, n, p in our Bayesian models

@mousum-github
Copy link
Member

mousum-github commented Sep 23, 2022

Need to check whether the Bayesian model is derived from StatisticalModel or not

@sourish-cmi
Copy link
Collaborator

All Bayesian models are completely developed by ourselves in CRRao. So we can return logLikelihood, n and p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants