Using our own `aic` and `bic` instead of GLM's versions. #13

codetalker7 · 2022-07-03T16:09:05Z

This PR is to fix revert back to our own implementations of aic and bic functions (from the older code, till commit e66244f) instead of using the GLM versions of those functions. The formula for computing the AIC of a frequentist model is
$$\text{aic} = 2\cdot p - 2\cdot\text{loglikelihood}$$
where $$p$$ is the number of parameters of the model, and $$\text{loglikelihood}$$ is the log-likelihood of the model.

The formula for computing the BIC of a frequentist model is
$$\text{bic} = \log(n)\cdot p - 2\cdot \text{loglikelihood}$$
where $$n$$ is the number of datapoints.

Instead of $p$, GLM uses $k$, consumed degrees of freedom.

We have added a property ndims to the FrequentistRegression struct; ndims is a tuple containing the dimensions of the dataset on which the model was trained. So, ndims[1] is the number of points in the dataset, and ndims[2] is the number of parameters in the model.

In the older code (till commit e66244f), the number of parameters for the frequentist linear regression model was size(X, 2) + 1; for other models, it was just size(X, 2). We have kept this in mind and made the relevant changes in the file (for instance, look at src/frequentist/linear_regression.jl line 60).

dimensions of the dataset on which the model was trained.

versions, we give our own implementation). Also fixed the number of parameters (`ndims[2]`) for frequentist linear regression.

sourish-cmi · 2022-09-18T06:59:01Z

This looks like a simple PR, but the test is failing. Any thought?

@ShouvikGhosh2048 @ayushpatnaikgit

codetalker7 · 2022-09-18T07:05:51Z

This PR is quite old; I believe the errors in the tests are old as well (for instance, predictors not being defined), and the condition a < b failing for Bayesian Linear Regression with Uniform prior. The first one has been resolved, but I believe the second one still exists for some choice of random seed.

sourish-cmi · 2022-09-19T02:32:53Z

predictors not being defined is being fixed.
the condition a < b failing for Bayesian Linear Regression with Uniform prior. In this error, I just checked. If someone chooses h>0 as an argument, then by construction a < b cannot happen Bayesian Linear Regression with Uniform prior.

It would be better to create fresh PR.

On a different thought:

Shall we make a PR to GLM? What do you think?

@mousum-github : shall we make a small PR to GLM?

Mousum da, The AIC and BIC of Julia-GLM do not match R's GLM. Hence we created our own code. But I think we should do a PR to GLM.

Will you be able to help @codetalker7 and @ShouvikGhosh2048 to develop this PR for GLM?

sourish-cmi · 2022-09-22T17:40:09Z

@codetalker7 @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github @ajaynshah

I thoroughly tested aic and bic of GLM. I believe the aic and bic of GLM is rolled back. If I call GLM's aic and bic - it throws the error.

Now only StatsBase.aic(model) and StatsBase.bic(model) is working, and the result exactly matches with that of R.

So I think we should not implement our own aic and bic. Rather we should integrate StatsBase.aic(model) and StatsBase.bic(model) with CRRao.

Here is the detailed test:

Linear Regression in R

> library(datasets)
> mod1 = lm(mpg~hp+disp,mtcars)
> AIC(mod1)
[1] 168.6186
> BIC(mod1)
[1] 174.4815

Linear Regression in Julia

using RDatasets, GLM, StatsBase
mtcars = dataset("datasets","mtcars");
mod1 = lm(@formula(MPG~HP+Disp),mtcars);
StatsBase.aic(mod1)
Julia> 168.61856413576285
StatsBase.bic(mod1)
Julia> 174.48150774696177

Clearly, aic and bic from StatsBase match exactly with R for linear regression.

Logistic Regression in R

> turnout=read.csv(file = "turnout.csv")
> mod2 = glm(Vote ~ Age + Race + Income,turnout,family=binomial(link="logit"))
> AIC(mod2)
[1] 2113.593
> BIC(mod2)
[1] 2135.997

Logistic Regression in Julia

turnout = dataset("Zelig", "turnout");
mod2 = glm(@formula(Vote ~ Age + Race + Income),turnout,Binomial(), LogitLink());
StatsBase.aic(mod2)
Julia> 2113.592978817039
StatsBase.bic(mod2)
Julia> 2135.9965886552072

Poisson Regression with R

> library(MASS)
> mod3 = glm(Days ~ Eth+Sex+Age+Lrn, quine, family = poisson(link = "log"))
> AIC(mod3)
[1] 2299.184
> BIC(mod3)
[1] 2320.069

Poisson Regression with Julia

quine = dataset("MASS", "quine")
mod3 = glm(@formula(Days ~ Eth+Sex+Age+Lrn), quine, Poisson(), LogLink());
StatsBase.aic(mod3)
Julia> 2299.1836302853603
StatsBase.bic(mod3)
Julia> 2320.0688766373187

Clearly, aic and bic from StatsBase match exactly with R for all linear, logistic and Poisson regression.

ajaynshah · 2022-09-23T02:33:29Z

Given that the formula for aic involves only the logl and n and k, it makes sense to place this at a highly general place.

…

On Thu, 22 Sept 2022 at 17:40, Sourish ***@***.***> wrote: @codetalker7 <https://github.com/codetalker7> @ShouvikGhosh2048 <https://github.com/ShouvikGhosh2048> @ayushpatnaikgit <https://github.com/ayushpatnaikgit> @mousum-github <https://github.com/mousum-github> @ajaynshah <https://github.com/ajaynshah> I thoroughly tested aic and bic of GLM. I believe the aic and bic of GLM is rolled back. If I call GLM's aic and bic - it throws the error. Now only StatsBase.aic(model) and StatsBase.bic(model) is working, and *the result exactly matches with that of R*. So I think we should not implement our own aic and bic. Rather we should integrate StatsBase.aic(model) and StatsBase.bic(model) with CRRao. Here is the detailed test: *Linear Regression in R* > library(datasets)> mod1 = lm(mpg~hp+disp,mtcars)> AIC(mod1) [1] 168.6186> BIC(mod1) [1] 174.4815 *Linear Regression in Julia* using RDatasets, GLM, StatsBase mtcars = dataset("datasets","mtcars"); mod1 = ***@***.***(MPG~HP+Disp),mtcars); StatsBase.aic(mod1) Julia> 168.61856413576285 StatsBase.bic(mod1) Julia> 174.48150774696177 Clearly, aic and bic from StatsBase match exactly with R for linear regression. *Logistic Regression in R* > turnout=read.csv(file = "turnout.csv")> mod2 = glm(Vote ~ Age + Race + Income,turnout,family=binomial(link="logit"))> AIC(mod2) [1] 2113.593> BIC(mod2) [1] 2135.997 *Logistic Regression in Julia* turnout = dataset("Zelig", "turnout"); mod2 = ***@***.***(Vote ~ Age + Race + Income),turnout,Binomial(), LogitLink()); StatsBase.aic(mod2) Julia> 2113.592978817039 StatsBase.bic(mod2) Julia> 2135.9965886552072 *Poisson Regression with R* > library(MASS)> mod3 = glm(Days ~ Eth+Sex+Age+Lrn, quine, family = poisson(link = "log"))> AIC(mod3) [1] 2299.184> BIC(mod3) [1] 2320.069 *Poisson Regression with Julia* quine = dataset("MASS", "quine") mod3 = ***@***.***(Days ~ Eth+Sex+Age+Lrn), quine, Poisson(), LogLink()); StatsBase.aic(mod3) Julia> 2299.1836302853603 StatsBase.bic(mod3) Julia> 2320.0688766373187 Clearly, aic and bic from StatsBase match exactly with R for all linear, logistic and Poisson regression. — Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACO2TJAIRDDNTSOFFAI2FK3V7SKYJANCNFSM52Q5AYQA> . You are receiving this because you were mentioned.Message ID: <xKDR/CRRao. ***@***.***>

-- Ajay Shah ***@***.*** http://www.mayin.org/ajayshah

sourish-cmi · 2022-09-23T05:11:02Z

@codetalker7 @ajaynshah @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github

So shall we just rename this PR as "Integrating aic and bic of StatsBase with CRRao" ?

mousum-github · 2022-09-23T06:06:41Z

@codetalker7 @ajaynshah @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github

So shall we just rename this PR as "Integrating aic and bic of StatsBase with CRRao"?

For all regression type models, there is a base function for AIC and BIC. While calling the AIC or BIC, we need to pass the model. All the necessary values like log-likelihood, n, p etc are extracted from the given model and computed AIC, BIC.

sourish-cmi · 2022-09-23T10:29:54Z

@codetalker7 @ajaynshah @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github
So shall we just rename this PR as "Integrating aic and bic of StatsBase with CRRao"?

For all regression type models, there is a base function for AIC and BIC. While calling the AIC or BIC, we need to pass the model. All the necessary values like log-likelihood, n, p etc are extracted from the given model and computed AIC, BIC.

Yes - I think we can use it for Bayesian models as well. We have to ensure we return log-likelihood, n, p in our Bayesian models

mousum-github · 2022-09-23T12:47:50Z

Need to check whether the Bayesian model is derived from StatisticalModel or not

sourish-cmi · 2022-09-23T15:49:01Z

All Bayesian models are completely developed by ourselves in CRRao. So we can return logLikelihood, n and p

codetalker7 added 2 commits July 3, 2022 21:01

Added ndims parameter to frequentist model; this represents the

f54df71

dimensions of the dataset on which the model was trained.

Modifying code for aic and bic functions (instead of using GLM's

843cf3c

versions, we give our own implementation). Also fixed the number of parameters (`ndims[2]`) for frequentist linear regression.

codetalker7 mentioned this pull request Sep 19, 2022

@fitmodel errors on sanction dataset with seed 123. #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using our own `aic` and `bic` instead of GLM's versions. #13

Using our own `aic` and `bic` instead of GLM's versions. #13

codetalker7 commented Jul 3, 2022

sourish-cmi commented Sep 18, 2022

codetalker7 commented Sep 18, 2022

sourish-cmi commented Sep 19, 2022

sourish-cmi commented Sep 22, 2022

ajaynshah commented Sep 23, 2022 via email

sourish-cmi commented Sep 23, 2022

mousum-github commented Sep 23, 2022

sourish-cmi commented Sep 23, 2022

mousum-github commented Sep 23, 2022 •

edited

Loading

sourish-cmi commented Sep 23, 2022

Using our own aic and bic instead of GLM's versions. #13

Are you sure you want to change the base?

Using our own aic and bic instead of GLM's versions. #13

Conversation

codetalker7 commented Jul 3, 2022

sourish-cmi commented Sep 18, 2022

codetalker7 commented Sep 18, 2022

sourish-cmi commented Sep 19, 2022

sourish-cmi commented Sep 22, 2022

ajaynshah commented Sep 23, 2022 via email

sourish-cmi commented Sep 23, 2022

mousum-github commented Sep 23, 2022

sourish-cmi commented Sep 23, 2022

mousum-github commented Sep 23, 2022 • edited Loading

sourish-cmi commented Sep 23, 2022

Using our own `aic` and `bic` instead of GLM's versions. #13

Using our own `aic` and `bic` instead of GLM's versions. #13

mousum-github commented Sep 23, 2022 •

edited

Loading