-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using our own aic
and bic
instead of GLM's versions.
#13
base: main
Are you sure you want to change the base?
Conversation
dimensions of the dataset on which the model was trained.
versions, we give our own implementation). Also fixed the number of parameters (`ndims[2]`) for frequentist linear regression.
This looks like a simple PR, but the test is failing. Any thought? |
This PR is quite old; I believe the errors in the tests are old as well (for instance, |
It would be better to create fresh PR. On a different thought: Shall we make a PR to GLM? What do you think? @mousum-github : shall we make a small PR to GLM? Mousum da, The AIC and BIC of Julia-GLM do not match R's GLM. Hence we created our own code. But I think we should do a PR to GLM. Will you be able to help @codetalker7 and @ShouvikGhosh2048 to develop this PR for GLM? |
@codetalker7 @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github @ajaynshah I thoroughly tested Now only So I think we should not implement our own Here is the detailed test: Linear Regression in > library(datasets)
> mod1 = lm(mpg~hp+disp,mtcars)
> AIC(mod1)
[1] 168.6186
> BIC(mod1)
[1] 174.4815 Linear Regression in using RDatasets, GLM, StatsBase
mtcars = dataset("datasets","mtcars");
mod1 = lm(@formula(MPG~HP+Disp),mtcars);
StatsBase.aic(mod1)
Julia> 168.61856413576285
StatsBase.bic(mod1)
Julia> 174.48150774696177 Clearly, aic and bic from Logistic Regression in > turnout=read.csv(file = "turnout.csv")
> mod2 = glm(Vote ~ Age + Race + Income,turnout,family=binomial(link="logit"))
> AIC(mod2)
[1] 2113.593
> BIC(mod2)
[1] 2135.997 Logistic Regression in turnout = dataset("Zelig", "turnout");
mod2 = glm(@formula(Vote ~ Age + Race + Income),turnout,Binomial(), LogitLink());
StatsBase.aic(mod2)
Julia> 2113.592978817039
StatsBase.bic(mod2)
Julia> 2135.9965886552072
Poisson Regression with > library(MASS)
> mod3 = glm(Days ~ Eth+Sex+Age+Lrn, quine, family = poisson(link = "log"))
> AIC(mod3)
[1] 2299.184
> BIC(mod3)
[1] 2320.069 Poisson Regression with quine = dataset("MASS", "quine")
mod3 = glm(@formula(Days ~ Eth+Sex+Age+Lrn), quine, Poisson(), LogLink());
StatsBase.aic(mod3)
Julia> 2299.1836302853603
StatsBase.bic(mod3)
Julia> 2320.0688766373187 Clearly, aic and bic from |
Given that the formula for aic involves only the logl and n and k, it makes
sense to place this at a highly general place.
…On Thu, 22 Sept 2022 at 17:40, Sourish ***@***.***> wrote:
@codetalker7 <https://github.com/codetalker7> @ShouvikGhosh2048
<https://github.com/ShouvikGhosh2048> @ayushpatnaikgit
<https://github.com/ayushpatnaikgit> @mousum-github
<https://github.com/mousum-github> @ajaynshah
<https://github.com/ajaynshah>
I thoroughly tested aic and bic of GLM. I believe the aic and bic of GLM
is rolled back. If I call GLM's aic and bic - it throws the error.
Now only StatsBase.aic(model) and StatsBase.bic(model) is working, and *the
result exactly matches with that of R*.
So I think we should not implement our own aic and bic. Rather we should
integrate StatsBase.aic(model) and StatsBase.bic(model) with CRRao.
Here is the detailed test:
*Linear Regression in R*
> library(datasets)> mod1 = lm(mpg~hp+disp,mtcars)> AIC(mod1)
[1] 168.6186> BIC(mod1)
[1] 174.4815
*Linear Regression in Julia*
using RDatasets, GLM, StatsBase
mtcars = dataset("datasets","mtcars");
mod1 = ***@***.***(MPG~HP+Disp),mtcars);
StatsBase.aic(mod1)
Julia> 168.61856413576285
StatsBase.bic(mod1)
Julia> 174.48150774696177
Clearly, aic and bic from StatsBase match exactly with R for linear
regression.
*Logistic Regression in R*
> turnout=read.csv(file = "turnout.csv")> mod2 = glm(Vote ~ Age + Race + Income,turnout,family=binomial(link="logit"))> AIC(mod2)
[1] 2113.593> BIC(mod2)
[1] 2135.997
*Logistic Regression in Julia*
turnout = dataset("Zelig", "turnout");
mod2 = ***@***.***(Vote ~ Age + Race + Income),turnout,Binomial(), LogitLink());
StatsBase.aic(mod2)
Julia> 2113.592978817039
StatsBase.bic(mod2)
Julia> 2135.9965886552072
*Poisson Regression with R*
> library(MASS)> mod3 = glm(Days ~ Eth+Sex+Age+Lrn, quine, family = poisson(link = "log"))> AIC(mod3)
[1] 2299.184> BIC(mod3)
[1] 2320.069
*Poisson Regression with Julia*
quine = dataset("MASS", "quine")
mod3 = ***@***.***(Days ~ Eth+Sex+Age+Lrn), quine, Poisson(), LogLink());
StatsBase.aic(mod3)
Julia> 2299.1836302853603
StatsBase.bic(mod3)
Julia> 2320.0688766373187
Clearly, aic and bic from StatsBase match exactly with R for all linear,
logistic and Poisson regression.
—
Reply to this email directly, view it on GitHub
<#13 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACO2TJAIRDDNTSOFFAI2FK3V7SKYJANCNFSM52Q5AYQA>
.
You are receiving this because you were mentioned.Message ID: <xKDR/CRRao.
***@***.***>
--
Ajay Shah
***@***.***
http://www.mayin.org/ajayshah
|
@codetalker7 @ajaynshah @ShouvikGhosh2048 @ayushpatnaikgit @mousum-github So shall we just rename this PR as "Integrating aic and bic of StatsBase with CRRao" ? |
For all regression type models, there is a base function for |
Yes - I think we can use it for Bayesian models as well. We have to ensure we return |
Need to check whether the Bayesian model is derived from |
All Bayesian models are completely developed by ourselves in |
This PR is to fix revert back to our own implementations of
$$\text{aic} = 2\cdot p - 2\cdot\text{loglikelihood}$$ $$p$$ is the number of parameters of the model, and $$\text{loglikelihood}$$ is the log-likelihood of the model.
aic
andbic
functions (from the older code, till commit e66244f) instead of using the GLM versions of those functions. The formula for computing the AIC of a frequentist model iswhere
The formula for computing the BIC of a frequentist model is
$$\text{bic} = \log(n)\cdot p - 2\cdot \text{loglikelihood}$$ $$n$$ is the number of datapoints.
where
Instead of$p$ , GLM uses $k$ , consumed degrees of freedom.
We have added a property
ndims
to theFrequentistRegression
struct;ndims
is a tuple containing the dimensions of the dataset on which the model was trained. So,ndims[1]
is the number of points in the dataset, andndims[2]
is the number of parameters in the model.In the older code (till commit e66244f), the number of parameters for the frequentist linear regression model was
size(X, 2) + 1
; for other models, it was justsize(X, 2)
. We have kept this in mind and made the relevant changes in the file (for instance, look atsrc/frequentist/linear_regression.jl
line 60).