Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redatam #672

Open
2 of 21 tasks
pachadotdev opened this issue Nov 8, 2024 · 19 comments
Open
2 of 21 tasks

redatam #672

pachadotdev opened this issue Nov 8, 2024 · 19 comments

Comments

@pachadotdev
Copy link

pachadotdev commented Nov 8, 2024

Submitting Author Name: Mauricio Pacha Vargas Sepulveda
Submitting Author Github Handle: @pachadotdev
Other Package Authors Github handles: (comma separated, delete if none) @litalbarkai
Repository: https://github.com/litalbarkai/open-redatam/tree/main/rpkg
Submission type: Pre-submission
Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: redatam
Type: Package
Title: Import 'REDATAM' Files 
Version: 2.0.3
Authors@R: c(
    person(
        given = "Mauricio",
        family = "Vargas Sepulveda",
        role = c("aut", "cre"),
        email = "[email protected]",
        comment = c(ORCID = "0000-0003-1017-7574")),
    person(
        given = "Lital",
        family = "Barkai",
        role = "aut"),
    person(
        given = "Arseny",
        family = "Kapoulkine",
        role = "ctb",
        comment = "'pugixml' C++ library"),
    person(
        family = "Republic of Ecuador",
        role = "dtc",
        comment = "Galapagos census data")
    )
Imports:
    data.table,
    janitor,
    stringi
Suggests: 
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
Depends: R(>= 3.5.0)
Description: Import 'REDATAM' formats into R via the 'Open REDATAM' C++ library
    <https://github.com/litalbarkai/open-redatam> based on De Grande (2016)
    <https://www.jstor.org/stable/24890658>.
License: Apache License (>= 2)
URL: https://github.com/litalbarkai/open-redatam
BugReports: https://github.com/litalbarkai/open-redatam/issues
RoxygenNote: 7.3.2
Encoding: UTF-8
NeedsCompilation: yes
VignetteBuilder: knitr
LinkingTo: cpp11
Config/testthat/edition: 3

Scope

  • Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check one or more appropriate boxes below):

    Data Lifecycle Packages

    • data retrieval
    • data extraction
    • data munging
    • data deposition
      • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis

    Statistical Packages

    • Bayesian and Monte Carlo Routines
    • Dimensionality Reduction, Clustering, and Unsupervised Learning
    • Machine Learning
    • Regression and Supervised Learning
    • Exploratory Data Analysis (EDA) and Summary Statistics
    • Spatial Analyses
    • Time Series Analyses
    • Probability Distributions
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

REDATAM is a closed-source format for census and survey data. This package is an "archeological" version of the haven package, and allows to read this specific format widely used in Latin America by different govt. statistical offices. With this package, I have been able to convert census data from the 1990s that is not possible with the Redatam software on Windows because of multiple hardware changes in the last 30 years, and this software also reads recent census data (2017-2020) correctly.

Sociologists, Political Scientists and Economists that need census data and an easy way to read it in R (or Python) to fit regression models or different kinds of analysis.

No. There is a "redatamx" that reads a newer format.

Yes.

  • Any other questions or issues we should be aware of?:

This package was removed from CRAN for asking about a specific CLANG-ASAN error that took me long to replicate. The error was asked here as well https://stackoverflow.com/questions/79171799/addresssanitizer-error-alloc-dealloc-mismatch-operator-new-vs-free-in-r-packa

@emilyriederer
Copy link

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

Error (500). The editorcheck service is currently unavailable

@emilyriederer
Copy link

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

Error (500). The editorcheck service is currently unavailable

@mpadge
Copy link
Member

mpadge commented Nov 11, 2024

@pachadotdev and @emilyriederer Sorry for any inconvenience caused by these errors. Our check system hasn't yet been properly configured to handle packages in sub-directories. I'll let you know here when we've updated, and you can call checks again.

@emilyriederer
Copy link

Hey @pachadotdev ! This seems like some very cool software and an important goal. Could you please elaborate on how you see this package interacting with the redatamx package you mention and the litalbarkai/open-redatam package it is forked from? As a general principle, we're unable to consider forks for submission. Would it be possible to contribute this code to the original source repo or restructure it?

@pachadotdev
Copy link
Author

pachadotdev commented Nov 12, 2024

Hey @pachadotdev ! This seems like some very cool software and an important goal. Could you please elaborate on how you see this package interacting with the redatamx package you mention and the litalbarkai/open-redatam package it is forked from? As a general principle, we're unable to consider forks for submission. Would it be possible to contribute this code to the original source repo or restructure it?

redatamx is a new package made by ECLAC, it is focused in the new "Redatam X" format and I have no part on it

redatam (retired from CRAN, I hope to get it back there soon) is more focused on data "archeology," and I already have a group of users from Latin America that need demographic data for the period 1990-2020, that is the span of years where the formats DIC (Redatam versions 1 to 5) and DICX (Redatam 6 and ongoing) were in use.

@litalbarkai wrote the C++ parts, then I focused on the R and Python code and I made some refactors to make it work with C++ 11 and very minimal dependencies (i.e., pugixml instead of building/installing Apache Xerces), but it is a collaborative project and Lital is a co-author. I also wrote the article that we sent to the journal, where I was 100% focused on the "human writing" and not the "code writing", and Lital is the lead singer for the C++ parts.

We have two repos and send each other PRs to keep it neat. Could we use branches? yes, but I am a boomer.

The alternative to this package is to use old hardware and a point-and-click tool on Windows 98/XP, which is why I keep my old ThinkPad X200 and an external DVD reader. It not feasible to read old census data with modern hardware, which is a problem derived from it being in a closed source format. Even worse, some recent census data comes with an installer that does not work on Windows 10+, and that I was able to extract the data by using Wine on my main modern laptop.

@emilyriederer
Copy link

Hi @pachadotdev - thanks for your patience as we discussed internally.

This looks like a great project that is undoubtedly extremely useful and well within the rOpenSci scope. However, we feel overall we need to adhere to the rOpenSci policy of not reviewing forks due to the complexities that could cause in downstream transparency, maintainability, etc. For example, changes is one project are less likely to be fully tested for the other; one part of the project could change it's license; etc.

I think either solutions where the full codebase is contained in one repo or another solution which limits yours to the current R package and uses the C++ project as a dependency could work.

If you might be interested in restructuring, I'd happily hold this and plan for a full review of the new project. Otherwise, it may be best to close for now.

@pachadotdev
Copy link
Author

Hi @pachadotdev - thanks for your patience as we discussed internally.

This looks like a great project that is undoubtedly extremely useful and well within the rOpenSci scope. However, we feel overall we need to adhere to the rOpenSci policy of not reviewing forks due to the complexities that could cause in downstream transparency, maintainability, etc. For example, changes is one project are less likely to be fully tested for the other; one part of the project could change it's license; etc.

I think either solutions where the full codebase is contained in one repo or another solution which limits yours to the current R package and uses the C++ project as a dependency could work.

If you might be interested in restructuring, I'd happily hold this and plan for a full review of the new project. Otherwise, it may be best to close for now.

thinking out of the box, should https://github.com/litalbarkai/open-redatam/ be the url?

I think @litalbarkai could comment on that solution.

I cannot just create a "copy and paste" in my own repository, that would hide Lital's great contributions. I have also contributed to the C++ codebase and I prefer to keep a track of all the changes I made.

This is only a fork in technical terms, but it is not a fork in terms of a derived project.

About this "changes is one project are less likely to be fully tested for the other; one part of the project could change it's license; etc.":

  1. Everything I sent gets a PR
  2. Changes are usually commented over the email
  3. Not really, the license was discussed and Lital is a co-author. There is an article under review now that I can email it.

@emilyriederer
Copy link

Thanks @pachadotdev for the replies. Just to clarify: I understand that you and @litalbaraki have established a good working model that sounds very effective for your purpose. I understand the complexities here and the concern isn't so much the operating model of this specific package but general concerns that illustrate why we have the current policy.

@pachadotdev
Copy link
Author

hi @mpadge @emilyriederer
I found that adding https://github.com/litalbarkai/open-redatam/tree/main/rpkg to the form would be sufficient for the request
I just did that

@emilyriederer
Copy link

Hi @pachadotdev ! Thanks for the update.

We do review submissions on branches, but that case assumes that packages post-review will ultimately be hosted on the default GitHub branch as the sole and primary instance. If both you and @litalbaraki are happy with that, this might work. Otherwise, I can run this back by the team to discuss

@pachadotdev
Copy link
Author

Hi @pachadotdev ! Thanks for the update.

We do review submissions on branches, but that case assumes that packages post-review will ultimately be hosted on the default GitHub branch as the sole and primary instance. If both you and @litalbaraki are happy with that, this might work. Otherwise, I can run this back by the team to discuss

@litalbarkai are you ok sending your repo to submission? I was considering merging my fork into your repo, but all the tutorials were McGyver level and I gave up.

@mpadge
Copy link
Member

mpadge commented Dec 3, 2024

@pachadotdev @litalbarkai This StackOverflow chunk can be used to merge two repos. I just tried it for your repos, and everything worked perfectly. You can also easily transfer issues across, if you need. Because the two repos are in different orgs, you might need to create a dummy repo in origin-org, transfer all issues to that, transfer dummy repo to destination-org, then move issues from there to destination-repo.

@pachadotdev
Copy link
Author

@pachadotdev @litalbarkai This StackOverflow chunk can be used to merge two repos. I just tried it for your repos, and everything worked perfectly. You can also easily transfer issues across, if you need. Because the two repos are in different orgs, you might need to create a dummy repo in origin-org, transfer all issues to that, transfer dummy repo to destination-org, then move issues from there to destination-repo.

can I keep my fork just in case?

using https://github.com/litalbarkai/open-redatam/tree/main/rpkg seems to be ok (that is the original repo)

@pachadotdev
Copy link
Author

or for a dummy/meta repo use https://github.com/orgs/open-redatam/repositories? I just created that

@emilyriederer
Copy link

Per discussion here, we're moving this to "hold" state while Capybara proceeds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants