-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make ETL run in Google Big Query #65
Comments
The most straightforward way to do this might be through the BigQuery DBAPI: https://googleapis.dev/python/bigquery/latest/dbapi.html since it will likely allow for significant code reuse without having to write a lot of BigQuery "Standard SQL". However, I don't know the details of the ETL or the DBAPI well enough though to know how much additional work is required to make this happen. It may be just as much work to write the ETL from scratch to accommodate BigQuery |
Stanford team has done some work with this ETL and GBQ. Stay tuned for more details. |
I'm going to add myself to that list of Stanford collaborators. I have worked on the ETL in the past, have the converted MIMIC-OMOP data in BigQuery, and have been working on ETL+modeling tools |
Thanks @spfohl, we're looking to coordinate a single mapping, preferably within this repository. The plan is then for the contributors to submit a well-described version of the output dataset to PhysioNet, to (1) allow the specific version used in a study to be clearly cited and (2) to avoid users having to build the OMOP version themselves. We are in the process of identifying a technical lead for the work who can take responsibility for managing the development process (i.e. overseeing development work, code review, testing framework, etc). It would be good to chat if you have thoughts on this! |
I'll sync with the others working with the BigQuery pipeline and see where I can best contribute and then follow up. I don't currently have the bandwidth to take on a leadership role here, but I am highly interested and motivated in broadly improving the quality and usability of this ETL since I am involved several on-going research efforts that would benefit from that |
Sounds good, thanks @spfohl. We'll post updates as things develop. We've found it difficult to decide how best to manage different multiple SQL dialects for other projects. If whoever becomes lead decides that BigQuery syntax is best, then maybe we just port this whole repo. It's an interesting thought that there may be multiple MIMIC to OMOP mappings already out there and being used. If so, it would be an interesting study to explore how the choice of mapping contributes to the output of an analysis. |
I worked with @jdposada and @PriyaDesai70 on a proof of concept to see how long it would take us to convert one SQL script to BigQuery syntax. It took about 90 minutes to convert the procedure_occurence script, but I imagine that further tables would be much faster. There are just a few simple patterns that need to substituted, and some like SELECT DISTINCT ON that required some more complicated logic. See here |
Current code is Postgres SQL flavor specific.
(e.g., ::integer) in code.
To run on other platforms, notes how to do port this is needed.
The text was updated successfully, but these errors were encountered: