make ETL run in Google Big Query #65

vojtechhuser · 2020-05-05T19:22:47Z

Current code is Postgres SQL flavor specific.
(e.g., ::integer) in code.

To run on other platforms, notes how to do port this is needed.

spfohl · 2020-05-05T20:07:32Z

The most straightforward way to do this might be through the BigQuery DBAPI: https://googleapis.dev/python/bigquery/latest/dbapi.html since it will likely allow for significant code reuse without having to write a lot of BigQuery "Standard SQL". However, I don't know the details of the ETL or the DBAPI well enough though to know how much additional work is required to make this happen. It may be just as much work to write the ETL from scratch to accommodate BigQuery

vojtechhuser · 2020-05-20T16:16:54Z

Stanford team has done some work with this ETL and GBQ. Stay tuned for more details.

spfohl · 2020-05-20T16:27:38Z

I'm going to add myself to that list of Stanford collaborators. I have worked on the ETL in the past, have the converted MIMIC-OMOP data in BigQuery, and have been working on ETL+modeling tools

tompollard · 2020-05-20T16:38:24Z

Thanks @spfohl, we're looking to coordinate a single mapping, preferably within this repository.

The plan is then for the contributors to submit a well-described version of the output dataset to PhysioNet, to (1) allow the specific version used in a study to be clearly cited and (2) to avoid users having to build the OMOP version themselves.

We are in the process of identifying a technical lead for the work who can take responsibility for managing the development process (i.e. overseeing development work, code review, testing framework, etc). It would be good to chat if you have thoughts on this!

spfohl · 2020-05-20T17:21:32Z

I'll sync with the others working with the BigQuery pipeline and see where I can best contribute and then follow up. I don't currently have the bandwidth to take on a leadership role here, but I am highly interested and motivated in broadly improving the quality and usability of this ETL since I am involved several on-going research efforts that would benefit from that

tompollard · 2020-05-20T17:48:51Z

Sounds good, thanks @spfohl. We'll post updates as things develop.

We've found it difficult to decide how best to manage different multiple SQL dialects for other projects. If whoever becomes lead decides that BigQuery syntax is best, then maybe we just port this whole repo.

It's an interesting thought that there may be multiple MIMIC to OMOP mappings already out there and being used. If so, it would be an interesting study to explore how the choice of mapping contributes to the output of an analysis.

spfohl · 2020-05-28T22:40:49Z

I worked with @jdposada and @PriyaDesai70 on a proof of concept to see how long it would take us to convert one SQL script to BigQuery syntax. It took about 90 minutes to convert the procedure_occurence script, but I imagine that further tables would be much faster. There are just a few simple patterns that need to substituted, and some like SELECT DISTINCT ON that required some more complicated logic. See here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make ETL run in Google Big Query #65

make ETL run in Google Big Query #65

vojtechhuser commented May 5, 2020

spfohl commented May 5, 2020 •

edited

Loading

vojtechhuser commented May 20, 2020

spfohl commented May 20, 2020

tompollard commented May 20, 2020

spfohl commented May 20, 2020

tompollard commented May 20, 2020 •

edited

Loading

spfohl commented May 28, 2020

make ETL run in Google Big Query #65

make ETL run in Google Big Query #65

Comments

vojtechhuser commented May 5, 2020

spfohl commented May 5, 2020 • edited Loading

vojtechhuser commented May 20, 2020

spfohl commented May 20, 2020

tompollard commented May 20, 2020

spfohl commented May 20, 2020

tompollard commented May 20, 2020 • edited Loading

spfohl commented May 28, 2020

spfohl commented May 5, 2020 •

edited

Loading

tompollard commented May 20, 2020 •

edited

Loading