Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example of interacting with a remote catalog #13722

Merged
merged 5 commits into from
Dec 19, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 10, 2024

Which issue does this PR close?

Rationale for this change

Many of the Catalog APIs in DataFusion are not async and thus it is not clear how to interface with remote catalogs (e.g. catalogs that require remote network access)

This pattern is possble with the current APIs but it is not obvious. Let's document how to do so

What changes are included in this PR?

  1. Add example of how to use a remote catalog with DataFusion

Are these changes tested?

By CI

Are there any user-facing changes?

A new example

@github-actions github-actions bot added the core Core DataFusion crate label Dec 10, 2024
@alamb
Copy link
Contributor Author

alamb commented Dec 10, 2024

FYI @westonpace

@alamb alamb marked this pull request as ready for review December 10, 2024 18:55
"Error during planning: table 'datafusion.remote_schema.remote_table' not found"
);

// Instead, to use a remote catalog, we must use lower level APIs on
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here on down is the API dance required. I believe @westonpace offered to help make a nicer API for this (I am certainly interested)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I'll take a stab tomorrow, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, tomorrow turned into next week but here we go: #13800

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, tomorrow turned into next week but here we go: #13800

LOL this is the story of my life the last 2 weeks

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still playing with this but wanted to give a 👍

datafusion-examples/examples/remote_catalog.rs Outdated Show resolved Hide resolved
remote_catalog_interface: Arc<RemoteCatalogInterface>,
/// Local cache of tables that have been preloaded from the remote
/// catalog
tables: Mutex<Vec<Arc<dyn TableProvider>>>,
Copy link
Member

@westonpace westonpace Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just an example so it doesn't matter but I'm curious why Vec and not HashMap which would be my default thought (e.g. is this to keep the example simple or am I missing some interesting knowledge).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was likely being pedantic here and trying to avoid storing the table name twice (once in the Catalog and once on the Table). I think perhaps that is just more confusing -- I'll change this to use HashMap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 69723c4 -- it is a nice simplification. Thank you for the suggestion

datafusion/catalog/src/catalog.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2024

Thank you @jonahgao

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this merges before #13800 then I'll update this example to refer to the helpers introduced there.

@alamb alamb merged commit 71f157f into apache:main Dec 19, 2024
25 checks passed
@alamb
Copy link
Contributor Author

alamb commented Dec 19, 2024

Thanks @westonpace -- I was procrastinating on merging this example as I wanted to use it as an excuse to consolidate some of the examples down. that wa sunecessary

@alamb alamb deleted the alamb/remote_catalog_example branch December 19, 2024 11:45
@alamb
Copy link
Contributor Author

alamb commented Dec 19, 2024

I feel bad that adding yet another example makes the "too many examples" problem even worse, so I filed a few tickets to try and make it better:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide an example of using a remote catalog
4 participants