Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added database existence checks and clear_tables function #593

Merged
merged 4 commits into from
Apr 16, 2024

Conversation

robthew
Copy link

@robthew robthew commented Apr 9, 2024

This uses the python library 'sqlalchemy_utils', so you will need to perform an install: pip install sqlalchemy_utils

@robthew
Copy link
Author

robthew commented Apr 10, 2024

I added a parameter to the RelationalDBStore init for creating the database.

The creation step has been updated to include a 'drop database' step. This insures that the database used in the test is always up-to-date with the latest schema.

@rpiazza
Copy link
Contributor

rpiazza commented Apr 10, 2024

LGTM

@chisholm - can you take a quick look?

@chisholm
Copy link
Contributor

I think "creating" and "instantiating" a database seems awfully similar. Having those two parameters together could be confusing. "create_db" is also missing from the data store docstring.

One of the design principles I think we should keep in mind is to allow the source/sink to be usable by themselves, not just in concert with a store. After all, we're supposed to be able to mix 'n match sources and sinks. With current code, it looks like you can have your database automatically created if you use a store, but not if you use a sink. I think whatever the store can do as far as modifying/mutating database operations (creation, insertion, etc) should also be doable with the sink.

You also might want to be careful with dropping an existing database. Perhaps there should be a drop-if-exists option which defaults to False, rather than unconditionally deleting it. The most common use case would probably be "make sure it's there", not "blow it away if it's there".

So what would be a better constructor signature than create and instantiate true/false... hmm. We want to allow some finer control over what gets automatically created. The choices are "database + tables", "tables only", "nothing"? And another setting regarding what to do if they already exist, with choices "drop" and "nothing"? Maybe we need a three-valued (maybe enum-valued?) parameter describing what to create, and a separate drop_if_exists option which applies to everything to be created (both databases and tables)?

@rpiazza
Copy link
Contributor

rpiazza commented Apr 12, 2024

@robthew - any thoughts on Andy's comments?

@robthew
Copy link
Author

robthew commented Apr 12, 2024

It seems like we could end up with a lot of duplicated code in the store/sink/source. What's the use case for needing database creation in the source? In the sink we are inserting data, so the database needs to be created if it doesn't exist already, but when would you need to go through a database creation step when using a source by itself? It seems like you couldn't use a source and get data without first putting data in with a sink.

@chisholm
Copy link
Contributor

I don't think mutable operations (creating, inserting, etc) should be in the source, only the sink/store. The relational data source docstring says as much:

Initialize this source. Only one of stix_object_classes and metadata
should be given: if the latter is given, assume table schemas are
already created. Instances of this class do not create the actual
database tables; see the source/sink for that.

Oops... should say "store/sink", not "source/sink" :-P That's my typo!

There's no need for duplication between store/sink if the store can delegate to the sink. That's how it is now, for schema (in the postgres sense)/table creation. The sink has code for that; the store does not.

There is some duplication with regard to table schema creation (the SQLAlchemy table objects and metadata) so that the source/sink are usable independently as well as paired together in a store, without having to duplicate the table schema creation work. So, relational db store can create that once and reuse for both its source and sink, but source/sink also need to be able to create that on their own when necessary. But that's factored out to a separate reusable module so the duplication is just a couple lines of code, not that big a deal.

@robthew
Copy link
Author

robthew commented Apr 15, 2024

I adjusted the database creation options. The 'instantiate_database' option is True by default, and will create the database if it doesn't exist, and create the schemas and tables. The 'force_recreate' is False by default and will drop an existing database and recreate it. This should only be needed when changes have been made to the schema.

On a separate note - Postgres doesn't like dashes in database names. Should the 'stix-data-sink' name changed to use underscores?

@rpiazza
Copy link
Contributor

rpiazza commented Apr 15, 2024

LGTM

@chisholm
Copy link
Contributor

I didn't actually run into the hyphen problem with database names; pgadmin4 lets you create/drop the database without problems. But I guess if you were hand-writing SQL, you'd need quotes. Sure, underscores are a bit simpler to use.

@rpiazza rpiazza merged commit 316edb3 into oasis-open:relational-data-sink Apr 16, 2024
1 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants