Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean bifurcation of contract mgmt versus connection functionality #402

Open
ormu5 opened this issue Aug 30, 2024 · 2 comments
Open

Clean bifurcation of contract mgmt versus connection functionality #402

ormu5 opened this issue Aug 30, 2024 · 2 comments

Comments

@ormu5
Copy link

ormu5 commented Aug 30, 2024

First: thanks for the effort on this tool! Is coming in handy for me on a current project as we embrace the spec in earnest.

It seems like the optional feature-level installation via pip is intended to pick and choose which data sources to make available for testing connections. This makes sense, but I wonder if the central contract mgmt and this connectivity functionality can be further delineated without too much trouble.

I noticed this when doing pip install datacontract-cli (no features specified) and then trying to run datacontract:

(venv) √ contracts 15:38:27 % datacontract lint customer.yaml 
Traceback (most recent call last):
  File "dir/bin/datacontract", line 5, in <module>
    from datacontract.cli import app
  File "dir/lib/python3.11/site-packages/datacontract/cli.py", line 15, in <module>
    from datacontract import web
  File "dir/lib/python3.11/site-packages/datacontract/web.py", line 7, in <module>
    from datacontract.data_contract import DataContract, ExportFormat
  File "dir/lib/python3.11/site-packages/datacontract/data_contract.py", line 16, in <module>
    from datacontract.engines.soda.check_soda_execute import check_soda_execute
  File "dir/lib/python3.11/site-packages/datacontract/engines/soda/check_soda_execute.py", line 11, in <module>
    from datacontract.engines.soda.connections.duckdb import get_duckdb_connection
  File "dir/lib/python3.11/site-packages/datacontract/engines/soda/connections/duckdb.py", line 3, in <module>
    from deltalake import DeltaTable
ModuleNotFoundError: No module named 'deltalake'

It looks like the above may be fixed in the next release, based on release notes. But for me, I then manually ran pip install deltalake and that got me past this error to this one:

(venv) √ contracts 15:39:10 % datacontract lint customer.yaml
Traceback (most recent call last):
  File "dir/bin/datacontract", line 5, in <module>
    from datacontract.cli import app
  File "dir/lib/python3.11/site-packages/datacontract/cli.py", line 15, in <module>
    from datacontract import web
  File "dir/lib/python3.11/site-packages/datacontract/web.py", line 7, in <module>
    from datacontract.data_contract import DataContract, ExportFormat
  File "dir/lib/python3.11/site-packages/datacontract/data_contract.py", line 16, in <module>
    from datacontract.engines.soda.check_soda_execute import check_soda_execute
  File "dir/lib/python3.11/site-packages/datacontract/engines/soda/check_soda_execute.py", line 12, in <module>
    from datacontract.engines.soda.connections.kafka import create_spark_session, read_kafka_topic
  File "dir/lib/python3.11/site-packages/datacontract/engines/soda/connections/kafka.py", line 3, in <module>
    from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'

'check_soda_execute' seems to have dependencies on connection-related modules under 'soda'. This seems to violate the implied segregation of contract mgmt and connectivity functionality.

If I pass no source-centric feature flags, it seems like no sources/connections config should be needed to run the tool. This would also assist greatly in running a lighter-weight version of datacontract-cli as a centralized service, where it would only be performing contract management. We are currently planning on running it as a centralized service, anyway; it's just pretty meaty with connection-related application code it will never use.

Perhaps 'check_soda_execute' should truly be a [graceful] check or else conditionally called based on command line flag, availability of 'server' information in the contract, etc.

Thanks again.

@jochenchrist
Copy link
Contributor

jochenchrist commented Sep 1, 2024

I agree.

Maybe the first step is to have separate test case in Github Actions that validates that contract management commands runs without extras.

To be specific, these commands should be tested:

  • datacontract init
  • datacontract lint
  • datacontract import --format sql
  • datacontract import --format json
  • datacontract export --format sql
  • datacontract test (with server type local, format json)
  • datacontract diff
  • datacontract changelog
  • datacontract breaking

@ormu5
Copy link
Author

ormu5 commented Sep 5, 2024

This looks right (full disclosure: I'm still ramping up on your tool).
The purist in me was thinking only datacontract test --examples should be a part of this test case and what it represents (i.e., contract mgmt functionality), but think I see why you're saying what you're saying (plus I'm sure I'm also projecting). Today the 'local' testing seems implicit to the base install as I look at pyproject.toml (along with sql/json import/export), and that's lightweight enough and, of course, local.
I this this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants