Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifact Manager #151

Merged
merged 41 commits into from
Jan 2, 2024
Merged

Artifact Manager #151

merged 41 commits into from
Jan 2, 2024

Conversation

SebS94
Copy link
Contributor

@SebS94 SebS94 commented Oct 23, 2023

Description

This PR introduces a basic artifact manager functionality based on squirrel stores.

Fixes # issue

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring including code style reformatting
  • Other (please describe):

Checklist:

  • I have read the contributing guideline doc (external contributors only)
  • Lint and unit tests pass locally with my changes
  • I have kept the PR small so that it can be easily reviewed
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • All dependency changes have been reflected in the pip requirement files.

@github-actions
Copy link

github-actions bot commented Oct 23, 2023

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@SebS94
Copy link
Contributor Author

SebS94 commented Oct 23, 2023

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Oct 23, 2023
Copy link
Contributor

@AlirezaSohofi AlirezaSohofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG in general. I would suggest to add timestamp as well. We can then add search by time functionality.

The default collection is a bit odd. I assume that the typical usage will be informally assuming some semantic or schema consistency for each collection and organize artifact by that, then default because like whatever. I think there is also a risk that people forget to specify the collection and incorrectly put artifacts in default collection, which can be hard to notice and hard to fix.

I think a nice extension is for artifact registry would be collections with explicit schema definition, say in json-schema format. The benefits:

  • add validation logic
  • amenable to build analytic layers on top
  • better search capabilities, e.g. with SQL

squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved
squirrel/artifact_manager/fs.py Outdated Show resolved Hide resolved
manager.log_object(obj2, artifact_name, collection)
assert manager.get_artifact(artifact_name, collection) == obj2
assert manager.get_artifact(artifact_name, collection, 2) == obj1
assert manager.get_artifact(artifact_name, collection, 1) == obj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manager.get_artifact(artifact_name, collection, 4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would just fail, right?

raise NotImplementedError

@abstractmethod
def get_artifact_source(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not implement artifact logic here and keep cohesion high, but let a user convert this manager to a catalog and let him/her retreive the source from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the source description of squirrel would be a nice vehicle to expose for pre-processing and filtering artifacts before fetching/downloading them based on meta-data?

raise NotImplementedError

@abstractmethod
def log_object(self, obj: Any, name: str, collection: Optional[str] = None) -> Source:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to retrieve objects again? can we restrict it to safe object types? e.g. safetensors, numpy, primitive types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to log_artifact - can be retrieved via get_artifact.
In general everything that can be serialised via the backend serialiser (currently messagepack, later deltalake) can be logged. What would be the motivation for restricting this?

@SebS94
Copy link
Contributor Author

SebS94 commented Nov 7, 2023

The default collection is a bit odd. I assume that the typical usage will be informally assuming some semantic or schema consistency for each collection and organize artifact by that, then default because like whatever. I think there is also a risk that people forget to specify the collection and incorrectly put artifacts in default collection, which can be hard to notice and hard to fix.

The collection attribute is intended as an active collection to which objects are logged if no other target is provided. The assumption here is that this is set e.g. to the job id at the beginning. I renamed the attribute to make this more explicit.

I think a nice extension is for artifact registry would be collections with explicit schema definition, say in json-schema format.

Absolutely agree that this will be a useful extension, however, for me this is part of the semantic layer which exposes functionality for defining a schema and retrieving/filtering objects based on it.

I would suggest to add timestamp as well. We can then add search by time functionality.

This seems useful but will keep it out of the basic logger as it unnecessarily complicates the interface.
What I think makes sense here is to provide a minimal schema containing basic attributes (e.g. author, timestamp) as part of the semantic layer.

@AlirezaSohofi
Copy link
Contributor

This seems useful but will keep it out of the basic logger as it unnecessarily complicates the interface.
What I think makes sense here is to provide a minimal schema containing basic attributes (e.g. author, timestamp) as part of the semantic layer.

This should not introduce any change to the interface. The minimal schema you mentioned was actually aligned with what I had in mind. Whether it's part of artifact itself or semantic layer is up for discussion and involves trade-offs.

Copy link
Contributor

@adrianloy adrianloy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, generally LGTM so far.

squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved
squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved

@abstractmethod
def log_artifact(self, obj: Any, name: str, collection: Optional[str] = None) -> Source:
"""Log an arbitrary python object"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if docstring has the serializer we use. Also, are we sure we need this? How are different python versions and pickle versions affecting this? I heard stories that people really dont like pickle for serializing due to version dependencies occurring. Having to figure out which versions were used during logging in able to load a year old artifact seems really painful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialiser depends on the backend - WandB has it's own, for SquirrelFileStores it's actually explicitly chosen when initialising the Manager.

squirrel/artifact_manager/base.py Outdated Show resolved Hide resolved
squirrel/artifact_manager/base.py Show resolved Hide resolved
squirrel/artifact_manager/fs.py Outdated Show resolved Hide resolved
Copy link

This is PR is marked as stale as it has been inactive for 30 days. It will be closed in 7 days.

@SebS94
Copy link
Contributor Author

SebS94 commented Dec 15, 2023

@AlirezaSohofi @ThomasWollmann @adrianloy I updated the PR with your comments and also added a basic version of the WandB manager.
Unfortunately, the WandB artifact API is actually somewhat incompatible with what we envisioned. They manage versioned artifacts within collections ("artifact types" in their terminology). Each artifact is actually not atomic but a folder of its own that can contain arbitrary files.

This goes against what we discussed where each artifact is an individually versioned file or serialised python value. I now somewhat abused the WandB artifact to force each committed value into a separate WandB artifact. While I don't think it's a huge blocker right now, I'd be interested in your input on whether or not we should relax our assumption here and allow multiple files per artifact (not necessarily an issue) or stick with this.

Also let me know if you any additional comments regarding the overall API.

@SebS94 SebS94 force-pushed the seb-artifact-manager branch from d1bc2a5 to 5af0566 Compare December 15, 2023 17:08
@SebS94 SebS94 requested a review from adrianloy December 18, 2023 12:08
@adrianloy adrianloy marked this pull request as ready for review December 18, 2023 13:20
@adrianloy
Copy link
Contributor

I would follow their approach. I think wandb integration should be first citizen, as this is likely how we will use it the most. I thought the old artifact manager also coul dbe used for "log everything of this local folder as a single artifact" and I think in general that makes sense. So I think relaxing of our assumptions is fine.

@ThomasWollmann ThomasWollmann removed their request for review December 18, 2023 13:28
@SebS94 SebS94 force-pushed the seb-artifact-manager branch from 2635599 to ce7cb3a Compare December 22, 2023 13:36
@mg515 mg515 self-requested a review December 22, 2023 15:18
mg515
mg515 previously approved these changes Dec 22, 2023
Copy link
Contributor

@mg515 mg515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go from my side as a first version, further implementation can go into follow-up PRs. Looking forward to use the new artifact manager woop woop

@SebS94 SebS94 merged commit 0ed5733 into main Jan 2, 2024
4 checks passed
@SebS94 SebS94 deleted the seb-artifact-manager branch January 2, 2024 08:55
@github-actions github-actions bot locked and limited conversation to collaborators Jan 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants