Upcoming BC breaks #269
norberttech
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey,
since Flow did not have any stable release yet, it's not really a true BC break from a SemVer point of view however some parts of the project API will drastically change in upcoming days. Changes in the core repo will affect pretty much all filesystem based adapters.
The good news is that those changes are most likely the final ones before the first stable release.
If by any chance you are using flow in your system, please lock your dependency to a specific commit version and schedule an upgrade since it's going to bring pretty important features!
The change is driven by two features that are currently under development:
/partitioned_path/partiton_1=a/partition_2=b/file.csv
Currently, all Loaders/Extractors that reads/writes into files (remote or local) are operating on top of FileStream abstraction that holds URI to the local or remote file with all options (remote files) needed to access external filesystem through streamWrapper.
It was made this way since we wanted to keep Loaders/Extractors as simple as possible, almost completely detached from Flow.
Because of that, by design, every Loader/Extractor had to deal on its own with a given data source.
For example, Http based data sources would expect Extractors to get HTTP client as a dependency.
The same rule applies to databases.
Those are of course very important data sources but Flow wasn't created for them, Flow was created to process files.
In order to make it truly scalable, Flow needs to provide some of the optimization techniques known from databases while working with filesystem, like for example partitioning.
So to make things clear, Flow is a file-oriented data processing library that additionally supports some other data sources like Http protocol or Databases. Files are the most important flow data sources/sinks, all others are considered supporting.
What is going to change
(all code examples might still change before final release)
1. Added Filesystem
From now Filesystem (built on top of flysystem) is going to be part of the Flow core.
The goal of the filesystem will be to scan given paths (remote/local) and open streams (so extractors/loaders can use fread/fwrite/fputcsv functions on them).
2. Added Flow Context
Each DataFrame instance will initialize it own FlowContext that will take Configuration.
$context
is going to be passed into all Extractors/Transformers/Loaders which will give them access to the following things:In the future, FlowContext will also provide access to Logs and Telemetry.
3. Changed Extractor/Transformer/Loader API
This one is
4. Replaced FileStream, LocalFile, RemoteFile with Path
Path is a new concept in Flow, it is coupled with Filesystem that as the name suggests, represents a filesystem path :) Existing or not, absolute or glob pattern, local or remote.
From now, instead of FileStream, Loader/Extractors should expect just Path in order to work with it using Filesystem passed through with FlowContext. (Yes, this means BC breaks in all adapters :( )
There are some additional things that Path makes possible:
glob
pattern with external filesystems5) Deprecation of flow-php/etl-adapter-streams
Since Filesystem is going to be part of the core repo it does not make sense to keep implementation separated, it introduces huge maintenance effort while there is no better alternative than flysystem anyway. This means that flysystem is pretty much Flow dependency so it can be bundled into the core library.
Beta Was this translation helpful? Give feedback.
All reactions