v1.0.0
This is not a backwards compatible release.
Additions:
- Added SampleByKey, which provides a way to sample tuples based on certain fields.
- Added Coalesce, which returns the first non-null value from a list of arguments like SQL's COALESCE.
- Added BagGroup, which performs an in-memory group operation on a bag.
- Added ReservoirSample
- Added In filter func, which behaves like SQL's IN
- Added EmptyBagToNullFields, which enables multi-relation left joins using COGROUP
- Sessionize now supports long values for timestamp, in addition to string representation of time.
- BagConcat can now operate on a bag of bags, in addition to a tuple of bags
- Created TransposeTupleToBag, which creates a bag of key-value pairs from a tuple
- SessionCount now implements Accumulator interface
- DistinctBy now implements Accumulator interface
- Using PigUnit from Maven for testing, instead of checked-in JAR
- Added many more test cases to improve coverage
- Improved documentation
Changes:
- Moved WeightedSample to datafu.pig.sampling
- Using Pig 0.11.1 for testing.
- Renamed package datafu.pig.numbers to datafu.pig.random
- Renamed package datafu.pig.bag.sets to datafu.pig.sets
- Renamed TimeCount to SessionCount, moved to datafu.pig.sessions
- ASSERT renamed to Assert
- MD5Base64 merged into MD5 implementation, constructor arg picks which method, default being hex
Removals:
- Removed ApplyQuantiles
- Removed AliasBagFields, since can now achieve with nested foreach
Fixes:
- Quantile now outputs schemas consistent with StreamingQuantile
- Necessary fastutil classes now packaged in datafu JAR, so fastutil JAR not needed as dependency
- Non-deterministic UDFs now marked as so