-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-2147] Added lookback time fetch in partitioned filesource #4044
base: master
Are you sure you want to change the base?
Conversation
/** | ||
* DEFAULT LOOKBACK TIME KEY property | ||
*/ | ||
public static final String DEFAULT_COPY_LOOKBACK_TIME_KEY = "copy.lookbackTime"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the datapacks MP, this config is used to set another config (configX), which is used at various places in gobblin to find the lookback period. Please use that configX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The property is "gobblin.copy.recursive.lookback.time", isn't it will be confusing to use other DatasetFinder config in other finder class ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 100% clear here...
is the issue that:
- the same prop name is already in use elsewhere so we should avoid conflicting?
- that a config w/ similar semantics already exists, and we might intentionally borrow it here for uniformity?
- or something else...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other templates copy.lookbackTime is assigned to "gobblin.copy.recursive.lookback.time", but in templates where this Filebasedsource are used there is no such property so even if we use the recursive property key the value will be empty
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4044 +/- ##
============================================
+ Coverage 38.79% 41.14% +2.35%
- Complexity 1599 2204 +605
============================================
Files 388 480 +92
Lines 15998 20360 +4362
Branches 1585 2355 +770
============================================
+ Hits 6207 8378 +2171
- Misses 9293 11090 +1797
- Partials 498 892 +394 ☔ View full report in Codecov by Sentry. |
@@ -129,8 +134,15 @@ public List<FileInfo> getFilesToProcess(long minWatermark, int maxFilesToReturn) | |||
new FileInfo(fileStatus.getPath().toString(), fileStatus.getLen(), date.getMillis(), partitionPath)); | |||
} | |||
} | |||
|
|||
if (growthTracker.isAnotherMilestone(iteration++)) { | |||
LOG.info("~{}~ collected files to process", filesToProcess.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this log the same number of files every time there's a milestone? I believe you intend to log how many files were processed thus far when reaching each milestone... correct?
ideally you might put some description in the first substitution (~{}~
) to contextualize the processing, and add the size in addition.
also, collapse adjacent logging calls, rather than keeping separate. e.g.
LOG.info("~{}~ collected {} (of {}) files to process; most-recent source path: '{}'",
description, iteration, files.size(), sourcePath);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the log lines , it was confusing initially.
filesToProcess -- is the list of file we are collecting here,
only goal here to add log inside is to find if code is stucked or not as we saw that due to too much granularity code was stucked and we have to get thread dump to see where the code was stucked and nothing was in log lines.
gobblin-core/src/main/java/org/apache/gobblin/source/DatePartitionedNestedRetriever.java
Outdated
Show resolved
Hide resolved
gobblin-core/src/main/java/org/apache/gobblin/source/DatePartitionedNestedRetriever.java
Outdated
Show resolved
Hide resolved
gobblin-core/src/main/java/org/apache/gobblin/source/PartitionAwareFileRetrieverUtils.java
Outdated
Show resolved
Hide resolved
gobblin-core/src/main/java/org/apache/gobblin/source/PartitionAwareFileRetrieverUtils.java
Outdated
Show resolved
Hide resolved
* @return an {@link Optional} containing the {@link Duration} if the lookback time is valid, or | ||
* an empty {@link Optional} if the lookback time is invalid or cannot be parsed. | ||
*/ | ||
public static Optional<Duration> getLookbackTimeDuration(String lookBackTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although I am a big proponent of Optional
, in this case it loses the specific errors, putting them in the log rather than "returning" them to the caller, where they might be presented in a meaningful end-user error message (rather than logs, which are for developers).
since string parsing like this regularly requires error handling, it seems reasonable and basically expected to throw an exception. you can avow in javadoc that the Duration
return value is never null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to throw IOException
/** | ||
* DEFAULT LOOKBACK TIME KEY property | ||
*/ | ||
public static final String DEFAULT_COPY_LOOKBACK_TIME_KEY = "copy.lookbackTime"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 100% clear here...
is the issue that:
- the same prop name is already in use elsewhere so we should avoid conflicting?
- that a config w/ similar semantics already exists, and we might intentionally borrow it here for uniformity?
- or something else...?
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Tests
PartitionAwareFileRetrieverUtilsTest
Commits