Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure splits are sorted by path name #200

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions src/main/java/org/seqdoop/hadoop_bam/AnySAMInputFormat.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

import java.io.IOException;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
Expand Down Expand Up @@ -229,18 +230,21 @@ else if (split instanceof FileVirtualSplit)
final List<InputSplit> origSplits =
BAMInputFormat.removeIndexFiles(super.getSplits(job));

final List<InputSplit> sortedSplits = new ArrayList<>(origSplits);
sortedSplits.sort(Comparator.comparing(split -> ((FileSplit) split).getPath()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cast problematic? I ran into the problem that some splits are FileVirtualSplits which are not a subtype of FileSplit, there isn't any common super type that offers getPath. Maybe FileVirtualSplit should extend FileSplit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnySAMInputFormat extends Hadoop's FileInputFormat, which returns FileSplit objects and never FileVirtualSplit. So the cast is safe at this point in the code.


// We have to partition the splits by input format and hand them over to
// the *InputFormats for any further handling.
//
// BAMInputFormat and CRAMInputFormat need to change the split boundaries, so we can
// just extract the BAM and CRAM ones and leave the rest as they are.

final List<InputSplit>
bamOrigSplits = new ArrayList<InputSplit>(origSplits.size()),
cramOrigSplits = new ArrayList<InputSplit>(origSplits.size()),
newSplits = new ArrayList<InputSplit>(origSplits.size());
bamOrigSplits = new ArrayList<InputSplit>(sortedSplits.size()),
cramOrigSplits = new ArrayList<InputSplit>(sortedSplits.size()),
newSplits = new ArrayList<InputSplit>(sortedSplits.size());

for (final InputSplit iSplit : origSplits) {
for (final InputSplit iSplit : sortedSplits) {
final FileSplit split = (FileSplit)iSplit;

if (SAMFormat.BAM.equals(getFormat(split.getPath())))
Expand Down