Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlusterVolume modification to wrap getPos() functionality in underlying ... #59

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

GlusterVolume modification to wrap getPos() functionality in underlying ... #59

wants to merge 1 commit into from

Conversation

jayunit100
Copy link
Contributor

This PullRequest Wraps calls to get pos and records the last value, returning that last value in the case that the underlying stream closes. This fix is essential for distributed mapreduce jobs that use the mapred.* API, because that API expects a working call to FSInputStream.getPos() even after the stream has been read: And because the RawLocalFileSystem buffers that stream (unlike DFSInputStream, which handles doesnt wrap the stream), the buffer closes and sets the underlying stream to null . See BZ 100362 for details.

@childsb
Copy link
Contributor

childsb commented Sep 18, 2013

A typical file I/O workflow goes:
openFile();
fileIOOperations()
closeFile()

any type of I/O after closeFile() is invalid. GlusterVolume class is correctly throwing an exception if an attempt is made to access the file's stream position after it's been closed. there is no stream open, thus there is no valid position after the stream is closed, and calling getPos() on a closed stream means the open/read/close flow is broken further up the calling chain.

i dont like nurfing up the code by catching exceptions and not reporting them. the responsibility is on the caller NOT to attempt file I/O operations after its closed the file or stream. if the caller attempts to do so, we're justified throwing exception.

i would like to see the open(), read(), close(), getPos() code thats exhibiting this incorrect I/O workflow before OKing guarding against the behavior at a file system level.

@jayunit100
Copy link
Contributor Author

To clarify brad's comment: IT is true that the bug can be thought of as existing in the RR implementation in mapred version of MultiFileWordCount, and not necessarily a bug in our shim. This allows you to call "getPos()" so that the behaviour is similar to DFSClient stream - always returning a valid long. without the patch, the version of MultiFileWordCount in hadoop-examples will not run. Im in agreement that the overlying "bug" is in the older MultiFileWordCount impl, but nevertheless, the older mapred.* MultiFileWordCount impl does work on HDFS due to the way the DFSClient is implemented.

@jayunit100
Copy link
Contributor Author

The offending code in RR for MultiFileWordCount violates the access pattern, here is why. (i agree its not the right way to use a stream). But in any case, here is the code path brad requests to see:

By the way, some of this code is JDK dependant also, see how streams get set to null whenclosed : http://grepcode.com/file_/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/BufferedReader.java/?v=source

  1. RR calls getPos() BEFORE and AFTER reading a record, in the TrackingRecordReader implementation which in mapred.

  2. At the first call, before the underlying stream is closed, the getPos() returns something valid. Then, the stream is closed, and the implementation in the MultiFielWordCount custom RR returns the result of reading getPos from the "currentStream".

  3. But, since currentStream is actually wrapped by buffered stream, which sets currentStream to null when its closed, the 2nd call to getPos() fails in NPE.

So the code path is that

#In the MapTask in mapred.* framework, hadoop 1.X
TrackingRecordReader calls MultiFileWordCountRecordReader.getPos

#In the MultiFileWordCount MapReduce job - in hadoop-examples for 1.X
MultiFileWordCountRecordReader.getPos calls currentStream getPos

public long gc long getPos() throws IOException {
etPos() throws IOException {
long currentOffset = currentStream == null ? 0 : currentStream.getPos();
return offset + currentOffset;
}

and currentStream has set to null already, by the overlying BufferedReader : since the currentReader is closed and currentreader is an instance of BufferedReader :

#Why is NPE thrown ? Because of the way bufferedreader works... Finally, from src/examples/org/apache/hadoop/examples/MultiFileWordCount.java
currentReader=new BufferedReader(new InputStreamReader(currentStream));

@childsb
Copy link
Contributor

childsb commented Sep 18, 2013

will you walk through error/code in TrackingRecordReader by line number?

@jayunit100
Copy link
Contributor Author

From ./src/examples/org/apache/hadoop/examples/MultiFileWordCount.java in hadoop 1.2:

151 public long getPos() throws IOException {
152 long currentOffset = currentStream == null ? 0 : currentStream.getPos();
153 return offset + currentOffset;
154 }

The error is thrown at getPos(), because cuurentStream is not null, but the inputStreamReader is null. makes sense?

@childsb
Copy link
Contributor

childsb commented Oct 7, 2013

per the code you pasted in, if the stream is null "0" is assumed; if its not then position is queried.

this shows that the MuleFileWordCount is guarding somewhat that a position is returned even if a stream doesn't exist, so i dont see why the GlusterVolume code should be doing the same thing.

@jayunit100
Copy link
Contributor Author

Alas, your right ! The problem is that currentStream is not null.
However, the InputStreamReader IS set to null.
That is the problem :) !

@jayunit100
Copy link
Contributor Author

does it make sense now? The currentStream still exists, but the overlying wrapper (the Buffer) SETs the input stream reader to null.

Thus, when the input stream reader is accesssed, a NPE is thrown. But the logic in this method only checks the current stream.

this is NOT a problem in HDFS, because in HDFS, the underlying stream doesnt proxy calls to getPos().

@jayunit100
Copy link
Contributor Author

https://issues.apache.org/jira/browse/MAPREDUCE-5572 <--- corresponding jira to point to the root cause which is a custom RecordReader which relies on undefined FileSYstem semantics

@childsb
Copy link
Contributor

childsb commented Oct 9, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants