-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GlusterVolume modification to wrap getPos() functionality in underlying ... #59
base: master
Are you sure you want to change the base?
Conversation
A typical file I/O workflow goes: any type of I/O after closeFile() is invalid. GlusterVolume class is correctly throwing an exception if an attempt is made to access the file's stream position after it's been closed. there is no stream open, thus there is no valid position after the stream is closed, and calling getPos() on a closed stream means the open/read/close flow is broken further up the calling chain. i dont like nurfing up the code by catching exceptions and not reporting them. the responsibility is on the caller NOT to attempt file I/O operations after its closed the file or stream. if the caller attempts to do so, we're justified throwing exception. i would like to see the open(), read(), close(), getPos() code thats exhibiting this incorrect I/O workflow before OKing guarding against the behavior at a file system level. |
To clarify brad's comment: IT is true that the bug can be thought of as existing in the RR implementation in mapred version of MultiFileWordCount, and not necessarily a bug in our shim. This allows you to call "getPos()" so that the behaviour is similar to DFSClient stream - always returning a valid long. without the patch, the version of MultiFileWordCount in hadoop-examples will not run. Im in agreement that the overlying "bug" is in the older MultiFileWordCount impl, but nevertheless, the older mapred.* MultiFileWordCount impl does work on HDFS due to the way the DFSClient is implemented. |
The offending code in RR for MultiFileWordCount violates the access pattern, here is why. (i agree its not the right way to use a stream). But in any case, here is the code path brad requests to see: By the way, some of this code is JDK dependant also, see how streams get set to null whenclosed : http://grepcode.com/file_/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/BufferedReader.java/?v=source
So the code path is that #In the MapTask in mapred.* framework, hadoop 1.X #In the MultiFileWordCount MapReduce job - in hadoop-examples for 1.X public long gc long getPos() throws IOException { and currentStream has set to null already, by the overlying BufferedReader : since the currentReader is closed and currentreader is an instance of BufferedReader : #Why is NPE thrown ? Because of the way bufferedreader works... Finally, from src/examples/org/apache/hadoop/examples/MultiFileWordCount.java |
will you walk through error/code in TrackingRecordReader by line number? |
From ./src/examples/org/apache/hadoop/examples/MultiFileWordCount.java in hadoop 1.2: 151 public long getPos() throws IOException { The error is thrown at getPos(), because cuurentStream is not null, but the inputStreamReader is null. makes sense? |
per the code you pasted in, if the stream is null "0" is assumed; if its not then position is queried. this shows that the MuleFileWordCount is guarding somewhat that a position is returned even if a stream doesn't exist, so i dont see why the GlusterVolume code should be doing the same thing. |
Alas, your right ! The problem is that currentStream is not null. |
does it make sense now? The currentStream still exists, but the overlying wrapper (the Buffer) SETs the input stream reader to null. Thus, when the input stream reader is accesssed, a NPE is thrown. But the logic in this method only checks the current stream. this is NOT a problem in HDFS, because in HDFS, the underlying stream doesnt proxy calls to getPos(). |
https://issues.apache.org/jira/browse/MAPREDUCE-5572 <--- corresponding jira to point to the root cause which is a custom RecordReader which relies on undefined FileSYstem semantics |
955da2a
to
9cab8ab
Compare
This PullRequest Wraps calls to get pos and records the last value, returning that last value in the case that the underlying stream closes. This fix is essential for distributed mapreduce jobs that use the mapred.* API, because that API expects a working call to FSInputStream.getPos() even after the stream has been read: And because the RawLocalFileSystem buffers that stream (unlike DFSInputStream, which handles doesnt wrap the stream), the buffer closes and sets the underlying stream to null . See BZ 100362 for details.