-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Functionality to accept compressed files as input to predict when using a Predictor #5299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but needs a few tweaks.
I would be OK with not having the manual override functionality at all. If you name your gzip file "something.bz2"
, you deserve what you get. But if you have a scenario in mind where we need this, let's keep it. It's good practice.
allennlp/commands/predict.py
Outdated
try: | ||
with open_compressed(input_file) as file_input: | ||
for line in file_input: | ||
if not line.isspace(): | ||
yield self._predictor.load_line(line) | ||
except OSError: | ||
if self.compression_type: | ||
with open_compressed(input_file, self.compression_type) as file_input: | ||
for line in file_input: | ||
if not line.isspace(): | ||
yield self._predictor.load_line(line) | ||
else: | ||
print( | ||
"Automatic detection of compression type failed, please specify the compression type argument" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic needs to be the other way around. If the compression type is specified, we have to always respect it. If it's not specified, we autodetect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, will incorporate it
allennlp/common/file_utils.py
Outdated
@@ -1085,20 +1085,27 @@ def get_file_extension(path: str, dot=True, lower: bool = True): | |||
|
|||
|
|||
def open_compressed( | |||
filename: Union[str, PathLike], mode: str = "rt", encoding: Optional[str] = "UTF-8", **kwargs | |||
filename: Union[str, PathLike], | |||
compression_type: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the last parameter, so we don't break existing usage of positional arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the type of this should be Optional[None]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the make typecheck
commands gave me incompatibility error on changing it to Optional[None]
so I kept it as Optional[str]
, will that be fine?
allennlp/common/file_utils.py
Outdated
elif filename.endswith(".bz2"): | ||
import bz2 | ||
compression_modules = {"gz": "gzip", "bz2": "bz2", "lzma": "lzma"} | ||
if not compression_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I pass in an empty string for compression_type
, we will go down this path. I don't think that's what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I simply consider an empty string for compression_type to be equivalent to it being 'None' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be surprising to a user of the library. You can just compare to None
:
if not compression_type: | |
if compression_type is None: |
allennlp/common/file_utils.py
Outdated
open_fn = module.open | ||
break | ||
else: | ||
module = __import__(compression_modules[extension]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think extension
is undefined here? Or am I blind and can't see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that was an error at my end, will fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close to done!
@@ -152,6 +160,7 @@ def __init__( | |||
batch_size: int, | |||
print_to_console: bool, | |||
has_dataset_reader: bool, | |||
compression_type: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type should be Optional[str]
.
except OSError: | ||
print("please specify the correct compression type argument.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why OSError
? I don't think you have to catch this exception at all. If it fails, it fails. The only thing we might want to do is make sure that open_compressed()
throws exceptions that are understandable.
open_fn = module.open | ||
else: | ||
for extension in compression_modules: | ||
if filename.endswith(extension): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use os.path.splitext()
here to make that detection? I don't want a file named info.fogbugz
to show up as a gzip file.
Since I can't push to your branch and you are MIA, I'm continuing this in #5578. |
Fixes #5237
Changes proposed in this pull request:
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting
codecov/patch
reports high test coverage (at least 90%).You can find this under the "Actions" tab of the pull request once the other checks have finished.