-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linguist is tongue-tied when it encounters invalid/inconsistent encodings. #830
Comments
You might want to take a look at my last comment from the pull linked above. |
geoff-nixon
pushed a commit
to pullreq/linguist
that referenced
this issue
Dec 16, 2013
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Basically: - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb). - Adds the test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py). Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
Pull Request #841 closes this if merged. |
geoff-nixon
pushed a commit
to pullreq/linguist
that referenced
this issue
Dec 16, 2013
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Basically: - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and/or parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb) and [sample.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/sample.rb). - Adds a test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py). Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
geoff-nixon
pushed a commit
to pullreq/linguist
that referenced
this issue
Dec 23, 2013
…errors). So I've gone ahead and rebased this onto 2.10.7... But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess? Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code? This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Addresses a number of encoding errors, mostly by: - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing. - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.) - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc. - Includes the following new test cases for the above, all taken from real repositories here on Github: - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper) - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer) - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252) - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
geoff-nixon
pushed a commit
to pullreq/linguist
that referenced
this issue
Dec 29, 2013
…errors). So I've gone ahead and rebased this onto 2.10.8... But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess? Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code? This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Addresses a number of encoding errors, mostly by: - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing. - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.) - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc. - Includes the following new test cases for the above, all taken from real repositories here on Github: - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper) - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer) - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252) - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
See #829, or just:
Oh the indignity!
The text was updated successfully, but these errors were encountered: