Linguist is tongue-tied when it encounters invalid/inconsistent encodings. #830

geoff-nixon · 2013-12-12T12:58:53Z

See #829, or just:

curl -OL http://llvm.org/svn/llvm-project/llvm/trunk/utils/lit/tests/shtest-encoding.py
linguist shtest-encoding.py
...

Oh the indignity!

The text was updated successfully, but these errors were encountered:

geoff-nixon · 2013-12-12T13:49:02Z

Put that in a gist and... what would you call that? A spinning octocat of death?

I believe this is reproducible: clone a dummy gist; push with the bad file; click edit; click cancel.

geoff-nixon · 2013-12-13T01:22:28Z

You might want to take a look at my last comment from the pull linked above.
Or just skip to the gist.

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Basically: - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb). - Adds the test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py). Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.

geoff-nixon · 2013-12-16T12:28:30Z

Pull Request #841 closes this if merged.

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Basically: - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and/or parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb) and [sample.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/sample.rb). - Adds a test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py). Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.

…errors). So I've gone ahead and rebased this onto 2.10.7... But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess? Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code? This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Addresses a number of encoding errors, mostly by: - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing. - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.) - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc. - Includes the following new test cases for the above, all taken from real repositories here on Github: - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper) - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer) - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252) - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)

…errors). So I've gone ahead and rebased this onto 2.10.8... But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess? Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code? This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Addresses a number of encoding errors, mostly by: - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing. - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.) - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc. - Includes the following new test cases for the above, all taken from real repositories here on Github: - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper) - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer) - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252) - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)

geoff-nixon mentioned this issue Dec 13, 2013

Update libmagic; link system ICU on Darwin; use RbConfig CFLAGS, etc. brianmario/charlock_holmes#55

Closed

geoff-nixon mentioned this issue Dec 16, 2013

Fix handling of invalid UTF-8 byte sequences in Ruby 1.9+. #840

Closed

geoff-nixon mentioned this issue Dec 16, 2013

Fix handling of invalid UTF-8 byte sequences in Ruby 1.9+. #841

Closed

geoff-nixon closed this as completed Dec 16, 2013

geoff-nixon mentioned this issue Dec 17, 2013

Fix handling of invalid UTF-8 (and other character encoding errors). #845

Closed

geoff-nixon reopened this Dec 23, 2013

geoff-nixon closed this as completed Dec 23, 2013

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linguist is tongue-tied when it encounters invalid/inconsistent encodings. #830

Linguist is tongue-tied when it encounters invalid/inconsistent encodings. #830

geoff-nixon commented Dec 12, 2013

geoff-nixon commented Dec 12, 2013

geoff-nixon commented Dec 13, 2013

geoff-nixon commented Dec 16, 2013

Linguist is tongue-tied when it encounters invalid/inconsistent encodings. #830

Linguist is tongue-tied when it encounters invalid/inconsistent encodings. #830

Comments

geoff-nixon commented Dec 12, 2013

geoff-nixon commented Dec 12, 2013

geoff-nixon commented Dec 13, 2013

geoff-nixon commented Dec 16, 2013