Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linguist is tongue-tied when it encounters invalid/inconsistent encodings. #830

Closed
geoff-nixon opened this issue Dec 12, 2013 · 3 comments
Closed

Comments

@geoff-nixon
Copy link
Contributor

See #829, or just:

curl -OL http://llvm.org/svn/llvm-project/llvm/trunk/utils/lit/tests/shtest-encoding.py
linguist shtest-encoding.py
...

Oh the indignity!

@geoff-nixon
Copy link
Contributor Author

Put that in a gist and... what would you call that? A spinning octocat of death?

I believe this is reproducible: clone a dummy gist; push with the bad file; click edit; click cancel.

octocat of death

@geoff-nixon
Copy link
Contributor Author

You might want to take a look at my last comment from the pull linked above.
Or just skip to the gist.

geoff-nixon pushed a commit to pullreq/linguist that referenced this issue Dec 16, 2013
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Basically:
 - explicitly convert text to UTF-8, replacing invalid characters
   prior to spitting into lines and parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb).
 - Adds the test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py).

Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
@geoff-nixon
Copy link
Contributor Author

Pull Request #841 closes this if merged.

geoff-nixon pushed a commit to pullreq/linguist that referenced this issue Dec 16, 2013
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Basically:
 - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and/or parsing.
   See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb) and [sample.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/sample.rb).
 - Adds a test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py).

Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
geoff-nixon pushed a commit to pullreq/linguist that referenced this issue Dec 23, 2013
…errors).

So I've gone ahead and rebased this onto 2.10.7...

But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess?

Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code?

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Addresses a number of encoding errors, mostly by:
 - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing.
 - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.)
 - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'.
   See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc.
 - Includes the following new test cases for the above, all taken from real repositories here on Github:
    - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper)
    - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer)
    - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252)
    - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
@geoff-nixon geoff-nixon reopened this Dec 23, 2013
geoff-nixon pushed a commit to pullreq/linguist that referenced this issue Dec 29, 2013
…errors).

So I've gone ahead and rebased this onto 2.10.8...

But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess?

Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code?

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Addresses a number of encoding errors, mostly by:
 - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing.
 - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.)
 - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'.
   See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc.
 - Includes the following new test cases for the above, all taken from real repositories here on Github:
    - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper)
    - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer)
    - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252)
    - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.