Skip to content

Commit

Permalink
Fix handling of invalid UTF-8 byte sequences in Ruby 1.9+.
Browse files Browse the repository at this point in the history
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Basically:
 - explicitly convert text to UTF-8, replacing invalid characters
   prior to spitting into lines and parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb).
 - Adds the test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py).

Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
  • Loading branch information
Geoff committed Dec 16, 2013
1 parent 03cadf2 commit 12b0706
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 2 deletions.
18 changes: 16 additions & 2 deletions lib/linguist/blob_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,20 @@ def vendored?
name =~ VendoredRegexp ? true : false
end

# Internal: Explicitly remove invalid UTF-8 sequences by conversion.
#
# Only affects Ruby 1.9+ since 1.8 is charset naive.
#
# Returns the data blob with invalid characters replaced with \uFFFD if needed.
def _safe_data
if ''.respond_to?(:encode!)
safe_utf8 = Encoding::Converter.new(encoding, 'utf-8', :invalid => :replace)
safe_utf8.convert(data).dump
else
data
end
end

# Public: Get each line of data
#
# Requires Blob#data
Expand All @@ -241,7 +255,7 @@ def vendored?
def lines
@lines ||=
if viewable? && data
data.split(/\r\n|\r|\n/, -1)
_safe_data.split(/\r\n|\r|\n/, -1)
else
[]
end
Expand Down Expand Up @@ -274,7 +288,7 @@ def sloc
#
# Return true or false
def generated?
@_generated ||= Generated.generated?(name, lambda { data })
@_generated ||= Generated.generated?(name, lambda { _safe_data })
end

# Public: Detects the Language of the blob.
Expand Down
4 changes: 4 additions & 0 deletions samples/Python/invalid-encoding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/usr/bin/env python
# Here is a string that cannot be decoded in line mode: Â.

# From: llvm.org/svn/llvm-project/llvm/trunk/utils/lit/tests/shtest-encoding.py

0 comments on commit 12b0706

Please sign in to comment.