Fix handling of invalid UTF-8 byte sequences in Ruby 1.9+.

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Basically: - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb). - Adds the test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py). Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
pullreq · Dec 16, 2013 · 12b0706 · 12b0706
1 parent 03cadf2
commit 12b0706
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 2 deletions.
diff --git a/lib/linguist/blob_helper.rb b/lib/linguist/blob_helper.rb
@@ -233,6 +233,20 @@ def vendored?
       name =~ VendoredRegexp ? true : false
     end
 
+    # Internal: Explicitly remove invalid UTF-8 sequences by conversion.
+    #
+    # Only affects Ruby 1.9+ since 1.8 is charset naive.
+    # 
+    # Returns the data blob with invalid characters replaced with \uFFFD if needed.
+    def _safe_data
+      if ''.respond_to?(:encode!)
+        safe_utf8 = Encoding::Converter.new(encoding, 'utf-8', :invalid => :replace)
+        safe_utf8.convert(data).dump
+      else
+        data
+      end
+    end       
+
     # Public: Get each line of data
     #
     # Requires Blob#data
@@ -241,7 +255,7 @@ def vendored?
     def lines
       @lines ||=
         if viewable? && data
-          data.split(/\r\n|\r|\n/, -1)
+          _safe_data.split(/\r\n|\r|\n/, -1)
         else
           []
         end
@@ -274,7 +288,7 @@ def sloc
     #
     # Return true or false
     def generated?
-      @_generated ||= Generated.generated?(name, lambda { data })
+      @_generated ||= Generated.generated?(name, lambda { _safe_data })
     end
 
     # Public: Detects the Language of the blob.

diff --git a/samples/Python/invalid-encoding.py b/samples/Python/invalid-encoding.py
@@ -0,0 +1,4 @@
+#!/usr/bin/env python
+# Here is a string that cannot be decoded in line mode: Â.
+
+# From: llvm.org/svn/llvm-project/llvm/trunk/utils/lit/tests/shtest-encoding.py