Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of invalid UTF-8 byte sequences in Ruby 1.9+. #840

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions lib/linguist/blob_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,20 @@ def vendored?
name =~ VendoredRegexp ? true : false
end

# Internal: Explicitly remove invalid UTF-8 sequences by conversion.
#
# Only affects Ruby 1.9+ since 1.8 is charset naive.
#
# Returns the data blob with invalid characters replaced with \uFFFD if needed.
def _safe_data
if ''.respond_to?(:encode!)
safe_utf8 = Encoding::Converter.new(encoding, 'utf-8', :invalid => :replace)
safe_utf8.convert(data).dump
else
data
end
end

# Public: Get each line of data
#
# Requires Blob#data
Expand All @@ -241,7 +255,7 @@ def vendored?
def lines
@lines ||=
if viewable? && data
data.split(/\r\n|\r|\n/, -1)
_safe_data.split(/\r\n|\r|\n/, -1)
else
[]
end
Expand Down Expand Up @@ -274,7 +288,7 @@ def sloc
#
# Return true or false
def generated?
@_generated ||= Generated.generated?(name, lambda { data })
@_generated ||= Generated.generated?(name, lambda { _safe_data })
end

# Public: Detects the Language of the blob.
Expand Down
4 changes: 4 additions & 0 deletions samples/Python/invalid-encoding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/usr/bin/env python
# Here is a string that cannot be decoded in line mode: �.

# From: llvm.org/svn/llvm-project/llvm/trunk/utils/lit/tests/shtest-encoding.py