-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readtext sometimes produces invalid UTF-8 #108
Comments
Encoding issue... the file is Latin1, so you need to specify the conversion at input. Could be the "£" symbol. system2("file", tmp)
## /var/folders/46/zfn6gwj15d3_n6dhyy1cvwc00000gp/T//RtmpL5BugH/filee0d4589b4674.txt: ISO-8859 text, with CRLF line terminators
data <- readtext(tmp, encoding = "ISO-8859-1")
cat(substr(data$text, 875900, 875960))
## command of her beauty, and her £20,000, any one who could sat Nice vignette! |
I learned from the best! Would it be possible for readtext to detect that the text is invalid? Something like this: If the user does not specify the encoding, do the following:
On step 3, instead of assuming Latin-1, you could try some automatic encoding detection (still with a warning to the user). My prior is a high probability of Latin-1, so it might not be worth it to detect the encoding automatically. |
very doable, since we have the function data <- readtext(tmp)
encoding(data)
## readtext object consisting of 1 document and 0 docvars.
## # data.frame [1 x 2]
## doc_id text
## <chr> <chr>
## 1 filee0d4589b4674.txt "\"The Projec\"..."
## Probable encoding: ISO-8859-1
## (but note: detector often reports ISO-8859-1 when encoding is actually UTF-8.) |
Does your new package utf8 offer a way to solve this? |
You could check for validity with |
With current development version of readtext:
See https://github.com/patperry/r-corpus/blob/master/vignettes/unicode.Rmd for more context.
Session information
The text was updated successfully, but these errors were encountered: