Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for rtf format #152

Merged
merged 1 commit into from
May 7, 2019
Merged

Add support for rtf format #152

merged 1 commit into from
May 7, 2019

Conversation

kbenoit
Copy link
Collaborator

@kbenoit kbenoit commented May 6, 2019

Add support for rtf files. Solves #90.

@jeroen the striprtf package seems to handle encoding issues more robustly than unrtf so I went with that.

library("readtext")
library("quanteda")
## Package version: 1.4.3

readtext("https://jeroen.github.io/files/sample.rtf") %>%
  texts() %>%
  cat()
## It is an example test rtf-file to RTF2XML bean for testing
## 
## Font size 10, plain text;
## Font size 12, bold text. Underline,bold text.
##  Underline,italic,bold text. 
## Font size 22, plain text.
##                                                  Bold text.
##                 Italic text.
## 
##    Simple table :
## 
## 
## *| 1st column | 2nd column | 3rd column | 4th column | 5th column | 
## *| 1.1 item | 1.2 item | 1.3 item | 1.4 item | 1.5 item | 
## *| 2.1 item | 2.2 item | 2.3 item | 2.4 item | 2.5 item | 
## *| 3.1 item | 3.2 item | 3.3 item | 3.4 item | 3.5 item | 
## *| 4.1 item | 4.2 item | 4.3 item | 4.4 item | 4.5 item | 
## *| 5.1 item | 5.2 item | 5.3 item | 5.4 item | 5.5 item | 
## *| Empty  | 
## *| …
## *|  | 
## *| …
## *|  | 
## *| …
## *|  | Empty | 
## *| Last items | 
## *| …
## *|  | 
## *| …
## *|  | 
## *| …
## *|  | Last items | 
## 
## 
## List :
## 
## It is the 1st row of the list
## It is the 2nd row of the list
## …
## 
## …
## 
## …
## 
## It is the last row of the list
## 
##  Here is a brief Courier text.
##   Here is a brief MS Sans - Serif text.
##   Here is a brief MS Serif text.
##   Here is a brief Times New Roman text.
## 
##   
## 
##  Some paragraphs :
## 
## Align left :
## 
##      The text you are reading is aligned left. It is an align – left text. It is also an align – left sentence.           
##         
## Align right:
## 
##   The text you are reading is aligned right. It is an align – right text. It is also an align – right sentence. 
## 
## Align centered:
## 
##         The text you are reading is aligned center. It is an align – centered text. It is also an align – centered sentence. 
## 
## Align justified:
## 
##           The text you are reading is aligned justify. It is an align – justified text. It is also an align – justified sentence.
## 
## Here are some special characters: 
## ö
## t 
## á
## rv
## í
## zt
## û
## r
## õ
##  
## ü
## tvef
## ú
## r
## ó
## g
## é
## p, which means “five flood resistant hammer drills” () in Hungarian.
## 
##    At last you can see an image :

@kbenoit kbenoit requested a review from amatsuo May 6, 2019 07:12
@kbenoit kbenoit merged commit 5abeab0 into master May 7, 2019
@kbenoit kbenoit deleted the add-rtf branch May 7, 2019 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant