Skip to content

Latest commit

 

History

History
67 lines (42 loc) · 2.98 KB

README.md

File metadata and controls

67 lines (42 loc) · 2.98 KB

Worditize

Worditize is a Shiny application allowing users to explore and compare documents available on Project Gutenberg by performing basic text mining tasks in the browser. This application is powered by the wonderful tidytext and gutenbergr packages, as well as the ever-helpful tidyverse group of packages.

Getting Started

Upon running the application, the user is presented with the main page below:

Main Page

Counts and Sentiments

In the text box on the left, the user may begin typing the title of a public domain document that is available in Project Gutenberg. The first 10 matching results will be displayed in a dropdown menu. Doing so from the default tab "Topics and Feels" will generate the top 10 most common words in the text --- excluding stop words (e.g. 'a', 'an', 'the', 'but', 'and', etc.) --- as well as generating sentiment information, as shown in the images below.

Top 10 Words

Sentiment Information

The sentiment tab shows sentiment data for the three sentiment corpora available in the tidytext package. The NRC corpus associates emotion words with various emotive words. The AFINN corpus rates words on a point scale, indicating not only positivity and negativity, but als to what degree a word is positive or negative. Finally, the BING corpus simply rates words as "positive" or "negative".

In each case, the application generates counts for what each respective corpus measures.

Topic Modeling (In Progress)

To this point, the topic modeling part of the application (under the "Topics" header) employs the topicmodels package, which makes handy use of tidily-formatted text data to generate a Latent Dirichlet Analysis (LDA) topic model. An option is available to select Pachinko topic modeling as well, but this is currently not implemented.

At present, two topics at a time can be mined for topics. The user is permitted to select between one and ten topics (default: 5) to generate. Additionally, an arbitrary number of words to show for each topic can be selected. As shown below, running the analysis will generate a faceted chart of the top N words for each topic charted against the probability of each word being generated for that topic.

Initial Topic Modeling Results

On the "Document Probabilities" tab, we can see the probability that a word for a topic came from a given document, as shown by the gamma value.

Document Probabilities

Future Work

I would like to make additional topic modeling algorithms available, likely by way of the RMallet package, which provides an R interface to the mallet library written in Java.

Link

Try me out at shinyapps.io!