Skip to content

Latest commit

 

History

History
1144 lines (754 loc) · 63.6 KB

nlp-text-summarization.org

File metadata and controls

1144 lines (754 loc) · 63.6 KB

NLP Text Summarization, Sentence Compression, NER Summarization

After completing this post you will know

  1. Be able to run a naive basic non machine learning code that summarizes text based on TFIDF.
  2. Why such naive implementation works very partially.
  3. The 3 main methods for text summarization

    3.1 **Text Summarization** Summarize a blokc of text, a brief overview of techniques (from published paper).

    3.2 **Sentence Compression** This one aims to compress a single sentence.

    3.3 **NER based summarization** Look at the entities of the text (price, brand, product) and summarize by it.

  4. Extractive text summarizer vs sematic summarizer.
  5. Run code to identify the topic of a text and brand with deep learning. (online demo available).
  6. Run code to summarize multiple sentences with takehe github project.
  7. How wallmarts handles attributes extraction from products titles in eCommerce.
  8. Run code that extracts common NER’s (brand, money, location) from a sentence.
  9. What SumBasic is
  10. Have plenty of resources and directions to continue from here.

Here are some of the examples we are going to go through:

  1. Naive summarizatoin of a wikipedia page about leonardo da vinci.
  2. Extract topic.
  3. Detect what is the topic of pneumonia wikipedia article.
  4. Find out the category and brand of an amazon product.
  5. Detect what is the topic of abortion wikipedia article.
  6. Summarize news about clinton.
  7. Use spacy to extract features of text.

Sounds lke a lot, let’s get started.

Disclosure

I just began to study this topic, most of the things i’m either talking about or practicing here i’m doing for the first time. I’m by no means an expert and not even a novice :). There would be mistakes in that article. I will share my learning and discoveries with you. Document is in progress and would get updated. In addition while running most of the examples it looks like nothing predicted well all cases I entered, there were always mistakes!

Our plan

So we want to understand text summarization / sentence compression / NER based summarization, let’s have a plan:

  1. The jargon.
  2. Published researches and references to them.
  3. History of text summarization.
  4. The different methods.
  5. Let’s write some code.
  6. Where do I plan to head on.
  7. Summary.

Jargon

I always find that in any topic I study the jargon/taxonomy/terminology is one of the most important things to know so here it is:

TermDescription
Text SummarizationComputer creating meaninful summaries of text
Sentence CompressionTake a large sentence(or few) and compress
NERNamed Entity Recognition (Price, Brand, Geo)
Google knowledge graphA enhancement to search where it shows informative data on right panel for search results
Extractive SummaryA summary which is solely built from extraction of words and snippets from text
Abstractive SummaryA summary which involves not nessesaraly only items from text, involves understanding the text
ex. https://www.google.com/intl/es419/insidesearch/features/search/assets/img/snapshot.jpg
OntologyDoamin specific information: “The ontologies on the Web range from large taxonomies categorizing
Web sites (such as on Yahoo!) to categorizations of products for sale and their
NLGNatural Language Generation
Word EmbeedingVectorizing words, word2vec, GLoVE, solves sparse matrix, uses context

Introduction

What is text summarization: An example could be great here so below is a real world one:

Article:

novell inc. chief executive officer eric schmidt has been named chairman of the internet search-engine company google .

Human Summary:

novell ceo named google chairman

Machine Summary:

novell chief executive named to head internet company

Reference: TensonFlow Research Text Summarization

Yes, most text summarization train data, research and example models are focused on news, if you are not in news business most chances you need to get your own data and retrain, no ready models for you.

How do we (humans, although some bots are also reading this..) summarize text? We read it mostly or partially, understand, fill in context, reread, read other docs, think, put in intuiveness, apply templates (finance), assume audience expectation, highlight important items, sleep on it, i have to stop here..

And then:

**We come up with a much shorter version of the orig doc which contains the main ideas and shares the intent presented in the original doc - the glorious summary**

or as “Text Summarization Techniques” paper says:

a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually, significantly less than that

How do they (computers) summarize text, taking into account the above process was so complex!

Who knows!

Can they do that? If yes can they do that in a satisfactory manner? Will they have embarrassing mistakes? How far are they from humans? Or maybe how advanced?

First Paper - Text Summarization Techniques

Text Summarization Techniques: A Brief Survey

This is the paper that we need to get started, their premise is:

We review the different processes for summarization and describe the effectiveness and shortcomings of the different methods.

Sounds promising we would get back to this paper, but I want to scan the topic some more and let’s even start with a naive exmaple.

Naive Code

**Note, in our use case we are less interested in human complete sentences but more of a few words together which summarize the topic.**

In our first naive code implementation, no machine learning, just take some text and try to summarize it, somehow, common sense. Let’s see:

Step 1: Here is our bunch of text to summarize:

text: str = """
Leonardo da Vinci
Leonardo di ser Piero da Vinci (Italian: [leoˈnardo di ˌsɛr ˈpjɛːro da (v)ˈvintʃi] (About this sound listen); 15 April 1452 – 2 May 1519), more commonly Leonardo da Vinci or simply Leonardo, was an Italian polymath of the Renaissance, whose areas of interest included invention, painting, sculpting, architecture, science, music, mathematics, engineering, literature, anatomy, geology, astronomy, botany, writing, history, and cartography. He has been variously called the father of palaeontology, ichnology, and architecture, and is widely considered one of the greatest painters of all time. Sometimes credited with the inventions of the parachute, helicopter and tank,[1][2][3] he epitomised the Renaissance humanist ideal.

Many historians and scholars regard Leonardo as the prime exemplar of the "Universal Genius" or "Renaissance Man", an individual of "unquenchable curiosity" and "feverishly inventive imagination",[4] and he is widely considered one of the most diversely talented individuals ever to have lived.[5] According to art historian Helen Gardner, the scope and depth of his interests were without precedent in recorded history, and "his mind and personality seem to us superhuman, while the man himself mysterious and remote".[4] Marco Rosci notes that while there is much speculation regarding his life and personality, his view of the world was logical rather than mysterious, and that the empirical methods he employed were unorthodox for his time.[6]

Born out of wedlock to a notary, Piero da Vinci, and a peasant woman, Caterina, in Vinci in the region of Florence, Leonardo was educated in the studio of the renowned Florentine painter Andrea del Verrocchio. Much of his earlier working life was spent in the service of Ludovico il Moro in Milan. He later worked in Rome, Bologna and Venice, and he spent his last years in France at the home awarded to him by Francis I of France."""

Leonardo was a good man, let’s naively summarize him.

First, how would you summarize this text, let’s say limiting to 7 words?

I would say this:

My modest summary: “Leoardo Da Vinci, italian, renaisssane, painter, sculpturer”

Now lets move on with our naive code implementation:

Step 2: Tokenize the words:

words = word_tokenize(text) # thanks nltk

Step 3: Score words based on their frequency

words_score: FreqDist = FreqDist() # thanks nltk
for word in words:
    words_score[word.lower()] += 1

Step 4: The summary would be our top 7 frequent words:

def top_scores_sorted_by_text(w_scores: FreqDist, k: int):
    return sorted(w_scores.most_common(k), key=lambda w: word_index(text, w))

summary = top_scores_sorted_by_text(words_score, 7)
print(summary)

Let’s see our result

[('[', 15), ('his', 17), (',', 67), ('of', 31), ('the', 32), ('and', 26), ('.', 21)] # that's a horrible summary!

We have his of the obviously we don’t want them in our summary let’s get rid of them:

Step 5: Get rid of stop words

stop_words: Set[str] = set(stopwords.words("english")) # thanks nltk
words = [w for w in words if not w in stop_words] # thanks python
text = ' '.join(words) # and the updated text (sorry immutability) is now a join of the words without stop words.

Now let’s print again the resulting summary

[('leonardo', 11), ('da', 5), ('vinci', 6), ('[', 15), (']', 15), (',', 67), ('.', 21)]

This is somewhat a little better version we have leonardo da vinci as the first 3 words in summary sounds perfect! but we have also lot of puncutaions, let’s get rid of them:

Step 6: Get rid of punctuations

def remove_punctuations(s: str) -> str:
    table = str.maketrans({key: None for key in string.punctuation}) # standard python (thanks).
    return s.translate(table)

text = remove_punctuations(text)

And print again the summary:

[('leonardo', 9), ('da', 5), ('vinci', 6), ('he', 4), ('renaissance', 4), ('painting', 4), ('engineering', 3)]

Uh, looks much better. There is one issue, we have he in the summary, we don’t want it, we have only 7 words and no space to waste, could it be that leonaro was proficient in another topic?

Step 7: Fix stop word bug

We have a bug, we have removed the stopwords with: [w for w in words if not w in stop_words] but somehow the he stopword has sneaked inside. Let’s fix it, the problem is that we didn’t lower case the text so He was not considered as the stopword he

text = text.lower() # no immutability small example.

And now let’s run the summary again:

[('leonardo', 9), ('da', 5), ('vinci', 6), ('renaissance', 4), ('painting', 4), ('engineering', 3), ('inventions', 3)]

No more he stopword. This even looks like a much better summary that my original (human) one!

**But don’t get excited, there are millions if not billions of summaries this naive dumb summarized would not pass, just think of products for sale. If we think of products for sale we need a better flow.**

We could think of more enhancements:

  1. Give higher score to words appearing in title.
  2. Refer to query (if got to this page by search).
  3. More..

Let’s summary what we have done in the above naive summarizer:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Text Summarization Very Naive Implementation                                                         │
│                                                                                                     │
│┌───────────────────┐      ┌───────────────────┐      ┌───────────────────┐     ┌───────────────────┐│
││                   │      │                   │      │                   │     │                   ││
││Get Some text from │      │      Cleanup      │      │   Words Scoring   │     │Select top k words ││
││     wikipedia     │─────▶│                   │─────▶│                   │────▶│  as our summaruy  ││
││                   │      │                   │      │                   │     │                   ││
│└───────────────────┘      └───────────────────┘      └───────────────────┘     └───────────────────┘│
│                                     │                          │                                    │
│                                     ▼                          ▼                                    │
│                           ┌───────────────────┐      ┌───────────────────┐                          │
│                           │Remove punctuations│      │  Frequency Table  │                          │
│                           └───────────────────┘      └───────────────────┘                          │
│                                     │                                                               │
│                                     ▼                                                               │
│                           ┌───────────────────┐                                                     │
│                           │    Lower case     │                                                     │
│                           └───────────────────┘                                                     │
│                                     │                                                               │
│                                     ▼                                                               │
│                           ┌───────────────────┐                                                     │
│                           │ Remove stopwords  │                                                     │
│                           └───────────────────┘                                                     │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘

A few points to note:

  1. This is extractive text summarizer we didn’t invent anything, no semantic understanding, we just selected words.
  2. There is a better algorithm called SumBasic

The difference between extractive and semantic is that extractive takes phrases from the text so in that sense it cannot go wrong it will take only things which preexisted in the text, semantic will try to actually understand the text and compose new text.

SumBasic

Here is the formula for sum basic:

\begin{equation} g(S_j)=\frac{∑w_i∈{S_j}P(w_i)}{|\{w_i|w_i∈{S_j}|} \end{equation}

This looks complex to me. But I found that after I got what each symbol means it became simple, even embarrasingly simple.

Here is the meaning of that formula:

termmeaning
g(S_j)Weight of sentence j
w_i∈{S_j}For each word that belongs to sentence j
w_i∈{S_j}P(w_i)The sum of all probabilities of words that belong to sentence j
{|\{w_i\vertw_i∈{S_j}|Number of words in the sentence j

So that turns g(S_j) to be the average probability of words in sentence j where word probabilty is simply the number of occurences of word w_i inside the document.

This is very similar to what we did with words without knowing SumBasic! In our case we wanted to get a bunch of words and not a bunch of sentences so we just took the words appearing most, which is similar to taking the sentences with highest word probablity.

SumBasic then continues to update each word probability as it’s multiplication by itself (reduce it) so we can now pick other sentences, and it keeps on with this loop until we picked as much sentences as we meant to.

Multi Sentence Compression

There is an intersting github repo named takehe (based on papers below) let’s give it a shot:

takahe is a multi-sentence compression module. Given a set of redundant sentences, a word-graph is constructed by iteratively adding sentences to it. The best compression is obtained by finding the shortest path in the word graph. The original algorithm was published and described in:

Katja Filippova, Multi-Sentence Compression: Finding Shortest Paths in Word Graphs, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 322-330, 2010.

Let’s give it a shot:

conda create -n takahe-py2 python=2.7
conda activate takahe-py2
conda install -y graphviz pygraphviz spyder numpy networkx
git clone https://github.com/boudinfl/takahe
pip install secretstorage
pip install networkx==1.1
git clone https://github.com/boudinfl/takahe

now we give it some text but it requires some annotated text:

["The/DT wife/NN of/IN a/DT former/JJ U.S./NNP president/NN 
#Bill/NNP Clinton/NNP Hillary/NNP Clinton/NNP visited/VBD China/NNP last/JJ 
#Monday/NNP ./PUNCT", "Hillary/NNP Clinton/NNP wanted/VBD to/TO visit/VB China/NNP 
#last/JJ month/NN but/CC postponed/VBD her/PRP$ plans/NNS till/IN Monday/NNP 
#last/JJ week/NN ./PUNCT", "Hillary/NNP Clinton/NNP paid/VBD a/DT visit/NN to/TO 
#the/DT People/NNP Republic/NNP of/IN China/NNP on/IN Monday/NNP ./PUNCT",
"Last/JJ week/NN the/DT Secretary/NNP of/IN State/NNP Ms./NNP Clinton/NNP 
#visited/VBD Chinese/JJ officials/NNS ./PUNCT"]

And the summarization results are:

0.234 hillary clienton visited china last week.
0.247 hillary clienton visited china on monday #last week.
.
.
.
#0.306 hillary clinton paid a visit to #the people of republic of china last week.
.
.
.
 

We are still summarizing news :( we need to revisit our plan and github and google searches :)

Updated Plan

Now that we did a variation on SumBasic for words instead of sentences, lets move on with more examples appearing on the web. Namely algorithms that do more of understanding of the text and compose new text and not just choose and extract ready made summary from our existing text.

**Step 1: Mode: Classify text**

Is the text about an artist? is the text about a car is the text about an electric cleaning machine?

**Step 2: Manual: Idetify the main features of the topic**

That is the ontology, topic we want to identify the topic once we ge tthe topic we can get better at the summarization (you see we get to understand the text). We have identified that the text is about an electric washin cleaning machine this means, we need these features (this is the task to identify the features)

  1. Watts
  2. Target
  3. Price
  4. Size

But how can we get the topic? how can we get then the relevant features?

**Step 3: Given an article identify topic fill in feature values**

So given an article identify:

  1. Which topic is it about?
  2. What are the features of that topic?
  3. Fill in the values from the article about the features of that topic.

Sounds like a plan!

Step 1: Identify Article Topic

This is also called **Text Classification**. There 3 main categories to achieve Text Classification:

  1. Rules
  2. Standard Machine Learning Models
  3. Deep Learning

I don’t have time for rules, my laptop is too slow for deep learning and i’m not sure I have enought data, si’ll go with option 2 standard models and then move on to deep learning on EC2.

There is a great example (i’m doing this for the first time) at sklearn website for how to build a model to classify text. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html I’m simply going to use and run it.

Creating the model and prediciting the class/topic for the article will involve the following steps:

  1. Load labeled newgroups data with topics.
  2. Vectorize the documents, BOW (Bag Of Words).
  3. We can do better than BOW so we are going to TFIDF the docs to get the target vectors.
  4. Run train
  5. Predict

We are not going to check the accuracy, just run arbitrary example on the model.

Note that sklearn will handle the large sparse matrix issue (consming much of RAM) for us, it’s going to shrink them automatically. (did i say thanks sklearn?)

**Step 1: Load Labeled newsgroups data with topics**

from sklearn.feature_extraction.text import CountVectorizer
import json

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)
twenty_train.target_names = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In the above code we:

  1. define our categories, we have defined 4 newsgroups categories. Note that sklearn knows to fetch this example data automatially for us.
  2. Load the text data into a variable twenty_train
  3. Add a new member to twenty_train named target_names with our categories.

**Step 2: Feature engineering**

We have loaded our data which is just a set of newsgroups posts. What are it’s features? It’s a text data, so it has words right? so each distinct word is going to serve as a feature. In our case BOW means a matrix where each doc is a row and each column is a word and we count the number of times such word appears in each doc. Guess what, sklearn will do that automatically for us and also shrink the sparse matrix (most of words do not appear in each doc).

BOW code:

count_vect = CountVectorizer() 
X_train_counts = count_vect.fit_transform(twenty_train.data) # Tokenize, Filter Stopwords, BOW Features, Transform to vetor, this returns Term Document Matrix! thanks sklearn

That’s it with 2 lines we have tokenized the newgroup messages, filtered stopwords, extracted BOW features, transformed them to a vector (numbers).

BOW is skewed toward large documents where words appear more so we are going to turn our face to the TFIDF vectorizing instead of BOW, here is the code to do that:

**Step 3: Replace BOW with TFIDF**

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) # Transform a count matrix to a normalized tf or tf-idf representation
X_train_tf = tf_transformer.transform(X_train_counts) # Transform a count matrix to a tf or tf-idf representation # X_train_tf.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

The above code is self explanatory we first do TF and then IDF, note that we do all operatoins with just a few lines, sklearn appears to be very developer friendly and has concise and clear api, no wonder it’s so common.

Now that we have our data loaded, and extracted all the features from it (vectorized with tfidf) it’s time to build the model.

**Step 4: Build the model to predict class of newsgroup message**

from sklearn.naive_bayes import MultinomialNB # Naive bayes classifier
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

There are multiple classifiers we are following sklearn example, in our example, so we have chosen the same. We then called fit and passed as input: X_train_tfidf that is the set of features for each doc (the tfidf vectors) and as the labels/output we train the model with twenty_train.target which is the vector of topics we train the model with for each row.

Now for money time, we are going to predict something, i’m going to take an arbitrary wikipedia article that deals with one of the 4 categories and see if it’s well predicted, so what have we got there, science medicine, religion, computer graphics, and atheism.

To test the prediction we are not going to run on a set of artiles but just pick two example articles from wikipedia and see the outcome prediction. At first let’s pick an easy one I think, an artile from wikipedia about pneumonia, I will pick the first two sections and run it through the model prediction and see the category chosen.

## Predict document class!

# https://en.wikipedia.org/wiki/Pneumonia

docs_new = ["""pneumonia is an inflammatory condition of the lung affecting primarily the small air sacs known as alveoli.[4][13] Typically symptoms include some combination of productive or dry cough, chest pain, fever, and trouble breathing.[2] Severity is variable.  Pneumonia is usually caused by infection with viruses or bacteria and less commonly by other microorganisms, certain medications and conditions such as autoimmune diseases.[4][5] Risk factors include other lung diseases such as cystic fibrosis, COPD, and asthma, diabetes, heart failure, a history of smoking, a poor ability to cough such as following a stroke, or a weak immune system.[6] Diagnosis is often based on the symptoms and physical examination.[7] Chest X-ray, blood tests, and culture of the sputum may help confirm the diagnosis.[7] The disease may be classified by where it was acquired with community, hospital, or health care associated pneumonia"""]
X_new_counts = count_vect.transform(docs_new) # Extract new doc features.
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

Now after running this pneumonia text we get from the model this prediction:

it was acquired with community, hospital, or health care associated pneumonia' => sci.med (science medical) so it got categorized as sci.med which is simply corret!

Now let’s say a nother piece of text this time about abortion and see what the model will predict, here is the new text we have fed it with: https://en.wikipedia.org/wiki/Abortion the first section again which is:

> Abortion is the ending of pregnancy by removing an embryo or fetus before it can survive outside the uterus.[note 1] An abortion that occurs spontaneously is also known as a miscarriage. An abortion may be caused purposely and is then called an induced abortion, or less frequently, “induced miscarriage”. The word abortion is often used to mean only induced abortions. A similar procedure after the fetus could potentially survive outside the womb is known as a “late termination of pregnancy”

And the resulting prediction by the model is:

...survive outside the womb is known as a "late termination of pregnancy' => soc.religion.christian

Which means that abortion was categorized as social religion christianity category => I don’t know if to be happy, sad, depressed, or excited by this prediction.

**Summary of step 1**

It looks like there is a way to determine the class of an text snippet by it’s content using machine learning models, for sure there are challenges but this appears to be rather well known problem and there are available methods for solving and optimizing it (changing model, parameters, better training input data).

Now for the next step we have expected that for each class/topic we are going to select the set of features which we are going to use for text summarization. I’m afraid this part has to be manual, we have to say that for a topic “disease”, the features are going to be a set of closed features suh as “mortality rate”, “suspectible age group”, “name”, “average length”. And on the other hand for “cars” topic the summary template variables are going to be: “manufacturer”, “engine type”, “year”, “color”, “used/new”, etc. It appears like for these set of summary template variables are going to be hand crafted.

The question is for step 3, whether a model could extract the set of “variable values” from articles and apply a summary from them? I don’t have the answer, at least not at my current googling phase.

Step 2 and 3 looks like lot of manual work, is it possible that I could do some googling for better and more automatic solutions or better approaches to this problem of summarization?

Step 2 Extract Features

As we said in the previous section, extracting the relevant features for a topic is either a heavy manual work or magic-computer work. You see, for every topic for every discussion there is its own unique set of feature, if its a luggage you have the dimentions, color, applies to low-cost or not, and ofcourse brand name for each of them. I’m sure there must be a way out of it without programming the universe from scratch again.

After doing some more google search NER looks like a good candidate, at least for part of the problem. NER? After doing some googling, I have noticed that NER seems like part of the solution, looking at spacy.io I see they have already implemented some common NER and have API to train new NER, standford NLP libraries also have an NER this time with java.

According to toward data science:

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc

Let’s have a look at the abilities of spacy and what it can do for us and ccording to spacy’s documentation:

The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

According to it’s documentation it can identify the following (and not only) entities: PERSON, ORG (companies), PRODUCT, WORK_OF_ART (Books, ..), PERCENT, MONEY, QUANTITY, and a few more

In addition it allws you to extend and train new models to recognize new entities.

Let’s try it out with it’s basic usage.

We start with their example:

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

And when I run it I get:

(u'Apple', 0, 5, u'ORG')
(u'U.K.', 27, 31, u'GPE')
(u'$1 billion', 44, 54, u'MONEY')

So it has recognized the company Apple the geogrpahical entity UK and a small amount of money: $1 billion

Let’s change the input sentence to: Google is looking at buying U.K. startup for $1 billion, if it works it might buy Apple and see that it identifies now two companies, there result of running the above code is:

(u'Google', 0, 6, u'ORG')
(u'U.K.', 28, 32, u'GPE')
(u'$1 billion', 45, 55, u'MONEY')
(u'Apple', 84, 89, u'ORG')

What if I change from Apple to apple that is Google is looking at buying U.K. startup for $1 billion, if it works it might buy apple

(u'Google', 0, 6, u'ORG')
(u'U.K.', 28, 32, u'GPE')
(u'$1 billion', 45, 55, u'MONEY')

Aha so apple with lower case does not count as a company, what if google decides to eat an Apple? with upper case: Google is looking at buying U.K. startup for $1 billion, if it works it might eat an Apple

(u'Google', 0, 6, u'ORG')
(u'U.K.', 28, 32, u'GPE')
(u'$1 billion', 45, 55, u'MONEY')
(u'Apple', 85, 90, u'ORG')

It’s a company apparently if Google decides to eat an Apples it’s eating a company, interesting.

Let’s take some arbitrary product from ebay and feed it into Spacy NER, so i’m taking ~Apple iPhone 8 4.7” Display 64GB UNLOCKED Smartphone US $499.99~ and let’s see how spacy’s NER parses it:

(u'Apple iPhone 8 4.7', 0, 18, u'ORG')
(u'64', 28, 30, u'CARDINAL')
(u'UNLOCKED', 33, 41, u'PERSON')
(u'Smartphone', 42, 52, u'DATE')
(u'US', 53, 55, u'GPE')
(u'499.99', 57, 63, u'MONEY')

So the org was identified as Apple iPhone 8 4.7 not so good i’m not aware of such a company it should have been a product, 64 was identieid as Cardinal this is good, UNLOCKED as a person, Smartphone as date, and US as geography, and 499.99 as money, this is partially good but definetly not satisfactory.

The good thing to remember is that spacy said they have a way to train new models so possibly with additional training for more domain specific items we could reach better results.

https://kanbanflow.com/img/avatars/22/man12.png

Code - Product categorization and named entity recognition

The code below from github ProductNER is meant to automatically extract features from product titles and descriptions. Below we explain how to install and run the code, and the implemented algorithms. We also provide background information including the current state-of-the-art in both sequence classification and sequence tagging, and suggest possible improvements to the current implemention. Let’s analyze what its doing! The code uses deep learning for NLP and our topic Deep Learning is especially important as it provides better perforemance, by models though may require more data but it requires less linguistic expertise to train and operate. In addition deep learning models can learn the features themselfs from the rawtext rather than having an expert extract them even for standard machine learning this is required.

In general our manually designed features tend to be overspecified, incomplete, take a long time to design and validated, and only get you to a certain level of performance at the end of the day. Where the learned features are easy to adapt, fast to train and they can keep on learning so that they get to a better level of performance they we’ve been able to achieve previously.

Chris Manning, Lecture 1 – Natural Language Processing with Deep Learning, 2017.

Input Data

According to documentation we first run: python parse.py metadata.json, let’s see what parse.py does:

Let’s see first how our input looks like, its called metadata.json and here are it’s first few lines:

{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 't
{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'E
{'asin': '0000037214', 'related': {'also_viewed': ['B00JO8II76', 'B00DGN4R1Q', 'B00E1YRI4C']}, 'title': 'Purple Sequin Tiny Dancer Tutu Ballet Dance Fa
{'asin': '0000032069', 'title': 'Adult Ballet Tutu Cheetah Pink', 'price': 7.89, 'imUrl': 'http://ecx.images-amazon.com/images/I/51EzU6quNML._SX342_.jp
{'asin': '0000031909', 'related': {'also_bought': ['B002BZX8Z6', 'B00JHONN1S', '0000031895', 'B00D2K1M3O', '0000031852', 'B00D0WDS9A', 'B00D10CLVW', 'B
{'asin': '0000032034', 'title': 'Adult Ballet Tutu Yellow', 'price': 7.87, 'imUrl': 'http://ecx.images-amazon.com/images/I/21GNUNIa1CL.jpg', 'related':
{'asin': '0000589012', 'title': "Why Don't They Just Quit? DVD Roundtable Discussion: What Families and Friends need to Know About Addiction and Recove

Preprocessing Scripts

it opens metadata.json and then reads each line for each line it searches for:

if ("'title':" in line) and ("'brand':" in line) and ("'categories':" in line):

So it checks whether each of the above is in line and if yes puts them inside variables together with description and categories it’s output is product.csv:

Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory,Big Dreams,,"Clothing, Shoes & Jewelry / Girls / Clothing, Shoes & Jewelry
Adult Ballet Tutu Cheetah Pink,BubuBibi,,Sports & Outdoors / Other Sports / Dance / Clothing / Girls / Skirts
Girls Ballet Tutu Neon Pink,Unknown,High quality 3 layer ballet tutu. 12 inches in length,Sports & Outdoors / Other Sports / Dance
Adult Ballet Tutu Yellow,BubuBibi,,Sports & Outdoors / Other Sports / Dance / Clothing / Girls / Skirts
Girls Ballet Tutu Zebra Hot Pink,Coxlures,TUtu,Sports & Outdoors / Other Sports / Dance
Adult Ballet Tutu Purple,BubuBibi,,Sports & Outdoors / Other Sports / Dance / Clothing / Girls / Skirts

So what we see above is title,brand,description,categories inside products.csv and that was parse.py

Now to the next file to run: python normalize.py products.csv which normalizes the product data see below the script runs lower casing on all words, and replaces \n with space. so the files format is noramlized the output is products.normalized.csv which is given in turn to the next script.

products.normalized.csv:

purple sequin tiny dancer tutu ballet dance fairy princess costume accessory,big dreams,,"clothing, shoes & jewelry / girls / clothing, shoes & jewelry
adult ballet tutu cheetah pink,bububibi,,sports & outdoors / other sports / dance / clothing / girls / skirts
girls ballet tutu neon pink,unknown,high quality 3 layer ballet tutu. 12 inches in length,sports & outdoors / other sports / dance
adult ballet tutu yellow,bububibi,,sports & outdoors / other sports / dance / clothing / girls / skirts
girls ballet tutu zebra hot pink,coxlures,tutu,sports & outdoors / other sports / dance
adult ballet tutu purple,bububibi,,sports & outdoors / other sports / dance / clothing / girls / skirts

Next script to be run is: python trim.py products.normalized.csv this script, removes any unknown brands:

if brand == 'unknown' or brand == '' or brand == 'generic':
                trimmed += 1

So we are left only with known brands.

Next script to run is: python supplement.py products.normalized.trimmed.csv this script appends the brand name to the title and appends the title to the description, so now all title have brand name inside them see below:

if not (brand in title):
    supplemented += 1
    title = brand + ' ' + title
description = title + ' ' + description

Next script to run is: python tag.py products.normalized.trimmed.supplemented.csv: it’s adding the actual standard POS (Part Of Speach Tagging) for example tagging += 'B-B ' (Begin Brand) and tagging += 'I-B ' (In Brand) tagging += 'O ' (No Brand).

Training Scripts

These are the training scripts to run:

mkdir -p ./models/
python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv

Let’s see what they do one by one first: python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv:

from tokenizer import WordTokenizer
    # Tokenize texts
    tokenizer = WordTokenizer()
    tokenizer.train(texts)

Well it’s calling .train(texts According to documentation .train does:

Takes a list of texts, fits a tokenizer to them, and creates the embedding matrix.

What is embeeding? Let’s google for it:

Word embeeding is an improvement over traditional bag of words model encoding where large sparse vectors were used to represent each word, in word embeeding the the position of a word within the vector space is learned fro text, examples Word2Vec GloVe

Therefore the tokenizer creates and embeeding matrix, so the output of the tokenizer is a vector space containing a representation of the words in our products.

To the next script: python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv this script:

trains a product category classifier based on product titles and descriptions

So here we want to extract the product category! it’s utilizing classifier.py which in turn:

  1. Takes as input data (np.array): 2D array representing descriptions of the product and/or product title
  2. And its output: list(dict(str, float)): List of dictionaries of product categories with associated confidence

How does it do it? It trains a model, after all we have labels we have categories in our data, so we can train a model.

@startuml

left to right direction

title Train Product Labels Classifier

[Product Reviews with Categories] as CSV
[Labels] as LB
[Products] as PD
[GloVe] as GL
[Word Embeeding] as WE
[Network] as NW
[models/classifier.h5] as CP
CSV --> LB : Extract
CSV --> PD : Extract
PD --> WE : Compile Network
LB --> NW : Train 
WE --> NW : Train
GL --> NW : Train
NW --> CP : Predict

@enduml

The output is the model create at models/classifier.h5 and it prints the summary below (according results and estimation according to cross validation):

In code it looks as following: preds = Dense(len(self.category_map), activation='softmax')(x)

This is the activation for the model (so I read not that I get what it means) is softmax and from what I read this is the activation function that is used in the output layer, softmax is used when we have multiple classes to predict.

Other possible output functions

  1. linear - Linear Regression
  2. sigmoid - Binary Classificatoin
  3. softmax - (this is the one we use) is for multi class classification and this is indeed our problem.

Then it compiles the model and it’s using following loss function:

self.model.compile(loss='categorical_crossentropy',
                           optimizer='rmsprop',
                           metrics=['acc'])

As we both read the loss function is: 'categorical_crossentropy which I have no idea which function exactly that is, but this is the loss function that it’s using, and the optimization algoritm is rmsprop an alternative optimization algorithm could be sgd which is the Stochastic Gradient Descend this time we will go on with rmsprop which according ot documentation rmsprop: Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.– This is the mini-batch version of just using the sign of the gradient.

# Train a product category classifier based on product titles and descriptions

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
                           precision    recall  f1-score   support

clothing, shoes & jewelry   0.768944  0.683034  0.723448      7250
        sports & outdoors   0.697127  0.700144  0.698632     18022
             toys & games   0.744507  0.877790  0.805673     21193
              movies & tv   0.863326  0.819637  0.840914      2312
                     baby   0.556271  0.666802  0.606542      2461
 tools & home improvement   0.772414  0.678099  0.722190     17698
               automotive   0.871059  0.887794  0.879347     26389
           home & kitchen   0.727050  0.802991  0.763136     16649
    arts, crafts & sewing   0.769580  0.631638  0.693819      5367
          office products   0.678700  0.756802  0.715626      7204
                    books   0.000000  0.000000  0.000000        21
 office & school supplies   0.000000  0.000000  0.000000       109
              electronics   0.752167  0.875671  0.809234     13971
                computers   0.000000  0.000000  0.000000        31
cell phones & accessories   0.910150  0.808887  0.856536      2993
             pet supplies   0.891313  0.773756  0.828384      5967
   health & personal care   0.708116  0.680906  0.694244     15146
              cds & vinyl   0.726473  0.795404  0.759377      1349
      musical instruments   0.866925  0.762178  0.811184      4701
                 software   0.000000  0.000000  0.000000        37
  industrial & scientific   0.441718  0.031115  0.058135      2314
               all beauty   0.000000  0.000000  0.000000       259
              video games   0.000000  0.000000  0.000000        63
                   beauty   0.817036  0.910148  0.861082     14101
     patio, lawn & garden   0.782244  0.611744  0.686567      5790
   grocery & gourmet food   0.873358  0.879315  0.876327      7184
          all electronics   0.000000  0.000000  0.000000        79
            baby products   0.594203  0.093394  0.161417       439
         kitchen & dining   0.000000  0.000000  0.000000        96
          car electronics   0.000000  0.000000  0.000000        11
            digital music   0.000000  0.000000  0.000000       111
         home improvement   0.000000  0.000000  0.000000       117
           amazon fashion   0.546512  0.129121  0.208889       364
               appliances   0.000000  0.000000  0.000000        16
           camera & photo   0.000000  0.000000  0.000000         3
         purchase circles   0.000000  0.000000  0.000000        12
         gps & navigation   0.000000  0.000000  0.000000        15
mp3 players & accessories   0.000000  0.000000  0.000000        23
  collectibles & fine art   0.000000  0.000000  0.000000       103
            luxury beauty   0.000000  0.000000  0.000000        12
         furniture & dcor   0.000000  0.000000  0.000000        17
                            0.000000  0.000000  0.000000         1

              avg / total   0.766003  0.772215  0.763889    200000


real	326m7.851s
user	475m9.852s
sys	25m13.631s

Demo

https://angular-p6yyuv.stackblitz.io

Paper Summary - Attribute Extraction from Product Titles in eCommerce

With no syntactic structure in product titles it’s a challening problem. In this paper he concentrates on brand NER extraction.

Vocabulary

ItemDescription
Productany commodity which may be sold by a retailer. ex. IPhone.
Attributea feature that describes a specific property of a product or a product listing ex. color, brand.
Attribute Valuea particular value assumed by the attribute. For example, for the product title

Example: Apple iPad Mini 3 16GB Wi-Fi Refurbished, Gold

Attribute NameAttribute Value
BrandApple
ProductiPad Mini 3
ColorGold
RAM16GB
ConditionRefurbished

Getting both those attributes names and values automatically without rules from freetext product titles is, challenging.

The common use case which is described in this paper is:

  1. User searches for t-shirt
  2. User filters by color red (checkbox/facet)
  3. Results should contain only red tshirts, note that filtering is on unstructured title/description.

The following challenges are presented by the paper:

  1. Lack of syntactic structure

– Chihuahua Bella Decorative Pillow by Manual Woodworkers and Weavers - SLCBCH – Real Deal Memorabilia BCosbyAlbumMF Bill Cos

Due to the diversity of products sold in any leading eCommerce site, product titles do not follow any specific composition

different products may contain slightly varying spellings of the same brand

Some titles may contain abbreviations of brand names

Brand names in titles may contain typographical errors

generic or unbranded products.

There are categories of products for which brand name is not an important attribute.

The list of brand names relevant to a given product catalog is constantly changing

Collecting expert feedback either for the purposes of generating training data or validating model generated labels is subject to inter-annotator disagreement

You get the idea.

The paper continues and describes other approaches such as:

Other Approaches

Dictionary based lookup

prepare a curated lexicon of attribute values and given a product title, scan it to find a value from the list

Alas:

  1. The curated list need to be constantly updated
  2. For certain attributes the number of values of a single attributes is the order of number of products (part number).
  3. Attribute value may appear in multiple forms - curated list need to keep track of all variations
  4. Multiple matches - the system need to decide which value to choose

Crowd Sourcing

Ineefective - Scale of retail catalog millions of products, need to standartizise attribute values, expert intervention needed

Rule based extraction

With texts having grammatical structure rule based systems had success. However:

product titles do not conform to a syntactical structure or grammar unlike news articles or prose

So maybe apply rule based to product description and not only title? but what if description refers to competitors?

Creating a maintaining rules of hundreds or thousands of attributes is challenging. Smells like machine learnig models are needed.

Supervised text classification

With bayes or SVM or logistic regression. According to the paper these methods can be suitable when the number of classes is known and small. It adds the following:

In contrast, when the number of classes is in tens of thousands, we will need a lot labeled training data and the model footprint will also be large. However, the main drawback with these models for attributes like brand and manufacturer part number is that they can only predict classes on which they are trained. Thus, in order to predict new brand values, the training data will need to be constantly updated with labeled data corresponding to new brands. In the case of manufacturer part number, this approach is essentially worthless since every new product will likely have an unseen part number

Sequence Labeling Approaches

The paper moves on to the way its going to extract the features and values of products its under the category of “Sequence Labeling Approaches”. While we talk about sequences a mini google search about what “Sequence Labeling Means” yields the following informative description:

Often we deal with sets in applied machine learning such as a train or test sets of samples.

Each sample in the set can be thought of as an observation from the domain.

In a set, the order of the observations is not important.

A sequence is different. The sequence imposes an explicit order on the observations.

The order is important. It must be respected in the formulation of prediction problems that use the sequence data as input or output for the model.

https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/07/Example-of-a-Sequence-Prediction-Problem.png

And according to: “— Sequence Learning: From Recognition and Prediction to Sequential Decision Making, 2001.”:

Sequence prediction attempts to predict elements of a sequence on the basis of the preceding elements

For example given a sequence of previous weather temprature predict the following days weather temprature.

Note also that sequence generation can generate Novel Sequences for example generate music!

They then give an example of a feature function, a feature function assigns for word x label-sequence y at index i (not product type yet) for example ofor POS (Part Of Speech Tagging). Here is the example function

We have a labeled sequence for each word x_i we have a label y_i and we want a feature function.

\begin{equation} f(x,y,i) = \begin{cases} 1\ if \ x_i = the\ and\ y_i\ =\ DT
0\ otherwise \end{cases} \end{equation}

Meaning, for word with index i we tag it with y part of speech if the word x_i is the and the label sequence y_i is DT (determinent POS) so the output of the feature function is either 0 or 1 for each word.

Creating the training set

To create the training set the paper mentiones that instead of manually labeling they created a set of regular expressions which catched exact brand names, this also limited the noise because they didn’t catch errors (at least they think). They have added product titles which did not have any brand-name so that they have also labeled training set without any brands.

Interpreting output labels

Their function currently: output_labels = learning_algorithm(product-title-x): Seq[(Token, Label)] Meaning if they apply their learning algorithm they get a sequence of each of the tokens in the product title and for each the learning algorithm assigned a label.

Now they need to transform this labeling into candidate brand name. toBrand(Seq[(Token, Label)]: BrandName and they do tihs not surprisingly by looking for the “Brand” label in the branded tokens..

Sentence Compression

While googling some more I’ve noticed there is another approach to text summarization called: “Sentence Compression”, this approach is more compelling for me because from all the search results I get it looks like a fully automatic process (except for training). Note that although we have text summarization there is another important topic called Sentence Compression in this case we are taking a rather small text and - compressing it, deleting undeeded words.

Sentence compression is a paraphrasing task where the goal is to generate sentences shorter than given while preserving the essential content

Sentence compression is a standard NLP task where the goal is to generate a shorter paraphrase of a sentence. Dozens of systems have been introduced in the past two decades and most of them are deletion-based: generated compressions are token subsequences of the input sentences (Jing, 2000; Knight & Marcu, 2000; McDonald, 2006; Clarke & Lapata, 2008; Berg-Kirkpatrick et al., 2011, to name a few).

References:

Overcoming the Lack of Parallel Data in Sentence Compression Sentence Compression by Deletion with LSTMs

Resources

resourcelink
Sentence Compression by Deletion with LSTMshttps://research.google.com/pubs/archive/43852.pdf
Models Zoo - Ready Made Modelshttps://modelzoo.co/
A Neural Attention Model for Abstractive Sentence Summarizationhttps://arxiv.org/abs/1509.00685
TensorFlow-Summarizationhttps://github.com/thunlp/TensorFlow-Summarization
Webscrapperhttp://webscraper.io/
Dzone on text summarizationhttps://dzone.com/articles/a-guide-to-natural-language-processing-part-3
DataSethttps://duc.nist.gov/duc2004/
Google Research DataSets for Sentence Compressionhttps://github.com/google-research-datasets/sentence-compression
How do I download DUC dataset for text summarization?https://www.quora.com/How-do-I-download-DUC-dataset-for-text-summarization
**EXAMPLE**: Keras text summarization on newshttps://github.com/chen0040/keras-text-summarization
Example: NLTK Simple Summarizationhttps://dev.to/davidisrawi/build-a-quick-summarizer-with-python-and-nltk
Example: Text Summarize ROUGE scoringhttp://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85
SumBasic Clusteringhttp://www.cs.middlebury.edu/~mpettit/project.html
Keras Text Classificationhttps://medium.com/skyshidigital/getting-started-with-keras-624dbf106c87
NLP for hackers TextRank for TextSummarizationhttps://nlpforhackers.io/textrank-text-summarization/
Track NLP Status and Progress - Summarizationhttps://github.com/sebastianruder/NLP-progress/blob/master/summarization.md
Sentence Compression and Text Summarization - Many resourceshttps://github.com/mathsyouth/awesome-text-summarization
Google AI Portalhttps://ai.google
Text Summarization Thesishttps://tinyurl.com/text-summarization-thesis
Text Compression Deletion Impl based on Katja Filippova Paperhttps://github.com/zhaohengyang/Generate-Parallel-Data-for-Sentence-Compression
Katja Filippova Multi Sentence Compression Paperhttp://www.aclweb.org/anthology/C10-1037
Overcoming the Lack of Parallel Data in Sentence Compressionhttps://www.aclweb.org/anthology/D/D13/D13-1155.pdf
Toward Data Science NERhttps://tinyurl.com/towarddatascience-ner
Wallmart Ajinkya Product Attributeshttps://tinyurl.com/ajnkya-product-attributes
Wallmart Ajinkya Product Attributes Paperhttps://arxiv.org/pdf/1608.04670.pdf
Current state of NLP Summarizationhttps://github.com/sebastianruder/NLP-progress/blob/master/summarization.md
Current state of NLPhttps://github.com/sebastianruder/NLP-progress
Google Search Transform Text To Human Readablehttps://tinyurl.com/search-transform-human-readabl
Google Search Stanford NLP Extract From Titlehttps://tinyurl.com/stanford-nlp-extract-product-d
Google Search Product NERhttps://tinyurl.com/google-search-product-ner
ROBUST TREE-STRUCTURED NAMED ENTITIES RECOGNITION FROM SPEECHhttp://www.irisa.fr/texmex/people/raymond/pub/icassp2013.pdf
An Extractive Text Summarizer Based on Significant Wordshttps://tinyurl.com/extractive-words-summary
Google Scholar Textu Summarizationhttps://tinyurl.com/google-scholar-text-summarizat
Text Summarization on HackerNewshttps://tinyurl.com/text-summarization-hn
Online Web Scrapperhttp://webscraper.io/
Awesome Training Datahttps://tinyurl.com/google-search-awesome-training
Awesome Public Data Setshttps://github.com/awesomedata/awesome-public-datasets
Kaggle Text Summarizationhttps://tinyurl.com/kaggle-text-summarization
Awesome Text Summarizationhttps://github.com/icoxfog417/awesome-text-summarization
NLTK scikit tensorflow text summarizationhttps://tinyurl.com/nltk-scikit-tensorflow-text-su

Summary

We have seen there are existing methods and github repositories and papers for summarizing text, for sentence compression, for identify topic based on product title and description and for producing summarization based on NER, the future looks both interesting and promising, but also very difficult.