- Be able to run a naive basic non machine learning code that summarizes text based on TFIDF.
- Why such naive implementation works very partially.
- The 3 main methods for text summarization
3.1 **Text Summarization** Summarize a blokc of text, a brief overview of techniques (from published paper).
3.2 **Sentence Compression** This one aims to compress a single sentence.
3.3 **NER based summarization** Look at the entities of the text (price, brand, product) and summarize by it.
- Extractive text summarizer vs sematic summarizer.
- Run code to identify the topic of a text and brand with deep learning. (online demo available).
- Run code to summarize multiple sentences with takehe github project.
- How wallmarts handles attributes extraction from products titles in eCommerce.
- Run code that extracts common NER’s (brand, money, location) from a sentence.
- What SumBasic is
- Have plenty of resources and directions to continue from here.
Here are some of the examples we are going to go through:
- Naive summarizatoin of a wikipedia page about leonardo da vinci.
- Extract topic.
- Detect what is the topic of pneumonia wikipedia article.
- Find out the category and brand of an amazon product.
- Detect what is the topic of abortion wikipedia article.
- Summarize news about clinton.
- Use spacy to extract features of text.
Sounds lke a lot, let’s get started.
I just began to study this topic, most of the things i’m either talking about or practicing here i’m doing for the first time. I’m by no means an expert and not even a novice :). There would be mistakes in that article. I will share my learning and discoveries with you. Document is in progress and would get updated. In addition while running most of the examples it looks like nothing predicted well all cases I entered, there were always mistakes!
So we want to understand text summarization / sentence compression / NER based summarization, let’s have a plan:
- The jargon.
- Published researches and references to them.
- History of text summarization.
- The different methods.
- Let’s write some code.
- Where do I plan to head on.
- Summary.
I always find that in any topic I study the jargon/taxonomy/terminology is one of the most important things to know so here it is:
What is text summarization: An example could be great here so below is a real world one:
Article:
novell inc. chief executive officer eric schmidt has been named chairman of the internet search-engine company google .
Human Summary:
novell ceo named google chairman
Machine Summary:
novell chief executive named to head internet company
Reference: TensonFlow Research Text Summarization
Yes, most text summarization train data, research and example models are focused on news, if you are not in news business most chances you need to get your own data and retrain, no ready models for you.
How do we (humans, although some bots are also reading this..) summarize text? We read it mostly or partially, understand, fill in context, reread, read other docs, think, put in intuiveness, apply templates (finance), assume audience expectation, highlight important items, sleep on it, i have to stop here..
And then:
**We come up with a much shorter version of the orig doc which contains the main ideas and shares the intent presented in the original doc - the glorious summary**
or as “Text Summarization Techniques” paper says:
a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually, significantly less than that
How do they (computers) summarize text, taking into account the above process was so complex!
Who knows!
Can they do that? If yes can they do that in a satisfactory manner? Will they have embarrassing mistakes? How far are they from humans? Or maybe how advanced?
Text Summarization Techniques: A Brief Survey
This is the paper that we need to get started, their premise is:
We review the different processes for summarization and describe the effectiveness and shortcomings of the different methods.
Sounds promising we would get back to this paper, but I want to scan the topic some more and let’s even start with a naive exmaple.
**Note, in our use case we are less interested in human complete sentences but more of a few words together which summarize the topic.**
In our first naive code implementation, no machine learning, just take some text and try to summarize it, somehow, common sense. Let’s see:
Step 1: Here is our bunch of text to summarize:
text: str = """
Leonardo da Vinci
Leonardo di ser Piero da Vinci (Italian: [leoˈnardo di ˌsɛr ˈpjɛːro da (v)ˈvintʃi] (About this sound listen); 15 April 1452 – 2 May 1519), more commonly Leonardo da Vinci or simply Leonardo, was an Italian polymath of the Renaissance, whose areas of interest included invention, painting, sculpting, architecture, science, music, mathematics, engineering, literature, anatomy, geology, astronomy, botany, writing, history, and cartography. He has been variously called the father of palaeontology, ichnology, and architecture, and is widely considered one of the greatest painters of all time. Sometimes credited with the inventions of the parachute, helicopter and tank,[1][2][3] he epitomised the Renaissance humanist ideal.
Many historians and scholars regard Leonardo as the prime exemplar of the "Universal Genius" or "Renaissance Man", an individual of "unquenchable curiosity" and "feverishly inventive imagination",[4] and he is widely considered one of the most diversely talented individuals ever to have lived.[5] According to art historian Helen Gardner, the scope and depth of his interests were without precedent in recorded history, and "his mind and personality seem to us superhuman, while the man himself mysterious and remote".[4] Marco Rosci notes that while there is much speculation regarding his life and personality, his view of the world was logical rather than mysterious, and that the empirical methods he employed were unorthodox for his time.[6]
Born out of wedlock to a notary, Piero da Vinci, and a peasant woman, Caterina, in Vinci in the region of Florence, Leonardo was educated in the studio of the renowned Florentine painter Andrea del Verrocchio. Much of his earlier working life was spent in the service of Ludovico il Moro in Milan. He later worked in Rome, Bologna and Venice, and he spent his last years in France at the home awarded to him by Francis I of France."""
Leonardo was a good man, let’s naively summarize him.
First, how would you summarize this text, let’s say limiting to 7 words?
I would say this:
My modest summary: “Leoardo Da Vinci, italian, renaisssane, painter, sculpturer”
Now lets move on with our naive code implementation:
Step 2: Tokenize the words:
words = word_tokenize(text) # thanks nltk
Step 3: Score words based on their frequency
words_score: FreqDist = FreqDist() # thanks nltk
for word in words:
words_score[word.lower()] += 1
Step 4: The summary would be our top 7 frequent words:
def top_scores_sorted_by_text(w_scores: FreqDist, k: int):
return sorted(w_scores.most_common(k), key=lambda w: word_index(text, w))
summary = top_scores_sorted_by_text(words_score, 7)
print(summary)
Let’s see our result
[('[', 15), ('his', 17), (',', 67), ('of', 31), ('the', 32), ('and', 26), ('.', 21)] # that's a horrible summary!
We have his
of
the
obviously we don’t want them in our summary let’s get rid of them:
Step 5: Get rid of stop words
stop_words: Set[str] = set(stopwords.words("english")) # thanks nltk
words = [w for w in words if not w in stop_words] # thanks python
text = ' '.join(words) # and the updated text (sorry immutability) is now a join of the words without stop words.
Now let’s print again the resulting summary
[('leonardo', 11), ('da', 5), ('vinci', 6), ('[', 15), (']', 15), (',', 67), ('.', 21)]
This is somewhat a little better version we have leonardo da vinci
as the first 3 words in summary sounds perfect! but we have also lot of puncutaions, let’s get rid of them:
Step 6: Get rid of punctuations
def remove_punctuations(s: str) -> str:
table = str.maketrans({key: None for key in string.punctuation}) # standard python (thanks).
return s.translate(table)
text = remove_punctuations(text)
And print again the summary:
[('leonardo', 9), ('da', 5), ('vinci', 6), ('he', 4), ('renaissance', 4), ('painting', 4), ('engineering', 3)]
Uh, looks much better. There is one issue, we have he
in the summary, we don’t want it, we have only 7 words and no space to waste, could it be that leonaro was proficient in another topic?
Step 7: Fix stop word bug
We have a bug, we have removed the stopwords with: [w for w in words if not w in stop_words]
but somehow the he
stopword has sneaked inside. Let’s fix it, the problem is that we didn’t lower case the text so He
was not considered as the stopword he
text = text.lower() # no immutability small example.
And now let’s run the summary again:
[('leonardo', 9), ('da', 5), ('vinci', 6), ('renaissance', 4), ('painting', 4), ('engineering', 3), ('inventions', 3)]
No more he
stopword. This even looks like a much better summary that my original (human) one!
**But don’t get excited, there are millions if not billions of summaries this naive dumb summarized would not pass, just think of products for sale. If we think of products for sale we need a better flow.**
We could think of more enhancements:
- Give higher score to words appearing in title.
- Refer to query (if got to this page by search).
- More..
Let’s summary what we have done in the above naive summarizer:
┌─────────────────────────────────────────────────────────────────────────────────────────────────────┐ │Text Summarization Very Naive Implementation │ │ │ │┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐│ ││ │ │ │ │ │ │ ││ ││Get Some text from │ │ Cleanup │ │ Words Scoring │ │Select top k words ││ ││ wikipedia │─────▶│ │─────▶│ │────▶│ as our summaruy ││ ││ │ │ │ │ │ │ ││ │└───────────────────┘ └───────────────────┘ └───────────────────┘ └───────────────────┘│ │ │ │ │ │ ▼ ▼ │ │ ┌───────────────────┐ ┌───────────────────┐ │ │ │Remove punctuations│ │ Frequency Table │ │ │ └───────────────────┘ └───────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────┐ │ │ │ Lower case │ │ │ └───────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────┐ │ │ │ Remove stopwords │ │ │ └───────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────────────────────────────┘
A few points to note:
- This is extractive text summarizer we didn’t invent anything, no semantic understanding, we just selected words.
- There is a better algorithm called
SumBasic
The difference between extractive and semantic is that extractive takes phrases from the text so in that sense it cannot go wrong it will take only things which preexisted in the text, semantic will try to actually understand the text and compose new text.
Here is the formula for sum basic:
\begin{equation} g(S_j)=\frac{∑w_i∈{S_j}P(w_i)}{|\{w_i|w_i∈{S_j}|} \end{equation}
This looks complex to me. But I found that after I got what each symbol means it became simple, even embarrasingly simple.
Here is the meaning of that formula:
term | meaning |
---|---|
g(S_j) | Weight of sentence j |
w_i∈{S_j} | For each word that belongs to sentence j |
∑w_i∈{S_j}P(w_i) | The sum of all probabilities of words that belong to sentence j |
{|\{w_i\vertw_i∈{S_j}| | Number of words in the sentence j |
So that turns g(S_j) to be the average probability of words in sentence j where word probabilty is simply the number of occurences of word w_i inside the document.
This is very similar to what we did with words without knowing SumBasic
! In our case we wanted to get a bunch of words and not a bunch of sentences so we just took the words appearing most, which is similar to taking the sentences with highest word probablity.
SumBasic then continues to update each word probability as it’s multiplication by itself (reduce it) so we can now pick other sentences, and it keeps on with this loop until we picked as much sentences as we meant to.
There is an intersting github repo named takehe (based on papers below) let’s give it a shot:
takahe is a multi-sentence compression module. Given a set of redundant sentences, a word-graph is constructed by iteratively adding sentences to it. The best compression is obtained by finding the shortest path in the word graph. The original algorithm was published and described in:
Katja Filippova, Multi-Sentence Compression: Finding Shortest Paths in Word Graphs, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 322-330, 2010.
Let’s give it a shot:
conda create -n takahe-py2 python=2.7
conda activate takahe-py2
conda install -y graphviz pygraphviz spyder numpy networkx
git clone https://github.com/boudinfl/takahe
pip install secretstorage
pip install networkx==1.1
git clone https://github.com/boudinfl/takahe
now we give it some text but it requires some annotated text:
["The/DT wife/NN of/IN a/DT former/JJ U.S./NNP president/NN
#Bill/NNP Clinton/NNP Hillary/NNP Clinton/NNP visited/VBD China/NNP last/JJ
#Monday/NNP ./PUNCT", "Hillary/NNP Clinton/NNP wanted/VBD to/TO visit/VB China/NNP
#last/JJ month/NN but/CC postponed/VBD her/PRP$ plans/NNS till/IN Monday/NNP
#last/JJ week/NN ./PUNCT", "Hillary/NNP Clinton/NNP paid/VBD a/DT visit/NN to/TO
#the/DT People/NNP Republic/NNP of/IN China/NNP on/IN Monday/NNP ./PUNCT",
"Last/JJ week/NN the/DT Secretary/NNP of/IN State/NNP Ms./NNP Clinton/NNP
#visited/VBD Chinese/JJ officials/NNS ./PUNCT"]
And the summarization results are:
0.234 hillary clienton visited china last week.
0.247 hillary clienton visited china on monday #last week.
.
.
.
#0.306 hillary clinton paid a visit to #the people of republic of china last week.
.
.
.
We are still summarizing news :( we need to revisit our plan and github and google searches :)
Now that we did a variation on SumBasic for words instead of sentences, lets move on with more examples appearing on the web. Namely algorithms that do more of understanding of the text and compose new text and not just choose and extract ready made summary from our existing text.
**Step 1: Mode: Classify text**
Is the text about an artist? is the text about a car is the text about an electric cleaning machine?
**Step 2: Manual: Idetify the main features of the topic**
That is the ontology, topic we want to identify the topic once we ge tthe topic we can get better at the summarization (you see we get to understand the text). We have identified that the text is about an electric washin cleaning machine this means, we need these features (this is the task to identify the features)
- Watts
- Target
- Price
- Size
But how can we get the topic? how can we get then the relevant features?
**Step 3: Given an article identify topic fill in feature values**
So given an article identify:
- Which topic is it about?
- What are the features of that topic?
- Fill in the values from the article about the features of that topic.
Sounds like a plan!
This is also called **Text Classification**. There 3 main categories to achieve Text Classification:
- Rules
- Standard Machine Learning Models
- Deep Learning
I don’t have time for rules, my laptop is too slow for deep learning and i’m not sure I have enought data, si’ll go with option 2 standard models and then move on to deep learning on EC2.
There is a great example (i’m doing this for the first time) at sklearn website for how to build a model to classify text. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html I’m simply going to use and run it.
Creating the model and prediciting the class/topic for the article will involve the following steps:
- Load labeled newgroups data with topics.
- Vectorize the documents, BOW (Bag Of Words).
- We can do better than BOW so we are going to TFIDF the docs to get the target vectors.
- Run train
- Predict
We are not going to check the accuracy, just run arbitrary example on the model.
Note that sklearn will handle the large sparse matrix issue (consming much of RAM) for us, it’s going to shrink them automatically. (did i say thanks sklearn?)
**Step 1: Load Labeled newsgroups data with topics**
from sklearn.feature_extraction.text import CountVectorizer
import json
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)
twenty_train.target_names = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
In the above code we:
- define our categories, we have defined 4 newsgroups categories. Note that sklearn knows to fetch this example data automatially for us.
- Load the text data into a variable
twenty_train
- Add a new member to
twenty_train
namedtarget_names
with our categories.
**Step 2: Feature engineering**
We have loaded our data which is just a set of newsgroups posts. What are it’s features? It’s a text data, so it has words right? so each distinct word is going to serve as a feature. In our case BOW means a matrix where each doc is a row and each column is a word and we count the number of times such word appears in each doc. Guess what, sklearn will do that automatically for us and also shrink the sparse matrix (most of words do not appear in each doc).
BOW code:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data) # Tokenize, Filter Stopwords, BOW Features, Transform to vetor, this returns Term Document Matrix! thanks sklearn
That’s it with 2 lines we have tokenized the newgroup messages, filtered stopwords, extracted BOW features, transformed them to a vector (numbers).
BOW is skewed toward large documents where words appear more so we are going to turn our face to the TFIDF vectorizing instead of BOW, here is the code to do that:
**Step 3: Replace BOW with TFIDF**
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) # Transform a count matrix to a normalized tf or tf-idf representation
X_train_tf = tf_transformer.transform(X_train_counts) # Transform a count matrix to a tf or tf-idf representation # X_train_tf.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
The above code is self explanatory we first do TF and then IDF, note that we do all operatoins with just a few lines, sklearn appears to be very developer friendly and has concise and clear api, no wonder it’s so common.
Now that we have our data loaded, and extracted all the features from it (vectorized with tfidf) it’s time to build the model.
**Step 4: Build the model to predict class of newsgroup message**
from sklearn.naive_bayes import MultinomialNB # Naive bayes classifier
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
There are multiple classifiers we are following sklearn example, in our example, so we have chosen the same. We then called fit
and passed as input: X_train_tfidf
that is the set of features for each doc (the tfidf vectors) and as the labels/output we train the model with twenty_train.target
which is the vector of topics we train the model with for each row.
Now for money time, we are going to predict something, i’m going to take an arbitrary wikipedia article that deals with one of the 4 categories and see if it’s well predicted, so what have we got there, science medicine, religion, computer graphics, and atheism.
To test the prediction we are not going to run on a set of artiles but just pick two example articles from wikipedia and see the outcome prediction. At first let’s pick an easy one I think, an artile from wikipedia about pneumonia, I will pick the first two sections and run it through the model prediction and see the category chosen.
## Predict document class!
# https://en.wikipedia.org/wiki/Pneumonia
docs_new = ["""pneumonia is an inflammatory condition of the lung affecting primarily the small air sacs known as alveoli.[4][13] Typically symptoms include some combination of productive or dry cough, chest pain, fever, and trouble breathing.[2] Severity is variable. Pneumonia is usually caused by infection with viruses or bacteria and less commonly by other microorganisms, certain medications and conditions such as autoimmune diseases.[4][5] Risk factors include other lung diseases such as cystic fibrosis, COPD, and asthma, diabetes, heart failure, a history of smoking, a poor ability to cough such as following a stroke, or a weak immune system.[6] Diagnosis is often based on the symptoms and physical examination.[7] Chest X-ray, blood tests, and culture of the sputum may help confirm the diagnosis.[7] The disease may be classified by where it was acquired with community, hospital, or health care associated pneumonia"""]
X_new_counts = count_vect.transform(docs_new) # Extract new doc features.
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))
Now after running this pneumonia
text we get from the model this prediction:
it was acquired with community, hospital, or health care associated pneumonia' => sci.med
(science medical) so it got categorized as sci.med
which is simply corret!
Now let’s say a nother piece of text this time about abortion
and see what the model will predict, here is the new text we have fed it with: https://en.wikipedia.org/wiki/Abortion the first section again which is:
> Abortion is the ending of pregnancy by removing an embryo or fetus before it can survive outside the uterus.[note 1] An abortion that occurs spontaneously is also known as a miscarriage. An abortion may be caused purposely and is then called an induced abortion, or less frequently, “induced miscarriage”. The word abortion is often used to mean only induced abortions. A similar procedure after the fetus could potentially survive outside the womb is known as a “late termination of pregnancy”
And the resulting prediction by the model is:
...survive outside the womb is known as a "late termination of pregnancy' => soc.religion.christian
Which means that abortion was categorized as social religion christianity
category => I don’t know if to be happy, sad, depressed, or excited by this prediction.
**Summary of step 1**
It looks like there is a way to determine the class of an text snippet by it’s content using machine learning models, for sure there are challenges but this appears to be rather well known problem and there are available methods for solving and optimizing it (changing model, parameters, better training input data).
Now for the next step we have expected that for each class/topic we are going to select the set of features which we are going to use for text summarization. I’m afraid this part has to be manual, we have to say that for a topic “disease”, the features are going to be a set of closed features suh as “mortality rate”, “suspectible age group”, “name”, “average length”. And on the other hand for “cars” topic the summary template variables are going to be: “manufacturer”, “engine type”, “year”, “color”, “used/new”, etc. It appears like for these set of summary template variables are going to be hand crafted.
The question is for step 3, whether a model could extract the set of “variable values” from articles and apply a summary from them? I don’t have the answer, at least not at my current googling phase.
Step 2 and 3 looks like lot of manual work, is it possible that I could do some googling for better and more automatic solutions or better approaches to this problem of summarization?
As we said in the previous section, extracting the relevant features for a topic is either a heavy manual work or magic-computer work. You see, for every topic for every discussion there is its own unique set of feature, if its a luggage you have the dimentions, color, applies to low-cost or not, and ofcourse brand name for each of them. I’m sure there must be a way out of it without programming the universe from scratch again.
After doing some more google search NER looks like a good candidate, at least for part of the problem. NER? After doing some googling, I have noticed that NER seems like part of the solution, looking at spacy.io
I see they have already implemented some common NER and have API to train new NER, standford NLP libraries also have an NER this time with java.
According to toward data science:
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a sub-task of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc
Let’s have a look at the abilities of spacy
and what it can do for us and ccording to spacy’s documentation:
The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.
According to it’s documentation it can identify the following (and not only) entities: PERSON, ORG (companies), PRODUCT, WORK_OF_ART (Books, ..), PERCENT, MONEY, QUANTITY, and a few more
In addition it allws you to extend and train new models to recognize new entities.
Let’s try it out with it’s basic usage.
We start with their example:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
And when I run it I get:
(u'Apple', 0, 5, u'ORG')
(u'U.K.', 27, 31, u'GPE')
(u'$1 billion', 44, 54, u'MONEY')
So it has recognized the company Apple
the geogrpahical entity UK
and a small amount of money: $1 billion
Let’s change the input sentence to: Google is looking at buying U.K. startup for $1 billion, if it works it might buy Apple
and see that it identifies now two companies, there result of running the above code is:
(u'Google', 0, 6, u'ORG')
(u'U.K.', 28, 32, u'GPE')
(u'$1 billion', 45, 55, u'MONEY')
(u'Apple', 84, 89, u'ORG')
What if I change from Apple
to apple
that is Google is looking at buying U.K. startup for $1 billion, if it works it might buy apple
(u'Google', 0, 6, u'ORG')
(u'U.K.', 28, 32, u'GPE')
(u'$1 billion', 45, 55, u'MONEY')
Aha so apple
with lower case does not count as a company, what if google decides to eat an Apple? with upper case: Google is looking at buying U.K. startup for $1 billion, if it works it might eat an Apple
(u'Google', 0, 6, u'ORG')
(u'U.K.', 28, 32, u'GPE')
(u'$1 billion', 45, 55, u'MONEY')
(u'Apple', 85, 90, u'ORG')
It’s a company apparently if Google decides to eat an Apples it’s eating a company, interesting.
Let’s take some arbitrary product from ebay and feed it into Spacy NER, so i’m taking ~Apple iPhone 8 4.7” Display 64GB UNLOCKED Smartphone US $499.99~ and let’s see how spacy’s NER parses it:
(u'Apple iPhone 8 4.7', 0, 18, u'ORG')
(u'64', 28, 30, u'CARDINAL')
(u'UNLOCKED', 33, 41, u'PERSON')
(u'Smartphone', 42, 52, u'DATE')
(u'US', 53, 55, u'GPE')
(u'499.99', 57, 63, u'MONEY')
So the org was identified as Apple iPhone 8 4.7
not so good i’m not aware of such a company it should have been a product, 64 was identieid as Cardinal
this is good, UNLOCKED
as a person, Smartphone
as date, and US
as geography, and 499.99 as money, this is partially good but definetly not satisfactory.
The good thing to remember is that spacy said they have a way to train new models so possibly with additional training for more domain specific items we could reach better results.
The code below from github ProductNER is meant to automatically extract features from product titles and descriptions. Below we explain how to install and run the code, and the implemented algorithms. We also provide background information including the current state-of-the-art in both sequence classification and sequence tagging, and suggest possible improvements to the current implemention. Let’s analyze what its doing! The code uses deep learning for NLP and our topic Deep Learning is especially important as it provides better perforemance, by models though may require more data but it requires less linguistic expertise to train and operate. In addition deep learning models can learn the features themselfs from the rawtext rather than having an expert extract them even for standard machine learning this is required.
In general our manually designed features tend to be overspecified, incomplete, take a long time to design and validated, and only get you to a certain level of performance at the end of the day. Where the learned features are easy to adapt, fast to train and they can keep on learning so that they get to a better level of performance they we’ve been able to achieve previously.
Chris Manning, Lecture 1 – Natural Language Processing with Deep Learning, 2017.
According to documentation we first run: python parse.py metadata.json
, let’s see what parse.py
does:
Let’s see first how our input looks like, its called metadata.json
and here are it’s first few lines:
{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 't
{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'E
{'asin': '0000037214', 'related': {'also_viewed': ['B00JO8II76', 'B00DGN4R1Q', 'B00E1YRI4C']}, 'title': 'Purple Sequin Tiny Dancer Tutu Ballet Dance Fa
{'asin': '0000032069', 'title': 'Adult Ballet Tutu Cheetah Pink', 'price': 7.89, 'imUrl': 'http://ecx.images-amazon.com/images/I/51EzU6quNML._SX342_.jp
{'asin': '0000031909', 'related': {'also_bought': ['B002BZX8Z6', 'B00JHONN1S', '0000031895', 'B00D2K1M3O', '0000031852', 'B00D0WDS9A', 'B00D10CLVW', 'B
{'asin': '0000032034', 'title': 'Adult Ballet Tutu Yellow', 'price': 7.87, 'imUrl': 'http://ecx.images-amazon.com/images/I/21GNUNIa1CL.jpg', 'related':
{'asin': '0000589012', 'title': "Why Don't They Just Quit? DVD Roundtable Discussion: What Families and Friends need to Know About Addiction and Recove
it opens metadata.json
and then reads each line for each line it searches for:
if ("'title':" in line) and ("'brand':" in line) and ("'categories':" in line):
So it checks whether each of the above is in line and if yes puts them inside variables together with description and categories it’s output is product.csv
:
Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory,Big Dreams,,"Clothing, Shoes & Jewelry / Girls / Clothing, Shoes & Jewelry
Adult Ballet Tutu Cheetah Pink,BubuBibi,,Sports & Outdoors / Other Sports / Dance / Clothing / Girls / Skirts
Girls Ballet Tutu Neon Pink,Unknown,High quality 3 layer ballet tutu. 12 inches in length,Sports & Outdoors / Other Sports / Dance
Adult Ballet Tutu Yellow,BubuBibi,,Sports & Outdoors / Other Sports / Dance / Clothing / Girls / Skirts
Girls Ballet Tutu Zebra Hot Pink,Coxlures,TUtu,Sports & Outdoors / Other Sports / Dance
Adult Ballet Tutu Purple,BubuBibi,,Sports & Outdoors / Other Sports / Dance / Clothing / Girls / Skirts
So what we see above is title,brand,description,categories
inside products.csv and that was parse.py
Now to the next file to run: python normalize.py products.csv
which normalizes the product data see below the script runs lower casing on all words, and replaces \n with space. so the files format is noramlized the output is products.normalized.csv
which is given in turn to the next script.
products.normalized.csv
:
purple sequin tiny dancer tutu ballet dance fairy princess costume accessory,big dreams,,"clothing, shoes & jewelry / girls / clothing, shoes & jewelry
adult ballet tutu cheetah pink,bububibi,,sports & outdoors / other sports / dance / clothing / girls / skirts
girls ballet tutu neon pink,unknown,high quality 3 layer ballet tutu. 12 inches in length,sports & outdoors / other sports / dance
adult ballet tutu yellow,bububibi,,sports & outdoors / other sports / dance / clothing / girls / skirts
girls ballet tutu zebra hot pink,coxlures,tutu,sports & outdoors / other sports / dance
adult ballet tutu purple,bububibi,,sports & outdoors / other sports / dance / clothing / girls / skirts
Next script to be run is: python trim.py products.normalized.csv
this script, removes any unknown brands:
if brand == 'unknown' or brand == '' or brand == 'generic':
trimmed += 1
So we are left only with known brands.
Next script to run is: python supplement.py products.normalized.trimmed.csv
this script appends the brand name to the title and appends the title to the description, so now all title have brand name inside them see below:
if not (brand in title):
supplemented += 1
title = brand + ' ' + title
description = title + ' ' + description
Next script to run is: python tag.py products.normalized.trimmed.supplemented.csv
: it’s adding the actual standard POS
(Part Of Speach Tagging) for example tagging += 'B-B '
(Begin Brand) and tagging += 'I-B '
(In Brand) tagging += 'O '
(No Brand).
These are the training scripts to run:
mkdir -p ./models/
python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv
python train_ner.py data/products.normalized.trimmed.supplemented.tagged.csv
Let’s see what they do one by one first: python train_tokenizer.py data/products.normalized.trimmed.supplemented.tagged.csv
:
from tokenizer import WordTokenizer # Tokenize texts tokenizer = WordTokenizer() tokenizer.train(texts)
Well it’s calling .train(texts
According to documentation .train
does:
Takes a list of texts, fits a tokenizer to them, and creates the embedding matrix.
What is embeeding
? Let’s google for it:
Word embeeding is an improvement over traditional bag of words model encoding where large sparse vectors were used to represent each word, in word embeeding the the position of a word within the vector space is learned fro text, examples
Word2Vec
GloVe
Therefore the tokenizer creates and embeeding matrix, so the output of the tokenizer is a vector space containing a representation of the words in our products.
To the next script: python train_classifier.py data/products.normalized.trimmed.supplemented.tagged.csv
this script:
trains a product category classifier based on product titles and descriptions
So here we want to extract the product category! it’s utilizing classifier.py
which in turn:
- Takes as input
data (np.array): 2D array representing descriptions of the product and/or product title
- And its output:
list(dict(str, float)): List of dictionaries of product categories with associated confidence
How does it do it? It trains a model, after all we have labels we have categories in our data, so we can train a model.
@startuml
left to right direction
title Train Product Labels Classifier
[Product Reviews with Categories] as CSV
[Labels] as LB
[Products] as PD
[GloVe] as GL
[Word Embeeding] as WE
[Network] as NW
[models/classifier.h5] as CP
CSV --> LB : Extract
CSV --> PD : Extract
PD --> WE : Compile Network
LB --> NW : Train
WE --> NW : Train
GL --> NW : Train
NW --> CP : Predict
@enduml
The output is the model create at models/classifier.h5
and it prints the summary below (according results and estimation according to cross validation):
In code it looks as following: preds = Dense(len(self.category_map), activation='softmax')(x)
This is the activation for the model (so I read not that I get what it means) is softmax
and from what I read this is the activation
function that is used in the output layer, softmax is used when we have multiple classes to predict.
Other possible output functions
- linear - Linear Regression
- sigmoid - Binary Classificatoin
- softmax - (this is the one we use) is for multi class classification and this is indeed our problem.
Then it compiles the model and it’s using following loss function:
self.model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
As we both read the loss function is: 'categorical_crossentropy
which I have no idea which function exactly that is, but this is the loss function that it’s using, and the optimization algoritm is rmsprop
an alternative optimization algorithm could be sgd
which is the Stochastic Gradient Descend this time we will go on with rmsprop
which according ot documentation rmsprop: Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.– This is the mini-batch version of just using the sign of the gradient.
# Train a product category classifier based on product titles and descriptions
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
precision recall f1-score support
clothing, shoes & jewelry 0.768944 0.683034 0.723448 7250
sports & outdoors 0.697127 0.700144 0.698632 18022
toys & games 0.744507 0.877790 0.805673 21193
movies & tv 0.863326 0.819637 0.840914 2312
baby 0.556271 0.666802 0.606542 2461
tools & home improvement 0.772414 0.678099 0.722190 17698
automotive 0.871059 0.887794 0.879347 26389
home & kitchen 0.727050 0.802991 0.763136 16649
arts, crafts & sewing 0.769580 0.631638 0.693819 5367
office products 0.678700 0.756802 0.715626 7204
books 0.000000 0.000000 0.000000 21
office & school supplies 0.000000 0.000000 0.000000 109
electronics 0.752167 0.875671 0.809234 13971
computers 0.000000 0.000000 0.000000 31
cell phones & accessories 0.910150 0.808887 0.856536 2993
pet supplies 0.891313 0.773756 0.828384 5967
health & personal care 0.708116 0.680906 0.694244 15146
cds & vinyl 0.726473 0.795404 0.759377 1349
musical instruments 0.866925 0.762178 0.811184 4701
software 0.000000 0.000000 0.000000 37
industrial & scientific 0.441718 0.031115 0.058135 2314
all beauty 0.000000 0.000000 0.000000 259
video games 0.000000 0.000000 0.000000 63
beauty 0.817036 0.910148 0.861082 14101
patio, lawn & garden 0.782244 0.611744 0.686567 5790
grocery & gourmet food 0.873358 0.879315 0.876327 7184
all electronics 0.000000 0.000000 0.000000 79
baby products 0.594203 0.093394 0.161417 439
kitchen & dining 0.000000 0.000000 0.000000 96
car electronics 0.000000 0.000000 0.000000 11
digital music 0.000000 0.000000 0.000000 111
home improvement 0.000000 0.000000 0.000000 117
amazon fashion 0.546512 0.129121 0.208889 364
appliances 0.000000 0.000000 0.000000 16
camera & photo 0.000000 0.000000 0.000000 3
purchase circles 0.000000 0.000000 0.000000 12
gps & navigation 0.000000 0.000000 0.000000 15
mp3 players & accessories 0.000000 0.000000 0.000000 23
collectibles & fine art 0.000000 0.000000 0.000000 103
luxury beauty 0.000000 0.000000 0.000000 12
furniture & dcor 0.000000 0.000000 0.000000 17
0.000000 0.000000 0.000000 1
avg / total 0.766003 0.772215 0.763889 200000
real 326m7.851s
user 475m9.852s
sys 25m13.631s
https://angular-p6yyuv.stackblitz.io
With no syntactic structure in product titles it’s a challening problem. In this paper he concentrates on brand NER extraction.
Vocabulary
Item | Description |
---|---|
Product | any commodity which may be sold by a retailer. ex. IPhone. |
Attribute | a feature that describes a specific property of a product or a product listing ex. color, brand. |
Attribute Value | a particular value assumed by the attribute. For example, for the product title |
Example: Apple iPad Mini 3 16GB Wi-Fi Refurbished, Gold
Attribute Name | Attribute Value |
---|---|
Brand | Apple |
Product | iPad Mini 3 |
Color | Gold |
RAM | 16GB |
Condition | Refurbished |
Getting both those attributes names and values automatically without rules from freetext product titles is, challenging.
The common use case which is described in this paper is:
- User searches for t-shirt
- User filters by color red (checkbox/facet)
- Results should contain only red tshirts, note that filtering is on unstructured title/description.
The following challenges are presented by the paper:
- Lack of syntactic structure
– Chihuahua Bella Decorative Pillow by Manual Woodworkers and Weavers - SLCBCH – Real Deal Memorabilia BCosbyAlbumMF Bill Cos
…
Due to the diversity of products sold in any leading eCommerce site, product titles do not follow any specific composition
…
different products may contain slightly varying spellings of the same brand
…
Some titles may contain abbreviations of brand names
…
Brand names in titles may contain typographical errors
…
generic or unbranded products.
…
There are categories of products for which brand name is not an important attribute.
…
The list of brand names relevant to a given product catalog is constantly changing
…
Collecting expert feedback either for the purposes of generating training data or validating model generated labels is subject to inter-annotator disagreement
You get the idea.
The paper continues and describes other approaches such as:
prepare a curated lexicon of attribute values and given a product title, scan it to find a value from the list
Alas:
- The curated list need to be constantly updated
- For certain attributes the number of values of a single attributes is the order of number of products (part number).
- Attribute value may appear in multiple forms - curated list need to keep track of all variations
- Multiple matches - the system need to decide which value to choose
Ineefective - Scale of retail catalog millions of products, need to standartizise attribute values, expert intervention needed
With texts having grammatical structure rule based systems had success. However:
product titles do not conform to a syntactical structure or grammar unlike news articles or prose
So maybe apply rule based to product description and not only title? but what if description refers to competitors?
Creating a maintaining rules of hundreds or thousands of attributes is challenging. Smells like machine learnig models are needed.
With bayes or SVM or logistic regression. According to the paper these methods can be suitable when the number of classes is known and small. It adds the following:
In contrast, when the number of classes is in tens of thousands, we will need a lot labeled training data and the model footprint will also be large. However, the main drawback with these models for attributes like brand and manufacturer part number is that they can only predict classes on which they are trained. Thus, in order to predict new brand values, the training data will need to be constantly updated with labeled data corresponding to new brands. In the case of manufacturer part number, this approach is essentially worthless since every new product will likely have an unseen part number
The paper moves on to the way its going to extract the features and values of products its under the category of “Sequence Labeling Approaches”. While we talk about sequences a mini google search about what “Sequence Labeling Means” yields the following informative description:
Often we deal with sets in applied machine learning such as a train or test sets of samples.
Each sample in the set can be thought of as an observation from the domain.
In a set, the order of the observations is not important.
A sequence is different. The sequence imposes an explicit order on the observations.
The order is important. It must be respected in the formulation of prediction problems that use the sequence data as input or output for the model.
And according to: “— Sequence Learning: From Recognition and Prediction to Sequential Decision Making, 2001.”:
Sequence prediction attempts to predict elements of a sequence on the basis of the preceding elements
For example given a sequence of previous weather temprature predict the following days weather temprature.
Note also that sequence generation can generate Novel Sequences for example generate music!
They then give an example of a feature function, a feature function assigns for word x
label-sequence y
at index i
(not product type yet) for example ofor POS (Part Of Speech Tagging). Here is the example function
We have a labeled sequence for each word x_i we have a label y_i and we want a feature function.
\begin{equation}
f(x,y,i) =
\begin{cases}
1\ if \ x_i = the\ and\ y_i\ =\ DT
0\ otherwise
\end{cases}
\end{equation}
Meaning, for word with index i
we tag it with y
part of speech if the word x_i
is the
and the label sequence y_i
is DT
(determinent POS) so the output of the feature function is either 0 or 1 for each word.
Creating the training set
To create the training set the paper mentiones that instead of manually labeling they created a set of regular expressions which catched exact brand names, this also limited the noise because they didn’t catch errors (at least they think). They have added product titles which did not have any brand-name so that they have also labeled training set without any brands.
Their function currently: output_labels = learning_algorithm(product-title-x): Seq[(Token, Label)]
Meaning if they apply their learning algorithm they get a sequence of each of the tokens in the product title and for each the learning algorithm assigned a label.
Now they need to transform this labeling into candidate brand name. toBrand(Seq[(Token, Label)]: BrandName
and they do tihs not surprisingly by looking for the “Brand” label in the branded tokens..
While googling some more I’ve noticed there is another approach to text summarization called: “Sentence Compression”, this approach is more compelling for me because from all the search results I get it looks like a fully automatic process (except for training).
Note that although we have text summarization there is another important topic called Sentence Compression
in this case we are taking a rather small text and - compressing it, deleting undeeded words.
Sentence compression is a paraphrasing task where the goal is to generate sentences shorter than given while preserving the essential content
Sentence compression is a standard NLP task where the goal is to generate a shorter paraphrase of a sentence. Dozens of systems have been introduced in the past two decades and most of them are deletion-based: generated compressions are token subsequences of the input sentences (Jing, 2000; Knight & Marcu, 2000; McDonald, 2006; Clarke & Lapata, 2008; Berg-Kirkpatrick et al., 2011, to name a few).
References:
Overcoming the Lack of Parallel Data in Sentence Compression Sentence Compression by Deletion with LSTMs
We have seen there are existing methods and github repositories and papers for summarizing text, for sentence compression, for identify topic based on product title and description and for producing summarization based on NER, the future looks both interesting and promising, but also very difficult.