A simple algorithm for clustering web pages. A wrapper around KMeans. Web pages are converted to vectors, where each vector entry is just the count of a given tag and class attribute. The dimension of the vectors will change as new pages with new tags or class attributes arrive. Also a simple outlier detection is available and enabled by default. This allows for rejecting web pages that are highly improbable to belong to any cluster.
pip install page_clustering
import page_clustering
clt = page_clustering.OnlineKMeans(n_clusters=5)
# `pages` must have been obtained somehow
for page in pages:
clt.add_page(page)
y = clt.classify(new_page)
for page in more_pages:
clt.add_page(page)
y = clt.classify(yet_another_page)
wget -r --quota=5M https://news.ycombinator.com
python demo.py news.ycombinator.com
cd tests
py.test
The first part, vectorization, transforms the web page to a vector. For example, take the following page:
<html>
<body>
<ul class="list1">
<li>A</li>
<li>B</li>
</ul>
<ul class="list2">
<li>Y</li>
<li>Z</li>
</ul>
</body>
</html>
Each non-closing (tag, class) pair is mapped to a vector position and the number of times it appears in the document is the value of the vector at that position.
tag, class | position | count |
---|---|---|
html | 0 | 1 |
body | 1 | 1 |
ul, list1 | 2 | 1 |
li | 3 | 4 |
ul, list2 | 4 | 1 |
The vector is therefore [1, 1, 1, 4, 1]
. This vector is normalized so that
it's elements sum up to 1 and the final frequency vector is:
[0.125, 0.125, 0.125, 0.5, 0.125]
When a new page arrives it can be possible that new (tag, class) pairs appear. For example imagine that this new page arrives:
<html>
<body>
<p>Another page with a paragraph tag </p>
</body>
</html>
The new page would be mapped according to this table:
tag, class | position | count |
---|---|---|
html | 0 | 1 |
body | 1 | 1 |
ul, list1 | 2 | 0 |
li | 3 | 0 |
ul, list2 | 4 | 0 |
p | 5 | 1 |
The vector for this page would be [1, 1, 0, 0, 0, 1]
, and with normalization:
[0.33, 0.33, 0, 0, 0, 0.33]
.
The new vector has 6 dimensions, this means that the previous page vector needs
to be extended accordingly with zeros to the right: [0.125, 0.125, 0.125, 0.5, 0.125, 0]
.
Once all needed pages are vectorized, KMeans is applied.