Visualizing the Hacker News frontpage

Published in Mon 11 September 2017 in Tech

#data visualization #Python #R

Scraping the frontpage with Python
Rendering images from the snapshots
Glue all the PNGs

The source code of this experiment is available on Github.

Scraping the frontpage with Python

First of all, we need data to visualize anything. Python is really usefull for scraping web content because it offers a lot of libraries that do their job really well.

The most obvious thing to do is downloading the frontpage of Hacker News. We are using the requests library for this task.

url = 'https://news.ycombinator.com'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0'}

# Fetch URL contents
try:
    r = req.get(url, headers=headers, allow_redirects=True)
except req.exceptions.ConnectionError as ex:
    exit(f"Error fetching URL {url}")

If all goes well, r holds the downloaded data.
Next, we fire up BeautifulSoup with the lxml backend to parse the HTML.

# Parse and process fetched data
soup = BeautifulSoup(r.text, 'lxml')
headlines = soup.find_all('tr', 'athing')

As a result, headlines is a list of elements with each containing one headline with all surrounding data (comments, rank). We will now have to disassemble all the different data components from this element. For example, to get the headline title and the current rank we need to extract the text from a.storylink and span.rank.

el_title = hl.select("td.title > a.storylink")[0]
title = el_title.string

el_rank = hl.select("td > span.rank")[0]
rank = int(el_rank.string[:-1])  # cut dot

When the script evolves over time and multiple test runs, more and more edge cases appear which need to be handled. For example, the differentiation between the strings „1 comment“ and „x comments“.

commentcount = 0  # 'discuss'
if el_commentcount.string != 'discuss':
    suffix = " comments"  # 'x comments'
    if not el_commentcount.string[-1] == 's':
        suffix = " comment"  # '1 comment'
    commentcount = int(el_commentcount.string[:-len(suffix)])

After the extraction is done, dump all collected data into a well-formatted JSON file and run the script again. You can use cron or watch to regularly run the script.

Rendering images from the snapshots

Part 2 of this experiment is visualizing the collected JSON snapshots with R. I decided to use R in this case, because I wanted to learn more about R - this script is my first experience with it.

Skipping the part of reading in the JSON files and making DataFrames out of them, let's look into the plotting part.

# Create the plot
plotted <- ggplot(headlines, aes(score, commentcount)) +
    ggtitle(timestamp_cet) +
    theme_bw(base_size = 10) + xlab("Points") + ylab("Comments") +
    scale_x_continuous(trans = log2_trans(),
                       breaks = seq(0, 1000, 50),
                       limits = c(40,NA)) +
    scale_y_continuous(trans = log2_trans(),
                       breaks = seq(0, 1000, 20),
                       limits = c(20,NA)) +
    scale_fill_gradient(low = "green", high = "red") +
    geom_label(aes(label=sprintf("(%i) %s", rank, title), fill=-rank),
               show.legend = F, size=3.5, label.r = unit(0.2, "lines"),
               label.size = 0.2, label.padding = unit(0.25, "lines"))

In detail:

ggtitle set the title
theme_bw sets the theme of the graph with the font size
xlab/ylab add labels to the axis
scale_x_continuous and scale_y_continuous set the display of the axis to logarithmic scales
scale_fill_gradient adds a color to each headline ranging from green (low rank) to red (high rank)
geom_label formats the labels of all headlines

I decided to use logarithmic scales because there are headlines with a lot of comments or points. They would push all the other headlines in the bottom-left corner if we would use linear scaling. The gradient filling of the headline labels is very useful when identifying hot topics.

Sometimes a new headline appears already colored in red, meaning it went from 0 to rank 1 in a very short time period. The graphs show different other things as well, e.g. you can observe when there is high activity and when Americans sleep (the timestamps are UTC+1 btw).

All surrounding code handles the file input and output. This script consumes all JSON files that were collected with the Python script above and skips already rendered PNGs. The images size can be changed by altering png(file.path('./pngs', filename.target), height = 720, width = 1280). But be aware that rendering thousands of images with 1920x1080 takes a lot of computing power.

Glue all the PNGs

Now, at last we stitch the resulted PNG files together. For this we use FFMPEG:

ffmpeg -y -framerate 8 -f image2 -pattern_type glob -i 'pngs/*.png' -c:v libvpx-vp9 -pix_fmt yuva420p -vb 10M -threads 2 hn-vis.webm

The result might look like the following video:

You can download a sample of collected data from 2017-04-11 to 2017-06-20 with an interval of around two minutes. Rendering this collection of of around 50000 files with FFMPEG results in a video of one full hour with the size of 1100MB.