Visualizing the Hacker News frontpage
The source code of this experiment is available on Github.
Scraping the frontpage with Python
First of all, we need data to visualize anything. Python is really usefull for scraping web content because it offers a lot of libraries that do their job really well.
The most obvious thing to do is downloading the frontpage of Hacker News. We are using the requests library for this task.
1 2 3 4 5 6 7 8 |
|
If all goes well, r
holds the downloaded data.
Next, we fire up BeautifulSoup with the lxml backend to parse the HTML.
1 2 3 |
|
As a result, headlines
is a list of elements with each containing one headline with all surrounding data (comments, rank). We will now have to disassemble all the different data components from this element. For example, to get the headline title and the current rank we need to extract the text from a.storylink
and span.rank
.
1 2 3 4 5 |
|
When the script evolves over time and multiple test runs, more and more edge cases appear which need to be handled. For example, the differentiation between the strings „1 comment“ and „x comments“.
1 2 3 4 5 6 |
|
After the extraction is done, dump all collected data into a well-formatted JSON file and run the script again. You can use cron or watch to regularly run the script.
Rendering images from the snapshots
Part 2 of this experiment is visualizing the collected JSON snapshots with R. I decided to use R in this case, because I wanted to learn more about R - this script is my first experience with it.
Skipping the part of reading in the JSON files and making DataFrames out of them, let's look into the plotting part.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
In detail:
ggtitle
set the titletheme_bw
sets the theme of the graph with the font sizexlab
/ylab
add labels to the axisscale_x_continuous
andscale_y_continuous
set the display of the axis to logarithmic scalesscale_fill_gradient
adds a color to each headline ranging from green (low rank) to red (high rank)geom_label
formats the labels of all headlines
I decided to use logarithmic scales because there are headlines with a lot of comments or points. They would push all the other headlines in the bottom-left corner if we would use linear scaling. The gradient filling of the headline labels is very useful when identifying hot topics.
Sometimes a new headline appears already colored in red, meaning it went from 0 to rank 1 in a very short time period. The graphs show different other things as well, e.g. you can observe when there is high activity and when Americans sleep (the timestamps are UTC+1 btw).
All surrounding code handles the file input and output. This script consumes all JSON files that were collected with the Python script above and skips already rendered PNGs. The images size can be changed by altering png(file.path('./pngs', filename.target), height = 720, width = 1280)
. But be aware that rendering thousands of images with 1920x1080 takes a lot of computing power.
Glue all the PNGs
Now, at last we stitch the resulted PNG files together. For this we use FFMPEG:
1 |
|
The result might look like the following video:
You can download a sample of collected data from 2017-04-11 to 2017-06-20 with an interval of around two minutes. Rendering this collection of of around 50000 files with FFMPEG results in a video of one full hour with the size of 1100MB.