In this blog post we'll explore some interesting implementation aspects of the analysis tool used in "The underlying problem of the Fediverse and other decentralised platforms". You could see it as the technical counterpart of the bundle. First, I'll recapitulate some points of the 2019 version. Then the 2020 version serves as the reference for the three highlighted topics: slicing, caching and threading. After reading this article you also might be able to speed up your Python apps (here: from around six minutes to four seconds). The source code of this tool is available in the bitkeks/fediverse-infra-analysis Github repository.
Some words about the methodologies applied during the analysis created last year. The graphs in 2019 were created with Libre Office Calc, based on an instances dump from instances.social. For each instance, the hostname was resolved to the target IP addresses and then a
whois call was issued for these IPs. The cumulated data was then put together in a CSV file, ready to be examined with Calc.
whois not just returns the name of a hosting provider, but the name of an allocated network address range, mapping from one to the other has to be done. In the 2019 version, this was implemented by utilizing a
dict, as shown below. Please note that querying the WHOIS databases might be subject to fair use policies. I solved this by reducing the queries with a cache file, in which successful WHOIS queries were stored and not sent again (this mechanism was improved in the 2020 version, as you'll see later).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Easy task! Collect the
whois output, split at each newline and then look for the given keys in each line. For example, the line
netname: OVH-CLOUD-LIM would be matched, and the
ovh would match in line 18. The value
ovh would then be returned and used as the key in
counters, adding this instance to the list of instances hosted at this provider.
For the 2020 version, I created a new Python tool which has a lot more features than the last one and uses
matplotlib for plots. The graph below shows the processing pipeline with file inputs and task chains.
It begins on the left, processing input files for IP-to-AS mappings and for the instances data. Both tasks use cache files, which we will explore in detail later. As the next step, a clean up job is executed for all cache files, creating a clean base for the workers to run. Then the pool of worker threads is started. Each worker has two assignments: get the IP address of the instance and then map this IP address to an AS. The topic of threading will be picked up in part three. When the workers finish, their results are collected. Lastly, graphs and exports are created.
In 2020, IP-to-AS mapping (WHOIS) was vastly improved from the previous approach. When you're just hacking along, putting together some ideas it might be okay to be lazy and use
whois to gather the information you need. But hey, performance-wise this is pretty inefficient. Looking for an alternative source for these mappings I stumbled upon iptoasn.com. A free IP address to ASN database, downloadable, easy to parse, updated hourly was exactly what I was looking for. We now have a list of all AS with their name, IP range and ASN. The following code handles the new input.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
You might wonder what
current_slice does exactly. It is one of the many performance tuners built into the tool. In the first version of this implementation, each AS was placed as a top-level entry in the
ip_networks dict. Simple, easy, what could go wrong. But this data structure quickly reaches its limits when it comes to search. Later in the process, IP addresses have to be mapped to their AS, based on the data in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Compare this implementation to the first version in which all networks are iterated and converted on the fly:
1 2 3 4 5
The IPv4 data set has 418362 entries. Due to the implementation running through all the entries, this means 418362 iterations. Each iteration executes two IP conversions with
ipaddress.ip_address() for the comparison, which consumes a lot of performance. In the new version there are only 419 keys in the
ip_networks dict, with 1000 entries each, so only a maximum of 1419 iterations are needed. Creating
ipaddress objects still happens, but they are only applied to a specific subset of the whole data set (additionally using a cache as well).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
This method of splitting a huge batch of items into smaller chunks I call slicing. In Python, slices are applied to lists to get only a subset of its items, e.g.
list_with_items[1:5] returns items 2-5 from the list. In data structure theory we'd now enter the realm of trees and partitions. More on that at Wikipedia. Since we already touched caching, let's continue with this topic!
Caching is an extremly efficient way to persist results of a process for later re-use. Imagine you read an input file and you apply a parsing task on all the entries in this file, resulting in some output like a dict. Your script exits and you wish to re-run it. Should it need to do everything again, if the input has not changed? No, of course not. This is where caching comes in.
In my tool there are five caches: IP-to-ASN input file parsing, IP resolving (one for IP cache and one for hostnames which have no IP address), IP-to-AS mapping and the
ip_networks conversion as shown above. The first four caches are persisted to disk and read on subsequent runs of the program. In the code snippet below, the first code example in this post (
asnfile_init) is extended with caching functionality.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
The function receives the file name as parameter, creates a hash for this file and uses the resulting hash value as an identifier. If some previous run has already done the work of parsing the file, read the cache and return it directly. Else the code shown in the first code block is continued with.
Another example is the resolution of hostnames to IP addresses. Sure enough, your DNS server already caches requests, but we can save even more time by not even contacting the DNS server. Note that DNS entries have their own lifetime (time to live, TTL), which we ignore in this implementation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Some times it happened that the tool was run ten times a minute for small adjustments. What would you think how much time it saves to not iterate over the whole list of instances again, to resolve their IP addresses which surely did not change in the last minute?
And as a last example, the clean up job.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Every cache entry has a
timestamp field in addition to its value. At the start all entries are iterated through and each timestamp is evaluated, searching for expired entries. Expired? Removed from the dict! The dict is held in memory and used during the data processing. When the program finishes the dict is persisted, dropping the expired items which were removed in
Talking about IP address resolution, we will now come to the third and last part of this exercise: threading.
In the previous section the IP address resolution function was shown as an example. In it
socket.getaddrinfo(hostname, None, proto=socket.IPPROTO_TCP) has the vital mission to fetch all available IP addresses that are stored for the corresponding hostname, for IPv4 as well as IPv6. What do you think might go wrong here?
Maybe you have guessed it already - it's the request timeout applied by your OS to
getaddrinfo which can give us a hard time. So what easily goes wrong is plain simple: a DNS request can take a long time (in some configurations up to 60 seconds) to fail. These delays accumulate to an insane amount of waiting time which is wasted because other hostnames could be resolved in the meantime. And this is where asynchronous workers come into play.
Please note that in Python threading is a limited endeavor. The so called Global Interpreter Lock (GIL) prevents a running Python program from utilising the full functionality of threading, namely parallel execution. If you need full speed, use
multiprocessing. If you prefer
asyncio. I opted for
threading because it's very easy to implement with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
Let's go through this step-by-step. In lines 1-17 the
worker function is described. Each worker has the task to a) get the IP address and b) get the corresponding AS for up to ten hostnames. From line 19 the main program runs. Define a
NUM_WORKERS threads (for example
NUM_WORKERS = 4), iterate over all instances (line 27) and fill batches of ten hostnames to process in a worker (line 38-40). The last batch might not reach an amount of ten items, so it's handled extra (lines 44-46). Now that all instance hostnames are distributed, wait for all workers to finish (
pool.join() in line 49). As a last step of threading, the returned lists of
WorkerResult objects are unpacked and concatenated (line 52).
Easy right? Just define the
ThreadPool, run a function via
pool.apply_async and collect the results. In my case eight workers were used and they did a really good job. If from the 4233 entries in my instances data set only 12 were unresolvable, my setup would delay two minutes (ten seconds request timeout, which is low!) in the sequential implementation. With workers, the failing requests still block, but other hostnames are resolved in the meantime, virtually parallelising the request pipeline. In the best case the longest waiting time for the whole operation would be the DNS timeout, with all other requests finishing during the waiting period of the blocking request.
Threading concludes this blog post. In your next Python application, you'll now be able to search faster, re-use processed data and efficiently distribute workloads. For the whole code base, have a look at my Github repository. Thank you for reading and happy performance hacking!