The underlying problem of the Fediverse and other decentralised platforms


ActivityPub and Mastodon brought new incentives into the world of decentralised communication platforms, even so far as I would call it a serious alternative to platforms like Twitter. But all efforts made by hundreds of individuals every day – administrating servers, developing software and moderating communities – have a weak spot which needs to be addressed in the near future: who has control over the underlying computing infrastructure of the Fediverse? And are users aware of the conditions?

This post will have a focus on the distribution of Mastodon instances over hosting providers throughout the Fediverse. First, I'll revisit an analysis from May 2019, in which I commented the current state of the Fediverse on Twitter. Later, we'll reopen the case in 2020. How did the distribution change? And what conclusions can we draw from these analyses?

Side note: I took a break from using Mastodon around May 2019, mostly because my instance was migrated to CloudFlare. More on that in my last Toot. Nevertheless, the approach of establishing decentralised platforms based on free software is a fantastic idea! They are foreseeable alternatives to centralized services like Facebook and Twitter. Contribute now! Have a look at my Mastodon-to-Twitter bot for an example (and somewhere there's code that runs a Mastodon-to-XMPP bot..). The Fediverse is still very young and Mastodon has had a tremendous amount of contributions and commits over the last months. Other software projects like Pixelfed and Pleroma gained, too. And since they are all compatible to each other thanks to the ActivityPub protocol, they share community and development philosophies.

And for all the data science fans out there, I have published the follow-up post "Python and data: slicing, caching, threading" containing the technical aspects of the analyser tool.

2019

In May 2019 I wrote some Tweets about the current "state of the #fediverse", intending to highlight the problems I see in regard to "privacy and independency on decentralised platforms" vs. the actual condition of ownership and data control Fediverse users have.

State of the #Fediverse: recommending #Mastodon because of #privacy might be misleading. Looking at the top 2937 instances, TWO hosting providers are home to more than half of the 2.032.419 users! We seriously need to rethink what #decentralization means! 1/4 #stateOfTheFediverse

My statement was based on the data analysis of existing instances in the Mastodon Fediverse and their respective hosting platforms. Let's reiterate what the data showed in 2019.

First graph with absolute numbers: how many instances and users are hosted by each provider? OVH, Sakura Internet, Amazon, CloudFlare, DigitalOcean and Hetzner make up 1441 of these instances. Insight: the number of hosted instances does not correlate with hosted users. 2/4

I used a list of Mastodon instances from instances.social as the data basis, resulting in 2937 entries. Through a pipeline of commands they were mapped to IP addresses, which in turn were mapped to their hosting provider (autonomous systems, also known as AS, identified by their AS number, or ASN). Each IP address range that exists in the Internet is registered by an AS.

For example, the IP for twitter.com (104.244.42.1) is registered as the range 104.244.40.0 - 104.244.47.255 with the net name TWITTER-NETWORK by the organization Twitter Inc.. So if an instance with the example name bird.site appears in the list of instances and it has the IP address 104.244.42.1, we would know that Twitter hosts it own instance in the Fediverse.

Coming back to the data source. instances.social gathers public information from each instance: name, description, different counters (users, toots) and other stuff. Details in the Mastodon API docs. This collection of data entries can be combined with other sources - in this case the data about the hosting provider! So, put everything together and the graph below can be produced.

Fediverse chart 2019 1

From left to right the chart bars are categorised for each provider. It begins with OVH with the highest amount of hosted instances, followed by others each with less. What can also be observed is that the amount of users does not correlate with the amount of instances. OVH hosts many small communities, probably due to the fact that masto.host, a Mastodon hosting service, runs its infrastructure at OVH.

Second graph with shares: what's the share of each hoster regarding users and instances? Amazon and CloudFlare take the cake with more than 53% users on their infrastructure. Hetzner, OVH, DigitalOcean and Sakura Internet follow with 27%, filling up 80% of total users. 3/4

Fediverse chart 2019 2

This chart uses the same data as the first one, but sorts the providers by their respective amount of users (cumulated amount from each hosted instance) and displays a share of the overall amounts, instead of totals. Another observation: many users are hosted by only a few providers. This insight is one very important weakness of the Fediverse (more on that in section research).

Conclusion: #decentralization cannot be done with software alone. A serious examination of our #infrastructure must follow. Using #privacy as motivation for pulling people from #Twitter onto instances, which are hosted by yet another data-aggregator, is misleading at best! 😕 4/4

Polemic statement, yet I stand by it. I can't stress this enough: your hosting provider has access to all your and your user's data. If an AS like Amazon hosts thousands of users over hundreds of instances, it may look like paradise at the higher level, but it's technically not better than using Twitter if the provider decides to act maliciously with the data. And no, it does not matter if you use TLS or disk encryption. Once the machine runs, the underlying virtualisation stack has full access (write me an email if you want to discuss secure enclaves). I would even go as far as calling it worse than using Twitter, because of the misleading promises the Fediverse claims as its selling points – decentralisation and control over data.

2020

Now that nearly one year has passed since the last analysis, let's have a look at the current state of the Fediverse. As mentioned, for technical aspects of the analyser tool please refer to the follow-up post "Python and data: slicing, caching, threading". We will be looking at bar charts again, but this time merging absolute numbers and shares together into one plot. They differ in their sorting, by amount of instances and amount of users. Again, instances.social was used as a source, with data exported on March 4, 2020. Instead of dropping small providers on the right side of the chart, they are aggregated in groups (where (2-9) means the group of providers which host 2 to 9 instances). The respective share of each hosting provider is added to the label on the x-axis. I have to admit that the comparison of the charts from 2019 and 2020 is to be taken with a grain of salt, because the IP-to-AS mapping data bases differ between the two years. Starting with the first graphic, explanation below.

Fediverse chart 2020 2

First chart, providers are sorted by their amount of hosted instances. OVH still leads the competition, and that's probably due to the ongoing hosting service offered by masto.host. Followed by Sakura, Amazon, Hetzner, CloudFlare and DigitalOcean — same as in 2019. Some providers gained instances and users, others lost some. In total, these six providers host more than 50% of Fediverse instances. The high amount of users in the 2-9 category coheres with the big drop of users at Amazon: it's pawoo.net which migrated with its 605888 users from AWS (2019) to IDC Frontier (2020).

Fediverse chart 2020 2

Second chart, sorted by amount of hosted users. More visible in this chart is how pawoo.net is taking the cake: IDC Frontier replaces Amazon as the biggest user-hosting provider. Hetzner switched places with CloudFlare, OVH gains more but stays fourth. And what do we also see? IDC Frontier and Hetzner now host nearly 55% users! 86% if you take the top five. Still problematic. But I also like the fact that there are so many little providers, hosting less than a hundred instances, or even just one! Maybe someone has put up a Raspberry Pi in the closet, home to a group of like-minded people? Not to forget that the big instances also attract a lot of one-shot users, which register and never return. Luckily, instances.social also gathers the active_users data point!

Fediverse chart 2020 3

Third chart, sorted by amount of active users. This data comes from querying each instance at their activity API endpoint which returns daily logins counters. I'm not sure how instances.social treats this input (aggregation, snapshot, average), but we'll use it anyway. What changes? Hetzner suddenly jumps to first place, followed by OVH and IDC Frontier. Hetzner is interesting, since it has 27.11% hosted and 26.85% active users, so pretty evened relation. IDC Frontier falls from 27.9% to 17.3%, same for CloudFlare from 18% to 10.4%. The winner is OVH, with a raise from 10.2% to 24%. One could guess that payed-for instances have a stronger incentive to keep their users active, but that is not readable from the data. To go further we could explore the composition of instances for each hoster. Hetzner for example hosts mastodon.social, the flagship instance maintained by the creator of Mastodon, with 29831 active users (around 70% of Hetzner's active users). But that analysis will have to wait for another time.

All in all, the landscape has not changed much since 2019. We still have a long way to go, but I'm optimistic that people who care will also be the people who build the future of decentralised platforms. Regarding the analysis, that's it for now. Below, you'll find more information regarding research and my conclusion.

Fun fact: in some occurences IP addresses resolved to strange networks, like Facebook or Twitter. I thought, okay this might be some justified testing instances – turns out, the registration of the IP block switched so fast, my IP-to-AS data set was outdated after two days! More on that in the technical follow-up.

Research

During my search for similar investigations of the Fediverse, I stumbled upon the research paper Challenges in the Decentralised Web: The Mastodon Case (published September 2019, discovered via Wikipedia). In it, Aravindh Raman et al. explain the basics of the Fediverse and pick up multiple weak spots they discovered during their examination of instances and federation. They especially focused on the network in regards to availability and centralisation, while other aspects being the openness of registration and usage of tags. Also included are comparisons drawn by utilising a Twitter data set as well as examining HTTPS statistics (spoiler: Let's Encrypt provides more than 85% of instances with their TLS certificates). In their conclusion they highlight the risk [of] converging to semi-centralised systems, propose improvements and share related and future work.

Another thing that landed on my table was my own design of a decentralised platform, composed in March 2017. The so called DUMP – Distributed Unified Module Platform was meant to be a student thesis proposal for further research during my time at the TU Dresden, at the chair of privacy and data security. The basic idea was to establish a base platform stack which is run by independent servers, with each of these servers having the ability to offer a defined adapter (API) for standardised modules.

The DUMP

These modules would then implement services for the local users, with the server handling data exchange with the module instances running on other servers. Summary: DUMP is the project name of a distributed social network platform, that not only connects the profiles of its users but also offers services on top of this social network. The aim of the project is to create a concept that is possible to be implemented in software and should result in the migration of centralised services as we know them on the public internet towards instances on the DUMP. I did not pursue writing a thesis about this topic – regardless it's fun to see that old ideas come back to use sometimes, even if it's just for reference!

Conclusion

If you are a member of the Fediverse – watch out for your data. If you are an admin running an instance in the Fediverse – watch out for your and other people's data. If you care about availability, privacy and the future of decentralised infrastructure, think about who owns your server hardware. And lastly, if you are a small infrastructure provider - you can be part of the progress as well! Offer discounts to admins, provide tutorials, host yourself – you name it.

As always, thanks for reading! Don't forget about the follow-up post "Python and data: slicing, caching, threading". Comments welcome via Twitter or e-mail. Happy to hear from you!