Measuring IPv4 vs. IPv6 Performance
Bash,Networks,University,Web.Posted October 19, 2015. 4195 words.
I produced a series of bash scripts to automate the process of pinging the list of websites. I choose bash as it is trivial to pipe the output from ping into various other command line programs such as: sed, gawk and wget. As it was completely automated I decided to start early and just let it run. In total I pinged the top 100,000 websites up to 100 times each using script.sh (See Appendix A), recording useful statistics.
The script is very simple; a while loop to iterate over each site, and ping/ping6 piped into gawk to process the result. Gawk is very good at processing this sort of data, and more than fast enough to perform the task. The result was output to a large (11.1MiB) CSV file, with the IPv4 and IPv6 of each site, on separate lines.
As part of creating this script, I assumed that every site, if it exists would be able to respond within 10 seconds if it was online. If not, my script would timeout and assume the site was down. I feel this is a reasonable assumption as any remotely popular site should respond quickly, unless it’s currently being DDOSed. Another assumption I made is that any site I scanned would be able to withstand 5 requests per second. Even a raspberry pi is capable of serving 43 static pages per second1. As I sent a maximum of 50 requests, the brief period of slightly increased load should be negligible for any of the sites I scanned.
In hindsight, I would have combined both IPv4 and IPv6 into a single line from the start, as manipulating the data in excel is significantly easier to do if it is all on a single line. By that time I had already scanned the top 100,000 site’s so simply regathering the data was impossible. To fix this, I created combine.sh (See Appendix B) which simply echo’s the IPv4 line without a newline, then the IPv6 line with one. This is the reason I have some duplicated columns in my combined output. These are removed in Appendix D.
Whilst looking through the IPv6 column I noticed a very common prefix: “2400:cb00:”. After some research I discovered that this prefix belongs to cloudflare2. Using the prefixes I found on whatmyip.co3, I created a table mapping the hosting company to the number of sites it hosts. The results are impressive.
I decided to lookup the geolocation of each website. Looking around for a convenient database or API, I stumbled upon freegeoip.net4. It allows you to easily gather geolocation information for a specified IP in CSV form, perfect for my coursework. To retrieve this information using lookup.sh (See Appendix C) I self-hosted my own instance, then used cURL and a simple while loop to request and printf all the location information about each site to a file. I decided to record all the information given, to keep the script simple and retain all the information, to ensure I didn’t need to re-run the script.
Once the data was collected, it was time to head to Excel to analyze the data and draw conclusions. Having a large dataset let me create very good graphs, and draw good conclusions but was tedious to work with in Excel. Certain formulas, such as the ones used to create the Average Response Time per Country over Distance graph, managed to crash Excel numerous times and it even ran out of memory every now and then. In future when dealing with similarly sized amounts of data I would need to look into other graphing tools.
Despite not being perfect, I am happy with what data I gathered and how I gathered it. The results speak for themselves.
I decided to plot all 100,000 points as an X-Y scatter. In this, and subsequent graphs, IPv4 is blue and IPv6 is red. Immediately I noticed rather obvious bands of pings, which are shown in the Histogram, below.
There are large peaks in the graph above. It’s interesting to note that despite the lower adoption of IPv6, the initial peak is half the height of IPv4. Past that, the frequency is low, indicating lower IPv6 response times.
Thanks to CDNs and local sites, the top 1000 sites are concentrated around the 4-6ms mark with both averages trending slowly upwards. IPv6 is always significantly lower than IPv4 in the graph above.
From a glance, you can see regions such as Africa, the Caribbean, and the Middle East without any IPv6 deployment. Sites are concentrated around the U.S.A, Europe and East Asia, with barren areas in-between.
Average response times for IPv4 only and IPv6 only sites is roughly the same at about 104ms. The average min and max are largely identical as well; nothing surprising. On the other hand, the averages for sites running both IPv4 and IPv6 is very low in comparison – only 25ms compared with over 100ms! Now the question is, why are sites running both IPv4 and IPv6 significantly faster?
A Vast majority of 90% of the internet is IPv4 only, with only 4.5% of sites providing both. In fact, more sites provide neither than both! It’s impossible to not have an IPv4 address and be truly connected.
A large number of the top 100,000 websites are either blocking ICMP echo requests (ping) or are simply offline. Alternatively they could only be only listening to a specific sub-domain. I didn’t check for this.
One reason sites that provide both IPv4 and IPv6 are faster, is that 65% of them are behind cloudflare, or google. Both have worldwide CDNs, and cloudflare provides a free IPv6 gateway; allowing IPv4 only sites to be connected to using IPv6.
The U.S.A. is the world leader in number of hosted sites with 43% of the market. Comparatively, every other country is trailing behind with Canada at 9%, Germany at 6%, and Hong Kong at 6%. This is despite the existence of global CDNs.
To test the geolocation accuracy, I potted estimated distance, over average response time. Sites above the diagonal are likely closer than their IP suggests. Sites below the diagonal simply have a poor connection.
Besides a colourful graph, this shows the grouping of sites in different countries; explaining why there are so many peaks in the Histogram. There is a minimum amount of time it takes to connect to distant hosts.
I only ran the script once, and as the script took easily a week to analyze the top 100,000 sites, chances are the sites at the top had changed since the start. I could have gotten around this by parallelizing the script, or running it multiple times on a smaller set and taking an average. Parallelizing seemed too complex for the task at hand, and I didn’t really consider running it multiple times before it had almost finished analyzing. As I didn’t want to discard all the data, I decided to go ahead with the data I had.
The geolocation database I used isn’t 100% accurate - that is virtually impossible. You can see that it isn’t on the Average Response Time over Distance graph. Many sites are so significantly above the diagonal that the only way that the time would be possible is by breaking the speed of light; and indication that they are located much closer than their IP suggests. This is likely as common as it is, due the how exhausted IPv4 is, organizations are trading the limited number of IPv4’s that they have access to. This doesn’t really matter too much, due to the size of my data set. In future I could remove the outliers to produce cleaner results.
It is clear to see that sites that serve both IPv4 and IPv6 traffic are, on average, significantly faster than those that don’t (4 times faster on average). Every single graph I have produced shows this simple fact. I feel that this can be attributed to several factors:
- Cloudflare and Google account for ~65% of all IPv6 traffic and both have global CDNs to ensure a fast and reliable connection. There are many more IPv4 sites that neither are associated with. Instead of connecting to a distance site, a CDN acts as a proxy, speeding up the time it takes to connect.
- IPv6 isn’t widely deployed yet, with only 6% of sites serving it. Those 6% are usually the bigger websites; smaller sites may be stuck behind a single IPv4 address, in a datacenter that doesn’t support IPv6.
- Sites that serve both IPv4 and IPv6 traffic are concentrated in North America, Europe and East Asia – regions with geographically closer and better connected than the rest of the world. They are practically absent from regions such as Africa, the Caribbean and the Middle East. There is also little deployment in South America, West Asia and Oceania.
However, despite the fact that average response time for IPv6 is significantly faster than IPv4, it’s unlikely you’ll see any speed increase switching between IPv4 and IPv6 on a host that supports both. IPv6 is faster because the hosts that server both are well connected with fast response time, regardless of protocol.
Despite the best efforts of organizations such as worldipv6launch.org, cloudflare, and Google; IPv6 access is an afterthought. Despite claims of 500% growth since 2012, it’s 2015 and only 4.5% of sites support IPv6. As we go into the future, IPv6 deployment will surely grow as larger populations and the Internet of Things will strain the exhausted IPv4 pool even further. Until IPv6 is widespread, anyone with only an IPv6 address will be unable to connect directly to IPv4 only hosts, without the aid of a tunnel.
IPv6 isn’t evenly geographically distributed, compared with IPv4. If you’re in Africa, the Caribbean, or the Middle East, virtually no sites support IPv6. This suggests to me that the infrastructure required to support IPv6 just isn’t there.
Bigger sites are more likely to support and have a fast IPv4 and IPv6 connection than smaller sites. As you go through the different sites, the further down you get, the slower the site is to respond, on average.