Post Mortem, AS701 IPv6 network outage 2020-09-15

Yesterday, at the time of writing (2020-09-16), there was a large outage of an unconfirmed cause which resulted in all IPv6 packets over the AS701 network which were similar in size to the network’s MTU (>1250) were dropped about every thirty seconds, causing almost all TCP connections that defaulted to the IPv6 route to quickly fail.

The incident began at around 21:23 2020-09-14 where the network operator was quickly notified after a user reported the problem uploading files over HTTPS to a dual-stacked IPv4/v6 web server. Over the next few hours, the network was tested and the operator was able to identify that the tunnel used to provide IPv6 service on the AS701 network was not properly tunneling packets, though a specific pattern had not been identified. Most users were either moved to an alternative network or removed their default IPv6 route to mitigate the issue. Certain systems and users that relied on IPv6 were unable to completely remove IPv6 routing, causing a day long outage of their service. These users were deemed non-critical and the incident was deemed patched temporarily until the next day (2020-09-15), when further testing concluded that only large packets (>1400) were being dropped, and that all protocols over IPv6 were affected, not just TCP as previously suspected. This conclusion was not the first conclusion, which was a hardware failure in our network switch. This was tested and we gained valuable metrics from this even though the initial diagnosis was inaccurate.

At this point, we had a testing methodology in place to solve one of the problems causing the incident, involving traceroutes of a large size, iperf, and tcptraceroute6. After adjusting the MTU announced in the router’s RA, all packets were being treated equally and the dropping issue was no longer detected, but connectivity was not yet resumed.

With further testing, speeds over UDPv6 were one tenth the speeds of UDPv4, as previously identified, and the tunneling mechanism was finally shut down, while a new tunnel was created. The original tunnel was a SIT tunnel, which was dropping packets with no visible pattern, so a new GRE tunnel was created to replace it. The new tunnel was fully functional within 20 minutes, and all previous modifications to the network that were no longer necessary were rolled back quickly. Full connectivity was restored to all systems and users at 18:57 2020-09-15.

After the incident, it is believed that Verizon made a change to their traffic monitoring system without prior notification that was unable to handle SIT traffic, particularly larger packets. This was extremely difficult to diagnose due to a lack of communication from Verizon and a lack of proper testing. In the future, there are plans to test all backup connections to machines to ensure remote access over the backup links, which would allow any authorized connection to the internet to be used in connecting and mitigating outages on headless systems. In the future, there are also plans to collect more metrics, including tests using larger packet sizes, to aid in data collection for issues that are dependent on packet size, and other TCP tests (HTTPS) for further diagnostics. Finally, systems to move clients reliant on IPv6 to the backup network or to isolate the faulty protocols will be put into place to minimize downtime of clients and systems.

2 thoughts on “Post Mortem, AS701 IPv6 network outage 2020-09-15

Leave a Reply