As part of our DOS protection strategy, we use a digital mapping service provider MaxMind for IP location data addressing. Yesterday MaxMind deprecated an IP address immediately lowering our DOS responsiveness.
On Thursday afternoon when it first appeared my instinct was an upstream issue as the complaint appeared to be clipping and call quality against which our transit and upstreams monitoring showed a tonne of capacity with sub-millisecond connectivity between the networks.
When ‘we' realised it was occurring within the network, we initially believed it was an issue with our RTP port range. By Friday 12:00 pm we’d isolated it down to DOS, and a short while later isolating to the MaxMind change.
While we have now implemented a few additional checks and alarms the cause was a simple change in IP with our geolocation provider.
In answer to why it took 24 hrs, the problem was intermittent and dependant on coinciding with a DOS. While we saw the issue at a high level, we still needed to capture and ultimately Wireshark an actual call. Simply enabling verbosity against TBs of data would have taken months to analyse. In the end, Gunjan captured an actual call which we were able to point to the DOS prevention and then eventually to the MaxMind change of IP.
Starting in July we are phasing into production systems that will add significantly more scale to the way we register and maintain subscribers presence. In respect of the way we handle DOS while we have introduced granularity in the way we classify and treat offending IPs, that simply shifts into our session border.
Once again if you’ve made it to here, I thank you for reading. I’m happy to chat face to face it you require deeper detail.
Posted Jun 15, 2019 - 10:54 AEST
A fix has been implemented and we are monitoring the results.
Posted Jun 14, 2019 - 12:08 AEST
We are currently investigating call quality issues, we will update once we know more.
Posted Jun 13, 2019 - 15:02 AEST
This incident affected: YourCloudTelco Calling Platform (Network).