20 October, 2023
2 min read
Last updated on 20 October, 2023
This is a story about an incident happened some time ago at Zalando. We have started preparation to introducing an English language on all europeans market. Before this initiative each Zalando market operated in its own language: French Zalando was in French only, Latvia Zalando in Lithuanian and so on.
There were some outliers: Swiss and Belgium Zalandos had several languages available and German Zalando had English enabled. A lot of nuances.
Zalando has rich ecosystem on microservices.
There are rules and policies being made trying to standardise quality of services, which are being run in production. However, you can be sure – you won't know everything happening after or before service(s) your team are owning. Each team has different best practices, operational guidelines and way of work. And, as it was discovered during incident, you can expect each service has its own localisation settings & validation rules defined.
This was all discovered later. But before, we were team owning an edge Zalando router service and we were sure we do language settings stuff. Others should listen to us and accept settings we pass, aren't they?
We are sitting on the edge and know what we are doing.
We developed new set of language validation and language detection steps, started rolling it out on production and then received cold wave of several main API responses. Zalando stopped working for thousands of people, who had English as preferred language in their browsers.
Some hard-coded settings in some downstream services could not recognised new language settings which were passed to them. Some of those services were not doing anything with those settings, they simply checked that what's coming to them should match what's written long time ago. In some places that code haven't been touched for years.
Incident may seem quick to recover, but it backfired in different place. iOS clients realised they receive rejection when do tracking requests and started to spin retries. That burst of retries become a DDOS attack on our ingress, which was not in time with scaling. Ingress saturation started to produce timeouts and dropped requests from customers who don't have any English settings. Now, the whole Zalando was affected, because those retries were set up using backoff strategy and with time they just grew more and more.
Rollout to previous version of edge service worked out. All tracking requests were made their way to their end destinations and everything was back to normal.
But few lessons there:
Troy Köhler is Software Engineer living in Berlin, Germany with ~6 years of experience working in technology industry. He used to work in one of the biggest e-commerce product in Ukraine, and now works at Zalando which has more than 7 millions paying customers. He focuses his study and expertise on Rust language, complex backend systems, product development and engineering platforms.
I don't do emails now, but you can subscribe for the future.