Description of the issue:

The cause was that we suddenly started receiving emails from Mailgun with a body of null, and our system was not built to handle null values for the email.body. When an email is unable to be processed by Kafka, it blocks that partition as it runs sequentially rather than parallel. Our Kafka consumer has 6 partitions, so while some were blocked, others were not, and so most emails were getting through. But the emails that caused the issue, as well as the emails behind it in each blocked partition, were not.

On Friday January 21 at 10:29 AM PT, Chris Chavez posted in the #tool-dal Slack channel that we missed an inbound email from a client. Upon further investigation, it was discovered that our several of our Kafka pipelines were blocked due to errors when parsing the email body of emails with a body of null. Accordingly, many emails were being received via Mailgun and added to the Kafka queue, but were not being processed and landing in our DAL postgres db, and thus were not displaying in the DAL as expected.

What was the cause?

We are still investigating why we suddenly started receiving emails from Mailgun with a body of null, as this had never been the case in years prior.

What was done to resolve the issue?

The end fix to this was actually quite simple. Rodrigo (who assisted with providing context on this issue) put together a PR that I approved. The PR simply set a default value for the email.body within the emailReceived consumer, setting it to an empty string if no truthy value was provided.

PR is here: https://github.com/invisible-tech/yggdrasil/pull/6274

Timeline of events (all times UTC on Friday, January 25):

What was the impact?