The cause was that we suddenly started receiving emails from Mailgun with a body of null
, and our system was not built to handle null values for the email.body
. When an email is unable to be processed by Kafka, it blocks that partition as it runs sequentially rather than parallel. Our Kafka consumer has 6 partitions, so while some were blocked, others were not, and so most emails were getting through. But the emails that caused the issue, as well as the emails behind it in each blocked partition, were not.
On Friday January 21 at 10:29 AM PT, Chris Chavez posted in the #tool-dal Slack channel that we missed an inbound email from a client. Upon further investigation, it was discovered that our several of our Kafka pipelines were blocked due to errors when parsing the email body of emails with a body of null
. Accordingly, many emails were being received via Mailgun and added to the Kafka queue, but were not being processed and landing in our DAL postgres db, and thus were not displaying in the DAL as expected.
We are still investigating why we suddenly started receiving emails from Mailgun with a body of null
, as this had never been the case in years prior.
The end fix to this was actually quite simple. Rodrigo (who assisted with providing context on this issue) put together a PR that I approved. The PR simply set a default value for the email.body
within the emailReceived consumer, setting it to an empty string if no truthy value was provided.
PR is here: https://github.com/invisible-tech/yggdrasil/pull/6274
Timeline of events (all times UTC on Friday, January 25):
6:29 PM: Chris Chavez reported the issue of at least one email not landing in the DAL in #tool-dal
7:04 PM: Zack responded to the thread with a request to clarify the issue and provide more information
7:09 PM: Zack received a copy of the email that was missed, forwarded from Chris.
7:13 PM: Zack began investigating, including:
7:34 PM: Zack could replicate the issue and confirmed:
accepted
emails
table, indicating it had not been processed7:35 PM: Zack reached out to Rodrigo to ask for context on the DAL’s email receipt system. Rodrigo began sharing context and assisting on the investigation.
8:51 PM: Additional Operations personnel (Oscar Barrios and Eric Franco) began reporting experiencing the same issue — that they were not receiving emails in the DAL
10:03 PM: Rodrigo restarted Mimir to see if that resolved the issue, it did not.
10:14 PM: Rodrigo saw the error logged in Mimir’s heroku logs:
10:21 PM: Rodrigo provided this PR (https://github.com/invisible-tech/yggdrasil/pull/6274) setting the default email.body to an empty string. Zack approved and deployed.
10:48 PM: All emails in the Kafka queue were processed and landed in the DAL, and the issue was resolved.