On-Call rotation is on OpsGenie for the Client Engineering (ex-Flow) team.

The reason it is set to be daily and not weekly unlike the other On-Calls is because that's what the team members preferred when this was raised during standups.

Since everyone can look ahead on the chart and see what days they will be occupied, try and plan your days on those particular days accordingly. And if you cannot be available that day, we can always do an override and you can exchange shifts with another engineer who is available. Just make sure you raise it in advance so no one has to do a surprise On-Call shift out of the blue.

The responsibilities of the On-Call Engineer will be as follows:

  1. On-Call starts at 4 AM EST.
  2. 4 AM to 8 AM EST: Check Slack and see if any fires. If not, no need to log in right away. Just ensure WhatsApp is active with notifications on and phone is with you in case someone needs something urgent. Should rarely happen.
  3. 8 AM to 6 PM EST: Be online on Slack and extremely vigilant. If you see P0 reports on channels related to migrated processes, jump in without waiting to be tagged by a PM. Do a triage, diagnose the issue, check the obvious places for logs such as Datadog/Sentry. Meet with agents experiencing issues and check their consoles. If unable to figure out, rope in engineers with more context. Use best judgement, strike balance between ensuring Ops get quick resolution as well as your teammates don't have to context switch.
  4. 6 PM EST to 4AM EST next morning: These should be low times. You can log off from Slack, and consider the work day over. But do keep your phone near you with WhatsApp connected in case someone needs to call for something urgent. Once this period is over, the next person in the cycle should take over and you can return to a regular work day again.
  5. Before you do handovers, please make sure all Closed and Pending incidents that you covered during your On-Call shift are updated here on Notion in the Incidents' Database: https://www.notion.so/invisibletech/823eeb3190c94ae89e5ef44d889a2141?v=f940ea4b3032490ca411069593bb0404. This will ensure the next On-Call knows of any ongoing issues and also has context on the issues over the past days as they tend to reoccur.
  6. We are not setting up call alerts, etc. at the moment which are features OpsGenie allows for. Reason being a lot of these issues are not even found at the "Error Logs" level. We will keep it simple with Slack and WhatsApp calls for now.
  7. Not scheduling any "handover meetings" (for now at least). But when your On-Call shift ends, please drop a message here on this channel and tag the next On-Call. Next On-Call should acknowledge that he has taken over.

What Product/Management needs to ensure:

  1. 6PM EST is already EXTREMELY late for pretty much any engineer on this team.
  2. We need to do much better than we have been doing recently to ensure we don't have P0s arising at tail-end of the days by US times. We need to set very clear expectations with Ops about support. If there is something that happens after 6PM EST on a particular day that can't wait until the following day such that it risks SLAs, it means we have already failed at making sure we have those margins for errors.
  3. Of course, depending on how much action the On-Call engineer sees, his/her assigned tasks for the week may take a hit, so keep that margin in account for the worst case.
  4. This is for CE & Migrations specific use-cases. If there is a system-wide issue or outages, the regular EGT On-Call should be the person to go to. This is if ONLY one process is experiencing an issue that seems to have to do with its configs, data, custom WACs, automations, etc.