Mystery Timeouts with MS Graph API Webhook Subscriptions and AWS API Gateway

Microsoft Graph API supports a Subscription Resource type that manages their webhook implementation for delivering change notifications to clients. Subscriptions are enabled for several resources, including Outlook messages — this was a perfect fit for my need to manage incoming messages to a shared inbox in O365.

At a high-level, an implementation looks like this:

Make a subscription request, providing a custom notificationUrl endpoint
Respond to the notification validation request
Begin handling webhooks on the listener
Periodically renew the above subscription — max expiration time depends on the resource type

Implementation

My approach was pretty straightforward: an API Gateway hooked up to a Lambda handler for webhooks, and a CloudWatch rule to trigger the Subscription lifecycle management Lambda.

Importantly, the Lambda behind the API Gateway (webhook request/response handler) handles initial subscription validation in addition to actual incoming subscription notifications. Here’s a quick summary of the requirements of the validation process:

Accept an incoming HTTP POST with a validationToken value passed in the query string
Decode and escape the token value
Return an HTTP Response with:
- body containing the decoded token
- status code of 200
- content type of text/plain
- delivered within 10 seconds

The maximum expiration time for Outlook messages is 4230 minutes (under three days), so I setup the CloudWatch rule to run once per day to be safe. This worked perfectly — every day, the subscription extension Lambda was invoked and the Graph API would reply successfully. The total round trip time, which also included a validation request to the webhook listener, was usually under a second.

Mystery Timeouts

The above implementation worked as expected for months — until one day, I started to receive errors when updating/extending the subscription:

{
	"code": "InvalidRequest",
	"message": "Subscription validation request timed out." 	
}

With no code or infrastructure changes on our end, I started debugging by manually running the subscription update request, e.g.:

PATCH https://graph.microsoft.com/v1.0/subscriptions/{id}
Content-type: application/json

{
   "expirationDateTime":"2020-09-22T18:23:45.9356913Z"
}

This failed the first few times with the same error as above — yet, succeeded on the 4th or 5th attempt. This happened during several more rounds of testing, randomly ending with a successful response.

As experienced in the months before this issue began, successful requests returned in almost under a second. Notably, the failed requests indeed seemed to timeout in some sense, as I would not receive the HTTP 400 until at least 10 seconds had passed. After inspecting the logs, these failed requests never appeared to reach our infrastructure at all.

Hypothesis

The nature of the random successful requests, sprinkled within failures that never appear to reach our endpoint, seemed to hint at some type of rate limiting — either on the Microsoft side, or perhaps enforced on API Gateway at AWS. It’s possible that the number of integrations developed with API Gateway increased over time, reaching some threshold which triggered the limiting of connections from MS Graph API to API Gateway. Additional troubleshooting with public webhook listener testing tools were unable to reproduce this issue, and seemed to further point to something specific with API Gateway.

Graph API support declined to provide assistance, suggesting we didn’t have the appropriate enterprise support plan. Personally, I don’t think the nature of this issue should require one. If anyone at MS support reads this, I invite you to take a look as this may be affecting other implementations in us-east-1.

Workaround

In the end, I was able to workaround this issue by switching from API Gateway to a Public ALB. For reference, here’s a handy CloudFormation snippet for the serverless framework.

Share this: