Someone messaged me the other day asking for assistance.

Their question was intriguing: “Can you help figure out why integration requests to our API have started failing, with an authentication error, when extra whitespace is added to the incoming JSON payload?”

The mysterious case of whitespace that… breaks API authentication

The breaking requests looked something like this:

{
  "data" : {
    "type" : "object",
    "name" : "roses",
    "color" : "red"
  }
}

Our systems responded with HTTP 403 when given this payload, even though they were properly authenticated.

Meanwhile, the very same requests would succeed if formatted like this:

{
  "data": {
    "type": "object",
    "name": "roses",
    "color": "red"
  }
}

The only difference is the presence of extra whitespace in between the JSON keys and the subsequent colon.

I’m used to seeing JSON object keys formatted with no white space in between "key" and the subsequent colon :, like "key": "value". I double checked the JSON spec in RFC 7159 section 2 and found that both ways are perfectly valid JSON:

Insignificant whitespace is allowed before or after any of the six structural characters.

(The “structural characters” include name-separator = ws %x3A ws ; : colon.)

Where does the problem occur?

I checked to see if we had changed our application’s JSON parsing library. I checked if anything had changed on the sending side. In both cases, nothing had recently changed.

In our configuration, our public API endpoints are normally handled by AWS API Gateway.

API Gateway has a bit of internal complexity but roughly we use it like this:

sequenceDiagram
    participant c as Client
    participant gw as AWS API Gateway
    participant app as Application Server

    c->>gw: Incoming request
    gw->>gw: Checks authentication
    Note right of gw: Could have auth/parsing errors?
    gw->>app: Sends allowed, authenticated requests
    Note right of app: Could have auth/parsing errors?
    app-->>gw: Returns response
    gw-->>c: Returns response

We don’t expect API Gateway to do much more than handle authentication and a bit of URL path routing for us. We expect it to pass all authenticated inbound paylods to our back end servers for processing.

However, strangely, the whitespace problem was only reproducible when requests were sent through AWS API Gateway. If we sent test requests straight to our application servers, bypassing the API Gateway, then the whitespace issue vanished.

This helps to localize the problem.

What does API Gateway really do?

I checked whether API Gateway does any kind of processing on the request payloads. I was curious if it attempted to validate JSON syntax and if it somehow considered the extra whitespace a syntax error.

No, and no.

And even if it did validate request payloads, one would not expect it to emit an auth error (HTTP 403 Forbidden, in particular).

So I started looking for the API Gateway logs to see what it thought was happening.

It’s the invisible layers of the stack

The API Gateway access logs promptly revealed the problem: the broken requests were all being blocked by AWS WAF. That’s the web application firewall, a service that inspects incoming requests and tries to block malicious traffic.

The API Gateway logs themselves didn’t reveal why these requests were blocked by the WAF, but this was already enough to work with. We chatted with the infrastructure person who maintains WAF configuration, and found that some new filtering rules had recently been deployed. These rules had erroneously blocked some legit traffic since deployment. The QA team initially discovered the problem in our staging environment, and escalated to engineering.

Around this point, I realized my high-level mental model of the system was incomplete.

An expanded model would look like this:

sequenceDiagram
    participant c as Client
    participant gw as AWS API Gateway
    participant waf as AWS WAF
    participant app as Application Server

    c->>gw: Incoming request
    gw->>gw: Checks authentication
    Note right of gw: Auth/parsing errors?
    gw->>waf: Checks requests against WAF rule set
    Note right of waf: What happens here?
    waf-->>gw: Responds blocked or allowed
    gw->>app: Sends allowed, authenticated requests
    Note right of app: Auth/parsing errors?
    app-->>gw: Returns response
    gw-->>c: Returns response

In other words, WAF constitutes another major failure point in the request flow, one that I hadn’t given much thought to. It’s always the invisible layers in the stack that get you. There’s just so much happening in complex technical systems, and it’s easy to neglect the layers that usually work silently, without problems.

But why would WAF block JSON because of extra whitespace

I don’t have the whole story here as I don’t have access to WAF rule definitions. But I learned that the request was being blocked by a rule set that tried to block known Windows shell attacks. I presume something about "type" : triggered some regex.

Our back end servers don’t run Windows, but these rules were enabled anyway, causing the issue.

As AWS docs explain, WAF will respond with HTTP 403 when a request is blocked. Unfortunately this status is also sometimes emitted by our downstream application servers, so it doesn’t help us localize the issue. But at least now I know to look for this case if we see 403s.

Fixing blind spots

We’ll probably deploy better reporting to try to prevent this scenario in the future. We could have caught it faster if we had alerts on the right things in our logs.

It’s still very unintuitive that anyone could ever get an auth error from adding extra whitespace in a JSON payload. But things make more sense when you remember that every request is screened against a mostly-opaque set of security rulesets that probably don’t even consider the inbound request format.

The thing people like about AWS WAF is that it is supposed to mitigate security risks without much work on the application developer’s part. The problem is, this requires that it should work invisibly like a black box without ever failing. Then if it doesn’t work properly - you still end up having to partly understand it anyway.

At the very least — now we remember it’s there, and we know that this failure mode is possible.