Troubleshooting Azure API 504 Gateway Timeout: Diagnosing and Fixing Backend Latency
Resolve Azure API 504 Gateway Timeout errors by tuning forward-request policies, diagnosing SNAT port exhaustion, and optimizing backend performance.
- Azure API Management (APIM) enforces a strict 20-second default timeout for backend requests, resulting in a 504 Gateway Timeout if exceeded.
- Timeouts are frequently caused by SNAT port exhaustion when scaling outbound connections without a NAT Gateway.
- Quick Fix: Extend the timeout using the `<forward-request timeout="120" />` XML policy at the API or operation scope.
- Long-term Fix: Refactor long-running synchronous API calls to use an asynchronous polling pattern (HTTP 202 Accepted) or optimize backend database queries.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Increase APIM `<forward-request>` Timeout | Quick mitigation for backend operations taking 20-120 seconds. | 5 mins | Low |
| Scale Up Backend App Service/AKS | Backend compute CPU/Memory limits are causing queuing and slow responses. | 15 mins | Low |
| Implement VNet NAT Gateway | Diagnostics indicate SNAT port exhaustion dropping outbound connections. | 1 hour | Medium |
| Refactor to Async (HTTP 202 Polling) | Operations inherently take minutes (e.g., report generation, data processing). | Days/Weeks | High |
Understanding the Azure API Timeout Error
When working within the Azure ecosystem—specifically utilizing Azure API Management (APIM), Azure App Service, Azure Functions, or Application Gateway—encountering a 504 Gateway Timeout or 408 Request Timeout is a critical incident that DevOps and Site Reliability Engineering (SRE) teams must address immediately.
This error explicitly indicates that a server acting as a gateway or proxy (like APIM) did not receive a timely response from an upstream server (your backend service). In microservice architectures, an API request typically traverses multiple hops. If any segment of that journey exceeds its allotted time, the connection is forcibly severed, and a timeout exception is returned to the client.
The Anatomy of the 504 Error in Azure
By default, Azure API Management sets a hard forward-request timeout of 20 seconds. If your backend service (e.g., an Azure Function or an AKS pod) takes 21 seconds to complete its database query and return the payload, APIM terminates the connection at exactly 20 seconds. The client receives the following standard response:
{
"statusCode": 504,
"message": "Gateway Timeout"
}
If you inspect the Ocp-Apim-Trace headers, you will often find an exception logged right at the timeout boundary:
"source": "forward-request", "timestamp": "2023-10-24T10:00:20.000Z", "elapsed": "20001", "data": { "message": "The operation has timed out." }
Step 1: Diagnosing the Root Cause
Before modifying infrastructure or altering code, you must determine where the latency is introduced. Timeouts generally fall into three categories: Network Level (SNAT Exhaustion, DNS resolution), Compute Level (CPU throttling, memory starvation), or Application Level (inefficient queries, deadlocks).
1.1 Querying Log Analytics for APIM Failures
If your APIM instance is tied to an Azure Monitor Log Analytics workspace, you can execute a Kusto Query Language (KQL) script to identify exactly which APIs and operations are timing out.
ApiManagementGatewayLogs
| where TimeGenerated > ago(1d)
| where ResponseCode == 504 or BackendResponseCode == 504
| summarize count() by ApiId, OperationId, BackendId, bin(TimeGenerated, 1h)
| render timechart
If BackendResponseCode is empty or 0, it means APIM gave up before the backend even established a TCP connection. This strongly suggests network latency, a cold start (in Azure Functions), or SNAT port exhaustion. If BackendResponseCode is 504, the backend App Service or Load Balancer explicitly returned the timeout.
1.2 Identifying SNAT Port Exhaustion
SNAT (Source Network Address Translation) port exhaustion is a notoriously difficult issue to debug. When an Azure App Service makes outbound calls (e.g., to a SQL database or another API), it uses a finite pool of SNAT ports. If your API experiences a sudden burst of traffic, it can exhaust these ports. When this happens, new outbound connections queue up and eventually time out, resulting in a 504 at the APIM level.
You can diagnose SNAT exhaustion in the Azure Portal:
- Navigate to your App Service.
- Go to Diagnose and solve problems.
- Search for SNAT Port Exhaustion.
- Review the graphs for
Failed outbound connections.
Step 2: Implementing the Fixes
Depending on your diagnostic results, you will need to apply one or more of the following solutions. We will start with the fastest mitigations and progress to architectural overhauls.
2.1 Quick Fix: Extending the APIM Forward-Request Timeout
If your backend simply requires more than 20 seconds to process complex requests (e.g., legacy data aggregation), the most immediate fix is to increase the APIM timeout. This is done by modifying the inbound XML policy for the specific API or Operation.
Navigate to APIM > APIs > Select your API > Design > Inbound Processing > Policy Code Editor. Insert the <forward-request> policy in the <backend> section.
<policies>
<inbound>
<base />
</inbound>
<backend>
<!-- Increase timeout to 120 seconds (2 minutes) -->
<forward-request timeout="120" />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
Warning: While Azure APIM allows timeouts up to 240 seconds (and even up to 300 seconds in specific tiers), keeping HTTP connections open for minutes is an anti-pattern. It ties up threads on both the client and server, increasing vulnerability to connection drops.
2.2 Fixing Cold Starts in Azure Functions and App Services
If timeouts only occur on the first few requests after a period of inactivity, you are experiencing cold starts.
For Azure Functions (Consumption Plan): The infrastructure takes time to allocate resources. Consider migrating to a Premium Plan where you can utilize Always Ready instances.
For Azure App Service: Ensure the Always On setting is enabled.
- Go to App Service > Configuration > General settings.
- Toggle Always On to
On. - Save and restart the application. This prevents the worker process from idling out after 20 minutes of inactivity.
2.3 Resolving SNAT Port Exhaustion with NAT Gateway
If your API is making numerous outbound calls and hitting the default 128 SNAT port limit per instance, scaling up the App Service won't necessarily fix the issue linearly. The enterprise solution is to route outbound traffic through an Azure NAT Gateway.
- Deploy an Azure NAT Gateway in your VNet.
- Attach a Public IP Address to the NAT Gateway (providing 64,000 SNAT ports).
- Associate the NAT Gateway with the subnet delegated to your App Service (using VNet Integration).
- Route all outbound traffic (
vnetRouteAllEnabled = true) through the NAT Gateway.
2.4 Architectural Fix: The Asynchronous Polling Pattern (HTTP 202)
If an API operation naturally takes longer than 60-120 seconds (e.g., generating a massive PDF report, training an ML model), you must refactor the architecture. Synchronous HTTP is not designed for long-running compute.
Instead, implement the Asynchronous Request-Reply pattern:
- Initial Request: The client sends a
POST /api/reports. - Immediate Response: The API immediately queues a message (e.g., Azure Service Bus) and returns an
HTTP 202 Acceptedwith aLocationheader pointing to a status endpoint (e.g.,Location: /api/reports/status/123). - Background Processing: An Azure Function triggers off the Service Bus queue, generates the report, and saves it to Blob Storage, updating the status in a database.
- Polling: The client periodically GETs the status endpoint. Once complete, the status endpoint returns an
HTTP 303 See Otheror anHTTP 200 OKwith a download link.
This pattern entirely eliminates 504 Gateway Timeouts from the APIM layer because the initial synchronous request takes only milliseconds to queue the work.
Step 3: Verifying the Resolution
After applying your fixes, you must validate that the system is stable under load. Do not rely solely on manual testing.
Use Azure Load Testing or tools like k6 or JMeter to simulate concurrent API traffic. Monitor the GatewayTimeout metrics in Azure Monitor. If you implemented the APIM policy fix, you should see the BackendDuration metrics increase beyond 20 seconds without resulting in a 5xx series error.
Additionally, review the TCP Connections and SNAT Connection Count metrics on your App Service to ensure connection pooling is functioning correctly and ports are not leaking. By systematically addressing network configuration, gateway policies, and application architecture, you can permanently eradicate Azure API timeouts from your environment.
Frequently Asked Questions
# Azure CLI snippet to diagnose 504 errors and update APIM policy
# 1. Query Azure Monitor for 504 Gateway Timeouts on APIM
az monitor metrics list \
--resource "/subscriptions/{subscription-id}/resourceGroups/{rg-name}/providers/Microsoft.ApiManagement/service/{apim-name}" \
--metric "Requests" \
--filter "BackendResponseCode eq '504'" \
--aggregation Total \
--interval PT1H
# 2. Check App Service SNAT Port Exhaustion (Connections Failed)
az monitor metrics list \
--resource "/subscriptions/{subscription-id}/resourceGroups/{rg-name}/providers/Microsoft.Web/sites/{app-name}" \
--metric "ConnectionsFailed" \
--aggregation Total \
--interval PT1H
# 3. Quick Fix: Apply a new policy XML to a specific API extending the timeout to 120s
# Ensure policy.xml contains: <forward-request timeout="120" /> inside the <backend> node.
az apim api update \
--resource-group "{rg-name}" \
--service-name "{apim-name}" \
--api-id "{api-id}" \
--set properties.format=xml properties.value="@policy.xml"Error Medic Editorial
Written by senior Site Reliability Engineers and Azure Cloud Architects dedicated to solving complex cloud infrastructure, networking, and microservice deployment bottlenecks.
Sources
- https://learn.microsoft.com/en-us/azure/api-management/api-management-troubleshoot-timeouts
- https://learn.microsoft.com/en-us/azure/app-service/troubleshoot-intermittent-outbound-connection-errors
- https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply
- https://learn.microsoft.com/en-us/azure/virtual-network/nat-gateway/nat-gateway-resource