Troubleshooting "504 Gateway Timeout" and "TaskCanceledException" in Azure APIs
Fix Azure API timeout errors (504, 408) by extending APIM forward-request timeouts, implementing async polling patterns, and resolving SNAT exhaustion.
- Azure API Management (APIM) enforces a strict 20-second default timeout for backend requests.
- Azure App Service and its underlying Load Balancer have a hardcoded, unchangeable 230-second idle timeout.
- Client-side TaskCanceledExceptions in .NET typically occur because the default HttpClient timeout is 100 seconds.
- Quick fix: Increase the forward-request timeout in APIM or implement an Asynchronous Request-Reply (202 Accepted) pattern for long-running processes.
| Method | When to Use | Time to Implement | Risk Level |
|---|---|---|---|
| Increase APIM <forward-request> timeout | When backend responses take between 20s and 230s. | 5 minutes | Low |
| Async Request-Reply Pattern (202 Accepted) | For complex tasks inherently taking longer than 230 seconds. | Hours to Days | High (Requires architecture changes) |
| Use IHttpClientFactory (Fix SNAT Exhaustion) | When facing intermittent socket/connection errors under load. | 1-2 hours | Medium |
| Scale Up/Out App Service Plan | When CPU/Memory exhaustion is dragging down response times. | 10 minutes | Low (Increases monthly cost) |
Understanding the Error
When working with Azure APIs—whether through Azure API Management (APIM), Azure App Services, or Azure Functions—one of the most common and frustrating issues is encountering timeout errors. These typically manifest as 504 Gateway Timeout, 408 Request Timeout, or in .NET applications as a TaskCanceledException: A task was canceled.
The root cause fundamentally stems from a mismatch between how long a client (or intermediary proxy like APIM) is willing to wait and how long the backend service takes to generate a response. Azure enforces several hard and soft limits across its networking stack to prevent resource exhaustion and ensure system stability.
Key Timeout Limits in Azure
- Azure API Management (APIM): By default, APIM waits 20 seconds for a backend service to respond. If the backend doesn't reply within this window, APIM terminates the connection and returns a
500 Internal Server Erroror504 Gateway Timeoutto the client, accompanied by anOperationCanceledExceptionin the APIM logs. - Azure App Service & Azure Load Balancer: The Azure Load Balancer sitting in front of App Services has a hardcoded, unchangeable idle timeout of 230 seconds (3 minutes and 50 seconds). If your application takes longer than this to return the first byte of data, the load balancer will drop the connection, and the client will receive a
500 Server Erroror a network connection reset. - HttpClient Default Timeout: If you are calling an Azure API from a .NET client, the default
HttpClient.Timeoutis 100 seconds.
Step 1: Diagnose the Timeout
Before applying a fix, you must pinpoint where the timeout is occurring. Is it in APIM, the App Service, or the database?
1. Check Azure Application Insights
Application Insights is your best friend here. Run the following Kusto Query Language (KQL) query to find timeout exceptions in your application logs:
exceptions
| where itemType == "exception"
| where type in ("System.Threading.Tasks.TaskCanceledException", "System.Net.Http.HttpRequestException")
| order by timestamp desc
2. Inspect APIM Logs
If using APIM, use the "Trace" feature in the Azure Portal or check the ApiManagementGatewayLogs to see the exact duration of the backend pipeline. Look for forward-request errors that indicate the APIM proxy gave up waiting for your App Service or Function.
3. Analyze Backend Performance
Check the requests table in App Insights to see if the server response time is gradually creeping up. A steady increase indicates resource exhaustion (CPU/Memory) or database locking issues rather than a simple configuration problem.
Step 2: Fix - Increasing APIM Timeout
If your backend legitimately takes 45 seconds to process a request and you are using APIM, you need to override the default 20-second limit.
Navigate to your API in the Azure Portal, go to the Design tab, and edit the Inbound Processing policy. Add or modify the <forward-request> policy inside the <backend> node:
<policies>
<inbound>
<base />
</inbound>
<backend>
<!-- Increase timeout to 120 seconds -->
<forward-request timeout="120" />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
Step 3: Fix - Handling the 230-Second App Service Limit
If your process takes longer than 230 seconds, you cannot simply increase a timeout setting. The Azure Load Balancer will drop the connection unconditionally. You must re-architect the endpoint using the Asynchronous Request-Reply Pattern.
- Client Request: The client sends a
POSTrequest to start the job. - Immediate Response: The API immediately returns a
202 Acceptedstatus code. The response includes aLocationheader pointing to a status endpoint (e.g.,/api/jobs/{id}/status). - Background Processing: The API hands the work off to a background worker (e.g., placing a message on an Azure Service Bus queue which triggers an Azure Function, or using a BackgroundService in ASP.NET Core).
- Client Polling: The client periodically polls the status endpoint.
- Completion: Once the background task finishes, the status endpoint returns
200 OKwith the result, or a302 Foundredirecting to the final resource.
Step 4: Fix - TCP Connection Exhaustion (SNAT Port Exhaustion)
Sometimes timeouts occur not because the backend is slow, but because the App Service has run out of outbound network connections (SNAT ports) when calling another API or a database.
If you see errors like An attempt was made to access a socket in a way forbidden by its access permissions, you are likely hitting SNAT exhaustion.
Solution:
- .NET Core/5+: Ensure you are using
IHttpClientFactoryrather than instantiatingnew HttpClient()in ausingblock for every request. Creating too many HttpClient instances drains available ports. - Node.js: Configure keep-alive agents so connections are reused.
- Scale Out: If you legitimately need more outbound connections, consider integrating your App Service with an Azure NAT Gateway.
Step 5: Keep-Alive Pings (Workaround)
If rewriting to the Async Request-Reply pattern is not immediately feasible, a temporary workaround to bypass the 230-second idle timeout is to send data over the connection before it closes. You can periodically flush whitespace or a specific keep-alive packet to the response stream while the background work continues. However, this is considered an anti-pattern for REST APIs and should be avoided in favor of proper background processing queues.
Frequently Asked Questions
<!-- Azure API Management Policy to increase backend timeout -->
<policies>
<inbound>
<base />
</inbound>
<backend>
<!-- Increase the default 20-second timeout to 120 seconds -->
<forward-request timeout="120" />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>Error Medic Editorial
A collective of senior Cloud Architects and DevOps engineers dedicated to solving the most frustrating infrastructure and deployment issues.
Sources
- https://learn.microsoft.com/en-us/azure/api-management/api-management-advanced-policies#ForwardRequest
- https://learn.microsoft.com/en-us/troubleshoot/azure/app-service/web-apps-performance-issues
- https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply
- https://learn.microsoft.com/en-us/azure/app-service/troubleshoot-intermittent-outbound-connection-errors