Azure API Timeout: How to Diagnose and Fix 408/504 Timeout Errors
Fix Azure API timeout errors (408, 504, OperationTimedOut) by adjusting timeout settings, enabling retries, and optimizing long-running calls. Step-by-step guid
- Azure API timeouts surface as HTTP 408, 504, or the exception message 'The operation timed out' / 'OperationTimedOut' and stem from four root causes: client-side timeout too short, Azure API Management (APIM) gateway timeout, backend service cold start, or a long-running operation exceeding the 230-second Azure Load Balancer hard limit.
- Azure Application Gateway and the public Azure Load Balancer enforce a 4-minute (240 s) idle TCP timeout that cannot be extended; any HTTP request that takes longer than 230 s end-to-end will be silently dropped by the fabric before your backend responds.
- Quick fix summary: (1) set HttpClient.Timeout / Axios timeout to at least 100 s for synchronous calls; (2) raise the APIM policy timeout to match; (3) convert calls longer than 90 s to the async polling pattern (202 Accepted + Location header); (4) add an exponential-backoff retry policy with jitter for transient 429/503/504 responses.
| Method | When to Use | Implementation Time | Risk |
|---|---|---|---|
| Raise client HttpClient timeout | Client times out before server responds; 408 on client side | < 15 min | Low – isolated to your client code |
| Raise APIM forward-request timeout | APIM policy returns 504 before backend finishes | 15–30 min | Low – scoped to one API/operation policy |
| Switch to async polling (202 + Location) | Operations regularly exceed 90 s (reports, exports, ML inference) | 2–8 h | Medium – requires API contract change |
| Add Polly retry with exponential backoff | Transient 429 / 503 / 504 bursts | 30–60 min | Low – retries are idempotent only on safe methods |
| Enable APIM caching for repeated reads | Repeated identical GET calls timing out under load | 30–60 min | Low – stale-data risk on mutable resources |
| Scale out / warm up backend | Cold-start latency on Azure Functions consumption plan | 1–4 h | Low-Medium – cost increase, needs load testing |
| Move to Azure Durable Functions | Workflows that fan-out, aggregate, or run > 5 min | 1–3 days | Medium – architectural refactor |
Understanding Azure API Timeout Errors
When an Azure API call exceeds a time boundary, the failure can originate at several distinct layers, each producing a different error signature:
- Client SDK / HttpClient – throws
TaskCanceledException(C#) orECONNABORTED(Node.js) with message:The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing. - Azure API Management gateway – returns HTTP 504 Gateway Timeout with body
{ "statusCode": 504, "message": "Origin server did not respond in time." } - Azure Load Balancer idle timeout – silently resets the TCP connection after 4 minutes of inactivity; the client sees a connection reset or
SocketException. - Azure Resource Manager (ARM) polling – returns HTTP 202 Accepted immediately but the polling loop eventually times out with
CloudException: OperationTimedOut. - Azure SQL / Cosmos DB – surfaces as
SqlException: Timeout expiredorRequestRateTooLargeException(429) which, if unretried, manifests as a logical timeout.
Understanding which layer fired is the mandatory first step before applying any fix.
Step 1: Identify the Timeout Layer
1a. Read the full exception chain. In .NET, always call exception.ToString() rather than .Message – the inner TaskCanceledException or SocketException reveals whether the cancellation token came from your code or the HTTP stack.
1b. Check the HTTP status code. 408 = client or server explicitly signaled timeout. 504 = intermediate proxy (APIM, Application Gateway, or Azure Front Door) gave up. A connection-reset with no status code = TCP-layer idle timeout from the Load Balancer.
1c. Pull APIM diagnostic logs. In the Azure portal go to API Management → APIs → [your API] → Test and inspect the trace, or enable Application Insights on APIM:
GET https://management.azure.com/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ApiManagement/service/{apim}/apis/{api}/diagnostics/applicationinsights?api-version=2022-08-01
Look for backend-duration in the trace. If it is close to your forward-request timeout value, the backend is the bottleneck, not your client.
1d. Check Azure Monitor / Application Insights. Run this KQL query in Log Analytics to find all requests that exceeded 30 seconds:
requests
| where timestamp > ago(1h)
| where duration > 30000
| project timestamp, name, resultCode, duration, cloud_RoleName
| order by duration desc
Step 2: Fix Client-Side Timeouts (C# / .NET)
The default HttpClient timeout is 100 seconds. For APIs that legitimately take longer, create a named client via IHttpClientFactory:
// Program.cs / Startup.cs
builder.Services.AddHttpClient("AzureBackend", client =>
{
client.BaseAddress = new Uri("https://myapi.azure-api.net");
client.Timeout = TimeSpan.FromSeconds(180); // explicit, documented
})
.AddPolicyHandler(GetRetryPolicy());
static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy() =>
HttpPolicyExtensions
.HandleTransientHttpError() // 5xx and network errors
.OrResult(r => r.StatusCode == (HttpStatusCode)429)
.WaitAndRetryAsync(
retryCount: 4,
sleepDurationProvider: attempt =>
TimeSpan.FromSeconds(Math.Pow(2, attempt)) // 2, 4, 8, 16 s
+ TimeSpan.FromMilliseconds(new Random().Next(0, 500)));
Important: Set the CancellationToken on the request itself when you need per-request control, rather than mutating HttpClient.Timeout at runtime (which is not thread-safe).
Step 3: Fix APIM Gateway Timeouts
In Azure API Management, the default forward-request timeout is 300 seconds (since API version 2021+) but older services default to 60 seconds. Raise it in the inbound or backend policy:
<!-- APIM Policy (API or Operation scope) -->
<policies>
<inbound>
<base />
</inbound>
<backend>
<forward-request timeout="180" follow-redirects="true" />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
Note that the timeout attribute is in seconds and cannot exceed 230 seconds due to the underlying Azure Load Balancer constraint. If your operation needs more than 230 seconds, you must use the async pattern described in Step 4.
Step 4: Convert Long-Running Operations to Async Polling
The Azure-standard pattern for operations > 90 seconds is the REST Long-Running Operations (LRO) specification:
- Client POSTs the request.
- Backend immediately returns 202 Accepted with a
LocationorOperation-Locationheader pointing to a status endpoint. - Client polls the status endpoint (with exponential back-off) until it receives 200/201 with the final result or a terminal error.
import time, requests
def start_operation(endpoint, payload, token):
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
r = requests.post(endpoint, json=payload, headers=headers, timeout=30)
r.raise_for_status()
if r.status_code == 202:
return r.headers["Operation-Location"]
return None # synchronous completion
def poll_until_done(operation_url, token, max_wait=600):
headers = {"Authorization": f"Bearer {token}"}
elapsed = 0
interval = 5
while elapsed < max_wait:
r = requests.get(operation_url, headers=headers, timeout=30)
r.raise_for_status()
body = r.json()
status = body.get("status", "").lower()
if status in ("succeeded", "failed", "canceled"):
return body
time.sleep(interval)
elapsed += interval
interval = min(interval * 1.5, 30) # back-off up to 30 s
raise TimeoutError(f"Operation did not complete within {max_wait}s")
Step 5: Fix Azure Function Cold-Start Timeouts
Azure Functions on the Consumption plan can take 5–15 seconds to cold-start. If your API call hits a cold instance, the cumulative latency often triggers client timeouts.
Options:
- Set
"functionTimeout": "00:10:00"inhost.json(max 10 min on Consumption, unlimited on Premium/Dedicated). - Enable Always On (App Service Plan) or Pre-warmed instances (Premium plan) to eliminate cold starts.
- Use Azure Front Door health probes to keep instances warm.
Step 6: Verify the Fix in Staging
After applying changes, validate with a load test using Azure Load Testing or k6 before promoting to production:
# k6 smoke test – replace URL and token
k6 run --vus 10 --duration 60s - <<'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';
const TOKEN = __ENV.AZURE_TOKEN;
const BASE = __ENV.API_BASE_URL;
export default function () {
const res = http.post(`${BASE}/api/long-running`, JSON.stringify({input: 'test'}), {
headers: { 'Authorization': `Bearer ${TOKEN}`, 'Content-Type': 'application/json' },
timeout: '190s',
});
check(res, {
'status is 200 or 202': (r) => r.status === 200 || r.status === 202,
'no timeout': (r) => r.status !== 408 && r.status !== 504,
});
sleep(1);
}
EOF
Frequently Asked Questions
#!/usr/bin/env bash
# Azure API Timeout Diagnostic Script
# Prerequisites: az CLI logged in, jq, curl
# Usage: APIM_NAME=mygw RG=mygroup API_ID=myapi bash diagnose-api-timeout.sh
set -euo pipefail
APIM_NAME="${APIM_NAME:?Set APIM_NAME}"
RG="${RG:?Set RG}"
API_ID="${API_ID:?Set API_ID}"
SUB=$(az account show --query id -o tsv)
echo "=== 1. Check APIM SKU and forward-request timeout ==="
az apim show -n "$APIM_NAME" -g "$RG" \
--query '{sku:sku.name, capacity:sku.capacity, provisioningState:provisioningState}' -o table
echo ""
echo "=== 2. Fetch backend policy for API $API_ID ==="
az rest --method GET \
--url "https://management.azure.com/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.ApiManagement/service/$APIM_NAME/apis/$API_ID/policies/policy?api-version=2022-08-01" \
--query 'properties.value' -o tsv 2>/dev/null | grep -oP '(?<=forward-request timeout=")\d+' \
&& echo " seconds" || echo "forward-request timeout not explicitly set (check inherited policy)"
echo ""
echo "=== 3. Recent 504/408 errors from APIM in Azure Monitor (last 1h) ==="
az monitor log-analytics query \
--workspace "$(az monitor log-analytics workspace list -g "$RG" --query '[0].customerId' -o tsv)" \
--analytics-query "
ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| where ResponseCode in (408, 504)
| project TimeGenerated, OperationId, BackendId, BackendResponseCode, DurationMs
| order by DurationMs desc
| limit 20" \
--output table 2>/dev/null || echo "Log Analytics workspace not found or insufficient permissions"
echo ""
echo "=== 4. Check Function App timeout setting ==="
FUNC_APPS=$(az functionapp list -g "$RG" --query '[].name' -o tsv)
for FUNC in $FUNC_APPS; do
TIMEOUT=$(az functionapp config appsettings list -n "$FUNC" -g "$RG" \
--query "[?name=='AzureFunctionsJobHost__functionTimeout'].value" -o tsv 2>/dev/null || echo "default")
HOST_JSON=$(az storage file download --account-name \
"$(az functionapp show -n "$FUNC" -g "$RG" --query 'storageAccountRequired' -o tsv)" \
--share-name "$FUNC" --path host.json --dest /dev/stdout 2>/dev/null | \
python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('functionTimeout','not set'))" 2>/dev/null || echo "could not read")
echo " Function App: $FUNC | AppSetting timeout: $TIMEOUT | host.json: $HOST_JSON"
done
echo ""
echo "=== 5. Measure raw backend latency bypassing APIM ==="
BACKEND_URL="${BACKEND_URL:-}"
if [[ -n "$BACKEND_URL" ]]; then
curl -o /dev/null -s -w \
"DNS: %{time_namelookup}s | Connect: %{time_connect}s | TTFB: %{time_starttransfer}s | Total: %{time_total}s\n" \
"$BACKEND_URL"
else
echo " Set BACKEND_URL env var to measure raw backend latency"
fi
echo ""
echo "=== Diagnostics complete ==="Error Medic Editorial
Error Medic Editorial is a team of senior DevOps and SRE engineers with hands-on experience designing and operating production systems on Azure, AWS, and GCP. Our troubleshooting guides are built from real incident postmortems, not documentation summaries.
Sources
- https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-policies
- https://learn.microsoft.com/en-us/azure/azure-functions/functions-host-json#functiontimeout
- https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply
- https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/core/Azure.Core/samples/Configuration.md#retry-options
- https://stackoverflow.com/questions/66243958/azure-apim-504-gateway-timeout-increasing-timeout
- https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout