Why do my ECS tasks show as healthy but the ALB still returns 504 Gateway Timeout?

A 504 from the ALB means the ALB can reach the target (health check passes) but the request times out during actual traffic. This is usually because the ALB's idle timeout (default 60 s) is shorter than your application's response time, or because the connection draining timeout is misconfigured. Check ALB access logs for `target_processing_time` and increase the ALB idle timeout: `aws elbv2 modify-load-balancer-attributes --load-balancer-arn --attributes Key=idle_timeout.timeout_seconds,Value=120`.

ECS service is stuck in deployment with 'unable to place task' — is this a timeout?

No — 'unable to place task' is a capacity or constraint error, not a timeout. ECS cannot find a container instance (EC2 launch type) or Fargate capacity that meets CPU, memory, port, or placement constraint requirements. Check `aws ecs describe-services --query 'services[0].events[:5]'` for the exact constraint that failed. Common causes are insufficient memory on EC2 instances, a port conflict on a fixed host port mapping, or a placement constraint that no running instance satisfies.

How do I stop ECS from killing my task before it finishes graceful shutdown?

Increase `stopTimeout` in the container definition (up to 120 s for Fargate, up to the ECS_CONTAINER_STOP_TIMEOUT agent config value for EC2). Also ensure your application's SIGTERM handler is actually executing — in Docker, if PID 1 is a shell script, SIGTERM may not be forwarded to the child process. Use `exec` as the last command in your entrypoint script, or use the JSON array form of `CMD`/`ENTRYPOINT` so the app runs as PID 1 directly.

My ECS Fargate task health check grace period is already 300 seconds but tasks still fail — what else can I check?

With a 300 s grace period, health check failures after that window are genuine — the app is not returning HTTP 200 on the configured path. Use ECS Exec to curl the health endpoint from inside the container. Also confirm the ALB health check path matches what your app actually exposes (`/health`, `/healthz`, `/actuator/health`). Check CloudWatch Logs for application exceptions on startup. Verify the container port in the task definition matches the port the app binds to (environment variable misconfiguration is common).

Does ECS have a maximum deployment timeout I can configure?

Yes. For ECS services using the rolling update deployment type, you can set `deploymentConfiguration.maximumPercent` and `minimumHealthyPercent`, but the absolute timeout is controlled by the service's deployment circuit breaker and the CloudFormation stack timeout if you deploy via CloudFormation. For CodePipeline / CodeDeploy Blue/Green deployments, the timeout is set on the deployment group. If you need a service to wait longer before rolling back, disable the deployment circuit breaker: `aws ecs update-service --cluster my-cluster --service my-service --deployment-configuration deploymentCircuitBreaker={enable=false,rollback=false}`. Be cautious — this lets a broken deployment run indefinitely.

AWS ECS Timeout: Task Failed ELB Health Checks & Container Startup Timeouts — Complete Fix Guide

Fix AWS ECS timeout errors including failed ELB health checks, container startup timeouts, and deployment stalls with step-by-step CLI commands and config fixes

Last updated: February 23, 2026

Last verified: February 23, 2026

2,202 words

Key Takeaways

ECS timeouts most often stem from misconfigured ALB target group health check paths, intervals, or grace periods — not the application itself
Container startup timeouts occur when the app takes longer to bind to its port than the health check unhealthy threshold allows; increase the health check grace period or tune startup logic
Security groups that block the ALB's traffic to the container port are the silent killer — always verify ingress rules on the task's security group allow the ALB security group on the container port
stopTimeout defaults to 30 seconds; if your app needs a graceful drain window longer than that ECS will SIGKILL before it finishes shutting down
Fargate tasks failing to pull images due to NAT/VPC endpoint misconfiguration produce timeout-like symptoms that are easy to confuse with health check failures

ECS Timeout Fix Approaches Compared
Method	When to Use	Estimated Time	Risk
Increase ALB health check grace period	App starts slowly; tasks are healthy but deregistered before they finish booting	5 min (console or CLI)	Low — no code change required
Tune health check path, interval & threshold	Health check path returns non-200 or check fires too aggressively	10–15 min	Low — ALB change only
Fix security group ingress rules	ALB cannot reach container port; all tasks marked unhealthy immediately	5–10 min	Low — additive rule change
Increase task stopTimeout	Graceful shutdown takes longer than 30 s; data loss on scale-in or deploy	5 min (task definition update)	Low — requires new task def revision
Switch to awsvpc + VPC endpoints	Fargate pull timeouts in private subnets without NAT	30–60 min	Medium — network change
Enable ECS Exec for live debugging	Need to inspect running container to find why app is not binding	10 min setup	Low — read-only investigation
Optimize container startup (lazy init, readiness probe)	App genuinely takes 60+ s to become ready; cannot extend grace period further	Hours — code change	Medium — requires release cycle

Understanding AWS ECS Timeout Errors

AWS ECS surfaces timeout failures in several distinct but related forms. Understanding which layer is timing out is the first step to fixing the problem fast. The three most common error surfaces are:

ELB health check timeouts — the Application Load Balancer (ALB) or Network Load Balancer (NLB) marks targets unhealthy because it cannot get a successful HTTP response within the configured threshold.
Container startup / deployment timeouts — the ECS service rolls back a deployment because new tasks never transition to a RUNNING+HEALTHY state within the healthCheckGracePeriodSeconds window.
Container stop / drain timeouts — ECS sends SIGTERM during a scale-in or rolling deploy, then forcefully SIGKILLs the task after stopTimeout seconds if the container has not exited.

Exact error messages you will see in the ECS console or CloudWatch Logs:

(service my-service) (task abc123) failed to register targets in 
(target-group arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc): 
The following targets are not in a valid state for attachment to a load balancer: i-0abc

(service my-service) tasks are failing ELB health checks in 
(target-group arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc)

cancel reason: Task failed to start

DockerTimeoutError: Could not transition to started; timed out after waiting 3m0s

ResourceInitializationError: unable to pull secrets or registry auth: 
The task cannot pull image; max attempts exceeded. RequestCanceled: request context canceled

Step 1: Identify the Exact Timeout Layer

Before changing anything, determine which layer is generating the timeout. Run these commands in sequence:

# 1. List recent stopped tasks and their stopped reasons
aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --desired-status STOPPED \
  --query 'taskArns' --output text

# 2. Describe stopped tasks for stoppedReason
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-arn-1> <task-arn-2> \
  --query 'tasks[*].{id:taskArn,status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'

# 3. Check service events for health check failure messages
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].events[:10]'

# 4. Check ALB target health directly
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
  --query 'TargetHealthDescriptions[*].{target:Target,health:TargetHealth}'

If stoppedReason contains Task failed to start, the issue is pre-health-check — usually an image pull failure, missing secret, or a container that crashes on boot. If the service events mention failing ELB health checks, proceed to Step 2.

Step 2: Fix ALB Health Check Configuration

The most common cause of ECS timeout complaints is an ALB target group misconfigured for the application's actual startup behavior.

Check current health check settings:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Protocol:HealthCheckProtocol,Port:HealthCheckPort,IntervalSeconds:HealthCheckIntervalSeconds,TimeoutSeconds:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount}'

Common misconfiguration patterns:

Symptom	Likely Setting	Recommended Fix
Tasks marked unhealthy immediately	Grace period = 0	Set `healthCheckGracePeriodSeconds` to 60–120
Path returns 404	Wrong health check path	Update path to `/health` or `/actuator/health`
Timeout before app binds	`HealthCheckTimeoutSeconds` too low	Increase to 10–15 s
Two failed checks kill task	`UnhealthyThresholdCount` = 2	Increase to 3–5 for slow starters

Apply fixes via CLI:

# Update the health check path and thresholds on the target group
aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 10 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 5

# Increase the ECS service health check grace period
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --health-check-grace-period-seconds 120

Step 3: Verify Security Group Rules

If tasks are registered with the target group but immediately marked unhealthy, the ALB probe is almost certainly being blocked by security groups. In ECS Fargate with awsvpc networking, the task runs with its own ENI and its own security group — the ALB cannot reuse the task's security group.

# Find the security group attached to your ECS tasks
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].attachments[0].details'

# Get the ENI ID from output, then find its security groups
aws ec2 describe-network-interfaces \
  --network-interface-ids eni-0abc123 \
  --query 'NetworkInterfaces[0].Groups'

# Verify the task security group allows inbound from ALB security group
aws ec2 describe-security-groups \
  --group-ids sg-task123 \
  --query 'SecurityGroups[0].IpPermissions'

The task's security group must allow inbound TCP on the container port from the ALB's security group (not a CIDR range — use the ALB SG ID as the source):

# Add the correct ingress rule (replace sg-alb456 and port 8080)
aws ec2 authorize-security-group-ingress \
  --group-id sg-task123 \
  --protocol tcp \
  --port 8080 \
  --source-group sg-alb456

Step 4: Fix Container Stop Timeouts

When ECS needs to stop a task (rolling deploy, scale-in, spot reclamation), it sends SIGTERM and waits stopTimeout seconds before sending SIGKILL. The default is 30 seconds. If your application needs longer for graceful shutdown (draining connections, flushing queues), override this in the task definition:

{
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "my-repo/my-app:latest",
      "stopTimeout": 120,
      "portMappings": [{"containerPort": 8080}]
    }
  ]
}

aws ecs register-task-definition --cli-input-json file://task-def.json
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --task-definition my-task:NEW_REVISION

Step 5: Fix Fargate Image Pull Timeouts in Private Subnets

If tasks in private subnets fail with DockerTimeoutError or ResourceInitializationError, the task cannot reach ECR, Secrets Manager, or CloudWatch Logs endpoints. Solutions in order of preference:

Add VPC Interface Endpoints for ecr.api, ecr.dkr, logs, and secretsmanager.
Add a NAT Gateway to the private subnet's route table.
Move tasks to public subnets (assign public IPs) — acceptable for non-production.

# Create ECR API endpoint (repeat for ecr.dkr, logs, secretsmanager)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-private1 subnet-private2 \
  --security-group-ids sg-vpc-endpoints \
  --private-dns-enabled

Step 6: Enable ECS Exec for Live Container Debugging

If the above steps don't resolve the issue, connect directly to a running container to inspect what port the app is actually binding to:

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --enable-execute-command

# Open a shell in the running container
aws ecs execute-command \
  --cluster my-cluster \
  --task <task-arn> \
  --container my-app \
  --interactive \
  --command '/bin/sh'

# Inside the container — verify the port is actually bound
netstat -tlnp | grep LISTEN
curl -v http://localhost:8080/health

This reveals whether the app is listening on the wrong interface (e.g., 127.0.0.1 instead of 0.0.0.0) or on the wrong port entirely — both of which cause ECS health checks to fail silently.

Frequently Asked Questions

bash

#!/usr/bin/env bash
# ecs-timeout-diagnose.sh — run against a misbehaving ECS service
# Usage: CLUSTER=my-cluster SERVICE=my-service bash ecs-timeout-diagnose.sh

set -euo pipefail
CLUSTER=${CLUSTER:?Set CLUSTER env var}
SERVICE=${SERVICE:?Set SERVICE env var}

echo "=== [1] Service events (last 10) ==="
aws ecs describe-services \
  --cluster "$CLUSTER" --services "$SERVICE" \
  --query 'services[0].events[:10].[createdAt,message]' \
  --output table

echo ""
echo "=== [2] Health check grace period ==="
aws ecs describe-services \
  --cluster "$CLUSTER" --services "$SERVICE" \
  --query 'services[0].healthCheckGracePeriodSeconds'

echo ""
echo "=== [3] Stopped tasks — stoppedReason ==="
STOPPED_TASKS=$(aws ecs list-tasks \
  --cluster "$CLUSTER" --service-name "$SERVICE" \
  --desired-status STOPPED --max-results 5 \
  --query 'taskArns' --output text)

if [ -n "$STOPPED_TASKS" ]; then
  # shellcheck disable=SC2086
  aws ecs describe-tasks \
    --cluster "$CLUSTER" --tasks $STOPPED_TASKS \
    --query 'tasks[*].{task:taskArn,reason:stoppedReason,containers:containers[*].{name:name,exit:exitCode,reason:reason}}' \
    --output json
else
  echo "No recently stopped tasks found."
fi

echo ""
echo "=== [4] Target group health ==="
TG_ARN=$(aws ecs describe-services \
  --cluster "$CLUSTER" --services "$SERVICE" \
  --query 'services[0].loadBalancers[0].targetGroupArn' --output text)

if [ "$TG_ARN" != "None" ] && [ -n "$TG_ARN" ]; then
  aws elbv2 describe-target-health \
    --target-group-arn "$TG_ARN" \
    --query 'TargetHealthDescriptions[*].{id:Target.Id,port:Target.Port,state:TargetHealth.State,reason:TargetHealth.Reason,desc:TargetHealth.Description}' \
    --output table

  echo ""
  echo "=== [5] Target group health check config ==="
  aws elbv2 describe-target-groups \
    --target-group-arns "$TG_ARN" \
    --query 'TargetGroups[0].{Path:HealthCheckPath,IntervalSec:HealthCheckIntervalSeconds,TimeoutSec:HealthCheckTimeoutSeconds,HealthyN:HealthyThresholdCount,UnhealthyN:UnhealthyThresholdCount}' \
    --output table
else
  echo "No load balancer attached to this service."
fi

echo ""
echo "=== [6] Running task ENIs + security groups ==="
RUNNING_TASK=$(aws ecs list-tasks \
  --cluster "$CLUSTER" --service-name "$SERVICE" \
  --desired-status RUNNING --max-results 1 \
  --query 'taskArns[0]' --output text)

if [ "$RUNNING_TASK" != "None" ] && [ -n "$RUNNING_TASK" ]; then
  ENI_ID=$(aws ecs describe-tasks \
    --cluster "$CLUSTER" --tasks "$RUNNING_TASK" \
    --query 'tasks[0].attachments[0].details[?name==`networkInterfaceId`].value' \
    --output text)
  if [ -n "$ENI_ID" ]; then
    echo "ENI: $ENI_ID"
    aws ec2 describe-network-interfaces \
      --network-interface-ids "$ENI_ID" \
      --query 'NetworkInterfaces[0].Groups[*].{id:GroupId,name:GroupName}' \
      --output table
  fi
else
  echo "No running tasks found."
fi

echo ""
echo "=== Done. Review output above for misconfigurations. ==="

Error Medic Editorial

The Error Medic Editorial team comprises senior DevOps engineers and SREs with collective experience spanning AWS, GCP, and on-premises infrastructure. They specialize in container orchestration, distributed systems reliability, and translating cryptic cloud error messages into actionable fixes.

Sources

Explore More DevOps Config Guides

Ansible Failed: Fix Connection Refused, Permission Denied & Timeout Errors

Fix Ansible failures including connection refused, permission denied, and timeout errors. Step-by-step diagnosis with real commands and verified solutions.

ArgoCD 'connection refused' Error: Complete Troubleshooting Guide (2024)

Fix ArgoCD 'connection refused', CrashLoopBackOff, ImagePullBackOff, and timeout errors with step-by-step diagnostic commands and proven solutions.

ArgoCD Connection Refused: Fix CrashLoopBackOff, ImagePullBackOff, Permission Denied & Timeout Errors

Fix ArgoCD connection refused errors: diagnose CrashLoopBackOff, ImagePullBackOff, permission denied, and timeout with step-by-step kubectl commands and config