Retries: Best practice for managing transient API errors

Transient API errors occur because of temporary service interruptions such as network issues, rate limits, server congestion, and other similar occurrences. This topic outlines the recommended best practices for implementing retry-handling logic to maintain resiliency in the applications and integrations with PingOne.

These HTTP status codes indicate temporary issues that can resolve over time, making them ideal candidates for automatic retries by a calling client.

HTTP status code Description

HTTP status code	Description
`408` (Request Timeout)	The server took too long to respond to the client request, possibly caused by a weak or slow connection to the client application.
`429` (Too Many Requests)	Rate limiting controls denied the request, often because the user is over quota for the request. Retry after the delay specified in the `Retry-After` header. Refer to PingOne Platform Limits in the PingOne admin documentation for more information.
`500` (Internal Server Error)	There was an unexpected server-side error encountered while processing the request.
`502` (Bad Gateway)	Suggests a temporary network issue or service stack disruption caused by a communication problem between servers.
`503` (Service Unavailable)	The server cannot process the request, possibly resulting from a temporary service outage or in-progress deployment.
`504` (Gateway Timeout)	A downstream server (for example, DNS) didn’t respond in time to complete the request.

408 (Request Timeout)

The server took too long to respond to the client request, possibly caused by a weak or slow connection to the client application.

429 (Too Many Requests)

Rate limiting controls denied the request, often because the user is over quota for the request. Retry after the delay specified in the Retry-After header. Refer to PingOne Platform Limits in the PingOne admin documentation for more information.

500 (Internal Server Error)

There was an unexpected server-side error encountered while processing the request.

502 (Bad Gateway)

Suggests a temporary network issue or service stack disruption caused by a communication problem between servers.

503 (Service Unavailable)

The server cannot process the request, possibly resulting from a temporary service outage or in-progress deployment.

504 (Gateway Timeout)

A downstream server (for example, DNS) didn’t respond in time to complete the request.

These HTTP status codes represent permanent errors or client-side issues that automatic retries cannot resolve. In such cases, automatically retrying the request leads to unnecessary resource consumption or further complications.

400 (Bad Request): The client sent an invalid request. Fix the issue in the request before trying again.
401 (Unauthorized): Authentication is required or failed. Fix the authentication issue before trying again (for example, get a fresh token).
403 (Forbidden): The client lacks permission to access the resource. Retrying will not change the server’s response. Fix the authorization issue such as getting a new token with additional scopes before trying again.
405 (Method Not Allowed): The HTTP method used is not supported. Retrying with the same method will not resolve the issue.
409 (Conflict): Indicates a conflict in the request such as unique constraints for referential integrity. Retrying without addressing the conflict will continue to fail.
422 (Unprocessable Entity): The server understands the request but cannot process it due to semantic errors. Fix the issue in the request before trying again.

Accounting for latency

API clients can experience latency when creating resources across services in the PingOne platform multi-region architecture, where the newly created resource has not propagated through the system to allow for the successful completion of a follow-up request. For example, if you create a resource with one internal service, it is possible that other internal services might not be aware of that new resource in time for your code’s next step. An immediate call to the second resource can fail. Given this potential for latency, all applications should be written to retry the request.

The primary areas that experience latency are:

Applications and Secrets

Latency occurs when you create an application and then try immediately to retrieve the system-generated secret.
Applications and Scopes

Latency occurs when you create an application and then try immediately to retrieve its resource access grants.
SAML configuration and attributes

Latency occurs when you create a SAML configuration and then try immediately to retrieve its attribute mappings.
Environments, role assignments, and applications

Latency occurs when you create an environment and then try immediately to retrieve its role assignments or add an application to the new environment.
Populations and role assignments

Latency occurs when you create a population and then try immediately to retrieve its role assignments.

If you use Terraform and you create a new environment, calls to create or read configuration resources in that new environment immediately after creation can generate 401 errors. This occurs because the environment is still propagating on the resource server. This scenario represents a specific case in which an application should be written to initiate a retry for a 401 error. For more information on best practices for implementing retries, refer to Retries: Best practice for managing transient API errors.

Eventual Consistency and retryable 404 errors

Latency most often affects read operations immediately after a create request. For 404 Not Found responses where eventual consistency can delay resource visibility across services, it is recommended that you retry only if:

You recently performed a successful create operation on a parent resource (such as Environment or Application), and
The 404 response is received during a GET request for a subordinate resource shortly afterward.

For example, after creating an environment with POST /environments, an immediate follow-up GET /environments/{{envID}}/signOnPolicies might return a 404 briefly. Likewise, after creating an application with POST /environments/{{envID}}/applications, an immediate GET /environments/{{envID}}/applications/{{appID}}/secret might return a 404 briefly until the secret gets generated asynchronously.

Eventual consistency and delete operations

Similar to create cases, eventual consistency can affect reading resources after a delete operation. In these cases, subordinate resources might still be readable for a period of time until the system propagates the delete across services.

For example, after deleting an environment with DELETE /environments/{{envID}}, an immediate read operation on applications with GET /environments/{{envID}}/applications might return some results temporarily. In such cases, it is recommended that you do not retry the original delete and do not delete the currently readable resource(s). The system is eventually consistent and will catch up.

For more information about these status codes, refer to MDN Web Docs HTTP response status codes.

Retry-handling logic with exponential backoff and jitter

Do not simply create a loop that rapidly re-submits the same request over and over. You could unintentionally end up with a status 429, trigger rate limiting, or worse have your IP address blocked. Instead, implement an exponential backoff with jitter that gradually increases the time between retry attempts. Start small and increase exponentially until reaching maximum delay. This allows the server or network time to recover, and your application will get the response it needs.

For example, you can increase the number of seconds between retries in powers of two. If you add jitter, which involves adding randomness to those retry delays, this will randomize backoff retry delays to avoid "retry storms" when multiple clients are in use simultaneously.

// Pseudo code
// Example of an exponential backoff calculator
function calculateExponentialBackoff(attempt, baseDelay) {
    jitter = random(0, 100); // Add random jitter
    return baseDelay * (2 ** attempt) + jitter;
}

Honor the Retry-After header

Status codes 429 and 503 often include a Retry-After header that specifies one of two things, depending on the status code. For a 429, it specifies how long the client should wait before retrying the request, designated in number of seconds. For these cases, your retry handler must wait for the amount of time provided by the service.

For a 503 status code, the Retry-After header specifies when the service is expected to be available, designated in a date format. In both cases, the Retry-After header specifies when your app can reasonably make the request again. You need to do some math on the available date to figure out when you can try again. Or, if the retry delay is too long, then return a friendly error message to the client.

//Pseudo code
//Example of using the Retry-After header
if response.status == 429 or response.status == 503:
    retryAfter = response.getHeader("Retry-After");
    if retryAfter exists:
        wait(retryAfter);
    else:
        delay = calculateExponentialBackoff(attempt, baseDelay);
        wait(delay);
[source]

Don’t make retry handling a catchall

It’s good practice to avoid a catchall approach to errors, particularly if you are implementing retry logic in your API calls. The goal is to create resiliency in your application and provide the best user experience. Do not try to handle all status codes in the same way. The status codes outlined above that support retry logic mean different things with respect to how you should handle them.

The following is an example of an anti-pattern that could result in unintended bad consequences.

//Psuedo code
// Example of bad practice
switch(statusCode) {
    case 408:
    case 429:
    case 500:
    case 502:
    case 504:
    case 503:
        //callout to service
    default:
        //catchall for other status codes
}

Refer to the "Sample retry handler" section below for an example that follows best practices.

Fail gracefully

If you reach the maximum number of retries without a successful request, it’s important to consider how you handle the error. You need to take into account the user experience as well as your business requirements.

Don’t return raw error responses to the user. Provide a useful, friendly error message with instructions on how to retry later or who to contact. If there are business implications to not making a successful request, first consider what operations might need to be rolled back, and any logging that should occur, before or simultaneously with your client error message (particularly in authentication and authorization transactions). You might need to revoke tokens or sessions, depending on your use case.

Security considerations

To avoid exposing your applications to vulnerabilities or attacks, it is important to design your retry handler properly. The following factors should be considered.

Maximum retry counts

If your retry logic supports an unlimited number of retries, you could expose your application to resource exhaustion attacks. Attackers can use your application as a proxy for denial of service (DoS) attacks against PingOne APIs or as a conduit for brute force credential-stuffing attacks.

Consult with your app owners and security teams to determine a reasonable number of retries. Set that as the maximum number of retries in your retry handler in a way that cannot be overridden.
Validate reasonable Retry-After limits

Similar to the best practice of validating input data at the client and server, you should also validate the Retry-After header in API responses when the status code is 429 or 503. Man-in-the-middle attacks could result in manipulation of the response headers, and an attacker could reset the value to 1, causing your application to retry too often and get rate limited, or worse, have your IP address blocked. Likewise, attackers can reset the value to an extremely long wait time, like 172800 (2 days), causing your customers to abandon your application before a transaction is completed.

If no Retry-After header is present, make sure you have reasonable values set in your exponential-backoff variable number of seconds to avoid rate-limiting, IP blocking, or retry storms.
Check for token expiration

If your application is retrying requests, keep checking the HTTP status code in case it changes to a 401 (unauthorized), or 403 (forbidden). During your retry attempts, your token could expire, and you should switch to re-authentication before trying again. If you’re using OAuth/OIDC, and depending on your security requirements, you have options for silent authentication with the prompt="none" parameter, the login_hint_token request parameter, or using refresh tokens to get a new token before retrying. These options provide a better user experience rather than forcing users to login again.
Alert excessive failures

Monitor, log, and alert the number of attempts a user or customer submits while using your application to retry these API calls (if you have that capability from your client application). It could be a sign of a misuse or abuse case you need to handle.

Sample retry handler

The following psuedo code sample shows these retry logic best practices.

// Pseudo code
//Example of retry logic
function callApiWithRetry(apiEndpoint, maxRetries = 5, baseDelay = 1000) {
    let attempt = 0;

    while (attempt < maxRetries) {
        response = makeApiCall(apiEndpoint);

        // Success Case
        if response.status == 200 || response.status == 202:
            return response.data;

        // Handle Transient Errors
        if response.status in [408, 500, 502, 504]:
            delay = calculateExponentialBackoff(attempt, baseDelay);
            wait(delay);

        // Handle Rate-Limiting
        else if response.status == 429 or response.status == 503:
            retryAfter = response.getHeader("Retry-After");
            if retryAfter exists:
                wait(retryAfter);
            else:
                delay = calculateExponentialBackoff(attempt, baseDelay);
                wait(delay);

        // Handle Authentication Failures
        else if response.status == 401 or response.status == 403:
            refreshToken(); // Secure token refresh or re-authentication
            continue;

        // Handle other errors (No retries)
        else:
            handleError(response); // Sanitize and log the error
            throw Error("Request failed with status code: " + response.status);

        // Increment Retry Counter and Monitor
        attempt += 1;
        monitorRetries(apiEndpoint, attempt); // Security monitoring for abnormal retry behavior
    }

    // Fail Gracefully After Max Retries
    throw Error("Max retries reached. Request failed.");
}

function calculateExponentialBackoff(attempt, baseDelay) {
    jitter = random(0, 100); // Add random jitter
    return baseDelay * (2 ** attempt) + jitter;
}

PingOne Platform APIs