se.fault-tolerance.adoc Maven / Gradle / Ivy
The newest version!
///////////////////////////////////////////////////////////////////////////////
Copyright (c) 2020, 2024 Oracle and/or its affiliates.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
///////////////////////////////////////////////////////////////////////////////
= Fault Tolerance in Helidon
:toc:
:toc-placement: preamble
:description: Fault Tolerance in Helidon
:keywords: helidon, java, fault, tolerance, fault tolerance
:feature-name: Fault Tolerance
:rootdir: {docdir}/..
include::{rootdir}/includes/se.adoc[]
== Contents
- <>
- <>
- <>
- <>
- <>
== Overview
Helidon Fault Tolerance support is inspired by
link:{microprofile-fault-tolerance-spec-url}[MicroProfile Fault Tolerance].
The API defines the notion of a _fault handler_ that can be combined with other handlers to
improve application robustness. Handlers are created to manage error conditions (faults)
that may occur in real-world application environments. Examples include service restarts,
network delays, temporal infrastructure instabilities, etc.
The interaction of multiple microservices bring some new challenges from distributed systems
that require careful planning. Faults in distributed systems should be compartmentalized
to avoid unnecessary service interruptions. For example, if comparable information can
be obtained from multiples sources, a user request _should not_ be denied when a subset
of these sources is unreachable or offline. Similarly, if a non-essential source has been
flagged as unreachable, an application should avoid continuous access to that source
as that would result in much higher response times.
In order to tackle the most common types of application faults, the Helidon Fault
Tolerance API provides support for circuit breakers, retries, timeouts, bulkheads and fallbacks.
In addition, the API makes it very easy to create and monitor asynchronous tasks that
do not require explicit creation and management of threads or executors.
For more information, see
link:{faulttolerance-javadoc-base-url}/module-summary.html[Fault Tolerance API Javadocs].
include::{rootdir}/includes/dependencies.adoc[]
[source,xml]
----
io.helidon.fault-tolerance
helidon-fault-tolerance
----
== API
The Fault Tolerance API is _blocking_ and based on the JDK's virtual thread model.
As a result, methods return _direct_ values instead of promises in the form of
`Single` or `Multi`.
In the sections that follow, we shall briefly explore each of the constructs provided
by this API.
=== Retries
Temporal networking problems can sometimes be mitigated by simply retrying
a certain task. A `Retry` handler is created using a `RetryPolicy` that
indicates the number of retries, delay between retries, etc.
[source,java]
----
include::{sourcedir}/se/FaultToleranceSnippets.java[tag=snippet_1, indent=0]
----
The sample code above will retry calls to the supplier `this::retryOnFailure`
for up to 3 times with a 100-millisecond delay between them.
NOTE: The return type of method `retryOnFailure` in the example above must
be some `T` and the parameter to the retry handler's `invoke`
method `Supplier extends T>`.
If the call to the supplier provided completes exceptionally, it will be treated as
a failure and retried until the maximum number of attempts is reached; finer control
is possible by creating a retry policy and using methods such as
`applyOn(Class extends Throwable>... classes)` and
`skipOn(Class extends Throwable>... classes)` to control the exceptions
that must be retried and those that must be ignored.
=== Timeouts
A request to a service that is inaccessible or simply unavailable should be bounded
to ensure a certain quality of service and response time. Timeouts can be configured
to avoid excessive waiting times. In addition, a fallback action can be defined
if a timeout expires as we shall cover in the next section.
The following is an example of using `Timeout`:
[source,java]
----
include::{sourcedir}/se/FaultToleranceSnippets.java[tag=snippet_2, indent=0]
----
NOTE: Using a handler's `create` method is an alternative to using a builder that is
more convenient when default settings are acceptable.
The example above monitors the call to method `mayTakeVeryLong` and reports a
`TimeoutException` if the execution takes more than 10 milliseconds to complete.
=== Fallbacks
A fallback to a _known_ result can sometimes be an alternative to
reporting an error. For example, if we are unable to access a service
we may fall back to the last result obtained from that service at an
earlier time.
A `Fallback` instance is created by providing a function that takes a `Throwable`
and produces some `T` to be used when the intended method failed to return a value:
[source,java]
----
include::{sourcedir}/se/FaultToleranceSnippets.java[tag=snippet_3, indent=0]
----
This example calls the method `mayFail` and if it produces a `Throwable`, it maps
it to the last known value using the fallback handler.
=== Circuit Breakers
Failing to execute a certain task or to call another service repeatedly can have a direct
impact on application performance. It is often preferred to avoid calls to non-essential
services by simply preventing that logic to execute altogether. A circuit breaker can be
configured to monitor such calls and block attempts that are likely to fail, thus improving
overall performance.
Circuit breakers start in a _closed_ state, letting calls to proceed normally; after
detecting a certain number of errors during a pre-defined processing window, they can _open_ to
prevent additional failures. After a circuit has been opened, it can transition
first to a _half-open_ state before finally transitioning back to a closed state.
The use of an intermediate state (half-open)
makes transitions from open to close more progressive, and prevents a circuit breaker
from eagerly transitioning to states without considering sufficient observations.
NOTE: Any failure while a circuit breaker is in half-open state will immediately
cause it to transition back to an open state.
Consider the following example in which `this::mayFail` is monitored by a
circuit breaker:
[source,java]
----
include::{sourcedir}/se/FaultToleranceSnippets.java[tag=snippet_4, indent=0]
----
The circuit breaker in this example defines a processing window of size 10, an error
ratio of 30%, a duration to transition to half-open state of 200 milliseconds, and
a success threshold to transition from half-open to closed state of 2 observations.
It follows that,
* After completing the processing window, if at least 3 errors are detected, the
circuit breaker will transition to the open state, thus blocking the execution
of any subsequent calls.
* After 200 millis, the circuit breaker will transition back to half-open and
allow calls to proceed again.
* If the next two calls after transitioning to half-open are successful, the
circuit breaker will transition to closed state; otherwise, it will
transition back to open state, waiting for another 200 milliseconds
before attempting to transition to half-open again.
A circuit breaker will throw a
`io.helidon.faulttolerance.CircuitBreakerOpenException`
if an attempt to make an invocation takes place while it is in open state.
=== Bulkheads
Concurrent access to certain components may need to be limited to avoid
excessive use of resources. For example, if an invocation that opens
a network connection is allowed to execute concurrently without
any restriction, and if the service on the other end is slow responding,
it is possible for the rate at which network connections are opened
to exceed the maximum number of connections allowed. Faults of this
type can be prevented by guarding these invocations using a bulkhead.
NOTE: The origin of the name _bulkhead_ comes from the partitions that
comprise a ship's hull. If some partition is somehow compromised
(e.g., filled with water) it can be isolated in a manner not to
affect the rest of the hull.
A waiting queue can be associated with a bulkhead to handle tasks
that are submitted when the bulkhead is already at full capacity.
[source,java]
----
include::{sourcedir}/se/FaultToleranceSnippets.java[tag=snippet_5, indent=0]
----
This example creates a bulkhead that limits concurrent execution
to `this:usesResources` to at most 3, and with a queue of size 5. The
bulkhead will report a `io.helidon.faulttolerance.BulkheadException` if unable
to proceed with the call: either due to the limit being reached or the queue
being at maximum capacity.
=== Asynchronous
Asynchronous tasks can be created or forked by using an `Async` instance. A supplier of type
`T` is provided as the argument when invoking this handler. For example:
[source,java]
----
include::{sourcedir}/se/FaultToleranceSnippets.java[tag=snippet_6, indent=0]
----
The supplier `() -> Thread.currentThread()` is executed in a new thread and
the value it produces printed by the consumer and passed to `thenAccept`.
By default, asynchronous tasks are executed using a _new virtual thread per
task_ based on the `ExecutorService` defined in
`io.helidon.faulttolerance.FaultTolerance` and
configurable by an application. Alternatively, an `ExecutorService` can be specified
when building a non-standard `Async` instance.
=== Handler Composition
Method invocations can be guarded by any combination of the handlers
presented above. For example, an invocation that
times out can be retried a few times before resorting to a fallback value
—assuming it never succeeds.
The easiest way to achieve handler composition is by using a builder in the
`FaultTolerance` class as shown in the following example:
[source,java]
----
include::{sourcedir}/se/FaultToleranceSnippets.java[tag=snippet_7, indent=0]
----
The exact order in which handlers are added to a builder depends on the use case,
but generally the order starting from innermost to outermost should be: bulkhead,
timeout, circuit breaker, retry and fallback. That is, fallback is the first
handler in the chain (the last to executed once a value is returned)
and bulkhead is the last one (the first to be executed once a value is returned).
NOTE: This is the ordering used by the MicroProfile Fault Tolerance implementation
in Helidon when a method is decorated with multiple annotations.
== Examples
See <> section for examples.
== Additional Information
For additional information, see the
link:{faulttolerance-javadoc-base-url}/module-summary.html[Fault Tolerance API Javadocs].