Error Recovery
This chapter describes how hardware connection errors are handled and detailed design of the implementation.
Handling of Hardware Connection Errors
Error recovery has the 5 states defined in ErrorRecoveryState
: Disconnected, OK, Issue, Reconnect or Error. It is configured using the time thresholds defined in ErrorRecoveryConfig
.
Once an engine is connected to hardware I/O, state is OK
. This state is kept until a read or write error occurs. Such an error causes state to transition to Issue
.
In this state, if a success read/write occurs, state transitions back to OK
. If no success read/write occurs within a duration of issue_timeout_seconds
, state is set to Reconnect
.
In state Reconnect
, reconnect attempts are started. If successful, state is set to OK
. If not successful within a duration of error_timeout_seconds, state is set to Error
.
While in states Issue and Reconnect, any read and write errors are masked by returning last-known-good values for reads and caching values for writes. This means that the engine (and the user) will not notice the connection loss immediately.
The one exception to this is UOD commands. We have to assume that a UOD command cannot execute correctly without hardware connection so we have to fail UOD commands. (A possible improvement would be to require UOD commands to consider the connection and fail with predefined exception types).
In state Error
, reconnect attempts continue but errors are no longer masked. The Connection Status
tag is set to Disconnected
. If reconnect is successful, state is set to OK
and :console:Connection Status` is set to Connected
.
The consequence of no longer masking errors, is that the Engine will enter the paused on error state where the user can decide whether to continue or not.
Note
It is unlikely but possible that errors occur so soon after successful connection that no value is yet available as last-known-good. In this case, a read returns a None value and the error is logged.
Implementation
Error recovery is implemented using a decorator pattern around the HardwareLayerBase
hardware abstraction class. The class ErrorRecoveryDecorator
implements the HardwareLayerBase
interface by wrapping the concrete HardwareLayerBase
class, e.g.
OPCUA_Hardware
and delegating calls to it. However, when the concrete class fails with a connection related error, the decorator masks the read or write error as discussed above.
Note
The implementation uses the tick() method to detect and execute reconnect. If this takes too long and hurts engine timing, the hardware should instead implement its reconnect via threading.
Handling of Engine/Aggregator Protocol Errors
This section describes the handling of errors in the Engine-Aggregator connection and detailed design of the implementation.
The overall aim is to ensure that both Aggregator and Engine are resilient to temporary connection errors. If the network is down during a run and then comes back up later, both Aggregator and Engine should be able to recover from this such that:
When the connection is lost
Engine creates a ReconnectedMsg that contains a snapshot of its state at the time of the disconnect
Engine begins buffering up samples of the data that can not be sent to Aggregator
If a run is active, engine keeps running the method
Aggregator notices that the Engine is unavailable and reports this status to the front end
Frontend should somehow display this state, similar to the “Interrupted by error” state
Commands cannot be sent to Engine
When the connection is recovered
Engine sends the ReconnectedMsg created earlier
Engine sends the buffered up values
Aggregator notices that Engine is available and reports this status to the front end
Aggregator processes the ReconnectedMsg to restore its engine_data state for the engine at the time of the disconnect.
If the run is still active, continue the run
If run is failed or completed, update the persisted state to reflect this, same as if it was connected when it happened
Aggregator restart must be supported such that
Any connected engines reconnect when aggregator comes back up
Aggregator detects whether engines are in an active run or have completed a run and stores the correct information
Engine
Error handling is implemented which can detect connection errors, sample and batch up data messages. When connection is reestablished, data can be sent.
It is implemented in the EngineRunner
class which uses EngineDispatcher
to implement an autonomous and self-healing connection (at least when the connection is recovered in reasonable time).
To make the dispatchers testable, base classes are introduced that contain non-network details. These are subclassed in production versions using REST/WebSockets and in test versions using direct connection.
Aggregator
To make the aggregator resilient towards connection errors, little is needed. When an engine is disconnected, the WebSocket on_disconnect
callback fires and the data for en engine is removed. Additionally, the engine data is saved as a RecentEngine
.
Sequences
Connect Sequence
When engine starts, it starts the Connect sequence.
sequenceDiagram participant E as Engine participant A as Aggregator Note over E, A: Connect sequence E ->> A: register (post) activate E activate A A -->> E: engine_id deactivate A E ->> A: connect (websocket) activate A A ->> E: get_engine_id_async E -->> A: engine_id A ->> A: verify engine_id A -->> E: accept websocket connection deactivate A Note over E, A: Connection error E ->> E: raise ProtocolNetworkException deactivate E
Fig. 11 Engine connect sequence.
Steady-State Sequence
When the Connect sequence has executed successfully, the Steady-State sequence is activated.
sequenceDiagram participant E as Engine participant A as Aggregator Note over E, A: Steady-State sequence activate E loop every 0.5 second E ->> A: control messages (websocket) end loop every second E ->> A: data messages (websocket) end alt user command A ->> E: command (websocket) activate A E -->> A: response deactivate A end Note over E, A: Network error E ->> E: raise ProtocolNetworkException deactivate E
Fig. 12 Engine steady-state sequence.
Reconnect Sequence
When a ProtocolNetworkException is encountered in either the Connect sequence or the Steady-State sequence, the error recovery mechanism switches to the Reconnect sequence.
The Reconnect sequence is just the Connect sequence followed by:
A single ReconnectEngMsg (which allows the aggregator to restore its state for the particular engine)
A number of data and control messages that were buffered up while the Engine was disconnected
If a network error occurs, the sequence is reset and started again.
When the Reconnect sequence (including Catch-up) is complete, the Steady-State sequence is activated.
sequenceDiagram participant E as Engine participant A as Aggregator Note over E, A: Reconnect sequence E ->> A: register (post) activate E activate A A -->> E: engine_id deactivate A E ->> A: connect (websocket) activate A A ->> E: get_engine_id_async E -->> A: engine_id A -->> E: accept websocket connection deactivate A Note over E, A: Catch-up E ->> A: reconnect message activate A A ->> A: reestablish session loop Send buffered messages E --> A: buffered message end deactivate A Note over E, A: Network error E ->> E: raise ProtocolNetworkException deactivate E
Fig. 13 Engine reconnect sequence.
Recovery States
These are the system states of connection recovery in the Aggregator-Engine protocol.
It is implemented in the engine.engine_runner
module.
stateDiagram-v2 Started --> Connected Started --> Failed Connected --> Failed Failed --> Disconnected Disconnected --> Reconnecting Disconnected --> Failed Reconnecting --> CatchingUp Reconnecting --> Failed CatchingUp --> Reconnected CatchingUp --> Failed Reconnected --> Failed
Fig. 14 Engine recovery states.