The Short Answer

Mismatched server clocks give inaccurate results, for example, jobs executed with different Agents will not display accurate start times and end times in the JS7 - History and in log files.

If a JS7 - Controller Cluster or JS7 - Agent Cluster is used, then requirements for accuracy of server clocks are more strict: the cluster can fail and journals can be disrupted if the difference in server clock time exceeds reasonable limits.

This applies to a Controller Cluster, not to Standalone Controller instances.
This applies to Director Agents in an Agent Cluster, it does not apply to Subagents that are not affected from inaccurate server clocks.

To check if server times are synchronized see JS7 - How to check Synchronization of Clocks between Servers.

The Long Answer

Basics about Clocks

Which clocks does a computer use?

There are two of them:
- The hardware clock is a clock generator that is initialized when the computer is started and that specifies clocks, e.g. in nanoseconds. This clock is used if, for example, the number of milliseconds since 1970 is reqested. The clock provides past clock pulses (in seconds or nanoseconds), in other words: the time consumption, not the time.
- The real-time clock is the system time of the server calculated from the hardware clock, amongst other things. It is buffered in a battery, for example, when the computer is switched off.
Regardless of the two clocks there is a wall-clock time, this means the external time which for example is obtained via a time server. This is the time we know when we run to the train station.

What do the clocks have to do with each other?

The hardware clock is accurate or not, its beat cannot be changed. The real-time clock is set to wall-clock time by synchronizing with a time server. This is usually done once per day or several times a day.
A server stores clock beats, i.e. the past milliseconds of the hardware clock, the last wall-clock time obtained and the difference between the two: hardware clock + difference to wall-clock time = real-time clock.
If the real-time clock is set to the wall-clock time by time synchronization, then the difference value to the wall-clock time is updated in the server.

How accurate are these clocks?

The accuracy of a hardware clock depends on the design, but it is quite accurate.
- Traditionally, hardware interrupts are supplied to the OS. This frequently provides an accuracy from 1/60 to 1/100 of a second.
But wait, what about virtualization? The hardware clock of a VM is emulated by software, i.e. it is a software clock. And that means it is not exact, but depends on many things. Not least on the load placed on a hypervisor. In other words: if the hypervisor squeaks because too many VMs are running on a single hypervisor, then the emulated hardware clocks of the VMs become less accurate, in other words: slower. A hardware clock will not run faster than wall-clock time, but it can run slower if hardware interrupts are not taken into account due to server load.
- Find more details from https://www.vmware.com/docs/vmware_timekeeping.
- Users who perform further research, will find out that there are numerous complaints about inaccurate hardware clocks with VMs. In fact, the hardware clock is frequently installed as a driver in the server which suggests that this can go wrong and that failure might not be noticed.

JS7 Use of Clocks

Which clocks does JS7 use?

The hardware clock
- is used to calculate timeouts. In particular between clustered Controller instances and JOC Cockpit as Cluster Watch. Similarly between Director Agent instances and Active Controller. Inaccurately calculated timeouts cause severe problems in a cluster.
- is used to calculate Event IDs. This is not a severe problem: Event IDs are made up of milliseconds of the hardware clock and are also incremented monotonically in ascending order. Their uniqueness is guaranteed, even if the hardware clock falls behind significantly.

The real-time clock is used for start times, for example order starts, retry intervals etc.
- There can be a severe impact if jobs will not start accurately at the desired point in time. In addition, if the real-time clocks of Controller and Agent are different, then the history in JOC Cockpit, for example, can display an order with an end time that is earlier than its start time. This is because the order's start time is generated by the Agent (that acts autonomously for order starts) and the end time is generated by the Controller. The result in JOC Cockpit will look like time travel and will be confusing, but it does not harm operation of JS7 products.

The Problem

What's the problem?

Timeouts in cluster operation are calculated using hardware clocks. If these are different between the servers of JOC Cockpit (Cluster Watch), Active Controller and Standby Controller, then this is acceptable up to a threshold value of 3s. Within this tolerance, the cluster can catch up by repeating requests. A difference of 20s is beyond acceptable limits.
How can there be a difference of >3s or >20s?
- Because time synchronization takes place at different intervals on both servers. If a server synchronizes 3 times a day, then the clock leap, i.e. the correction of the difference to its hardware clock, will be small. If a server synchronizes once a week or, for example, only after months, then the clock leap will be greater. The problem is not caused by an inaccurate clock, but by a large clock leap.
- Hardware clocks do not necessarily have to slow down linearly. If things go wrong, then a server may miss hardware interrupts of 500ms or more at a certain point in time. It is not possible to predict when the slowdown of a hardware clock will exceed the threshold value.

If the threshold value for clock leaps is exceeded and the hardware clock of the Active Controller instance is slower than that of the Standby Controller instance, then the Cluster Watch in JOC Cockpit and Standby Controller will initiate fail-over as they consider the messages of the Active Controller being outdated.
- However, the Active Controller is still alive (while all others consider it dead due to the time difference) and is still connected to Agents.
- At the same time, the Standby Controller becomes active and starts exchanging events with Agents. This will not last for a long time, after 1-3s the Cluster Watch will instruct the Active Controller (if reachable) to become standby. But: in the mean-time, the (former) Active Controller possibly has exchanged events with Agents that the new Active Controller does not know (and vice versa). This can result in journal corruption which is indicated by warnings such as “inapplicable event”.
  
  The cluster no longer couples, if both Controller instances receive events from Agents for the moment they are active at the same time. Agent responses arrive to requests made by the other Controller instance, there is nothing a Controller instance can do about a response for which it didn't send the request. Both Controller instances assume they are on standby as they do not receive current events and neither instance will take the lead in the cluster. This means the cluster is inoperable.

The Resolution

For improved resilience the following is made available:

JS-2209 - Getting issue details... STATUS

What can users do?

Respect traffic lights

When it comes to non-matching server clocks, consider the following thresholds for clock leaps:

Difference	Traffic Light	Meaning
<3s	green	acceptable difference between server clocks
3-10s	yellow	the cluster is affected but can usually recover
>10s	red	in rare cases the cluster might be able to recover and otherwise will crash
>20s		definitely means crash

Check presence of the servers Hardware-clock

For VMs that emulate the hardware-clock by software it is a concern that the hardware-clock (driver) might not be functional due to failed installation etc.

Verify that Time Servers are used for synchronization

For Unix environments NTP and Chrony are frequently used packages for time synchronization. Users should check the time server pool managed by such packages and should verify that a server's real-time clock is synchronized in regular intervals. This similarly applies to Windows environments that offer use of time servers.

Users who operate servers without internet access to public time servers should operate their own time server: even if the time server is not accurate, it guarantees that all servers receive the same wall-clock time.

Avoid wrong assumptions

Users who assume that distributing the active and passive instances of a Controller Cluster or Director Agent Cluster across different sites or different cloud providers will improve the situation, might be wrong.

Using the same time server for two VMs running JS7 cluster instances (from different sites or cloud providers) is the right step, but is not sufficient. In fact, it depends on the fact if slowdown of hardware clocks of both VMs is corrected similarly in a timely manner and if a single correction does not exceed the threshold value for clock leaps.
If one site operates hypervisors at the edge of utilization while the other site balances resources much better, there will be a problem if the difference between the respective slowdown of hardware clocks exceeds the threshold value for clock leaps.
If users keep both Controller instances or Director Agent instances in the same site, then chances are somewhat better that the slowdown of hardware clocks will be more uniform. However, this is not guaranteed.

Resources

JS7 - How to check Synchronization of Clocks between Servers

Space shortcuts

Page tree

The Short Answer

The Long Answer

Basics about Clocks

Which clocks does a computer use?

What do the clocks have to do with each other?

How accurate are these clocks?

JS7 Use of Clocks

The Problem

The Resolution

Respect traffic lights

Check presence of the servers Hardware-clock

Verify that Time Servers are used for synchronization

Avoid wrong assumptions

Resources

Space shortcuts

Page tree

JS7 - FAQ - Why is Synchronization of Server Clocks relevant

The Short Answer

The Long Answer

Basics about Clocks

Which clocks does a computer use?

What do the clocks have to do with each other?

How accurate are these clocks?

JS7 Use of Clocks

The Problem

The Resolution

Respect traffic lights

Check presence of the servers Hardware-clock

Verify that Time Servers are used for synchronization

Avoid wrong assumptions

Resources