The Short Answer

Divergent server clocks give inaccurate results, for example in the JS7 - Order History.

If a JS7 - Controller Cluster or JS7 - Agent Cluster is used, then requirements to accuracy of server clocks are more strict: the cluster can fail and journals can be disrupted if divergence of server clocks exceeds reasonable limits.

  • This applies to a Controller Cluster, not to Standalone Controllers.
  • This applies to Director Agents in an Agent Cluster, it does not apply to Subagents that are more robust against inaccurate server clocks.

The Long Answer

Basics about Clocks

Which clocks does a computer use?

  • There are two of them:
    • The hardware clock is a clock generator that is initialized when the computer is started and that specifies clocks, e.g. in nanoseconds. This clock is used if, for example, the number of milliseconds since 1970 is reqested. The clock provides past clock pulses (in seconds or nanoseconds), in other words: the time consumption, not directly the time.
    • The real-time clock is the system time of the server calculated from the hardware clock, among other things. It is buffered in a battery, for example, when the computer is switched off.
  • Regardless of the two clocks there is a wall-clock time, this means the external time which for example is obtained via a time server. This is the time we know when we run to the train station.

What do the clocks have to do with each other?

  • The hardware clock is accurate or not, its beat cannot be changed. The real-time clock is set to wall-clock time by synchronizing with a time server. This is usually done once per day or several times a day.
  • A server stores the current clock, i.e. the past milliseconds of the hardware clock, the last wall-clock time obtained and the difference between the two: Hardware clock + difference to wall-clock time = real-time clock
  • If the real-time clock is set to the wall-clock time by time synchronization, then the difference value to the wall-clock time is updated in the server.

How accurate are these clocks?

  • The accuracy of a hardware clock depends on the design, but it is quite accurate. 
    • Traditionally, hardware interrupts are supplied to the OS. This frequently provides an accuracy from 1/60 to 1/100 of a second.
  • But wait, what about virtualization? The hardware clock of a VM is emulated by software, i.e. it is a software clock. And that means it is not exact, but depends on many things. Not least on the load placed on a hypervisor. In other words: if the hypervisor squeaks because too many VMs are running on a single hypervisor, then the emulated hardware clocks of the VMs become less accurate, in other words: slower. A hardware clock will not run faster than the wall-clock time, but it can run slower if hardware interrupts are not taken into account due to server load.
    • Find more details from https://www.vmware.com/docs/vmware_timekeeping.
    • Users who perform further research, will find out that there are numerous complaints about inaccurate hardware clocks with VMs. In fact, the hardware clock is frequently installed as a driver in the server which suggests that this can go wrong and that failure might not be noticed. Reasons can be subject to load factors of the server.

JS7 Use of Clocks

Which clocks does JS7 use?

  • The hardware clock 
    • is used to calculate timeouts. In particular between clustered Controller instances and JOC Cockpit as Cluster Watch. Similarly between Director Agent instances and Active Controller. Inaccurately calculated timeouts cause severe problems in a cluster.
    • is used to calculate Event IDs. This is not a severe problem: Event IDs are made up of milliseconds of the hardware clock and are also incremented monotonically in ascending order. Their uniqueness is guaranteed, even if the hardware clock falls behind significantly.
  • The real-time clock is used for start times, for example order starts, retry intervals etc. 
    • There is no severe impact here. If the real-time clock of the Controller and Agent is different, then the history in JOC Cockpit, for example, can display an order with an end time that is earlier than the start time. This is because the order's start time is generated by the Agent (that acts autonomously for order starts) and the end time is generated by the Controller. The result in JOC Cockpit will look like time travel and will be confusing, but it does not harm daily operation of JS7 products.

The Problem

What's the problem?

  • Timeouts in cluster operation are calculated using hardware clocks. If these diverge on the servers of JOC Cockpit (Cluster Watch), Active Controller and Standby Controller, then this is acceptable up to a threshold value of 3s. Within this tolerance, the cluster can catch up by repeating requests. A difference of 20s is beyond acceptable limits.
  • How can there be a difference of >3s or >20s? 
    • Because the time synchronization takes place at different intervals on both servers. If a server synchronizes 3 times a day, then the correction of the difference to its hardware clock will be small. If a server synchronizes once a week or, for example, only after months, then the correction will be greater. The problem is not caused by an inaccurate clock, but by large corrections. 
    • Hardware clocks do not necessarily have to slow down linearly. If things go wrong, then a server may miss hardware interrupts of 500ms or more at a certain point in time. It is therefore not possible to predict when the slowdown of a hardware clock will exceed the threshold value.
  • If the threshold value is exceeded and the hardware clock of the Active Controller is slower than that of the Standby Controller, then the JOC Cockpit and Standby Controller initiate fail-over as they consider the messages of the Active Controller being outdated. 
    • However, the Active Controller is still alive (while all others consider it dead due to the time difference) and is still connected to Agents. 
    • At the same time, the Standby Controller becomes active and starts exchanging events with the Agents. This state does not last long, after 1-3s JOC Cockpit as the Cluster Watch will instruct the Active Controller (if reachable) to become standby. But: in the mean-time, the (former) Active Controller possibly has exchanged events with Agents that the new Active Controller does not know (and vice versa). This can result in journal corruption which is indicated by warnings such as “inapplicable event”.

      The cluster no longer couples, if both Controller instances receive events from the Agents for the moment they are active at the same time. Agent responses arrive to requests made by the other Controller, there is nothing a Controller instance can do about a response for which it didn't send the request. Both Controller instances believe they are on standby as they do not receive current events and neither instance wants to take the lead in the cluster. This means the cluster is inoperable.

The Resolution

What can users do?

Respect traffic lights

When it comes to divergence of server clocks consider the following thresholds:

DivergenceTraffic LightMeaning
<3sgreendivergence of server clocks is acceptable
3-10syellowthe cluster is affected but can usually recover
>10sredin rare cases the cluster might be able to recover and otherwise will crash
>20s
definitely means crash

Verify that Time Servers are used for synchronization

For Unix environments NTP and Chrony are frequently used packages for time synchronization. Users should check the time server pool managed by such packages and should verify that a server's real-time clock is synchronized in regular intervals. This similarly applies to Windows environments that offer use of time servers.

Users who operate servers without internet access to public time servers should operate their own time server.

Don't take wrong assumptions

Users who assume that distributing the active and passive instances of a Controller Cluster or Director Agent Cluster across different sites or different cloud providers will improve the situation, might be wrong.

  • Using the same time server for two VMs running cluster instances (from different sites or cloud providers) is the right step, but it's not enough. In fact, it depends on whether the slowdown of the hardware clocks of both VMs are corrected similarly in a timely manner and whether a single correction does not exceed the threshold value.
  • If one site operates hypervisors at the edge of utilization while the other site balances resources better, there will be a problem if the difference between the respective slowdown of the hardware clocks exceeds the threshold value.
  • If users keep both Controller instances or Director Agent instances in the same site, then chances are somewhat better that the slowdown of hardware clocks will be more uniform. However, this is not guaranteed.



  • No labels