openttd/docs/desync.md

# Some explanations about Desyncs

Last updated: 2014-02-23

## Table of contents

- 1.0) Desync theory
    - 1.1) [OpenTTD multiplayer architecture](#11-openttd-multiplayer-architecture)
    - 1.2) [What is a Desync and how is it detected](#12-what-is-a-desync-and-how-is-it-detected)
    - 1.3) [Typical causes of Desyncs](#13-typical-causes-of-desyncs)
- 2.0) What to do in case of a Desync
    - 2.1) [Cache debugging](#21-cache-debugging)
    - 2.2) [Desync recording](#22-desync-recording)
- 3.0) Evaluating the Desync records
    - 3.1) [Replaying](#31-replaying)
    - 3.2) [Evaluation of the replay](#32-evaluation-of-the-replay)
    - 3.3) [Comparing savegames](#33-comparing-savegames)


## 1.1) OpenTTD multiplayer architecture

  OpenTTD has a huge gamestate, which changes all of the time.
  The savegame contains the complete gamestate at a specific point
  in time. But this state changes completely each tick: Vehicles move
  and trees grow.

  However, most of these changes in the gamestate are deterministic:
  Without a player interfering a vehicle follows its orders always
  in the same way, and trees always grow the same.

  In OpenTTD multiplayer synchronisation works by creating a savegame
  when clients join, and then transferring that savegame to the client,
  so it has the complete gamestate at a fixed point in time.

  Afterwards clients only receive 'commands', that is: Stuff which is
  not predictable, like
   - player actions
   - AI actions
   - GameScript actions
   - Admin Port command
   - rcon commands
   - ...

  These commands contain the information on how to execute the command,
  and when to execute it. Time is measured in 'network frames'.
  Mind that network frames to not match ingame time. Network frames
  also run while the game is paused, to give a defined behaviour to
  stuff that is executing while the game is paused.

  The deterministic part of the gamestate is run by the clients on
  their own. All they get from the server is the instruction to
  run the gamestate up to a certain network time, which basically
  says that there are no commands scheduled in that time.

  When a client (which includes the server itself) wants to execute
  a command (i.e. a non-predictable action), it does this by
   - calling DoCommandP resp. DoCommandPInternal
   - These functions first do a local test-run of the command to
     check simple preconditions. (Just to give the client an
     immediate response without bothering the server and waiting for
     the response.) The test-run may not actually change the
     gamestate, all changes must be discarded.
   - If the local test-run succeeds the command is sent to the server.
   - The server inserts the command into the command queue, which
     assigns a network frame to the commands, i.e. when it shall be
     executed on all clients.
   - Enhanced with this specific timestamp, the command is send to all
     clients, which execute the command simultaneously in the same
     network frame in the same order.

## 1.2) What is a Desync and how is it detected

  In the ideal case all clients have the same gamestate as the server
  and run in sync. That is, vehicle movement is the same on all
  clients, and commands are executed the same everywhere and
  have the same results.

  When a Desync happens, it means that the gamestates on the clients
  (including the server) are no longer the same. Just imagine
  that a vehicle picks the left line instead of the right line at
  a junction on one client.

  The important thing here is, that no one notices when a Desync
  occurs. The desync client will continue to simulate the gamestate
  and execute commands from the server. Once the gamestate differs
  it will increasingly spiral out of control: If a vehicle picks a
  different route, it will arrive at a different time at a station,
  which will load different cargo, which causes other vehicles to
  load other stuff, which causes industries to notice different
  servicing, which causes industries to change production, ...
  the client could run all day in a different universe.

  To limit how long a Desync can remain unnoticed, the server
  transfers some checksums every now and then for the gamestate.
  Currently this checksum is the state of the random number
  generator of the game logic. A lot of things in OpenTTD depend
  on the RNG, and if the gamestate differs, it is likely that the
  RNG is called at different times, and the state differs when
  checked.

  The clients compare this 'checksum' with the checksum of their
  own gamestate at the specific network frame. If they differ,
  the client disconnects with a Desync error.

  The important thing here is: The detection of the Desync is
  only an ultimate failure detection. It does not give any
  indication on when the Desync happened. The Desync may after
  all have occurred long ago, and just did not affect the checksum
  up to now. The checksum may have matched 10 times or more
  since the Desync happened, and only now the Desync has spiraled
  enough to finally affect the checksum. (There was once a desync
  which was only noticed by the checksum after 20 game years.)

## 1.3) Typical causes of Desyncs

  Desyncs can be caused by the following scenarios:
   - The savegame does not describe the complete gamestate.
      - Some information which affects the progression of the
        gamestate is not saved in the savegame.
      - Some information which affects the progression of the
        gamestate is not loaded from the savegame.
        This includes the case that something is not completely
        reset before loading the savegame, so data from the
        previous game is carried over to the new one.
   - The gamestate does not behave deterministic.
      - Cache mismatch: The game logic depends on some cached
        values, which are not invalidated properly. This is
        the usual case for NewGRF-specific Desyncs.
      - Undefined behaviour: The game logic performs multiple
        things in an undefined order or with an undefined
        result. E.g. when sorting something with a key while
        some keys are equal. Or some computation that depends
        on the CPU architecture (32/64 bit, little/big endian).
   - The gamestate is modified when it shall not be modified.
      - The test-run of a command alters the gamestate.
      - The gamestate is altered by a player or script without
        using commands.


## 2.1) Cache debugging

  Desyncs which are caused by improper cache validation can
  often be found by enabling cache validation:
   - Start OpenTTD with '-d desync=2'.
   - This will enable validation of caches every tick.
     That is, cached values are recomputed every tick and compared
     to the cached value.
   - Differences are logged to 'commands-out.log' in the autosave
     folder.

  Mind that this type of debugging can also be done in singleplayer.

## 2.2) Desync recording

  If you have a server, which happens to encounter Desyncs often,
  you can enable recording of the gamestate alterations. This
  will later allow the replay the gamestate and locate the Desync
  cause.

  There are two levels of Desync recording, which are enabled
  via '-d desync=2' resp. '-d desync=3'. Both will record all
  commands to a file 'commands-out.log' in the autosave folder.

  If you have the savegame from the start of the server, and
  this command log you can replay the whole game. (see Section 3.1)

  If you do not start the server from a savegame, there will
  also be a savegame created just after a map has been generated.
  The savegame will be named 'dmp_cmds_*.sav' and be put into
  the autosave folder.

  In addition to that '-d desync=3' also creates regular savegames
  at defined spots in network time. (more defined than regular
  autosaves). These will be created in the autosave folder
  and will also be named 'dmp_cmds_*.sav'.

  These saves allow comparing the gamestate with the original
  gamestate during replaying, and thus greatly help debugging.
  However, they also take a lot of disk space.


## 3.1) Replaying

  To replay a Desync recording, you need these files:
   - The savegame from when the server was started, resp.
     the automatically created savegame from when the map
     was generated.
   - The 'commands-out.log' file.
   - Optionally the 'dmp_cmds_*.sav'.
  Put these files into a safe spot. (Not your autosave folder!)

  Next, prepare your OpenTTD for replaying:
   - Get the same version of OpenTTD as the original server was running.
   - Uncomment/enable the define 'DEBUG_DUMP_COMMANDS' in
     'src/network/network_func.h'.
     (DEBUG_FAILED_DUMP_COMMANDS is explained later)
   - Put the 'commands-out.log' into the root save folder, and rename
      it to 'commands.log'.
   - Run 'openttd -D -d desync=3 -g startsavegame.sav'.
     This replays the server log and creates new 'commands-out.log'
     and 'dmp_cmds_*.sav' in your autosave folder.

## 3.2) Evaluation of the replay

  The replaying will also compare the checksums which are part of
  the 'commands-out.log' with the replayed gamestate.
  If they differ, it will trigger a 'NOT_REACHED'.

  If the replay succeeds without mismatch, that is the replay reproduces
  the original server state:
   - Repeat the replay starting from incrementally later 'dmp_cmds_*.sav'
     while truncating the 'commands.log' at the beginning appropriately.
     The 'dmp_cmds_*.sav' can be your own ones from the first reply, or
     the ones from the original server (if you have them).
     (This simulates the view of joining clients during the game.)
   - If one of those replays fails, you have located the Desync between
     the last dmp_cmds that reproduces the replay and the first one
     that fails.

  If the replay does not succeed without mismatch, you can check the logs
  whether there were failed commands. Then you may try to replay with
  DEBUG_FAILED_DUMP_COMMANDS enabled. If the replay then fails, the
  command test-run of the failed command modified the game state.

  If you have the original 'dmp_cmds_*.sav', you can also compare those
  savegames with your own ones from the replay. You can also comment/disable
  the 'NOT_REACHED' mentioned above, to get another 'dmp_cmds_*.sav' from
  the replay after the mismatch has already been detected.
  See Section 3.2 on how to compare savegames.
  If the saves differ you have located the Desync between the last dmp_cmds
  that match and the first one that does not. The difference of the saves
  may point you in the direction of what causes it.

  If the replay succeeds without mismatch, and you do not have any
  'dmp_cmd_*.sav' from the original server, it is a lost case.
  Enable creation of the 'dmp_cmd_*.sav' on the server, and wait for the
  next Desync.

  Finally, you can also compare the 'commands-out.log' from the original
  server with the one from the replay. They will differ in stuff like
  dates, and the original log will contain the chat, but otherwise they
  should match.

## 3.3) Comparing savegames

  The binary form of the savegames from the original server and from
  your replay will always differ:
   - The savegame contains paths to used NewGRF files.
   - The gamelog will log your loading of the savegame.
   - The savegame data of AIs and the Gamescript will differ.
     Scripts are not run during the replay, only their recorded commands
     are replayed. Their internal state will thus not change in the
     replay and will differ.

  To compare savegame more semantically, there exist some ugly hackish
  tools at:
     http://devs.openttd.org/~frosch/texts/zpipe.c
     http://devs.openttd.org/~frosch/texts/printhunk.c

  The first one decompresses OpenTTD savegames. The second one creates
  a textual representation of an uncompressed savegame, by parsing hunks
  and arrays and such. With both tools you need to be a bit careful
  since they work on stdin and stdout, which may not deal well with
  binary data.

  If you have the textual representation of the savegames, you can
  compare them with regular diff tools.