268 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			268 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Some explanations about Desyncs
 | 
						|
 | 
						|
Last updated: 2014-02-23
 | 
						|
 | 
						|
## Table of contents
 | 
						|
 | 
						|
- 1.0) Desync theory
 | 
						|
    - 1.1) [OpenTTD multiplayer architecture](#11-openttd-multiplayer-architecture)
 | 
						|
    - 1.2) [What is a Desync and how is it detected](#12-what-is-a-desync-and-how-is-it-detected)
 | 
						|
    - 1.3) [Typical causes of Desyncs](#13-typical-causes-of-desyncs)
 | 
						|
- 2.0) What to do in case of a Desync
 | 
						|
    - 2.1) [Cache debugging](#21-cache-debugging)
 | 
						|
    - 2.2) [Desync recording](#22-desync-recording)
 | 
						|
- 3.0) Evaluating the Desync records
 | 
						|
    - 3.1) [Replaying](#31-replaying)
 | 
						|
    - 3.2) [Evaluation of the replay](#32-evaluation-of-the-replay)
 | 
						|
    - 3.3) [Comparing savegames](#33-comparing-savegames)
 | 
						|
 | 
						|
 | 
						|
## 1.1) OpenTTD multiplayer architecture
 | 
						|
 | 
						|
  OpenTTD has a huge gamestate, which changes all of the time.
 | 
						|
  The savegame contains the complete gamestate at a specific point
 | 
						|
  in time. But this state changes completely each tick: Vehicles move
 | 
						|
  and trees grow.
 | 
						|
 | 
						|
  However, most of these changes in the gamestate are deterministic:
 | 
						|
  Without a player interfering a vehicle follows its orders always
 | 
						|
  in the same way, and trees always grow the same.
 | 
						|
 | 
						|
  In OpenTTD multiplayer synchronisation works by creating a savegame
 | 
						|
  when clients join, and then transferring that savegame to the client,
 | 
						|
  so it has the complete gamestate at a fixed point in time.
 | 
						|
 | 
						|
  Afterwards clients only receive 'commands', that is: Stuff which is
 | 
						|
  not predictable, like
 | 
						|
   - player actions
 | 
						|
   - AI actions
 | 
						|
   - GameScript actions
 | 
						|
   - Admin Port command
 | 
						|
   - rcon commands
 | 
						|
   - ...
 | 
						|
 | 
						|
  These commands contain the information on how to execute the command,
 | 
						|
  and when to execute it. Time is measured in 'network frames'.
 | 
						|
  Mind that network frames to not match ingame time. Network frames
 | 
						|
  also run while the game is paused, to give a defined behaviour to
 | 
						|
  stuff that is executing while the game is paused.
 | 
						|
 | 
						|
  The deterministic part of the gamestate is run by the clients on
 | 
						|
  their own. All they get from the server is the instruction to
 | 
						|
  run the gamestate up to a certain network time, which basically
 | 
						|
  says that there are no commands scheduled in that time.
 | 
						|
 | 
						|
  When a client (which includes the server itself) wants to execute
 | 
						|
  a command (i.e. a non-predictable action), it does this by
 | 
						|
   - calling DoCommandP resp. DoCommandPInternal
 | 
						|
   - These functions first do a local test-run of the command to
 | 
						|
     check simple preconditions. (Just to give the client an
 | 
						|
     immediate response without bothering the server and waiting for
 | 
						|
     the response.) The test-run may not actually change the
 | 
						|
     gamestate, all changes must be discarded.
 | 
						|
   - If the local test-run succeeds the command is sent to the server.
 | 
						|
   - The server inserts the command into the command queue, which
 | 
						|
     assigns a network frame to the commands, i.e. when it shall be
 | 
						|
     executed on all clients.
 | 
						|
   - Enhanced with this specific timestamp, the command is send to all
 | 
						|
     clients, which execute the command simultaneously in the same
 | 
						|
     network frame in the same order.
 | 
						|
 | 
						|
## 1.2) What is a Desync and how is it detected
 | 
						|
 | 
						|
  In the ideal case all clients have the same gamestate as the server
 | 
						|
  and run in sync. That is, vehicle movement is the same on all
 | 
						|
  clients, and commands are executed the same everywhere and
 | 
						|
  have the same results.
 | 
						|
 | 
						|
  When a Desync happens, it means that the gamestates on the clients
 | 
						|
  (including the server) are no longer the same. Just imagine
 | 
						|
  that a vehicle picks the left line instead of the right line at
 | 
						|
  a junction on one client.
 | 
						|
 | 
						|
  The important thing here is, that no one notices when a Desync
 | 
						|
  occurs. The desync client will continue to simulate the gamestate
 | 
						|
  and execute commands from the server. Once the gamestate differs
 | 
						|
  it will increasingly spiral out of control: If a vehicle picks a
 | 
						|
  different route, it will arrive at a different time at a station,
 | 
						|
  which will load different cargo, which causes other vehicles to
 | 
						|
  load other stuff, which causes industries to notice different
 | 
						|
  servicing, which causes industries to change production, ...
 | 
						|
  the client could run all day in a different universe.
 | 
						|
 | 
						|
  To limit how long a Desync can remain unnoticed, the server
 | 
						|
  transfers some checksums every now and then for the gamestate.
 | 
						|
  Currently this checksum is the state of the random number
 | 
						|
  generator of the game logic. A lot of things in OpenTTD depend
 | 
						|
  on the RNG, and if the gamestate differs, it is likely that the
 | 
						|
  RNG is called at different times, and the state differs when
 | 
						|
  checked.
 | 
						|
 | 
						|
  The clients compare this 'checksum' with the checksum of their
 | 
						|
  own gamestate at the specific network frame. If they differ,
 | 
						|
  the client disconnects with a Desync error.
 | 
						|
 | 
						|
  The important thing here is: The detection of the Desync is
 | 
						|
  only an ultimate failure detection. It does not give any
 | 
						|
  indication on when the Desync happened. The Desync may after
 | 
						|
  all have occurred long ago, and just did not affect the checksum
 | 
						|
  up to now. The checksum may have matched 10 times or more
 | 
						|
  since the Desync happened, and only now the Desync has spiraled
 | 
						|
  enough to finally affect the checksum. (There was once a desync
 | 
						|
  which was only noticed by the checksum after 20 game years.)
 | 
						|
 | 
						|
## 1.3) Typical causes of Desyncs
 | 
						|
 | 
						|
  Desyncs can be caused by the following scenarios:
 | 
						|
   - The savegame does not describe the complete gamestate.
 | 
						|
      - Some information which affects the progression of the
 | 
						|
        gamestate is not saved in the savegame.
 | 
						|
      - Some information which affects the progression of the
 | 
						|
        gamestate is not loaded from the savegame.
 | 
						|
        This includes the case that something is not completely
 | 
						|
        reset before loading the savegame, so data from the
 | 
						|
        previous game is carried over to the new one.
 | 
						|
   - The gamestate does not behave deterministic.
 | 
						|
      - Cache mismatch: The game logic depends on some cached
 | 
						|
        values, which are not invalidated properly. This is
 | 
						|
        the usual case for NewGRF-specific Desyncs.
 | 
						|
      - Undefined behaviour: The game logic performs multiple
 | 
						|
        things in an undefined order or with an undefined
 | 
						|
        result. E.g. when sorting something with a key while
 | 
						|
        some keys are equal. Or some computation that depends
 | 
						|
        on the CPU architecture (32/64 bit, little/big endian).
 | 
						|
   - The gamestate is modified when it shall not be modified.
 | 
						|
      - The test-run of a command alters the gamestate.
 | 
						|
      - The gamestate is altered by a player or script without
 | 
						|
        using commands.
 | 
						|
 | 
						|
 | 
						|
## 2.1) Cache debugging
 | 
						|
 | 
						|
  Desyncs which are caused by improper cache validation can
 | 
						|
  often be found by enabling cache validation:
 | 
						|
   - Start OpenTTD with '-d desync=2'.
 | 
						|
   - This will enable validation of caches every tick.
 | 
						|
     That is, cached values are recomputed every tick and compared
 | 
						|
     to the cached value.
 | 
						|
   - Differences are logged to 'commands-out.log' in the autosave
 | 
						|
     folder.
 | 
						|
 | 
						|
  Mind that this type of debugging can also be done in singleplayer.
 | 
						|
 | 
						|
## 2.2) Desync recording
 | 
						|
 | 
						|
  If you have a server, which happens to encounter Desyncs often,
 | 
						|
  you can enable recording of the gamestate alterations. This
 | 
						|
  will later allow the replay the gamestate and locate the Desync
 | 
						|
  cause.
 | 
						|
 | 
						|
  There are two levels of Desync recording, which are enabled
 | 
						|
  via '-d desync=2' resp. '-d desync=3'. Both will record all
 | 
						|
  commands to a file 'commands-out.log' in the autosave folder.
 | 
						|
 | 
						|
  If you have the savegame from the start of the server, and
 | 
						|
  this command log you can replay the whole game. (see Section 3.1)
 | 
						|
 | 
						|
  If you do not start the server from a savegame, there will
 | 
						|
  also be a savegame created just after a map has been generated.
 | 
						|
  The savegame will be named 'dmp_cmds_*.sav' and be put into
 | 
						|
  the autosave folder.
 | 
						|
 | 
						|
  In addition to that '-d desync=3' also creates regular savegames
 | 
						|
  at defined spots in network time. (more defined than regular
 | 
						|
  autosaves). These will be created in the autosave folder
 | 
						|
  and will also be named 'dmp_cmds_*.sav'.
 | 
						|
 | 
						|
  These saves allow comparing the gamestate with the original
 | 
						|
  gamestate during replaying, and thus greatly help debugging.
 | 
						|
  However, they also take a lot of disk space.
 | 
						|
 | 
						|
 | 
						|
## 3.1) Replaying
 | 
						|
 | 
						|
  To replay a Desync recording, you need these files:
 | 
						|
   - The savegame from when the server was started, resp.
 | 
						|
     the automatically created savegame from when the map
 | 
						|
     was generated.
 | 
						|
   - The 'commands-out.log' file.
 | 
						|
   - Optionally the 'dmp_cmds_*.sav'.
 | 
						|
  Put these files into a safe spot. (Not your autosave folder!)
 | 
						|
 | 
						|
  Next, prepare your OpenTTD for replaying:
 | 
						|
   - Get the same version of OpenTTD as the original server was running.
 | 
						|
   - Uncomment/enable the define 'DEBUG_DUMP_COMMANDS' in
 | 
						|
     'src/network/network_func.h'.
 | 
						|
     (DEBUG_FAILED_DUMP_COMMANDS is explained later)
 | 
						|
   - Put the 'commands-out.log' into the root save folder, and rename
 | 
						|
      it to 'commands.log'.
 | 
						|
   - Run 'openttd -D -d desync=3 -g startsavegame.sav'.
 | 
						|
     This replays the server log and creates new 'commands-out.log'
 | 
						|
     and 'dmp_cmds_*.sav' in your autosave folder.
 | 
						|
 | 
						|
## 3.2) Evaluation of the replay
 | 
						|
 | 
						|
  The replaying will also compare the checksums which are part of
 | 
						|
  the 'commands-out.log' with the replayed gamestate.
 | 
						|
  If they differ, it will trigger a 'NOT_REACHED'.
 | 
						|
 | 
						|
  If the replay succeeds without mismatch, that is the replay reproduces
 | 
						|
  the original server state:
 | 
						|
   - Repeat the replay starting from incrementally later 'dmp_cmds_*.sav'
 | 
						|
     while truncating the 'commands.log' at the beginning appropriately.
 | 
						|
     The 'dmp_cmds_*.sav' can be your own ones from the first reply, or
 | 
						|
     the ones from the original server (if you have them).
 | 
						|
     (This simulates the view of joining clients during the game.)
 | 
						|
   - If one of those replays fails, you have located the Desync between
 | 
						|
     the last dmp_cmds that reproduces the replay and the first one
 | 
						|
     that fails.
 | 
						|
 | 
						|
  If the replay does not succeed without mismatch, you can check the logs
 | 
						|
  whether there were failed commands. Then you may try to replay with
 | 
						|
  DEBUG_FAILED_DUMP_COMMANDS enabled. If the replay then fails, the
 | 
						|
  command test-run of the failed command modified the game state.
 | 
						|
 | 
						|
  If you have the original 'dmp_cmds_*.sav', you can also compare those
 | 
						|
  savegames with your own ones from the replay. You can also comment/disable
 | 
						|
  the 'NOT_REACHED' mentioned above, to get another 'dmp_cmds_*.sav' from
 | 
						|
  the replay after the mismatch has already been detected.
 | 
						|
  See Section 3.2 on how to compare savegames.
 | 
						|
  If the saves differ you have located the Desync between the last dmp_cmds
 | 
						|
  that match and the first one that does not. The difference of the saves
 | 
						|
  may point you in the direction of what causes it.
 | 
						|
 | 
						|
  If the replay succeeds without mismatch, and you do not have any
 | 
						|
  'dmp_cmd_*.sav' from the original server, it is a lost case.
 | 
						|
  Enable creation of the 'dmp_cmd_*.sav' on the server, and wait for the
 | 
						|
  next Desync.
 | 
						|
 | 
						|
  Finally, you can also compare the 'commands-out.log' from the original
 | 
						|
  server with the one from the replay. They will differ in stuff like
 | 
						|
  dates, and the original log will contain the chat, but otherwise they
 | 
						|
  should match.
 | 
						|
 | 
						|
## 3.3) Comparing savegames
 | 
						|
 | 
						|
  The binary form of the savegames from the original server and from
 | 
						|
  your replay will always differ:
 | 
						|
   - The savegame contains paths to used NewGRF files.
 | 
						|
   - The gamelog will log your loading of the savegame.
 | 
						|
   - The savegame data of AIs and the Gamescript will differ.
 | 
						|
     Scripts are not run during the replay, only their recorded commands
 | 
						|
     are replayed. Their internal state will thus not change in the
 | 
						|
     replay and will differ.
 | 
						|
 | 
						|
  To compare savegame more semantically, there exist some ugly hackish
 | 
						|
  tools at:
 | 
						|
     http://devs.openttd.org/~frosch/texts/zpipe.c
 | 
						|
     http://devs.openttd.org/~frosch/texts/printhunk.c
 | 
						|
 | 
						|
  The first one decompresses OpenTTD savegames. The second one creates
 | 
						|
  a textual representation of an uncompressed savegame, by parsing hunks
 | 
						|
  and arrays and such. With both tools you need to be a bit careful
 | 
						|
  since they work on stdin and stdout, which may not deal well with
 | 
						|
  binary data.
 | 
						|
 | 
						|
  If you have the textual representation of the savegames, you can
 | 
						|
  compare them with regular diff tools.
 |