268 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			268 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Some explanations about Desyncs
 | |
| 
 | |
| Last updated: 2014-02-23
 | |
| 
 | |
| ## Table of contents
 | |
| 
 | |
| - 1.0) Desync theory
 | |
|     - 1.1) [OpenTTD multiplayer architecture](#11-openttd-multiplayer-architecture)
 | |
|     - 1.2) [What is a Desync and how is it detected](#12-what-is-a-desync-and-how-is-it-detected)
 | |
|     - 1.3) [Typical causes of Desyncs](#13-typical-causes-of-desyncs)
 | |
| - 2.0) What to do in case of a Desync
 | |
|     - 2.1) [Cache debugging](#21-cache-debugging)
 | |
|     - 2.2) [Desync recording](#22-desync-recording)
 | |
| - 3.0) Evaluating the Desync records
 | |
|     - 3.1) [Replaying](#31-replaying)
 | |
|     - 3.2) [Evaluation of the replay](#32-evaluation-of-the-replay)
 | |
|     - 3.3) [Comparing savegames](#33-comparing-savegames)
 | |
| 
 | |
| 
 | |
| ## 1.1) OpenTTD multiplayer architecture
 | |
| 
 | |
|   OpenTTD has a huge gamestate, which changes all of the time.
 | |
|   The savegame contains the complete gamestate at a specific point
 | |
|   in time. But this state changes completely each tick: Vehicles move
 | |
|   and trees grow.
 | |
| 
 | |
|   However, most of these changes in the gamestate are deterministic:
 | |
|   Without a player interfering a vehicle follows its orders always
 | |
|   in the same way, and trees always grow the same.
 | |
| 
 | |
|   In OpenTTD multiplayer synchronisation works by creating a savegame
 | |
|   when clients join, and then transferring that savegame to the client,
 | |
|   so it has the complete gamestate at a fixed point in time.
 | |
| 
 | |
|   Afterwards clients only receive 'commands', that is: Stuff which is
 | |
|   not predictable, like
 | |
|    - player actions
 | |
|    - AI actions
 | |
|    - GameScript actions
 | |
|    - Admin Port command
 | |
|    - rcon commands
 | |
|    - ...
 | |
| 
 | |
|   These commands contain the information on how to execute the command,
 | |
|   and when to execute it. Time is measured in 'network frames'.
 | |
|   Mind that network frames to not match ingame time. Network frames
 | |
|   also run while the game is paused, to give a defined behaviour to
 | |
|   stuff that is executing while the game is paused.
 | |
| 
 | |
|   The deterministic part of the gamestate is run by the clients on
 | |
|   their own. All they get from the server is the instruction to
 | |
|   run the gamestate up to a certain network time, which basically
 | |
|   says that there are no commands scheduled in that time.
 | |
| 
 | |
|   When a client (which includes the server itself) wants to execute
 | |
|   a command (i.e. a non-predictable action), it does this by
 | |
|    - calling DoCommandP resp. DoCommandPInternal
 | |
|    - These functions first do a local test-run of the command to
 | |
|      check simple preconditions. (Just to give the client an
 | |
|      immediate response without bothering the server and waiting for
 | |
|      the response.) The test-run may not actually change the
 | |
|      gamestate, all changes must be discarded.
 | |
|    - If the local test-run succeeds the command is sent to the server.
 | |
|    - The server inserts the command into the command queue, which
 | |
|      assigns a network frame to the commands, i.e. when it shall be
 | |
|      executed on all clients.
 | |
|    - Enhanced with this specific timestamp, the command is send to all
 | |
|      clients, which execute the command simultaneously in the same
 | |
|      network frame in the same order.
 | |
| 
 | |
| ## 1.2) What is a Desync and how is it detected
 | |
| 
 | |
|   In the ideal case all clients have the same gamestate as the server
 | |
|   and run in sync. That is, vehicle movement is the same on all
 | |
|   clients, and commands are executed the same everywhere and
 | |
|   have the same results.
 | |
| 
 | |
|   When a Desync happens, it means that the gamestates on the clients
 | |
|   (including the server) are no longer the same. Just imagine
 | |
|   that a vehicle picks the left line instead of the right line at
 | |
|   a junction on one client.
 | |
| 
 | |
|   The important thing here is, that no one notices when a Desync
 | |
|   occurs. The desync client will continue to simulate the gamestate
 | |
|   and execute commands from the server. Once the gamestate differs
 | |
|   it will increasingly spiral out of control: If a vehicle picks a
 | |
|   different route, it will arrive at a different time at a station,
 | |
|   which will load different cargo, which causes other vehicles to
 | |
|   load other stuff, which causes industries to notice different
 | |
|   servicing, which causes industries to change production, ...
 | |
|   the client could run all day in a different universe.
 | |
| 
 | |
|   To limit how long a Desync can remain unnoticed, the server
 | |
|   transfers some checksums every now and then for the gamestate.
 | |
|   Currently this checksum is the state of the random number
 | |
|   generator of the game logic. A lot of things in OpenTTD depend
 | |
|   on the RNG, and if the gamestate differs, it is likely that the
 | |
|   RNG is called at different times, and the state differs when
 | |
|   checked.
 | |
| 
 | |
|   The clients compare this 'checksum' with the checksum of their
 | |
|   own gamestate at the specific network frame. If they differ,
 | |
|   the client disconnects with a Desync error.
 | |
| 
 | |
|   The important thing here is: The detection of the Desync is
 | |
|   only an ultimate failure detection. It does not give any
 | |
|   indication on when the Desync happened. The Desync may after
 | |
|   all have occurred long ago, and just did not affect the checksum
 | |
|   up to now. The checksum may have matched 10 times or more
 | |
|   since the Desync happened, and only now the Desync has spiraled
 | |
|   enough to finally affect the checksum. (There was once a desync
 | |
|   which was only noticed by the checksum after 20 game years.)
 | |
| 
 | |
| ## 1.3) Typical causes of Desyncs
 | |
| 
 | |
|   Desyncs can be caused by the following scenarios:
 | |
|    - The savegame does not describe the complete gamestate.
 | |
|       - Some information which affects the progression of the
 | |
|         gamestate is not saved in the savegame.
 | |
|       - Some information which affects the progression of the
 | |
|         gamestate is not loaded from the savegame.
 | |
|         This includes the case that something is not completely
 | |
|         reset before loading the savegame, so data from the
 | |
|         previous game is carried over to the new one.
 | |
|    - The gamestate does not behave deterministic.
 | |
|       - Cache mismatch: The game logic depends on some cached
 | |
|         values, which are not invalidated properly. This is
 | |
|         the usual case for NewGRF-specific Desyncs.
 | |
|       - Undefined behaviour: The game logic performs multiple
 | |
|         things in an undefined order or with an undefined
 | |
|         result. E.g. when sorting something with a key while
 | |
|         some keys are equal. Or some computation that depends
 | |
|         on the CPU architecture (32/64 bit, little/big endian).
 | |
|    - The gamestate is modified when it shall not be modified.
 | |
|       - The test-run of a command alters the gamestate.
 | |
|       - The gamestate is altered by a player or script without
 | |
|         using commands.
 | |
| 
 | |
| 
 | |
| ## 2.1) Cache debugging
 | |
| 
 | |
|   Desyncs which are caused by improper cache validation can
 | |
|   often be found by enabling cache validation:
 | |
|    - Start OpenTTD with '-d desync=2'.
 | |
|    - This will enable validation of caches every tick.
 | |
|      That is, cached values are recomputed every tick and compared
 | |
|      to the cached value.
 | |
|    - Differences are logged to 'commands-out.log' in the autosave
 | |
|      folder.
 | |
| 
 | |
|   Mind that this type of debugging can also be done in singleplayer.
 | |
| 
 | |
| ## 2.2) Desync recording
 | |
| 
 | |
|   If you have a server, which happens to encounter Desyncs often,
 | |
|   you can enable recording of the gamestate alterations. This
 | |
|   will later allow the replay the gamestate and locate the Desync
 | |
|   cause.
 | |
| 
 | |
|   There are two levels of Desync recording, which are enabled
 | |
|   via '-d desync=2' resp. '-d desync=3'. Both will record all
 | |
|   commands to a file 'commands-out.log' in the autosave folder.
 | |
| 
 | |
|   If you have the savegame from the start of the server, and
 | |
|   this command log you can replay the whole game. (see Section 3.1)
 | |
| 
 | |
|   If you do not start the server from a savegame, there will
 | |
|   also be a savegame created just after a map has been generated.
 | |
|   The savegame will be named 'dmp_cmds_*.sav' and be put into
 | |
|   the autosave folder.
 | |
| 
 | |
|   In addition to that '-d desync=3' also creates regular savegames
 | |
|   at defined spots in network time. (more defined than regular
 | |
|   autosaves). These will be created in the autosave folder
 | |
|   and will also be named 'dmp_cmds_*.sav'.
 | |
| 
 | |
|   These saves allow comparing the gamestate with the original
 | |
|   gamestate during replaying, and thus greatly help debugging.
 | |
|   However, they also take a lot of disk space.
 | |
| 
 | |
| 
 | |
| ## 3.1) Replaying
 | |
| 
 | |
|   To replay a Desync recording, you need these files:
 | |
|    - The savegame from when the server was started, resp.
 | |
|      the automatically created savegame from when the map
 | |
|      was generated.
 | |
|    - The 'commands-out.log' file.
 | |
|    - Optionally the 'dmp_cmds_*.sav'.
 | |
|   Put these files into a safe spot. (Not your autosave folder!)
 | |
| 
 | |
|   Next, prepare your OpenTTD for replaying:
 | |
|    - Get the same version of OpenTTD as the original server was running.
 | |
|    - Uncomment/enable the define 'DEBUG_DUMP_COMMANDS' in
 | |
|      'src/network/network_func.h'.
 | |
|      (DEBUG_FAILED_DUMP_COMMANDS is explained later)
 | |
|    - Put the 'commands-out.log' into the root save folder, and rename
 | |
|       it to 'commands.log'.
 | |
|    - Run 'openttd -D -d desync=3 -g startsavegame.sav'.
 | |
|      This replays the server log and creates new 'commands-out.log'
 | |
|      and 'dmp_cmds_*.sav' in your autosave folder.
 | |
| 
 | |
| ## 3.2) Evaluation of the replay
 | |
| 
 | |
|   The replaying will also compare the checksums which are part of
 | |
|   the 'commands-out.log' with the replayed gamestate.
 | |
|   If they differ, it will trigger a 'NOT_REACHED'.
 | |
| 
 | |
|   If the replay succeeds without mismatch, that is the replay reproduces
 | |
|   the original server state:
 | |
|    - Repeat the replay starting from incrementally later 'dmp_cmds_*.sav'
 | |
|      while truncating the 'commands.log' at the beginning appropriately.
 | |
|      The 'dmp_cmds_*.sav' can be your own ones from the first reply, or
 | |
|      the ones from the original server (if you have them).
 | |
|      (This simulates the view of joining clients during the game.)
 | |
|    - If one of those replays fails, you have located the Desync between
 | |
|      the last dmp_cmds that reproduces the replay and the first one
 | |
|      that fails.
 | |
| 
 | |
|   If the replay does not succeed without mismatch, you can check the logs
 | |
|   whether there were failed commands. Then you may try to replay with
 | |
|   DEBUG_FAILED_DUMP_COMMANDS enabled. If the replay then fails, the
 | |
|   command test-run of the failed command modified the game state.
 | |
| 
 | |
|   If you have the original 'dmp_cmds_*.sav', you can also compare those
 | |
|   savegames with your own ones from the replay. You can also comment/disable
 | |
|   the 'NOT_REACHED' mentioned above, to get another 'dmp_cmds_*.sav' from
 | |
|   the replay after the mismatch has already been detected.
 | |
|   See Section 3.2 on how to compare savegames.
 | |
|   If the saves differ you have located the Desync between the last dmp_cmds
 | |
|   that match and the first one that does not. The difference of the saves
 | |
|   may point you in the direction of what causes it.
 | |
| 
 | |
|   If the replay succeeds without mismatch, and you do not have any
 | |
|   'dmp_cmd_*.sav' from the original server, it is a lost case.
 | |
|   Enable creation of the 'dmp_cmd_*.sav' on the server, and wait for the
 | |
|   next Desync.
 | |
| 
 | |
|   Finally, you can also compare the 'commands-out.log' from the original
 | |
|   server with the one from the replay. They will differ in stuff like
 | |
|   dates, and the original log will contain the chat, but otherwise they
 | |
|   should match.
 | |
| 
 | |
| ## 3.3) Comparing savegames
 | |
| 
 | |
|   The binary form of the savegames from the original server and from
 | |
|   your replay will always differ:
 | |
|    - The savegame contains paths to used NewGRF files.
 | |
|    - The gamelog will log your loading of the savegame.
 | |
|    - The savegame data of AIs and the Gamescript will differ.
 | |
|      Scripts are not run during the replay, only their recorded commands
 | |
|      are replayed. Their internal state will thus not change in the
 | |
|      replay and will differ.
 | |
| 
 | |
|   To compare savegame more semantically, there exist some ugly hackish
 | |
|   tools at:
 | |
|      http://devs.openttd.org/~frosch/texts/zpipe.c
 | |
|      http://devs.openttd.org/~frosch/texts/printhunk.c
 | |
| 
 | |
|   The first one decompresses OpenTTD savegames. The second one creates
 | |
|   a textual representation of an uncompressed savegame, by parsing hunks
 | |
|   and arrays and such. With both tools you need to be a bit careful
 | |
|   since they work on stdin and stdout, which may not deal well with
 | |
|   binary data.
 | |
| 
 | |
|   If you have the textual representation of the savegames, you can
 | |
|   compare them with regular diff tools.
 | 
