pikdum's blog


Building a WoW server in Elixir

Thistle Tea is my new World of Warcraft private server project. You can log in, create a character, run around, and cast spells to kill mobs, with everything synchronized between players as expected for an MMO. It was floating around in my head to build this for a while, since I have an incurable nostalgia for early WoW. I first mentioned this on May 13th, and didn’t expect to get any further than login, character creation, and spawning into the map. Here’s a recount of the first month of development.

Day 0

Before coding, I did some research and came up a plan.

Day 1 - June 2nd

There are two parts that needed to be built: the authentication server and the game server. Up first was authentication, since you can’t do anything without logging in.

My plan was to build a MITM proxy between the client and a MaNGOS server to log all packets. It wasn’t as useful as expected, but it did help me internalize how the requests and responses worked.

The auth flow can be simplified as:

Packets have a header section that contains the opcode, or message type, and the size, followed by the payload.

It uses SRP6, which I hadn’t heard of before this. Seems like the idea is to avoid transmitting an unencrypted password, and instead both the client and server independently calculate something that only matches if they both know the correct password.

So basically, what I needed to do was:

This whole part is well documented, but I still ran into some issues with the cryptography. Luckily, I found a blog post and an accompanying Elixir implementation, so I was able to substitute my broken cryptography with working cryptography. Without that, I would’ve been stuck at this part for a very long time (maybe forever). Still wasn’t able to get login working on day 1, but I was close.

Links:

Day 2 - June 3rd

I spent some time cleaning up the code and realised I had a logic error where I reversed some crypto bytes that weren’t supposed to be. Fixing that made auth work, finally getting a success with hardcoded credentials.

Next up was getting the realm list to work, by handling CMD_REALM_LIST and returning which game server to connect to.

This got me out of the tedious auth bits and I could get to building the game server.

Links:

Day 3 - June 4th

The goal for today was to get spawned into the world. But first more tedious auth bits.

The game server auth flow can be simplified as:

This basically negotaties how to encrypt/decrypt future packet headers. Luckily Shadowburn also had crypto code for this, so I was able to use it here.

After that, it’s something like:

First is handling CMSG_CHAR_CREATE and CMSG_CHAR_ENUM:

Then I got side-tracked for a bit trying to get equipment to show up, since I had all the equipment display bytes just hardcoded to 0 before.

After that was handling CMSG_PLAYER_LOGIN. I found an example minimal SMSG_UPDATE_OBJECT spawn packet, which was supposed to spawn me in Northshire Abbey.

That’s probably the most important packet, since it does everything from:

It has a lot of different forms, can update multiple objects in a single packet, and has a compressed variant.

Whoops, had the coordinates a bit off. After fixing that, I was in the human starting area as expected. No player model yet, though.

Next up was adding more to that spawn packet to use the player race and proper starting area. The starting areas were grabbed from a MaNGOS database that I converted over to SQLite.

Last for the night was to get logout working.

The implementation was something like:

This was the first piece that really took advantage of Elixir’s message passing.

The white chat box was weird, but it was nice being able to log in.

Links:

Day 4 - June 5th

First up was reorganizing the code, since my game.ex GenServer was getting too large.

My strategy for that was:

It worked, but it messed with line numbers in error messages and made things harder to debug.

After that, I wanted to generate that spawn packet properly rather than hardcoding. The largest piece of this was figuring out the update mask for the update fields.

TODO: explain this better, maybe show some code/diagrams/bits

Simplified, there are a ton of fields for objects, units, players, etc. Before the fields in an update message, there’s a bit mask with bits set at offsets that correspond to the fields being sent. Without that, the client wouldn’t know what to do with the values. Luckily it’s all well documented, but it still took a while to implement.

Links:

Day 5 - June 6th

Referencing MaNGOS, I added some more messages that the server sends to the client after a CMSG_PLAYER_LOGIN. One of these, SMSG_ACCOUNT_DATA_TIMES, fixed the white chat box and keybinds being reset. I also added SMSG_COMPRESSED_UPDATE_OBJECT, which compresses the update packet with :zlib.compress/1.

Movement would come up soon, so I started adding the handlers for those packets.

Day 6 - June 7th

In the update packet, I still had the object guid hardcoded. This is because it used a packed guid, and I needed to write some functions to handle that. Rather than the entire guid, a packed guid is a byte mask followed by all non-zero bytes. The byte mask has bits set that correspond to where the following bytes go in the unpacked guid.

This took a while, because the client was crashing when I changed the packed guid from <<1, 4>> to anything else. After trying different things, I realized that the guid was in two places in the packet and they needed to match. One fix later and things were working as expected.

Links:

Day 7 - June 8th

It was about time to start implementing the actual MMO features, starting with seeing other players. To test, I hardcoded another update packet after the player’s with a different guid, to try and spawn something.

Then I used a Registry to keep track of logged in players and their spawn packets. After entering the world, I would use Registry.dispatch/3 to:

After that, I added a similar dispatch when handling movement packets to broadcast movement to all other players. When broadcasting movement, the client message doesn’t even really need to be parsed, since the server message is essentially the same with the player’s guid prepended to the payload. This is where the choice of Elixir really started to shine, and I quickly had players able to see each other move around the screen.

I tested this approach with multiple windows open and it was very cool to see everything synchronized.

I added a handler for CMSG_NAME_QUERY to get names to stop showing up as Unkown, and also despawned players with SMSG_DESTROY_OBJECT when logging out.

This is where I started noticing a bug: occasionally I wouldn’t be able to decrypt a packet successfully, which would lead to all future attempts failing too, since there’s a counter as part of the decryption function. I couldn’t figure out how to resolve it yet, though, or reliably reproduce.

Up next was working on chat.

Links:

Day 8 - June 9th

To get chat working, I handled CMSG_MESSAGECHAT and broadcasted SMSG_MESSAGECHAT to players, using Registry.dispatch/3 here too. I only focused on /say here, and it’s all players rather than nearby.

To fix that decryption bug, sometimes I was getting more than one packet and needed to split them. I added some logic to use the packet size in the header to only grab the right amount of bytes, handling any leftovers separately. This seemed to help, but didn’t resolve the issue entirely.

Links:

Day 9 - June 10th

I still had authentication with a hardcoded username, password, and salt, so it was about time to fix that. Rather than go with PostgreSQL or SQLite for the database, I decided to go with Mnesia, since one of my goals was to learn more about Elixir and its ecosystem. I briefly tried plain :mnesia, but decided to use Memento for a cleaner interface.

So I added models for Account and Character and refactored everything to use them. Rather than save to the database everytime the character changes, I just keep it in the process state and only save on logout or disconnect. I’m thinking of saving on CMSG_PING too, eventually. Right now data isn’t persisted to disk, since I’m still iterating on the data model, but that should be straightforward to toggle later.

Links:

Day 10 - June 11th

Today was standardizing the logging, handling a bit more of chat, and handling an unencrypted CMSG_PING. I was thinking that could be part of the intermittent issues too, but looking back I don’t think I’ve ever had my client send that unencrypted anyways.

Day 11 - June 12th

I wanted equipment working so players weren’t naked all the time, so I started on that. Using the MaNGOS item_template table, I wired things up to set random equipment on character creation. Then I got that added to the response to CMSG_CHAR_ENUM, so they would show up in the login screen.

Up next was getting it showing in game.

Day 12 - June 13th

Took a bit to figure out the proper offsets for each piece of equipment in the update mask, but eventually got it working.

By adding it to the function that builds the update object packet, it also just worked when showing other player’s equipment.

Day 13 - June 14th

I had player movement synchronizing between players properly, so I wanted to get sitting working too.

Whoops. Weird things happen when field offsets or sizes are incorrect when building that update mask.

After that, I wanted to play around a bit by randomizing equipment on every jump. Here I learned that you need to send all fields in the update object packet, like health, or they get reset. I was trying to just send the equipment changes, but I’d die on every jump.

After making sure to send all fields, it was working as expected.

Day 14 - June 15th

Took a break.

Day 15 - June 16th

Today was refactoring and improvements. I reworked things into proper modules, since it was getting hard to debug when all the line numbers were wrong. So now game.ex called the appropriate module’s handle_packet/3 function.

I also reworked things so players were spawned with their current position, rather than the original that I saved in the registry. This included some rework to make building an update packet more straightforward.

Day 16 - June 17th

Today was just playing around and no code changes.

Not sure why the model is messed up here, but it seems like it’s something with my computer rather than anything server related.

Day 17 - June 18th

The world was feeling a bit empty, so I wanted to spawn in mobs. First was hardcoding an update packet that should spawn a mob, and having it trigger on /say.

After that, I used the creature table of the MaNGOS database to get proper mobs spawning. I used a GenServer for this, so every mob would be a process and keep track of their own state. First I hardcoded a few select ids in the starting area to load, and after that worked I loaded them all.

Rather than spawn all ~57k mobs for the player, though, I wired things up to only spawn things within a certain range. This looked like:

It worked really well, and I could run around and see the mobs.

Next up was optimization and despawning mobs that were now out of range.

Day 18 - June 19th

For optimization, I didn’t want to send duplicate spawn packets for mobs that were already spawned. I also wanted to despawn mobs that were out of range. To do this, I used ETS to track which guids were spawned.

In the dispatch, the logic was:

Despawn was done through the same SMSG_DESTROY_OBJECT packet that I used for player logout.

After getting that working, I ran around the world and explored for a bit.

Found a bug in Westfall. Turns out I wasn’t separating mobs by map, so Westfall had mobs from Silithus mixed in. To fix, I reworked both the mob and player registries to use map as the key.

Having mobs standing in place was a bit boring, so I wanted them to move around. Turns out this is pretty complicated, and I’ll actually have to read the maps so mobs stay on the ground properly. There are a few projects for this, but all a bit difficult to include in an Elixir project. I’m thinking I’ll look into RPC, but there’s a chance that might not be performant enough.

The standard update object packet can be used for mob movement here, but it looks like there might be some more specialized packets to look into later too.

Without the map data, I couldn’t really get the server movement to line up with what happened in the client. So, I settled with getting mobs to spin at random speeds.

That was a bit silly and used a lot of CPU, so I tweaked it to just randomly change orientation instead.

Links:

Day 19 - June 20th

Here I got mob names working by implementing CMSG_CREATURE_QUERY. This seemed to crash when querying mobs that didn’t have a model, so I removed them from being loaded. I also started loading in mob movement data and optimized the query a bit.

I also finally got some people to help me test the networking. It didn’t start very well.

Turns out I hadn’t tested this locally since adding mobs, and the player + mob spawn/despawns were conflicting with each other due to guid collisions. So players were being constantly spawned in and out.

I did some emergency patching to make it so players are never despawned, even out of range. I also turned off /say spawning boars, since that was getting annoying. That worked for now.

There were still some major issues. My helper had 450 ms latency and would crash when running to areas with a lot of mobs. I couldn’t reproduce, though, with my 60 ms latency.

Links:

Day 20 - June 21

To reproduce that issue, I set things up so I could connect to my local server from my laptop on the same network. On my laptop, I used tc to simulate a lot of latency and wired things up so equipment would change on any movement instead of just jump. This sent a ton of packets when spinning and I was finally able to reproduce.

Turns out, the crashing issues were from the server not receiving an entire packet, but still trying to decrypt and handle it. I was handling if the server got more than one packet, but not if the server got a partial packet.

Referencing Shadowburn, the fix for this is to let the packet data accumulate until there’s enough to handle. This seems to have fixed all the network-related issues.

To fix the guid collision issue, I added a large offset to creature guids, so they’ll never conflict.

Day 21 - June 22

Took a break.

Day 22 - June 23

Worked on CMSG_ITEM_NAME_QUERY a bit, but I think there’s still something wrong here.

Decided spells would be next, so I started that. First was sending spells over with SMSG_INITIAL_SPELLS on login, using the initial spells in MaNGOS. Everything was instant cast though, for some reason.

Turns out I needed to set unit_mod_cast_speed in the player update packet for cast times to show up properly in the client.

I started by handling CMSG_CAST_SPELL, which would send a successful SMSG_CAST_RESULT after the spell cast time, so other spells could be cast. I also handled CMSG_CANCEL_CAST, to cancel that timer. This looked a bit like the logout logic.

The starting animation for casting a spell would play, but no cast bar or anything further.

Links:

Days 23 to 26 - June 24 to 27

Took a longer break.

Day 27 - June 28

I was able to get a cast bar showing up by sending SMSG_SPELL_START after reading the cast spell packet.

The projectile effect took a bit longer to figure out. I needed to send a SMSG_SPELL_GO after the cast was complete, with the proper target guids.

Links:

Day 28 - June 29

I got self-cast spells working, by setting the target guid to the player’s guid.

Day 29 - June 30

Another break.

Day 30 - July 1

Since I had spells somewhat working, I had to clean up the implementation. I dispatched the SMSG_SPELL_START and SMSG_SPELL_GO packets to nearby players and fixed spell cancelling, so movement would cancel as expected.

Day 31 - July 2

I added levels to mobs, random from their minimum to maximum level, rather than hardcoding to 1. Then I made spells do hardcoded damage, so mobs could die. Noticed that mobs would still change orientation when dead, so added a check to only move if alive.

Future Plans

I plan on slowly working on this, adding more functionality as I go. My goal isn’t really a 1:1 Vanilla server, but more something that fits well with Elixir’s capabilities. I’d like to see how many players this approach can handle, and how it compares in performance to MaNGOS, eventually.

Some things on the list:

So still plenty more work to do. :)


Tags:
Comments