Why Blizzard Stumbled With Diablo 3 Launch

A database administrator explains how Blizzard, having successfully run the servers of a massaive MMORPG like WoW, managed to stumble with basic things on Diablo 3.  If you’re short on time the summary is basically this – WoW and D3 are very different beasts, and years of experience running one doesn’t mean you aren’t going to run into wholly new and unexpected problems when launching the other.

Hi. Database administrator here. The design of WoW and the design of D3 are different enough at the database level that knowing how to run one doesn’t make you an automatic expert at the other.

With WoW, the player base is naturally split into shards. Except you don’t call them “shards”, you call them servers. The biggest servers have around 30,000 active players (source), and that’s over the course of a month. Peak active load on any single server is probably more like 20,000 people, which is something a traditional RDBMS can easily handle when tuned well. Your userbase naturally segments itself into separate silos (servers), each with its own database and game-server. They’re self-contained, and problems inside one silo won’t necessarily affect the others. Later on Blizzard allowed some “inter-silo” activity with battlegroups, cross-server dungeons and battlegrounds, but at the end of the instance every player goes back to their home silo. The total userbase inside each silo is small enough that your well-tuned queries don’t hammer the disk, and everything moves along smoothly. If any one shard (server) scales too big, you gate logins with a queue and some fraction of the excess will say “fuck this queue shit” and go away, to another server or maybe even another game. Either way that shard stabilizes at a sustainable population, and you can turn off the login queue.

Edit: Originally I referred to “vertical” partitioning above, which was incorrect, as pointed out by a very large string of numbers.

With Diablo 3, there are no discrete servers, so there are no natural shards. I guarantee they aren’t cramming the millions of us in each region into a single traditional databases, it would crumble under the load. They are almost certainly using some kind of dynamic sharding, or possibly a very large NoSQL-type cluster. Either way it’s a major architectural shift that creates all sorts of unique problems that you need to design for. With dynamic sharding, you assign new users to a shard, and then your game server has to know which shard to go to get data about each player in one of its games. Maybe you’re on shard A and your friend is on shard B. You guys buddy up and land on game server N, which now needs to pull data from two different shards. But whoops! The shard A DB is having problems. How does the game server handle this error condition? Does it dump just you? Does it shit-can your instance, even though your friend’s datastore is unaffected? From the datastore side of things, load and concurrency are a much bigger issue. Now instead of talking to one game server, you may have to talk to all of them. What if, for whatever reason, almost all of the players on shard N decided to play right now? What happens then? D3 isn’t a collection of neat, clean silo like WoW, it’s a mesh. Problems with one node in the mesh could impact other nodes, depending on the nature of the problem.

Regarding NoSQL, that’s kind of an umbrella term for any datastore that’s not a traditional RDBMS. These systems “cut corners” on traditional RDBMS givens like ACID compliance in order to get much better scaling. So it’s possible they might have one Texas-level huge NoSQL system in each region. God knows NoSQL systems canscale to this level and far beyond, Google uses it with their proprietary BigTable system, and there are a plethora of open-source and commercially available implementations of that model. I know for certain that Blizzard has at least some NoSQL type stuff internally, this job post is looking for people with experience working with Cassandra (a NoSQL-type datastore based on BigTable), memcached (an open-source memory caching tool often used for scaling applications to so-called “web-scale”), and Redis (another NoSQL-type datastore). That might be for the web side of Battle.net and not so much on the game side, I’m not sure. Anyway, you really need to understand how your datastore works when developing an application, and the skills you used to develop WoW (it used Oracle RAC, at least at launch) might not translate over to D3.

Also, I think you may be over-estimating Blizzard’s accomplishments, especially when it comes to launches ofnew services. Were you playing WoW at launch? Do you remember loot lag? If you do, then you know that Blizzard’s track record with wholly new online services is not good. I distinctly remember reading DBA job postings at Blizzard at the time of WoW’s launch, and gleaning insight into the problems they were having, based on the expertise they were looking for at the time.

Now, for the one example you listed: the auction house. WoW has an “auction house”, D3 has an “auction house”, same kettle ‘o fish, right? Oh hell no! The D3 auction house is every item listed by every player in a region, in a game with far more loot drops. It is orders of magnitude larger than the AH on the biggest of WoW servers. You can’t easily shard the dataset either, because then you’d be forcing the game server or AH server to gather all the multiple segments and reassemble them in memory before it could answer questions like “show me all 2-handed bows between levels 25 and 30 with dexterity and life on hit and a socket and a buyout price less than 50,000 gold”. It’s probably its own separate server, maybe even its own cluster. I’m frankly amazed that searches and sales are as fast as they are, at least when it isn’t having issues.

Final edit: A number of people have mentioned that scaling issues don’t really excuse what they see as feature gaps or lack of polish (e.g. missing functionality in AH search, quality of life features like locking action bars in elective mode, etc.). Totally valid point! Some feature omissions (e.g. rich AH search) may be design choices for performance reasons, and thus related to scaling, but many omissions clearly have no link to system scaling. My points are focused mainly on the stability of WoW’s live service vs. D3′s live service, and why there’s a perceived gap there. In terms of feature gaps or omissions, those are more likely to be due to development budget, pressure to launch, etc., issues that are wholly separate from scaling.

[fbcomments]