Networked Games -- More on Architectural Design

Large-Scale Computing

Obviously, networked games involve a lot of considerations regarding parallelism. Using different servers to handle a different zone, i.e. zone servers, is just one simple example (in fact, not a good idea if all the zone servers are on the same cluster).

Let's consider a few numbers for comparisons. In no way rigorous, but just to get some idea as to scale:

web servers: millions of concurrent visitors is usual
database systems: supporting financial data centers, tens of thousands of clients accessing hundreds of data fields is quite normal
console games: a successful title brings 3 to 5 million new subscription/licenses per year

Knowing these, unless we are talking about millions of clients, it is not that massive. That is the reason I am teaching "networked games", not the more trendy term of "massively parallel multi-player game". Basically, we are not on that scale yet.

With networked games, here are some ballpark figures. For a single server, it would be extremely hard to put 50,000 users on the same machine. A console of today's capability has enough processing power to support around 30 to 50 simultaneous users, if we use it as a mini-server. To support 20,000 CPUs, we need ~4 MWatts of electricity, 70% of which for A/C.

We have a communication need of O(N^2) complexity with N users. Assuming a new messages is generated for the world to see, and each message is of a small 50 bytes, 1.5Mbps (T1) is good for 19,000 players. T3 has 44.7 Mbps, allowing us to support 104,000 users. After going up to OC-12 (622 Mbps), we can then support 390,000 users. 1.5 million users would require us to use an OC-192 with 10 Gbps bandwidth.

Looking at the numbers, do we have any hope to be really massive :-). Well, we'd have to do it right, otherwise we either hit the bottleneck of processing power or the bandwidth limits.

Existing Architectures

As before, what we are discussing here are results of general computer science research, not by developers of computer games.

There are in general two flavors: client-server (C/S) and peer-to-peer (P2P). Unlike current perception, C/S and P2P have been peer technologies for tens of years. Basically starting from the very beginning, they two have co-existed. C/S has played a more dorminating role over the years. So people have heard about it more often. That's all.

C/S is the setup used by the classic mainframe computers from the 1960s. There, the mainframe carries out all the processing and monopolies access to resources of different kinds. Users sit in front of terminals, connected to the mainframe through certain network connection. The interactivity achieved on clients is real-time.

This is a very mature technology, with all aspects of its deployment and maintainence well-understood in a widespread manner. There is a strong mechanism to enforce security, with all the controls centralized. If there is a need to support more terminals, the mainframe can be extended in a modular fashion.

Most existing networked games use the C/S architecture. Zone servers are a good example. Unfortunately, the C/S setup is not the best for interactive game play. During the days of mainframe, each terminal operates independently and, in fact, is a physically independent entity. But for networked games, players interact with each other and the virtual universe by nature. The ensuing level of network communication could easily cripple any systems not directly connected to the backbone of the Internet.

It is a natural fit for web servers to operate under C/S framework, because between web requests there are no relations of any kind. With web caching, web servers could scale up well. Under ideal scenarios, modern web server can achieve O(lg(N)) complexity, with N being the number of concurrent users. In contrast, adopting C/S for networked games is only due to convenience.

Even if communication is not the bottleneck, it is still hard to put more than 50,000 players on one server. If a bunch of friends all sign up to play a game together and they end up on different servers, the results are obvious.

In P2P, the term "peer" implies that each participant operates both as a client and a server at the same time. Email is a classical example of P2P, particularly, the way mail servers collaborate amongst themselves to deliver emails. DNS servers operate in a similar manner.

The advantage of P2P is its inherent scalability. With each participant bringing its own resources, such as processing power, storage and bandwidth, there is virtually no limit in the size of the system. Due to the same reason, there is not a single point of failure. Collectively, P2P systems offer a massive amount of processing power and storage capability.

The decentralized infrastructure of P2P gained immense popularity in file sharing applications. Napster is an iconic system of this kind. After it ceased to operate, there are other successful examples such as KaZaa and Gnutella.

The weakness of P2P also stems from the decentralized structure. There is no way to tell the authenticity of the network traffic, or to control the behavior of the network in general.

Must Haves

So we know scalability and performance is really tough. Before pulling our hair out to design a perfect system, what are some of the desired properties?

We want to support millions of users, while staying in the comfort zone. We cannot afford to lose responsiveness or security. Storage for user created contents, for a simple example, a custom facial texture, is soon to be a standard feature. A powerful and direct control mechanism must be in place. There cannot be a single point of failure that affects the entire game world. Finally, the cheaper the better.

To be able to do all the above is a grand goal. The community has not found a silver bullet for this yet. The best idea (still not time tested yet) I have seen is by Max Skibinsky of HiveMind (thequest@h-mind.com). You are welcome to implement it and try it out. The following is a very simplified version of Skibinsky's concept.

The idea starts with distinguishing different types of changes: significant vs. real-time. The significant changes should take plays on an interval of 15 to 30 minutes. A single server can comfortably handle the significant changes by millions users, with some infrastructural support. Real-time changes are those representing gameplay actions.

Using a game scenario as an example. Encountering a monster, killing or getting killed are significant changes. The detailed punches, kicks and spell casts, etc. are real-time changes that only need to be kept consistently in a local manner.

Based on this distinction, a concept of vicinity is introduced. Like in real life, from a distance we can kind of see that people are waiting in lines to get coffee. But really what are being ordered, we cannot tell and probably don't care that much. In a game scene, we then distinguish the objects or players that a player can affect a change, from the background. The same with those that can affect changes on each player. For convenience, let's refer to these two groups as action/reaction groups.

In a scene, there can be many such groups, possibly overlapping each other too. Each player can belong to multiple such groups. Everything that is outside the groups is not of concern.

For each group, we support all of its action using P2P methods. The simplest kind is to pick a different machine to support each group. All entities in the group correspond directly with the machine to keep all real-time changes update-to-date. In this way, the original zone servers only maintain significant events and keep the global database intact. The server process of each group corresponds with the zone server.

For security purpose, the server process should not run on the machine of a player in the group. Additionally, it is also a good idea to move the server process to a different machine on a regular basis (say, hourly). During the process, it is also a good idea to keep a backup copy as a checkpoint.

This concept is a mixture of C/S and P2P, using P2P for scalability and performance, and C/S for security, global consistence and control. It is not hard to implement, provided that you have done your design right. Let's discuss in more details during group meetings.