Networked Games -- Automated Testing

Automated Testing

We have already talked about some basic concepts of test driven development in class. Particularly how important testing really is. Knowing that online gaming is really one of the most challenging kind there is for testing, the current practice of automated testing in this field is very sophisticated and effective.

Unfortunately, given the stiff learning curve, it is very unlikely for you to fully adopt the cutting edge practice. Nonetheless, let's still take a comprehensive look at this subject, hoping to achieve the following goals. When it is necessary for you to choose a testing framework, you know what to look for. When you need to retool some components of the best framework you can find for your project (this happens a lot!), you know where to start. Lastly, if you see a practice here that can be integrated into your current project, by all means try it!

Throughout this discussion, always keep in mind the following:

    Automation leads to accurate, repeatable and measurable tests. It augments
    manual testing, but in no way replaces manual testing.

What is it?

In short, it is just to build systems (software and hardware) that:

generates repeatable inputs against the game
tools to evaluate pass/fail for each test case
mechanisms to control processes and output of distributed system
tools to manage large volumes of tests and test results across both manual and automated testing.

From this perspective, testing is an equal partener with development. It requires its own hardware (e.g. testing servers), specialized tools, and its own staff to design, refine, analyze and maintain the tests as well as the testing system. Also considering the fact that core functionality tests are indeed one of the best kind of documentation of the overall design, testing is no less crucial a job than just the development work in networked games.

There are two major categories of tests: regression tests and load tests .

For regression testing, you simply compare a specialized "test validation log" from known good test runs against the current test results. You are already (or, should be) doing this in your projects. Please pay special attention to false fails caused by simple text differences, insignificant numerical variance (such as the 4th digit after the decimal point), as well as irrelevant data in the output. These can easily throw off diff() on your system.

Load testing aims to tell if any given requested operations on the remote server passed or failed, and record the response time for each request. It is common for an new module to function well for a while, but mysteriously fail as soon as the load on the server picks up. Here "fail" could either be that the server crashed, or the response time has become un-acceptably long.

With either regression or load tests, a large amount of generated output data is inevitable. The faster you can get a clear summarized view of the results, the more useful the entire testing system is. A serious plan for a series of Report Generators should always be in order. You are not yet doing this for your projects, but this is not hard to implement even just using common place tools like awk and ed. More advanced tools like perl, python, etc. are also good choices. Add this functionality into your bag of tricks, if possible.

Note: it is not a good idea to trust the return value as indicators of pass/fail. It is a lot more credible to dump all contents of an object into a log via an independent operation (hmm, what design pattern is that?) and then check for errors.

Presentation Layer

For obvious reasons, automated testing require results to be repeatable. The goal is to emulate players in ways highly ismilar to actual fielded conditions. Surprisingly, it is generally not a good practice to generate emulated sequences of mose and keyboard events. As automated testing runs thousands of tests, loading the tests quickly become prohitively expensive.

Usual techniques include: (1) algorithmic event generation, (2) event recorders, (3) package snooping and replication, and (4) class-level unit testing. Great care is necessary to consider how you use each technique. Let's discuss each separately.

Event recorder is a good concept, however, its bruteforce use is very counter-productive. The capture all events during a game play is to record all events, regardless of type and source. The major problem of doing so stems from how fast the source code evolves daily in even a moderately-sized development team. It is common for an event recording to be useless as soon as a new build comes out. Note, with regression tests the key point is to test across all builds. Crude event recorder is not suitable for generate-purpose testing. It is necessary to record events on a much higher level of abstraction. Package snooping suffers from similar drawbacks.

Class level unit testing is also a good idea, however not as useful to product develop as one might expect. It is a great way to get to all the bugs. But given the time pressure, oftentimes it is better to do relentless unit testing on a feature level. Granted, this is purely economic driven, and may not be technically sound.

Algorithmic testing is the most powerful form of testing. It pretty much covers all testing needs by dynamically generating events from an algorithm, or an emulator. Let's focus more on how this works. The first question here is: how do you let a program control an application driven by graphical user interfaces?

Thankfully, the community have already tried hard on this front through a painful process. EA's The Sims Online team implemented a great idea. That is to add a layer between the game GUI and the game logic. In other words, to shield [game logic] on local clients from local [game states and client interface]. They call this layer the Presentation Layer, which defines all high-level entry points to the game logic.

With this abstraction and via the Command design pattern, the invoker of any methods need NO idea of what those objects really are and how operations function. This allows a program to replace a user combined with the GUI. In their context, the regular client is called a View client, and the client driven by a program is termed a nullView client. nullView is the lightweight test client that is automatically run.

In a way, our timebox 2 is about the nullView client (only game logic, no graphics :-)). Just that you were not required to define all the entry points in the Presentation Layer.

Additionally, the Presentation Layer provides an ideal place for event recorders to operate.

Ordering Events

It is not sufficient to just order events independently on different processors, whether time-stamp based or not. In networked games, distributed synchronization primitives are indispensable.

The event generators need to provide 4 basic primitives: (1) WaitUntil (e.g. wait until there are 3 players), (2) Rendezvous (e.g. all hunters be on the field), (3) WaitFor (e.g. wait for 5 seconds), and (4) RemoteCommand (e.g. to issue a command to be executed remotely). With these primitives, it is then possible to implement a very functional event generator to emulate distributed runs.

Testing for Stability During Development

Automated testing is great for stabilizing your code base. Besides build regression, there are two other tests that you can run: (1) sniff test and (2) monkey test.

Sniff test is a lightweight test that only runs through the critical path of game code. Obviously, you need multiple sniff tests for different critical paths. The idea is that every developer must run all sniff tests (usually just a small number of them, say 2 or 3) before check-in. Hence, capturing all obvious bugs without worrying about currently incomplete functions to be implemented in future timeboxes.

Smoke test is often discussed together with sniff test, since smoke test if more about breadth coverage while sniff test is taking a very small number of point samples. Smoke test usually contains about 20 to 30 tests, with each testing a major feature of the game. (Imagine smoke coming out of an electric circuit board.) It is not uncommon for a smoke test to run for an hour, and hence run less frequently by the development team than sniff tests.

So here is the sequence followed by a developer. New code is written, and assumed to be working after debugging. Then sniff test is run, more bugs are caught and this process iterates until all bugs at this stage are found. Then code is checked into repository. After a build, smoke tests are run. (Actually, it is quite big a deal to fail smoke tests, since they are publicized to all team members as the expected quality of work. To fail smoke test could mean problems with team discipline.) After smoke test is passed, regression and load tests are then run automatically usually after work hours.

With monkey testing, each monkey grabs the latest code in repository and builds a new test client (i.e. nullView), and runs the client against that day's Reference server. Monkey tests are run hourly. Each monkey test is in fact a unit test of one core functionality element on the critical path (the ones considered by sniff tests, for example).

By recording the results of monkey tests, the team not only have on record when new problems were introduced, what new progresses have been made, but also in an indirect way provides a view of how the reference server performaned under stress tests.

Testing for Scalability

Automation is the only way to do this kind of testing, the load tests. In load test, the overall system is supposedly functioning properly under light loads, and the code is of reasonably good shape. Then, the number of test clients is gradually increased until something crashes.

With load test, there is no need to guess (or, defend your guess :_)) of what will prevent scaling. It is crucial here to have repeatable tests, as load test always reveals more problems than the team can hope to fix. The repeatable tests make it feasible to at least verify the several things that got fixed. In load test, there are also distinctions between common use cases and extreme cases. Save load tests on extreme cases to the last.

In load test, fail could also mean some metrics are not met. Try to record the following important ones: average and peak client latency, number of supportable clients, meantime between failure, resource utilization (CPU, memory, network traffic and number of page faults) on each server.

Testing Graphics

Graphics is a tricky thing for testing. One can always dump the rendered frames as images and do byte-wise comparison. This is a very brittle approach and hence less reliable.

A better approach here is to combine automated and manual testing. Use automation to place the game in repeatable situations (e.g. specific reference game states), and save snapshots of the screen for later evaluation by human testers. The tester can quickly flip through the images to spot visual artifacts.

On linux, if you have ImageMagick installed, you can capture screen image using:

    import -w MyApp myapp.gif

This saves the image inside the window named "MyApp" into a GIF image: myapp.gif. You can easily write a shell script to automatically capture a large number of images.

If you are pretty savvy with ImageMagick, you can use "convert" to convert the newly captured images into an animated GIF, with your reference sequence placed in the same animated GIF, right beside the new one. When you come to work the next morning, just quickly flip through and say "pass" or "fail".

This is a pretty operable summary of automated testing. Note a key term that keeps coming back is a sufficient abstraction and a high level programming interface to leverage the abstraction. In most cases, this interface takes the form of a scripting interface. While general purpose scripts like Python and Rudy could be used here, many (especially those highly specialized game engines) have their own tailored scripting language.

In fact, scripting is the most advanced form of interface design. If done right, it not only makes your development efficient but also brings an easy situation for testing. However, it is not an easy job. A good bet is to reuse existing successful ones as much as possible.