Posted by: David Harley | April 30, 2015

Gaming the tests: who’s being cheated?

[Update: a joint statement by AV-Comparatives, AV-Test and Virus Bulletin is now available here: it appears that the products submitted by Qihoo for testing had the Bitdefender engine enabled by default and its own QVM engine disabled, whereas ‘all versions made generally available to users in Qihoo’s main market regions had the Bitdefender engine disabled and the QVM engine active.’ The testers state that this engine provides ‘a considerably lower level of protection and a higher likelihood of false positives.’]

You may have the impression, if you’ve read some of the stuff I’ve written about testing over the years (surely somebody must have read a bit of it????), that I’m anti-tester. It’s not the case, though I’m passionately against bad testing: while many tests and testers make me want to shake somebody, I recognize that the Internet would be a more (ok, an even more) dangerous place without competent testers. Of whom there are quite a few, these days, and I think AMTSO, for all its false steps, can take some of the credit for that.

It’s easy to forget what a free-for-all testing was when AMTSO was actually conceived. Let’s be clear: there were always good (and bad, and mediocre) testers, and that’s still the case today, but many technical and ethical issues have been resolved – in the mainstream, at any rate – by exhaustive (and sometimes exhausting) discussion at workshops, in forums and by email.

I remember a time when there was much criticism of AMTSO because people suspected collusion between the two sides of the vendor/tester divide. In fact, a more accurate picture might be of two parties whose aims overlap but are by no means totally compatible, working towards methodologies that actually benefit customers rather than mislead them. There are testing organizations that decline to compromise their credibility by engaging with security companies in AMTSO or elsewhere, and I can see why they’d want to preserve their neutrality: the problem there is that testing is difficult, requiring a depth and breadth of knowledge and experience that is rarely found outside the security industry, and they’re cutting themselves off from a major source of information on how they can improve their testing.

Mainstream testers and vendors have a good knowledge of each other’s area of expertise, but there are consumer organizations who are convinced that testing AV is as easy as evaluating a pair of headphones or a car insurance policy. If they don’t feel competent to do it themselves and outsource the testing to a professional tester, that’s fine, but sometimes they prefer to use outside organizations who may have strong security connections, but aren’t sufficiently au fait with the subtleties of anti-malware technology.

Incompetent and downright dishonest testers, on the other hand, should be held accountable to and by the users of security products who are exposed to misleading test results and conclusions. But that doesn’t mean that vendors qualify for sainthood.

To some extent, it’s inevitable that vendors bear in mind the sort of tests they expect their products to undergo and configure them accordingly. Years ago, many anti-virus products would flag all sorts of non-viral, non-malicious files because they knew that high-profile testers were using poorly-filtered virus libraries that contained all sorts of unverified ‘garbage files’. In other words, they would detect and flag objects that posed no real threat to the user, because they knew that they would be penalized in comparative tests for not detecting them when other products did.

In the 90s, there was some controversy when a particular product (Dr Solomon’s) was found to configured so that if it found more than ten known viruses on a system, it assumed that it was being used by a tester/reviewer to scan a library of virus samples, and switched from using only static signatures to heuristic mode, to increase the likelihood that it would catch malware for which it didn’t yet have a static signature. McAfee (among others) claimed that this ‘cheat mode’ gave the Dr Solomon’s product an unfair advantage and misled the public, since the ‘extra’ viruses would not be detected in a real-world situation. Which did seem to be the case according to McAfee’s own testing, but it also contributed to making people aware that (a) heuristic scanning might be quite a good idea as more and more previously unknown viruses were appearing (b) the Dr Solomon’s range of products were really rather good at heuristics. (Unfortunately, it would be hard for any product to match that sort of performance on unknown malware today, because scanners need to detect a far wider range of malicious behaviour today than just the ability to self-replicate.)

Nowadays, static testing using huge collections of everything anyone had ever considered to be a ‘virus’ is the exception rather than the norm. At least, it’s not how competent mainstream testers work. And a product that restricted itself to non-heuristic detection would be of very little use, and it seems ludicrous that one company thought that another company was cheating by using heuristics. Perhaps that’s more understandable if you recall that there were concerns at the time that heuristics might increase the risk of false positives and have a negative impact on general processing speed.

But there is still a fine line between accommodating known testing methodologies and actually gaming a test. The Dr Solomon’s ‘cheat’ involved adding functionality to the same package used by its customers, even though that functionality was of doubtful direct benefit to the user (though it was obviously intended to benefit the vendor). Was that cheating? Lots of us didn’t think so at the time, and it seems to have become a non-issue since. Especially to McAfee, who actually bought the company subsequently.

However, AV-Comparatives has announced that it is investigating (in collaboration with AV-Test and Virus Bulletin) vendors who have submitted versions of their products for testing that are specifically engineered to optimize their performance in a testing environment, and that are not the same product generally in use among their customers. A joint statement is expected, but hasn’t yet been published, so we don’t know for sure which products/vendors are at issue.

Nor do we know exactly how the products at issue differ from the usual production versions, though I can think of a number of ways in which a product’s test performance might be boosted. For example:

  • By detecting ‘possible unwanted’ software by default. In general, security products are cautious about detecting PUAs/PUPs/PUS by default, for a number of reasons. That’s problematical, though, where testers insist on using default settings and don’t filter PUAs out of their sample sets.
  • By enabling by default a heuristic level so paranoid that in the real world it would generate an unacceptable level of false positives. (Though this might be a less effective strategy where a tester included FP testing in its test suite.)

Well, we’ll see what the testers’ joint statement tells us. What is clear, though, is this. A well-conceived test should reflect real-world experience as closely as possible. When the actual product isn’t the one that the real-world customer is using, the test can’t reflect real-world experience accurately, though we can’t say at the moment how much real difference to the test results the tweaking of the submitted version actually made.

But it isn’t just the tester who’s being cheated, it’s all the potential customers who will expect more of the out-of-the-box product than it actually provides. Hopefully, it can be configured to generate the same detection rates, if in fact boosting detection rates artificially in some way was the actual purpose of the tweak. However, many customers expect not to have to make any decisions at all about configuration: I think that’s an unhealthy expectation, but it is what it is. And if the product’s performance can be boosted to equal its detection under test, what are the implications for its performance in other respects, with or without tweaking?

David Harley

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.


%d bloggers like this: