Chapter V: A Lab to Certify Software by Caroline Benner 12/3/04

From CSEP590TU
Jump to: navigation, search

Software companies do not make secure software because consumers do not demand it. This shibboleth of the IT industry is beginning to be challenged. It is dawning on consumers that security matters as computers at work and home are sickened with viruses, as the press gives more play to cybersecurity stories, and as software companies work to market security to consumers. Now, consumers are beginning to want to translate their developing, yet vague, understanding that security is important into actionable steps, such as buying more secure software products. This, in turn, will encourage companies to produce more secure software. Unfortunately, consumers have no good way to tell how secure software products are.


Security is an intangible beast. When a consumer went to buy a new release of Office, he could see that the software now offered him a handy little paper clip to guide him through that letter he is writing. He understood that Microsoft had made tangible changes, perhaps worth paying for, or perhaps not, to the software. Security does not lend itself to such neat packaging: Microsoft can say it puts more resources into security, but should the consumer trust this means the software is more secure? How does the security of Microsoft's products compare to that of other companies? To help the consumer answer these questions, we need an independent lab to test and certify software security. If consumers trusted this lab, they could begin to vote for security with their dollars and companies would have the incentive to work harder at providing it.


This seems to be an ingenious solution. Consumers do not know what security looks like so they will trust the software security experts running the lab to tell them. The trouble is, those software security experts don’t know what security looks like either.


Why is software security so hard?

Quite simply, software security experts don’t know whether a piece of software is secure because it is practically impossible to ascertain whether a piece of software has a given property for all but the most simple classes of properties. Security, needless to say, is not one of those simple properties. Therefore, it is basically impossible to ascertain whether a piece of software is secure for anything but the most simple programs. Why is this true? Consider first what security vulnerabilities are. Security vulnerabilities are typically errors—bugs—made by the designers or implementers of a piece of software. These bugs, many of them very minor mistakes, live in a huge sea of code, millions of lines long for much commercial software, that is set up in untold numbers of different environments, with different configurations, different inputs from the user, from other software and different interactions with other software. This creates an exponential explosion in the number of things we’re asking the software to be able to do without making a mistake.


So where do bugs come from? Bugs are either features that the software’s designers planned for the software but that did not make it into the final product, or undesired features that the designers did not ask for, but that became part of the final product anyway. It’s very easy for a programmer to introduce the latter type of error, which is where most security vulnerabilities sneak in. Perhaps Roger the developer writes perfect code—an impossibility in and of itself since humans make mistakes. Now Joe next door writes another piece of code that uses Roger’s code in a way Roger did not expect and this can introduce bugs. Or Roger’s code, which was written on Roger’s computer with its unique configurations works fine on Roger’s computer but not fine on Joe’s next door.


If you can’t avoid making errors, can you catch them? The best we can do is test the software. Testing will never catch all errors since you need to test for an unlimited number of possible things the software might do. If you want to keep your cat fenced up in your backyard, you have to anticipate all possible ways the wily cat might escape. You can climb on the roof and guess whether the cat would be willing to jump off it over the fence. You can crouch down and guess whether a cat can slink under the fence. And still, the cat will probably escape via the route you overlooked. It’s very hard to prove definitively that something is not going to happen, that the cat cannot possibly escape the yard, that a piece of software has no vulnerabilities.


Two other problems make building secure software extremely difficult and unique from other complicated engineering tasks, such as building a bridge. First, there is often little correlation between the magnitude of error in the software and the problems that the error will cause. If you forget to drill in a bolt when building a bridge, the bridge likely won’t collapse. However, if you forget to type a line of code, or even a character, you may well compromise the integrity of the entire program.


The second problem is that software is constantly being probed for vulnerabilities by attackers because spectacular attacks can be carried out while incurring little cost—they require minimal skill and effort, and the risk of detection is quite slim. A bridge’s vulnerabilities are not easy to exploit—it would be a hassle to try to blow up a bridge—and so the payoff for trying is not worth it to most.


Trying to measure security

Software’s complexity makes it difficult not only to avoid making mistakes but also to devise foolproof ways to detect these mistakes, to measure how many of them, and how consequential, they are. However, computer scientists have devised a few methods for taking educated guesses at how secure a piece of software is.


Evaluating security can be attempted by looking at the design of the software and the implementation of that design. Experts could study the plan, or specification, for the software, to see if security features, such as encryption and access control, have been included in the software. They can also look at the software’s threat model, its evaluation of what threats the software might face and how it will deal with those threats.

On the implementation side, the experts could evaluate how closely the software matches its specification. They can also look at the process by which the software developers created the software. Many computer scientists believe that the more closely a software development process mirrors the formal step-by-step exactitude of a civil engineering process the more secure the code should be. So the experts can evaluate if the development team documented its source code so others can understand what the code does. They can see if a developer’s code has been reviewed by the developer’s peers. They can evaluate if the software has been developed according to a reasonable schedule and not rushed to market. They can look at how well the software was tested. The hope then is if the developers take measures like these, the code is likely to have fewer mistakes and thus will be more secure.


The experts might also run tests on the software themselves. They might run programs to check the source code for types of vulnerabilities that we know occur frequently, such as buffer overflows. Or they might employ human testers to probe the code for vulnerabilities. Finally, the experts could follow the software after it has been shipped, tracking how often patches to fix vulnerabilities are released, how severe a problem the patch fixes and how long it takes the company to release patches from when the vulnerability was found.


Creating a certifying lab: Requirements

For a lab to inspire consumers to trust its ratings and begin to buy products it has rated secure, the lab must meet several requirements.


1. Reasonable tests for measuring security The first requirement for an independent lab would be to decide on which of the above methods it would choose to evaluate security. The lab’s method must produce a result that enough people believe has merit. Some security experts [1] argue that the methods we currently use to evaluate software aren’t good enough yet and so there are no reasonable tests for measuring security. Talk of a lab, they say, is premature: Even if we can find some security vulnerabilities—which we can using these methods—there will always be another flaw that is overlooked. Therefore, any attempt to certify a piece of software as “secure” is essentially meaningless since such certification can’t guarantee with any degree of certainty that the software cannot fall victim to devastating attack. Others [2] argue that since we know how to find some security vulnerabilities, certifying software based on whether or not they have these findable flaws is better than doing nothing.


Assuming for the moment that we agree with the computer scientists who believe doing something is better than doing nothing, which of these methods should the lab use? Ideally, if the only consideration was to do the best job possible in estimating whether the software is secure, the lab would look at all of them. But in the real world, the lab will face constraints such as the time and money it takes to evaluate the software, and access to proprietary information such as source code and development processes.


Our current best attempt to certify software security works by employing experts to evaluate documentation about the software to look at the design and implementation of the software, checking especially for security features such as encryption, access control and authentication. Software copmanies submit their products to labs which use this scheme, called the Common Criteria, to evaluate their software. A good analogy for how this works is to consider a house with a door and a lock. The Common Criteria are about examing how good that lock is: Is it the right lock? Is it installed properly? Critics of the Common Criteria say a major shortcoming of this method for evaluating software is that it relies too much on the documentation of the design and implementation and essentially ignores the source code itself which can be full of vulnerabilities. [3] Figuring out if the lock on your door is a good one is hardly useful if the bad guys are poking holes through the walls of your house, which is what flawed code lets you do. Another problem with the Common Criteria is that it also looks at the software in a very limited environment—it evaluated Windows for use in a computer with no network connections and no remote media [4]—too limited to be meaningful in any general sense.


Another approach to devising methods to measure software security is under development at Carnegie Mellon University in partnership with leading IT companies including Microsoft, Oracle, Cisco and Hewlett Packard. The Cylab is working on a plan to plug holes in the walls of the house by running automated checking tools—programs that run against a piece of software’s source code much like spell-checkers—to catch three common security vulnerabilities, including buffer overflows. According to Larry Maccherone, executive director of CMU’s Sustainable Computing Consortium who is involved in this effort, 85 percent of the vulnerabilities cataloged in the CERT database, a database run by CMU’s Software Engineering Institute which is one of the more comprehensive collections of known security vulnerabilities, are of the kind that the Cylab’s tools will test for. The flaw with this method for evaluating software is that it is only looking at a part of the problem: In this case, we’re ignoring the locks on the doors. It seems a lab that looks at both plans for the software and the software code itself would be most useful.


2. Meaningful ratings: Once the lab decides on which methods it will use to evaluate the software, it needs to devise a way to convey information about its findings to consumers. Any rating system for the software must say how secure the lab thinks the software might be, but at the same time, not give users a false sense of security, so to speak. The lab should make clear that the ratings are simply an attempt to measure security and can never guarantee the software is secure.


In addition, the descriptors the ratings system uses to describe how secure a piece of software is must make sense to the average consumer and at the same time map to metrics intelligible to computer scientists. A simple solution is to pass or fail the software, as Maccherone proposes. However, translating what passing means and offering the appropriate caveats as described above, is no mean feat. For the average user you might try to explain that that the tools the Cylab would use to scan the source code are intended to find vulnerabilities that make up 85 percent of the vulnerabilities present in the CERT database, and that different pieces of software contained certain numbers of these vulnerabilities. However, as McGraw points out, the average user, much like Gary Larson’s cartoon dogs, would hear “Blah blah blah CERT database.” Confused, the user might then ask “does this mean my computer is going to turn into a spam-sending zombie or not?” which is a perfectly fair, and unanswerable, question. Spafford adds that it would be difficult to present meaningful information even to sophisticated users like system administrators.


Hal Varian, economist at Berkeley, suggests ratings might go beyond a general pass/fail system and note the software is “certified for home use” or “certified for business use.” This scheme too would confront the same problem: What does it mean for a piece of software to be certified for home or business use?


The Common Criteria, the primary certification system in use today, makes no attempt to make its ratings meaningful to your average consumer: the ratings are largely intended for use by government agencies. Vendors who want to sell to certain parts of the US govenrment must ensure their products are certified by the Common Criteria. Companies do tout their Common Criteria certification in marketing literature but this likely means almost nothing to consumers.


3. Educated consumers: Consumers then, would need to know about and value the ratings. Press attention, the endorsement of the ratings by leading industry figures and computer scientists, and more consumer education on why security is important would help here.


4. Critical mass of participating companies Companies will only care about getting their software rated if many of their competitors were participating. Therefore, a critical mass of companies would need to participate in the certification process for it to be useful. There are various ways to convince companies to participate. Industry leaders might take the initiative and participate for the benefit of the industry as a whole, like the leading IT companies partnered in the Cylab would do. According to Maccherone, industry giants like Microsoft have every interest in promoting industry-wide independent evaluations. After all two-thirds of the “blue-screen of death” incidents, where the operating system crashes, in Windows NT were caused by third-party drivers.


Alternatively, government policy might mandate that certain government agencies only use software that has been certified, which is how the Common Criteria certifying process works.


5. Costs in Time and Money: It must not be prohibitively expensive to submit your software for review so the smaller players in the industry can afford to be included. This is a major complaint about the Common Criteria labs: It’s too expensive, which bars all the but the biggest companies from having their software evaluated.


As new releases in software come with some regularity, the review process must take place quickly enough for the software to have time on the market before its next version comes out. Critics fault the Common Criteria for being too slow as well.


The Cylab, because it relies on automated tools, might well delivered speedier and cheaper results than a process like the Common Criteria which depends on human experts. Then again, a major problem with automated tools is that each of their finds would need to be evaluated by a human since the number of false positives they uncover is huge. Spafford noted that in one test he performed, the tools uncovered hundreds of false positives in ten thousand lines of code. Windows is 60 million lines of code long. How much would handling these false positives increase the costs in time and money for the lab?


6. Accountability and independence Finally, the lab must be accountable and independent. It needs to be held responsible for its scoring process and it must be able to evaluate software without caring whose software it is evaluating and who is funding the evaluation. Jonathon Shapiro, professor of Computer Science at Johns Hopkins University, points out that the Common Criteria labs are not as independent and accountable as they could be: He says that companies are playing the labs off each other for favorable treatment. While the Cylab’s reliance on tools might increase its ability to be independent—tools give impartial results—again, humans need to be involved to evaluate the tools’ findings, and processes with human intervention must be structured so they maintain independence and accountability.


Is a lab worth doing now?

So would it be more useful to have a lab that can test for some flaws than it would be to do nothing? Would widely publicizing the Common Criteria ratings or hyping the Cylab’s certification when it goes live to consumers inspire companies to make certifiable software? It seems it would, at least as long as conumers perceived their computers to be “safer” than they were before buying certified software. However, if consumers began to feel their computer was no more secure for having bought certified software, the certification system would quickly fall apart. (How consumers might perceive this to be true is another question entirely—most consumers probably believe their computers are pretty secure right now anyway.) It is for this reason that we may just have one shot at making a workable lab that is widely used by consumers and so perhaps we should wait until we can be reasonably confident that certification means a piece of software is really more secure than its uncertified rival.


Rather than encouraging the creation of a lab now, policy-making power then is probably best spent on funding security research so we can come up with metrics that would be meaningful enough for a lab to use. Spafford points out that research into security metrics, as well as into security more broadly, is woefully underfunded. He believes we need to educate those who hold the purse strings that security is about more than anti-virus software and patches. Security also about thinking of ways to make the software you are about to build secure, not just trying to clean up after faulty software. Funding for research into better tools, new languages, and new architectures is probably the best contribution policy-makers can make toward improving software security.



Bibliography

  • Interviews with Larry Maccherone, Executive Director, Sustainable Computing Consortium, Carnegie Mellon University, November 11, and November 18, 2004
  • Interview with Jonathan Shapiro, professor of Computer Science, Johns Hopkins, November 17, 2004
  • Interview with Stuart W. Katzke, Ph.D. Senior Research Scientist, National Bureau of Standards and Technology, November 12, 2004
  • Interview with Jeannette Wing, Chair, Department of Computer Science, Carnegie Mellon, November 12, 2004
  • Interview with Gary McGraw, Chief Technology Officer Cigital, November 30, 2004.
  • Interview with Eugene Spafford, Professor of Computer Science, Purdue, December 2, 2004
  • Email exchange with Hal Varian, Professor of Economics, Business, Information Systems UCB, November 15
  • Email exchanges with Steve Maurer, Professor of Economics, UCB, November 10-20, 2004
  • Interview with Valentin Razmov, UW CSE graduate student, November 4, 2004