Sunday, October 6, 2013

Static Analysis tools - Evaluating their Effectiveness

This blog post is dedicated to discussing about static analysis tools in software and approaches on how to evaluate their effectiveness.

An introduction to static analysis tools, how they work and their pros and cons can be found here:
Show Notes:

Why to evaluate static analysis tools?
Now, there are quite a few static analysis tools available for each language. Each of these tools finds a subset of all the bugs and there is some overlap in the bugs found. No one tool completely subsumes all the other and hence there is no clear winner. In such a situation, it becomes necessary to evaluate the effectiveness of these tools.

A number of metrics have been discussed for evaluating the effectiveness of bug finding tools:
  • False Positives (FP): A false positive is a bug warning that isn't really a bug.
  • True Positives (TP): A true positive is a bug warning that correctly finds a bug and leads to a fix.
  • True Negative (TN): A true negative is any line of code where no bug warning was reported and that line was found to be bug free.
  • False Negative (FN): A false negative is a line of code for which no bug warning was reported but bugs were later found.


This leads to the definition of other related metrics such as :
  • False Positive Rate (FPR) = FP/ (FP+TP) OR FP/(FP+TN) (an approach that uses this)
  • Precision = TP/ (FP+TP) OR Precision = 1 - FPR 


Another approach argues that they do not consider project specific importance of bugs.One approach considers that the bug warning which eventually leads to fixes by the developer are true bugs. Hence we can use this information about the lifetime of the bug warning. The lifetime of the bug warning is the number of fix versions over which the bug warning is not fixed. The intuition behind this is that if a bug is not fixed over several fix versions, then the bug warning is not very relevant. In a case study using this approach, relevance rates superior to 50% were found for FindBugs and 10.1% for PMD.

One discussion argues that these metrics do not consider priority of bugs. This means tools which detect trivial bugs will be rated higher. Some categories of bugs found are of lower significance to the developer:
i. True But Low Impact Defects: Bugs that are easy to find but will not cause big problems. If any of the above metrics are used, the tools that find easy to spot but non-important bugs will be rated higher.
ii. Deliberate errors: Intentionally throwing a runtime exception.
iii. Masked errors: Errors conditions that will never occur.
iv. Already Doomed: Sometimes,an error can occur only in a situation where the computation is already doomed, and throwing a runtime exception is not significantly worse than any other behavior that might result from fixing the defect.
v. Testing code: In testing code, developers will often try things to break the system.
vi. Unimportant case or lower importance functions.
The argument against fixing the above bugs is that it consumes resources such as the developer time in return for very less value.

So we can see that quite a few methods for evaluating the tools have been proposed and this seems to be a very active field of research. Evaluating tools will help in improving tools and selection of tools for specific project needs.
Our screencast here discusses on different approaches to use static analysis tools more effectively.

Nachiket V. Naik
Computer Science
North Carolina State University
Raleigh, US
nvnaik@ncsu.edu

Preeti Satoskar
Computer Science
North Carolina State University
Raleigh, US
pysatosk@ncsu.edu