Replace this file with prentcsmacro.sty for your meeting,
or with entcsmacro.sty for your meeting. Both can be
found at the ENTCS Macro Home Page.
Software Test and Analysis in a Nutshell
Luciano Baresi
1
Dipartimento di Elettronica e Informazione
Politecnico di Milano
Milano, Italy
Mauro Pezzé 2
Dipartimento di Informatica, Sistemistica e Comunicazione
Universitá degli Studi di Milano Bicocca
Milano, Italy
Abstract
The development of large software systems is a complex and error prone process.
Faults can occur at any development stage and they shall be identified and removed
as early as possible to stop their propagation and reduce verification cost. Quality
engineers must be involved in the development process since the very early phases
to identifying the required qualities and estimate their impact on the development
process. Their task span over the whole development cycle and goes beyond the
product deployment through maintenance and post mortem analysis. Developing
and enacting an effective quality process is not a simple task, but requires blending
many activities that satisfy the quality requirements, the product characteristics,
the process development structure, the availability of resources and skills, and the
budget constraints.
This paper overviews the main aspects of software quality: It discusses the characteristics of the quality process, it surveys the key testing phases, and it presents
modern functional and model based testing approaches.
Key words: Software Quality, Software Test, Integration Testing,
System and Acceptance Testing, Functional Testing, Model-based
Testing.
1
2
Email: baresi@elet.polimi.it
Email: pezze@disco.unimib.it
c 2004 Published by Elsevier Science B. V.
Baresi and Pezzè
1
Introduction
The development of large software products involves many activities that need
to be suitably coordinated to meet the desired requirements. Among these
tasks, we can distinguish activities that contribute mainly to the construction
of the product, e.g., requirements analysis, design specification, module integration, and activities that aim at checking the quality of the process and of
the intermediate and final results, e.g., design and code inspection, unit and
integration testing, acceptance testing. This classification is not sharp, since
most activities contribute to some extent to both advancing with the development and quality checking, and in some cases characterizing activities in this
way is not easy, but helps identify an important thread of the development
process that includes all quality related activities and is often referred to as
the quality process.
The quality process is not a phase, but it spans through the whole development cycle: it starts already during the feasibility study, and goes beyond
the product deployment through maintenance and post mortem analysis. The
qualities relevant to the product must be defined already in the feasibility
study; requirements and design specifications must be inspected and analyzed
as early as possible to identify and remove design errors otherwise hard to
reveal and expensive to remove; tests must be designed and planned in the
early design phases to improve the specifications and reduce the chances of
delivering badly tested and low quality products, and must be executed many
times though different builds and releases to assure non regression of product
functionality.
The quality process includes many complementary activities that must
be suitably blended to fit the specific development process, and to meet the
costs and quality requirements of the different products. The quality engineer must face sometimes contradicting requirements, like keeping costs low,
guaranteeing high quality results, avoiding interference with the development
process, and meeting stringent deadlines. Selecting a suitable set of quality
activities is hard and requires deep experience in both analysis and testing,
strong knowledge of design and development, good background in management and planning, and excellent abilities to mediate among different often
contradictory aspects and needs.
Quality activities address two different classes of problems: revealing faults
and assessing the readiness of the product. Quality cannot be added at the end
of the process, but must be enforced through the whole development cycle.
Important classes of faults are difficult to reveal and hard to remove from
the final product. Many analysis and test techniques aim at revealing faults,
which can then be eliminated. Identifying and removing faults through the
development process certainly help increase the quality of the final product,
but cannot assure the absence of faults that can persist after product delivery.
Continuing searching for faults until all faults are revealed and removed would
2
Baresi and Pezzè
postpone quality activities forever without any rationale. Users are interested
in solving problems, and measuring the quality of the software products in
terms of dependability, usability, costs, and ultimately ability of meeting users’
expectations, and not in terms of avoided or removed faults. Users tolerate few
annoying failures in products that address their problems cost-effectively, but
do not accept critical failures — even if they are very rare —, or too frequent
annoying failures. Thus, it is important to pair quality activities that aim
at revealing faults, with activities that estimate the readiness of a product in
terms of dependability and usability.
Most quality activities are carried on independently of the development,
but their effectiveness strongly depends on the quality of the development
process. For example, reviews and inspections of requirements specification
are much more effective when specifications are well structured than when they
are badly written, and integration testing is more effective when software is
implemented according to good design principles than when it is developed
without a coherent approach. Quality activities can help increase the quality
of the development practice by providing feedback on root causes of faults and
on common errors.
The rest of the paper is organized as follows. Section 2 discusses the
main characteristics of the quality process, empahsizing the design of a quality
plan. Section 3 presents the problems of integration testing and overviews
different approaches to integration testing. Section 4 moves on towards system
and acceptance testing, by discussing the differences between verification and
validation and surveying the problem of regression testing. Section 5 addresses
the problems of functional and model based testing, sampling some popular
modern approaches, evidencing their complementarities. Section 6 identifies
relevant open research problems.
2
The Quality Process
The quality process involves many activities that can be grouped in five main
classes: planning and monitoring, verification of specifications, test case generation, test case execution and software validation, and process improvement.
Planning and monitoring activities aim at steering the quality activities
towards a product that satisfies the initial quality requirements. Planning
activities start in the early phases of development with the identification of
the required qualities and the definition of an early analysis and test plan, and
continue through the whole development by monitoring the quality process
and by refining and adjusting it to take care of new problems and to avoid
that deviations from the original plan lead to project failures.
Specifications can be verified both for internal consistency and for consistency with respect to the corresponding specifications. Verification of intraspecification consistency aims at revealing and eliminating inconsistency or
incompleteness of the specifications; verification of inter-specification consis3
Baresi and Pezzè
tency aims at revealing deviations in the development process that manifest
themselves as either differences with respect to the corresponding specifications or missing elements in the detailed specifications. Specifications can be
verified by means of many technologies that span from simple syntactic checks
and low-technology inspection to model checking and formal verification.
Test cases are usually generated from specifications, and are integrated
with information from the application environment, development technology,
and code coverage. The application environment and development technology can provide information on common faults and risks. Many organizations
collect general test suites that derive from legacy systems and characterize
specific application environments, either in the form of regression test suites
for specific applications, or in the form of general suites for well known features. Programming languages and development technologies present different
weaknesses that can lead to specific classes of faults. For example, the C++
freedom in memory management comes with well-known risks that can lead
to dangerous memory leaks, which can be limited —but not avoided— with
disciplined design and programming practice, suitably paired with analysis
and tests. Code coverage indicates regions of code that have not been adequately tested, and may suggest additional test cases to complete functional
test suites or indicate ill-designed code.
Test cases might and should be generated as soon as the corresponding
specifications are available. Early test case generation has the obvious advantage of alleviating scheduling problems: tests can be generated in parallel with
development activities, and thus be excluded from critical paths. Moreover,
test execution can start as soon as the corresponding code is ready, thus reducing the testing time after coding. Early generation of test cases has also an
important side effect of helping validate specifications. Experience shows that
many specification errors are easier to detect early than than during design
or validation. Waiting till when specifications are used for coding may be too
late and may lead to high recovery costs and expensive project delays. Many
modern methodologies suggest early generation of test cases up to the extreme
case of XP that substitutes module specifications with test case generation.
Test cases may need to be executed in absence of the complete system.
This can be due to either the decision of incrementally testing modules without waiting for the development of the whole system, or the need of isolating
the modules-under-test to focus on particular features and facilitate the localization of faults. Executing test cases without the whole system requires the
development of an adequate scaffolding, i.e., a structure that substitute the
missing parts of the systems. Adequate scaffolding includes drivers, stubs, and
oracles. Drivers and stubs surrogate the executing environment by preparing
the activation environment for the module-under-test (drivers) and by simulating the behavior of missing components that may be required for the
execution of the module-under-test (stubs). Oracles evaluate the results of
executing test cases and signal anomalous behaviors. To build adequate scaf4
Baresi and Pezzè
folding, software engineers must find a good cost-benefit tradeoff. Accurate
scaffolding can be very useful for fast test execution, but it can be extremely
expensive and dangerously faulty. Cheap scaffolding can reduce costs, but
may be useless for executing tests.
Testing can reveal many kinds of failures, but may not be adequate for
others, and thus it should be complemented with alternative validation activities. For example the Cleanroom approach suggests to complement testing
with code inspection to reveal faults at the unit level [5].
Process improvement focuses mostly on clusters of projects that share similar processes, engineering teams, and development environments. Quality
experts collect and analyze data on current projects to identify frequent problems and reduce their effects. Problems can be either eliminated by changing
development activities, or alleviated by introducing specific quality activities
for removing problems as early as possible. Although some corrective actions
can be taken already in the current project, in many cases, corrective actions
are usually applied to future projects.
2.1 Quality Plan
The quality engineer should design the quality plan in the very early phases of
the design cycle. The initial plan is defined according to the test strategy of
the company and the experience of the quality engineer. The test strategy describes the quality infrastructure of the company that includes process issues,
e.g., the adoption of a specific development process, organizational issues, e.g.,
the choice of outsourcing specific testing activities, tool and development elements, e.g., the need for using a particular toolsuite, and any other element
that characterizes that can influence the quality plan.
The initial plan is then refined to take into account the incremental knowledge about the ongoing project, e.g., by detailing module testing activities
once design specifications are available. When the current plan cannot be detailed and adapted to the project needs, the current plan can be substituted
with an emergency plan, to cope with new unforeseen situations.
A quality plan should include all information needed to control and monitor
the quality process, from general information, like the items to be tested, to
very detailed information, like a scheduling of the single quality activities
and the resources allocated to conduct them. Figure 1 summarizes the main
elements that belong to a good quality plan.
2.2 Monitoring the Quality Process
The quality process must be monitored to reveal deviations and contingencies,
and to adapt the quality activities to the new scenarios as early as possible.
Monitoring can take two forms: the classic supervision of progresses of activities and the evaluation of quantitative parameters about testing results.
The supervision of activity progresses consists of periodically collecting infor5
Baresi and Pezzè
Test items characterize the items that have to be tested, indicating for example
the versions or the configurations of the system that will be tested.
Features to be tested indicate the features that have to be tested, among all the
features offered by the items to be tested.
Features not to be tested indicate the features that are not considered in the
plan. This helps check for completeness of the plan, since we can check if we
explicitly considered all features before selecting the ones to be tested.
Approach describe the approach to be chosen. We can for example require all modules to be inspected, and we may prescribe specific testing techniques for subsystem
and system testing, according to the company standards.
Pass/Fail criteria define the acceptance criteria, i.e., the criteria used for deciding the readiness of the software-under-test. We may for example tolerate some
low impact faults, but ask for absence of critical faults before approving a module.
Suspension and resumption criteria describe the conditions under which testing activities cannot be profitably executed. We may for example decide to suspend testing activities when the failure rate prevents a reasonable execution of the
system-under-test, and resume testing only after the success of a “sanity test”
that checks for a minimum level of executability of the unit-under-test.
Risks and contingencies identify risks and define suitable contingency plans.
Testing may face both risks common to many other development activities, e.g.,
personnel (loss of technical staff ), technology (low familiarity with the technology) and planning (delay in some testing tasks) risks, and risks specific to testing, e.g., development (delivery of poor quality components to the testing group),
execution (unexpected delays in executing test cases) and requirements (critical
requirements) risks.
Deliverables list all required deliverables.
Task and Schedule articulate the overall quality process in tasks scheduled according to development, timing and resources constraints to meet the deadlines.
The early plan is detailed while progressing with the development, to adjust to the
project structure. The early plan will for example indicate generic module testing
and inspection tasks that will be detailed when the design identifies the specific
modules comprising the system.
Staff and responsibilities identify the quality staff and allocate responsibilities.
Environmental needs indicate any requirements that may derive from the environment, e.g., specific equipments that are required to run test cases.
Fig. 1. The structure of a standard quality plan.
6
Baresi and Pezzè
160
140
120
Total
Critical
Severe
Moderate
Faults
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
Builds
Fig. 2. A typical distribution of faults for system builds over time.
mation about the amount of work done and compare it with the plan: the
quality engineer records start, end, consumed resources and advances of each
activity, and reacts to deviations to the current plan by either adapting it,
when deviations are tolerable, or adopting a contingency plan, when deviations are critical. The evaluation of quantitative progresses is difficult and has
been exploited only recently. It consists of gathering information about the
distribution of faults and comparing it with historical data. Fault exposure
follows similar patterns across similar projects both in terms of frequency of
faults and in terms of distribution across different fault categories.
The diagram of Figure 2 is taken from [6] and illustrates the distribution of
faults over releases, considering three levels of severity. The diagram clearly
indicates that fault occurrence grows for the first builds before decreasing.
The number of faults decreases with different speed: critical and severe faults
decrease faster than moderate ones, and we can even expect a slight growth
of moderate faults. Different distributions of faults signal possible anomalies
in the quality process: a non-increasing fault occurrence in the early builds
may indicate poor tests, while a non-decreasing fault occurrence in the latter
releases may indicate bad fault diagnosis and removal.
The Orthogonal Defect Classification (ODC) introduced in IBM in the
early nineties proposes a detailed classification of faults and suggests monitoring different distributions to reveal possible problems in the quality process.
Details about ODC can be found in [7] and [1].
Despite available technologies and tools, the quality process heavily depends on people. The allocation of responsibilities is a key element of quality
strategies and plans and can determine the success of a project. As with many
other aspects, there is no unique solution, but approaches depend on organization, process and project. Large organizations usually prefer to separate
development and quality teams, while small organizations and agile processes
(e.g., XP) tend to assign development and quality responsibilities to the same
7
Baresi and Pezzè
Actual Needs and
Constraints
Delivered
Package
acceptance test
System
specifications
System
Integration
system test/analysis
re
v ie
subsystem
design
w
LEGEND
integration test/analysis
Subsystems
verification
validation
Component
specifications module test
Modules
Fig. 3. Development and Testing phases in the V model.
team. Separating development and quality teams encourages objective judgement of quality and prevents it from being subverted by scheduling pressure,
but restrict scheduling freedom and requires good communication mechanisms
between the two teams.
3
Integration and Component-based Testing
The V model shown in Figure 3 illustrates the four main testing levels: module, integration, system, and acceptance testing. Module or unit testing aims
at checking that the single modules behave as expected; integration testing
aims at checking module compatibility; system and acceptance testing aim at
verifying the behavior of the whole system with respect to system specifications and user needs, respectively.
The quality of single modules is necessary but not sufficient to guarantee
the quality of the final system. The failure of low quality modules fatally leads
to system failures that are often difficult to diagnose, and hard and expensive
to remove. Unfortunately, many subtle failures are caused by unexpected interactions among well designed modules. Well-known examples of unexpected
interactions among well-designed software modules are described in the investigation reports of the Ariane 5 accident that caused the loss of the rocket
on July 4th, 1996 [9], and of the Mars Climate Orbiter failure to achieve the
Mars orbit on September 23rd, 1999 [13].
In the Ariane accident, a module that was adequately tested and successfully used in previous Ariane 4 missions failed in the first Ariane 5 mission
causing the chain of events that led to the loss of the rocket. The module
was in charge of computing the horizontal bias, and it failed because of an
overflow caused by the higher horizontal velocity of the Ariane 5 rocket than
that of the Ariane 4. The Mars Climate Orbiter failure was caused by the
8
Baresi and Pezzè
unexpected interactions between software developed by the JPL laboratories
and software developed by the prime contractor Lockheed Martin. The software developed by Lockheed Martin produced data in English units, while
the Spacecraft operating data needed for navigation were expected in metric
units. In both cases the modules worked well both stand alone and when
integrated in working systems, but failed in other contexts due to integration
problems.
Integration faults are ultimately caused by incomplete specifications or
faulty implementations of interfaces, resource usage, or required properties.
Unfortunately, it may be difficult or cost-ineffective to specify all module interactions completely. For example, it may be very difficult to predict interactions between remote and apparently unrelated modules through sharing a
temporary hidden file that happens to be given the same name by two modules, particularly if the name clash appears rarely and only in some installation
configurations. Integration faults can come from many sources:
•
inconsistent interpretation of parameters or values, as in the case of the
Mars Climate that interpreted English and metric units,
•
violations of value domains or of capacity/size, as it happened in some versions of the Apache 2 web server that could overflow a buffer while expanding
environment variables during configuration file parsing,
•
side effects on parameters or resources, as it can happen when modules use
resources not explicitly mentioned in their interfaces,
•
missing or misunderstood functionality, as it can happen when incomplete
specifications are badly interpreted,
•
non-functional problems, which derive from under specified non-functional
properties, like performances,
•
dynamic mismatches, which derive from unexpected dynamic bindings.
Integration testing deals with many communicating modules. Big bang
testing that waits until all modules are integrated is rarely effective, since integration faults may hide across different modules and remain uncaught, or
may manifest in failures much after their occurrence, thus becoming difficult to
localize and remove. Most integration testing strategies suggest testing integrated modules incrementally. Integration strategies can be classifies as structural and feature-driven. Structural strategies define the order of integration
according to the design structure and include bottom-up, top-down approaches,
and their combination, sometimes referred to as sandwich or backbone strategy. They consists in integrating modules according to the use/include relation, starting from the top, the bottom or both sides, respectively.
Feature-driven strategies define the order of integration according to the
dynamic collaboration patterns among modules, and include thread and critical module strategies. Thread testing suggests integrating modules according
to threads of execution that correspond to the system features. Critical mod9
Baresi and Pezzè
ule testing integrates modules according to the associated risk factor that
describes the criticality of modules.
Feature-driven test strategies better match development strategies that
produce early executable systems, and may thus benefit from early user feedback, but they usually require more complex planning and management than
structural strategies. Thus, they are preferable only for large systems, where
the advantages overcome the extra costs.
The use of COTS components further complicates integration testing.
Components differ from classical modules for being re-used in different contexts independently of their development. System designers, who reuse components, do not often have access to the source code or to the developers of
such components, but can only rely on the specifications of the components’
interfaces. Moreover, components are often reused in contexts that are not
always foreseen at development time, and their behavior may not fully match
the specific requirements. When testing components, designers should identify
the different usage profiles of components and provide test suites for each of
the identified profiles. System designers should match their requirements with
the provided profiles and re-execute the integration tests associated with the
identified profiles before deriving text suites specific to the considered context.
4
System, Acceptance and Regression Testing
Module and integration testing can provide confidence on the quality of the
single modules and on their interactions, but not about the behavior of the
overall system. For example, knowing that a module handles correctly the
product database and that the product database inter-operates correctly with
the module that computes prices does not assure that the whole system implements the discount policies as specified in the requirements. Moreover,
knowing that the system matches the requirements specifications does not assure that it behaves as expected by the users, whose expectations may not
fully match the results of the early analysis. We thus need to complete the
verification and validation process by testing the overall system against its
specifications and users’ needs. System testing verifies the correspondence
between the overall system and its specifications, while acceptance testing
verifies the correspondence between the system and users’ expectations.
4.1 Verification and Validation
Activities that aim at checking the correspondence of an implementation with
its specification are called verification activities, while activities that aim at
checking the correspondence between a system and users’ expectations are
called validation activities. The distinction between validation and verification
is informally illustrated by the diagram of Figure 4 and has been well framed
by Barry Boehm [2], who memorably described validation as “building the
10
Baresi and Pezzè
VALIDATION
Actual Needs and
Constraints
Requirements
specifications
Delivered
Package
VERIFICATION
Fig. 4. The different perspectives of validation versus verification.
right system” and verification as “building the system right”.
Validation and verification activities complement each other. Verification
can involve all development stages with users’ reviews of requirements and
design specifications, but the extent of users’ reviews is limited by the ability
of users to understand design and development details. Thus, the main validation activities concentrate on the final product that can be extensively tested
by the users during acceptance testing. Users’ needs and the late execution
of validation activities lead to high costs and risks, thus validation must be
paired with verification activities, which can be extensively executed at early
stages, as illustrated in Figure 3, and do not involve expensive and not-always
available users’ availability.
Software is characterized by many properties that include dependability,
usability, security, interoperability, etc. Module and integration testing focus mostly on dependability issues, while system testing must consider all
properties, and thus involves many aspects and different techniques. Some
properties can be naturally verified. For example, the ability of a web application to serve up to a given number N of users with response time below a
given threshold τ can be verified by simulating the presence of N users and
measuring the response time. Other properties can be difficult to verify and
are a natural target for validation. For example, the ability of users to easily
obtain the required information form a web application is hard to verify, but
can be validated, e.g., by monitoring a sample population of users.
Ultimately, the verifiability of properties depends on the way properties are
expressed. For example, a property of a web application expressed as: users
must be able to easily add an item to the shopping cart without experiencing
annoying delays cannot be intuitively verified, since it refers to subjective
feelings (“easily” and “annoying”). However, the same property expressed as:
users must be able to add a given item to the shopping cart with no more than 4
mouse clicks starting from the home page, and the delay of the application must
not exceed a second after the click when the application is serving up to ten
thousands users working concurrently can be verified, since subjective feelings
are rendered as quantifiable entities (“number of mouse clicks” and “delays
in seconds”). Thus, system testing starts early in the design when writing
11
Baresi and Pezzè
requirements specification. Mature development processes schedule inspection
activities to assess their testability and thus maximize the verifiability of the
software product.
4.2 Different Validation Approaches
Verification and validation of different properties require specific approaches.
Dependability properties can be verified using functional and model-based
testing techniques discussed in the next section, and can be validated by monitoring a selected set of users. To illustrate the different techniques that are
needed for non-functional properties, here we briefly survey approaches for
verifying and validating usability properties. A standard process for verifying
and validating usability includes the following main steps:
•
Inspecting specifications with classic inspection techniques and ad-hoc checklists.
•
Testing early prototypes produced by statically simulating the user interfaces.
•
Testing incremental releases with both usability experts and end users.
•
Considering final system and acceptance testing that includes expert-based
inspection and testing, user-based testing, comparison testing, and automatic analysis and checks.
We can notice that usability testing heavily relies on users, differently
of functional and model-based testing that do not require user intervention.
User based testing is carefully planned and evaluated: the usability team
identifies classes of users, selects suitable samples of the population according
to the identified classes, defines sets of interactions that well represent common
and critical usages of the system, carefully monitors the interactions of the
selected users with the system, and finally evaluates the results. More details
on usability testing can be found in [11].
4.3 Regression testing
Software is not produced linearly, but undergoes several builds and releases.
Each new build or release may introduce or uncover faults and results in
failures not experienced in previous versions. It is thus important to check
that each new release does not regress with respect to the previous ones. Such
test is often called non-regression testing, or, for short, regression testing.
A simple approach to regression testing, known as the retest all approach,
consists in re-running all test cases designed for the previous versions, to
check if the new version presents anomalous behaviors non experienced in
the former versions. This simple approach may present non trivial problems
and significant costs that derive from the need of adapting test cases not
immediately re-executable on the new version. Moreover, the costs of re12
Baresi and Pezzè
running all test cases may be too expensive and not always useful.
The number of test cases to be re-executed can be reduced with ad-hoc
techniques tailored to the specific application or with general-purpose selection
or prioritization techniques. Selection techniques work on code or specifications. The ones that work on code record the program elements exercised
by the tests on previous releases, and select test cases that exercise elements
changed in the current release. Various code-based selection techniques focuses on different programming elements: control-, data-flow, program slices,
etc... Code-based selection techniques find good tool support and work even
when specifications are not properly maintained, but do not scale up easily:
they apply well to simple local changes, but present difficulties when changes
affect large portions of the software.
Specification-based selection techniques focus on changes in the specifications. They scale up much better than code-based techniques, since they
are not bound to the amount of changed code, but they require properlymaintained specifications. They work particularly well with model-based testing techniques, that can be augmented with change trackers to identify tests
to be re-executed. For example, if the system is modeled with finite state
machines, we can easily extend classic test generation criteria to focus on elements of the finite state machines that are changed or added in the current
build [4].
Test case prioritization techniques do not select a subset of test cases, but
rather define priorities among tests and suggest different execution strategies.
Priorities aim at maximizing the effectiveness of tests by postponing the execution of cases that are less likely to reveal faults. Popular priority schemas are
based on execution history, fault detection effectiveness and code structure.
History-based priority schemas assign low priority to recently executed test
cases. In this way, we guarantee that all test cases will be re-executed in the
long run. This technique works particularly well for frequent releases, e.g.,
overnight regression testing. Priority schemas that focus on fault-detection
effectiveness raise the priority of tests that revealed faults in recent versions,
and thus are likely to exercise unstable portions of the code and reveal new or
persistent faults. Structural priority schemas give priority either to test cases
that exercise elements not recently executed or to test cases that result in high
coverage. In the first case, they try to minimize the chances that portions of
the code will not be tested across many consecutive versions; in the second
case, they try to minimize the set of tests to be re-executed to achieve high
coverage [8].
All regression testing technique rely on good maintenance of test cases and
good test documentation. High-quality test suites can be maintained across
versions by identifying and removing obsolete test cases, and revealing and
suitably marking redundant test cases. Good test documentation includes
planning, test specifications and reporting documents. Planning documents
include test plans that we briefly described in Section 2, and test strategies
13
Baresi and Pezzè
that summarize the quality strategies of the company or the group. Test
specification documents focus on test suites and single test cases. Reporting
documents summarize the results of the execution of single test cases and
test suites. Regression testing requires the coordination of specification and
reporting documents.
5
Functional and Model-based Testing
Functional testing is the base-line technique for designing test cases for a
number of reasons: it applies to all levels of the requirements specification
and design process, from system to module testing, and is the only test design
technique with such wide and early applicability, it is effective in finding some
classes of faults that typically elude code- and fault-based testing, it can be
applied to any description of program behavior, and finally, functional test
cases are typically less expensive to design and execute than white-box tests.
Functional testing identifies a wide set of techniques that apply to different
specification models and application domains. In many cases, specifications
are given in the form of models or can be easily rendered with models. For
example, specifications are sometime given in the form of finite state machines,
or decision or control models, or can be captured with one of these models,
even if expressed informally. In these cases, functional test cases can be derived
directly from the model, and we refer to model based testing.
Modern general-purpose functional testing approaches include categorypartition [12], combinatorial [3] and catalog-based testing [10].
Category partition testing applies to specifications expressed in natural
language and consists of three main steps:
decompose the specification into independently testable features: the
test designer identifies specification items that can be tested independently,
and identifies parameters and environment elements that determine the behavior of the considered feature (categories). For example, if we are testing
a web presence, we may identify the catalog handler functionality described
in Figure 5 as an independently testable feature.
From the informal specification, we can deduce the following categories
that influence the behavior of the functionality: number of orders in the
last period, amount in stock, type and status of the item, status of the
assembled items.
identify relevant values: the test designer selects a set of representative
classes of values (choices) for each parameter characteristic. Values are
selected in isolation, independently from other parameter characteristics.
For example, Figure 6 shows a possible set of choices for the categories
extracted from the specification of the catalog handler of Figure 5 (readers
should not consider the keywords in square parenthesis yet). We notice that
choices are not always individual values, but in general indicate classes of
14
Baresi and Pezzè
Catalog handler
...
The production of an item is suspended if in the last sales period the number of
orders falls below a given threshold t1 or if it falls below a threshold t2 > t1 and the
amount in stock is above a threshold s2 . An item is removed from the catalog if it
is not in production, the amount of order in the previous period remains below t1
and the amount in stock falls below a threshold s1 < s2 . The production of an item
in catalog is resumed if the amount of orders in the former period is higher than t2
and the amount in stock is less than s1 .
Items that are sold also in combination with other items are handled jointly with
the assembled items, i.e., the production is not suspended if one of the assembled
items is still in production, despite the sales and the stock of the considered item,
and similarly is kept in the catalog even if eligible for withdraw, if the assembled
items are kept in the catalog.
The amount in stock cannot exceed the maximum capacity for each item.
...
Fig. 5. An excerpt of a specification of a catalog handler that determines the
production and the sale of items on the basis of the sales and the stock in the last
sale period.
homogeneous values.
When selecting choices, test designers shall refer to normal values as well
as boundary and error values, i.e., values on the borderline between different
classes (e.g., zero, or the values of the thresholds for the number of orders
or the amount in stock that are relevant for the considered case.)
generate test case specifications: test case specifications can be straightforwardly generated as combinations of the choices identified in the above
steps.
Unfortunately, the mere combination of all possible choices produces extremely large test suites. For example, the simple set of choices of Figure
6 produces more than 1,000 combinations. Moreover, many combinations
may make little or no sense. For example, combining individual items with
different status of the assembled items makes no sense, or test designers
may decide to test only once the boundary cases.
Test designers can eliminate erroneous combinations and limit singletons
by imposing simple constraints: [error] marks erroneous values and requires
at most one test case for that value, [single] marks singleton values and
requires at most one test cases for them as well, pairs [“label”], [if-”label”]
constraint one value to occur in combination with a value indicated by the
label. The constraints given in Figure 6 reduce the number of generated
test cases from more than 1,000 to 82 that represent an adequate set of test
cases.
The category partition approach helps when natural constraints between
15
Baresi and Pezzè
# of orders in the last period
0
[single]
< s1
[single]
< s2
[single]
[single]
s2
> t2
< smax
>> t2
smax
[single]
> smax
[error]
type of item
individual
assembled
[single]
s1
< t2
t2
[single]
0
< t1
t1
amount in stock
[assmbl]
status of the item
status of the assembled item
in production
in production
[if assmbl]
in catalog
in catalog
[if assmbl]
not available
[if assmbl][error]
not available
[error]
Fig. 6. Category partition testing: A simple example of categories, choices and constraints for the catalog handler specification of Figure 5. The choices for each category
are listed in the corresponding columns. Constraints are shown in square brackets.
choices reduce the number of combinations, as in the simple catalog handler
example. However, in other cases, choices are not naturally constraint, and
the amount of test cases generated by considering all possible combinations of
choices may exceed the testing budget. Let us consider for example the choices
of Figure 8 that derive from the informal specification of the discount policy
given in Figure 7. None of the choices are naturally constrained, and thus we
obtain a combinatorial amount of test cases that in the current example goes
up to 182 test cases, but could grow exponentially if we consider a slightly
larger number of categories or choices as in most real cases.
Forcing constraints does not help, since it reduces the number of generated
test cases by eliminating combinations of values that could reveal important
faults. A different approach consists in limiting the combinations by covering
only pairwise combinations of choices. For example, all pairwise combinations
of Figure 8 can be covered with less than 18 test cases. Generating test cases
that cover only pairwise combinations is known as combinatorial testing and is
based on experimental evidence that most failures depend on single choices or
pairwise combinations of choices, and rarely depend on specific combinations
of many different choices. Thus covering all pairwise combinations of choices
can reveal most potential failures with test suites of limited size.
16
Baresi and Pezzè
Discount handler
...
Discount is applied to both individuals and educational institutions. Additional discounts are applied if the amount of orders in the considered sale period exceeds a
threshold o1 or a threshold o2 > o1 , respectively. Discounts are applied also if the
amount of the current invoice or the total amount of invoices in the current sales
period exceeds given thresholds (i1 and i2 > i1 for the current invoice, and t1 and
t2 > t1 for the total invoices in the period). Discounts can be cumulated without
limits. Customers with a risky credit history do not qualify for any discount.
...
Fig. 7. An excerpt of a specification of a discount policy handler that determines
the applicable discount on the basis of the status of the customer and the amount
of purchases.
customer
type
# of orders
in the period
individual
total invoices
in the period
≤ o1
amount
in the invoice
≤ i1
≤ t1
credit
situation
ok
business
≤ o2
≤ i2
≤ t2
risky
educational
> o2
> i2
> t2
Fig. 8. A set of categories and choices that do not have natural constrains.
Category partition and combinatorial testing can be fruitfully combined by
first constraining the combination of choices and than covering only pairwise
combinations.
When selecting choices for the identified categories, we identified both normal values as well as boundary and error conditions. In general many faults
hide in special cases that depend on the type of considered elements. For
example, when dealing with a range [low, high] of values, test experts suggest consider at least a value within the boundaries, the bounds low and high
themselves, the values before and after each bound, and at least another value
outside the range. The knowledge of expert test designers can be captured
in catalogs that list all possible cases that shall be considered for each type
of specification. We may build both general purpose and specialized catalogs. The formers can be used in most cases, while the latters apply to specific domains that are characterized by particular cases. Catalogs apply well
to well structured specifications. Thus in general, catalog based testing first
transforms specifications in the form of pre- and post-conditions, variables,
definitions and functions, and then applies catalogs to derive a complete set
of test cases.
In many cases, specifications are given as models, or can be easily mapped
onto models. Frequent are the cases of decision structures, control and data
flow graphs, and finite state models in different forms. In all these cases,
17
Baresi and Pezzè
...
Shopping cart
A shopping cart can be manipulated with the following functions:
createCart() that creates a new empty cart.
addItem(item, quantity) that adds the indicated quantity of items to the cart.
removeItem(item, quantity) that removes the indicated quantity of items from
the cart.
clearCart that empty the cart regardless of its content.
buy that freezes the content of the cart and computes the price.
...
Fig. 9. An excerpt of a specification of a shopping cart for a web shopping application.
we can derive test cases by applying test generation criteria to the models of
the specification. In this way, we concentrate on the creative steps of testing
(the derivation of the finite state model) and use automatic techniques for
the repetitive tasks (the application of test case generation criteria). In the
following, we focus on finite state models, but the same approach can be
applied to many different models.
When specifications describe transitions among a finite set of states, it is
often natural to derive a finite state model for generating test case specifications. For example the informal specification of a shoppingCart functionality
of a web presence given in Figure 9 can be modeled with the finite state machine of Figure 10. The finite state machine does not substitute the informal
specification, but captures the state related aspects. Test cases can be generated by deriving sequences of invocations that cover different elements of
the finite state machine. A simple criterion requires covering all transitions
(transition coverage). Sophisticated criteria require covering different kinds
of paths (single state path coverage, single transition path coverage, boundary
interior loop coverage).
Often finite state behaviors are rendered with models that provide features for simplifying the model. Statecharts are one of the most well known
examples: and- and or-decomposed states as well as history and default states
can greatly simplify a simple finite state machine model. For example, in the
Statecharts of Figure 11, we grouped the empyCart and filledCart states of
the finite state machine of Figure 10 in a single or-decomposed state, thus
merging the two buy edges in a single one. In most case, we can easily extend
the criteria available for finite state machines and add new ones that take
advantage from the specific structure of the new model. For example, in presence of large Statecharts, we can reduce the size of the generated test cases,
18
Baresi and Pezzè
clearCart()
addItem(item, quantity)
createCart()
noCart
addItem(item, quantity)
removeItem(item, quantity)
emptyCart
filledCart
clearCart()
buy()
buy()
removeItem(item, quantity)
ready for
purchasing
Fig. 10. A finite state machine model extracted from the informal specification of
the shopping cart given in Figure 9.
createCart()
noCart
clearCart()
buy()
addItem(item, quantity)
addItem(item, quantity)
emptyCart
filledCart
removeItem(item, quantity)
clearCart()
removeItem(item, quantity)
Fig. 11. A Statecharts specification of the shopping cart described in Figure 9.
by covering only the transitions on the Statechart and not all transitions of
the equivalent finite state machine. In the example, we would cover transition
buy only once and not twice as in the equivalent machine.
6
Conclusions
Software test and analysis has been an active research area for many decades,
and today quality engineers can benefit from many results, tools and techniques. However, challenges are far from being over: while many traditional
research areas are still open, advances in design and applications open many
new problems. So far, research on testing theory produced mostly negative
results that indicate the limits of the discipline, but call for additional study:
We still lack a convincing framework for comparing different criteria and approaches. The available testing techniques are certainly useful, but not completely satisfactory yet: We need more techniques to address new programming paradigms and application domains, but more important, we need better
support for test automation.
Component-based development, heterogeneous mobile applications, complex computer systems raise new challenges: It is often impossible to predict
all possible usages and executions framework of the software systems, and thus
we need to move from classic analysis and testing approaches that mostly work
before deployment, to approaches that work after deployment, like dynamic
analysis, self healing, self managing and self organizing software.
19
Baresi and Pezzè
References
[1] I. Bhandari, M. J. Halliday, J. Chaar, K. Jones, J. Atkinson, C. LeporiCostello, P. Y. Jasper, E. D. Tarver, C. C. Lewis, and M. Yonezawa. In-process
improvement through defect data interpretation. The IBM System Journal,
33(1):182–214, 1994.
[2] B. W. Boehm. Software Engineering Economics. Prentice Hall, Englewood
Cliffs, NJ, 1981.
[3] D. M. Cohen, S. R. Dalal, M. L. Fredman, and G. C. Patton. The AETG system:
An approach to testing based on combinatiorial design. IEEE Transactions on
Software Engineering, 23(7):437–444, July 1997.
[4] T. Graves, M. J. Harrold, J.-M. Kim, A. Porter, and G. Rothermel. An
empirical study of regression test selection techniques. In Proceedings of the
20th International Conference on Software Engineering, pages 188–197. IEEE
Computer Society Press, April 1998.
[5] P. A. Hausler, R. C. Linger, and C. J. Trammell. Adopting cleanroom software
engineering with a phased approach. IBM Systems Journal, March 1994.
[6] A. Jaaksi. Assessing software projects: Tools for business owners. In
Proceedings of the 9th European Software Engineering Conference held jointly
with 10th ACM SIGSOFT International Symposium on Foundations of Software
Engineering (ESEC/FSE 2003), pages 15–18. ACM Press, September 2003.
[7] C. J.K., H. M.J., B. I.S., and C. R. In-process evaluation for software inspection
and test. IEEE Transactions on Software Engineering, 19(11):1055–1070,
November 1993.
[8] J.-M. Kim and A. Porter. A history-based test prioritization technique for
regression testing in resource constrained environments. In Proceedings of the
International Conference on Software Engineering (ICSE 2002), May 2002.
[9] J. L. Lions. Ariane 5, flight 501 failure, report by the inquiry board. Technical
report, 1996.
[10] B. Marick. The Craft of Software Testing: Subsystems Testing Including ObjectBased and Object-Oriented Testing. Prentice-Hall, 1997.
[11] J. Nielsen. Designing Web Usability: The Practice of Simplicity. New Riders
Publishing, Indianapolis, 2000.
[12] T. J. Ostrand and M. J. Balcer. The category-partition method for specifying
and generating functional tests. Commun. ACM, 31(6):676–686, June 1988.
[13] I. A. Team. Mars program independent assessment team summary report.
Technical report, 2000.
20