Information and Software Technology: Vahid Garousi, Michael Felderer, Mika V. Mäntylä

JID: INFSOF
ARTICLE IN PRESS [m5GeSdc;September 26, 2018;19:9]
Information and Software Technology 000 (2018) 1–22
Contents lists available at ScienceDirect
Information and Software Technology

journal homepage: www.elsevier.com/locate/infsof
Guidelines for including grey literature and conducting multivocal

literature reviews in software engineering
Vahid Garousi a,∗, Michael Felderer b,c, Mika V. Mäntylä d
a
Information Technology Group, Wageningen University, Netherlands
b
University of Innsbruck, Austria
c
Blekinge Institute of Technology, Sweden
d
M3S, Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland
a r t i c l e i n f o a b s t r a c t
Keywords: Context: A Multivocal Literature Review (MLR) is a form of a Systematic Literature Review (SLR) which includes
Multivocal literature review the grey literature (e.g., blog posts, videos and white papers) in addition to the published (formal) literature
Grey literature (e.g., journal and conference papers). MLRs are useful for both researchers and practitioners since they provide
Guidelines
summaries both the state-of-the art and –practice in a given area. MLRs are popular in other fields and have
Systematic literature review
recently started to appear in software engineering (SE). As more MLR studies are conducted and reported, it is
Systematic mapping study
Literature study important to have a set of guidelines to ensure high quality of MLR processes and their results.
Evidence-based software engineering Objective: There are several guidelines to conduct SLR studies in SE. However, several phases of MLRs differ from
those of traditional SLRs, for instance with respect to the search process and source quality assessment. Therefore,
SLR guidelines are only partially useful for conducting MLR studies. Our goal in this paper is to present guidelines
on how to conduct MLR studies in SE.
Method: To develop the MLR guidelines, we benefit from several inputs: (1) existing SLR guidelines in SE, (2),
a literature survey of MLR guidelines and experience papers in other fields, and (3) our own experiences in
conducting several MLRs in SE. We took the popular SLR guidelines of Kitchenham and Charters as the baseline
and extended/adopted them to conduct MLR studies in SE. All derived guidelines are discussed in the context of
an already-published MLR in SE as the running example.
Results: The resulting guidelines cover all phases of conducting and reporting MLRs in SE from the planning phase,
over conducting the review to the final reporting of the review. In particular, we believe that incorporating and
adopting a vast set of experience-based recommendations from MLR guidelines and experience papers in other
fields have enabled us to propose a set of guidelines with solid foundations.
Conclusion: Having been developed on the basis of several types of experience and evidence, the provided MLR
guidelines will support researchers to effectively and efficiently conduct new MLRs in any area of SE. The authors
recommend the researchers to utilize these guidelines in their MLR studies and then share their lessons learned
and experiences.
1. Introduction the “grey” literature (GL), which are constantly produced by SE practi-
tioners outside of academic forums [10]. As SE is a practitioner-oriented
Systematic Literature Reviews (SLR) and Systematic Mapping (SM) and an application-oriented field [11] the role of GL should be formally
studies were adopted from medical sciences in mid-2000′s [1], and since recognized, as has been done for example in educational research [12,
then numerous SLRs studies have been published in software engineer- 13] and health sciences [14–16], and management [17]. We think that
ing (SE) [2, 3]. SLRs are valuable as they help practitioners and re- GL can enable a rigorous identification of emerging research topics in
searchers by indexing evidence and gaps of a particular research area, SE as many research topics already stem from software industry.
which may consist of several hundreds of papers [4-9]. Unfortunately, SLRs which include both the academic and the GL were termed as
SLRs fall short in providing full benefits since they typically review the Multivocal Literature Reviews (MLR) in educational research [12, 13],
formally-published literature only while excluding the large bodies of in the early 1990′s. The main difference between an MLR and an SLR is
∗
Corresponding author.
E-mail addresses: vahid.garousi@wur.nl (V. Garousi), michael.felderer@uibk.ac.at (M. Felderer), mika.mantyla@oulu.fi (M.V. Mäntylä).
https://doi.org/10.1016/j.infsof.2018.09.006
Received 8 May 2018; Received in revised form 15 September 2018; Accepted 17 September 2018
Available online xxx
0950-5849/© 2018 Elsevier B.V. All rights reserved.
Please cite this article as: V. Garousi et al., Information and Software Technology (2018), https://doi.org/10.1016/j.infsof.2018.09.006
JID: INFSOF
V. Garousi et al. Information and Software Technology 000 (2018) 1–22
the fact that, while SLRs use as input only academic peer-reviewed pa- Table 1
pers, MLRs in addition also use sources from the GL, e.g., blogs, videos, Spectrum of the ’white’, ‘grey’ and ’black’ literature (from [25]).
white papers and web-pages [18]. MLRs recognize the need for “multi- ’White’ literature ’Grey’ literature ’Black’ literature
ple” voices rather than constructing evidence from only the knowledge
Published journal papers Preprints Ideas
rigorously reported in academic settings (formal literature). The MLR
Conference proceedings e-Prints Concepts
definition from [12] elaborates this: “Multivocal literatures are comprised Books Technical reports Thoughts
of all accessible writings on a common, often contemporary topic. The writ- Lectures
ings embody the views or voices of diverse sets of authors (academics, practi- Data sets
Audio-Video (AV) media
tioners, journalists, policy centers, state offices of education, local school dis-
Blogs
tricts, independent research and development firms, and others). The writings
appear in a variety of forms. They reflect different purposes, perspectives, and
information bases. They address different aspects of the topic and incorporate
different research or non-research logics”.
Many SLR recommendations and guidelines, e.g., Cochrane [19],
do not prevent including GL in SLR studies, but on the contrary, they articles”. Additionally, there is an annual conference on the topic of
recommend considering the GL as long as GL sources meet the inclu- GL (www.textrelease.com) and an international journal on the topic
sion/exclusion criteria [20]. Yet, nearly all SLR papers in the SE domain (www.emeraldinsight.com/toc/ijgl/ 1/4). There is also a Grey Litera-
exclude GL in SLR studies, a situation which hurts both academia and in- ture Network Service (www.greynet.org) which is “dedicated to research,
dustry in our field. To facilitate adoption of the guidelines we integrate publication, open access, education, and public awareness to grey literature”.
boxes throughout the paper that cover concrete guidelines summarizing To classify different types of sources in the GL we adopted an existing
more detailed discussions of specific issues in the respective sections. model from the management domain [17] to SE in Fig. 1. The changes
The purpose of this paper is therefore to promote the role of GL in that we made to the model in [17] to make it more applicable to SE was
SE and to provide specific guidelines for including GL and conducting a revision of the outlets on the right-hand side under the three “tier”
multivocal literature reviews. We aim at complementing the existing categories, e.g., we added the Q/A websites (such as StackOverflow).
guidelines for SLR studies [3, 21, 22] in SE to address peculiarities of The model shown in Fig. 1 has two dimensions: expertise and out-
including the GL in our field. Without proper guidelines, conducting let control. Both dimensions run between extremes “unknown” and
MLRs by different teams of researchers may result in review papers with “known”. Expertise is the extent to which the authority and knowledge
different styles and depth. We support the idea that, “more specific guide- of the producer of the content can be determined. Outlet control is the
lines for scholars on including grey literature in reviews are important as the extent to which content is produced, moderated or edited in confor-
practice of systematic review in our field continues to mature”, which orig- mance with explicit and transparent knowledge creation criteria. Rather
inates from the field of management sciences [17]. Although multiple than having discrete bands, the gradation in both dimensions is on a con-
MLR guidelines have appeared in areas outside SE, e.g. [19, 20], we tinuous range between known and unknown, producing the shades of
think they are not directly applicable for two reasons. First, the specific GL.
nature of GL in SE needs to be considered (the type of blogs, questions The “shades” of grey model shown in Fig. 1 is quite consistent with
answer sites, and other GL sources in SE). Second, the guidelines are Table 1 showing the spectrum of the ’white’, ‘grey’ and ’black’ litera-
scattered to different disciplines and offer conflicting suggestions. Thus, ture from another source [25]. The ’white’ literature is visible in both
in this paper we integrate them all and utilize our prior MLR expertise Fig. 1 and Table 1 and the means the source where both expertise and
to present a single “synthesized” guideline. outlet control are fully known. ‘Grey’ literature according to Table 1 cor-
This paper is structured similar to SLR [22] and SM guideline [3] in responds mainly to the 2nd tier in Fig. 1 with moderate outlet control
SE and considers three phases: (1) planning the review, (2) conduct- and credibility. For SE, we add Q/A sites like StackOverflow to the
ing the review, and (3) reporting the review results. The remainder of 2nd tier. ‘Black’ literature finally corresponds to ideas, concepts and
this guidelines paper is structured as follows. Section 2 provides a back- thoughts. As blogs, but also emails and tweets mainly refer to ideas,
ground on concepts of GL and MLRs. Section 3 explains how we devel- concepts or thoughts they are in the 3rd tier. However, there are even
oped the guidelines. Section 4 presents guidelines on planning an MLR, “shades” of grey in the classification and depending on the concrete
Section 5 on conducting an MLR, and Section 6 on reporting an MLR. content a specific type of grey literature can be in a different tier than
Finally, in Section 8, we draw conclusions and suggest areas for further shown in Fig. 1. For instance, if a presentation (or a video, which is of-
work. ten linked to a presentation) is about new ideas, then it would fall into
the 3rd tier.
2. Background Due to the limited control of expertise and outlet in GL, it is im-
portant to also identify GL producers. According to [25] following GL
We review the concept of GL in Section 2.1. We then discuss different producers were identified: (1) Government departments and agencies
types of secondary studies (of which MLR is a type of) in Section 2.2. (i.e., in municipal, provincial, or national levels); (2) Non-profit eco-
Section 2.3 reviews the emergence of and need for MLRs in SE. We then nomic and trade organizations; (3) Academic and research institutions;
motivate the need for a set of guidelines for conducting MLR studies in (4) Societies and political parties; (5) Libraries, museums, and archives;
Section 2.4. (6) Businesses and corporations; and (7) Freelance individuals, i.e., blog-
gers, consultants, and web 2.0 enthusiasts. For SE, it might in addition
2.1. An overview of the concept of grey literature also be relevant to distinguish different types of companies, e.g. startups
versus established organizations, or different governmental organiza-
We found several definitions of GL in the literature. The most widely tions, e.g. military versus municipalities, producing GL. From a highly-
used and accepted definition is the so-called Luxembourg definition cited paper from the medical domain [26], we can see that GL searches
which states that, “<grey literature> is produced on all levels of govern- can go far beyond simple Google searches as the authors searched “44
ment, academics, business and industry in print and electronic formats, but online resource and database websites, 14 surveillance system websites, nine
which is not controlled by commercial publishers, i.e., where publishing is regional harm reduction websites, three prison literature databases, and 33
not the primary activity of the producing body” [23]. The Cochrane hand- country-specific drug control agencies and ministry of health websites”. That
book for systematic reviews of interventions [24] defines GL as “liter- paper highlighted the benefits of the GL by pointing out that 75% to
ature that is not formally published in sources such as books or journal 85% of their results were based on data sourced from the GL.
2
JID: INFSOF
Fig. 1. “Shades” of grey literatures (from [17]).
2.2. Different types of secondary studies As publishing various types of GL besides formal scientific literature
is becoming more popular and widespread, adapted types of secondary
A secondary study is a study of studies. A secondary study does usu- studies, e.g., Multivocal Literature Reviews (MLR), are becoming pop-
ally not generate any new data from a “direct” (primary) research study, ular as well. Therefore, respective guidelines for Multivocal Literature
instead it analyses a set of primary studies and usually seeks to aggre- Reviews, that take GL into account, are needed. This article provides
gate the results from these in order to provide stronger forms of evidence guidelines to perform newer types of secondary studies to ensure ef-
about a particular phenomenon [27]. In the research community, a sec- fective/efficient execution of such studies and high quality of reported
ondary study is sometimes also called a “survey paper” or a “review pa- reviews.
per” [28, 29]. There are different types of secondary studies. For exam- To better characterize secondary studies in SE, we categorize the
ple, a review of 101 secondary studies in software testing [29] classified types of systematic secondary studies in SE and briefly discuss their sim-
secondary studies into the following types: regular surveys, systematic ilarities, difference and relationships. Based on the review of the liter-
literature reviews (SLR), systematic literature mappings (SLM or SM). ature and our studies in this area, e.g., [29], we categorize secondary
The number of secondary studies in many research fields has grown studies in SE into six types, i.e., Systematic Literature Mappings (SLM),
very rapidly in recent years. To get a sense for the popularity of sys- Systematic Literature Review (SLR), Grey Literature Mapping (GLM),
tematic reviews, we searched for the term “systematic review” in paper Grey Literature Review (GLR), Multivocal Literature Mapping (MLM),
titles in the Scopus search engine. As of this writing (April 24, 2018), and Multivocal Literature Review (MLR) (see Fig. 2).
this phrase returned 86,525 papers. We also did the same search, but As we specify in Fig. 2, the differentiation factors of six types of
wanted to focus only on the SE discipline. To do so in an automated systematic secondary studies are: types of analysis, and types of sources
manner, we specified in the search criteria the term “software” appears under study. For example, the difference between an MLR and an SLR
in “source title”, i.e., venue (journal or conference) name. This approach is the fact that, while SLRs use as input only academic peer-reviewed
was used in several recent bibliometric studies, e.g., [30-32], and was articles, MLRs in addition also use sources from the GL, e.g., blogs, white
shown to be a precise way to automatically search for SE papers in Sco- papers, videos and web-pages [18].
pus. The search for “systematic review” in SE paper titles returned 401 Another type of literature reviews are GLR. As the name implies,
papers as of this writing (April 2018). they only consider GL sources in their pool of reviewed sources. Many
In general, secondary studies are of high value both for SE prac- GLR studies have also appeared in other disciplines, e.g., in medicine
tice and research. For instance, when asked about the benefit of a re- or social science [37–40]. For example, a GLR of special events for pro-
cent survey paper on testing embedded software, a practitioner tester moting cancer screenings was reported in [37]. To better understand
mentioned that [33]: “There are a lot of studies in the pool of this review and characterize the relationship between SLM, GLM and MLR studies,
study, which would benefit us in choosing the best methods to test embed- we visualize their relationship as a Venn diagram in Fig. 3. The same
ded software systems. I think review studies such as this one could be very relationship holds among SLR, GLR and MLM studies (see Fig. 2). As
beneficial for companies like ours”. Furthermore, a recent tertiary study Fig. 3 clearly shows, an MLR in a given subject field is a union of the
on software testing (a SLR of 101 secondary studies in software testing) sources that would be studied in an SLR and in a GLR of that field. As
[29] stresses the important role of secondary studies in SE in general a result, an MLR, in principle, is expected to provide a more complete
and software testing in particular. It compared citations of secondary picture of the evidence as well as the state-of-the-art and -practice in a
with citations of primary studies. The study found that, citation metrics given field than an SLR or a GLR (we will discuss this aspect more in
to the secondary studies were higher than the papers in the pool of three the next sub-section by rephrasing some results of our previous work in
SM studies (web testing [34], GUI testing [35] and UML-SPE [36]). This [41]).
suggests that the research community has already recognized the value Studies from all six types shown in Fig. 2 have started to be appear
of secondary studies, as secondary studies are cited on average higher in SE, e.g., a recent GLR paper [42] was published on the subject of
than regular primary studies. Thus, it appears that if a secondary study choosing the right test automation tools. A Multivocal Literature Map-
(or a MLR) is conducted with interesting and “useful” RQs, it could bring ping (MLM) is conducted to classify the body of knowledge in a specific
value and benefit to practitioners and researchers. area, e.g., a MLM on software test maturity assessment and test process
improvement [43]. Similar to the relationship of SLM and SLR studies
3
JID: INFSOF
Fig. 2. Relationship among different types of systematic secondary studies.
sized both the state-of-the art and the state-of—practice to ensure that
we would benefit from both research and also industrial knowledge to
answer the challenging questions (see Fig. 4). In a recent study [46], we
used the results of the MLR [44] in practice and found the results very
useful. We had a similar positive experience in using results from the
other MLR [43] in our recent projects in software test maturity and test
process improvement.
It should be highlighted that we are not advocating that all SLRs in
SE should include GL and become MLRs. But instead, as we explain in
Section 4.2, researchers considering to conduct an SLR from formal lit-
erature only in a given SE topic, should assess whether “broadening” the
scope and including GL would add value and benefits to the review study
and, only when the answer to those questions is positive, they should
Fig. 3. Venn diagram showing the relationship of SLR, GLR and MLR studies. plan an MLR instead of an SLR. We will review the existing guidelines
for those decisions in Section 4.2 and will adopt them to the SE con-
text. Finally, it should be noted that including the GL in review stud-
[22], a MLM can be extended by follow-up studies to a Multivocal Liter-
ies is not always straightforward or advantageous [50]. There are some
ature Review (MLR) where an additional in-depth analysis or qualitative
drawbacks as well, e.g., lower quality reporting on particularly when de-
coding of the issues and evidence in a given subject is performed, e.g.,
scribing research methodology. Thus, careful considerations should be
[44].
taken in different steps of an MLR study to be aware of such drawbacks
(details in Section 4-6).
2.3. Benefits of and need for including grey literature in review studies
(conducting MLRs)
Our previous work [41] explored the need for MLRs in SE. Our key 2.4. GL and MLRs in SE
findings indicated that (1) GL can give substantial benefits in certain
areas of SE, and that (2) the inclusion of GL brings forward certain chal- While extensive GL is available in the field of SE and the volume of
lenges as evidence in them is often experience and opinion based. We GL in SE is clearly expanding on a very rapid pace (e.g., in blogs and free
found examples that numerous practitioner sources had been ignored in online books), little effort has been made to utilize such a knowledge in
previous SLRs and we think that missing such information could have SE research. Recently small steps in this direction have been made by
profound impact on steering research directions. On the other hand, in Rainer who reported in [51] a preliminary framework and methodology
that paper, we demonstrated the information gained when making an based on ‟argumentation” theory [52] to identify, extract and structure
MLR. For example, the MLR on the subject of deciding when and what SE practitioners’ evidence, inference and beliefs. The authors argued
to automate in testing [44] would have missed a lot of expertise from that practitioners use (factual) stories, analogies, examples and popular
test engineers if we had not included the GL, see Fig. 4. opinion as evidence, and use that evidence in defeasible reasoning to
Also, in other domains (e.g., in educational sciences), a key benefit of justify their beliefs in GL sources (such as blogs) and to rebut the be-
MLRs has been “closing the gap between academic research and professional liefs of other practitioners. Their paper [51] showed that the presented
practice” [45], which was reported as early as in 1991. We have also ob- framework, methodology and examples could provide a foundation for
served in the execution and usage of review results in a few MLRs that SE researchers to develop a more sophisticated understanding of, and
we have been involved in, e.g., [43, 44]. One main reason to conduct appreciation for, practitioners’ defeasible evidence, inference and belief.
both those MLR studies [43, 44] were the real-world needs in industrial We will utilize some inputs from the study of Rainer [51] in develop-
settings that we had w.r.t. the topics of these two MLRs: When and what ment of our guidelines, especially for data synthesis (Section 5.5).
to automate in software testing in the case of [44], and software test ma- MLRs have recently started to emerge as a type of secondary study in
turity and test process improvement in the case of [43]. As reported in SE. The “multivocal” terminology has recently started to appear in SE.
[46–49], we were faced with the challenge of systematically decision Based on a literature search, we found several MLR studies in SE [18,
when (in the lifecycle) and what (which test cases) to automate in sev- 43, 44, 53–59]. We list those MLRs in Table 2 together with their topics,
eral industrial contexts. The MLR study that we conducted [44] synthe- years of publication and the information about the number of sources
4
JID: INFSOF
Table 2
List of MLRs in SE (sorted by year of publication).
Total - % of GL
Year Topic and Reference in the pool Literature used for MLR methodology and a brief summary of MLR process.
2013 An exploration of 35–100% This paper used MLR information from [12]. They used previously performed SLR for designing a grey literature
technical debt review. After grey literature, also interviews were done to collect primary data. They included top hits 50 from
[18] Google and performed two iterations of searches were the second iteration included new terms found in the first
iteration. Quality filtering was done case-by-case.
2015 iOS applications 21–42.9% This paper used MLR information from [12]. This paper first performed academic searches (SLR). Then it used
testing [53] keywords from academic search that were modified for the grey literature search. The paper studied the first 50
hits provided by Google search engine. Topic and quality based filtering was done for the MLR.
2016 When and what to 78–66.7% This (MLR-AutoTest) is one of our prior works and it references multiple prior works about MLR and including
automate in grey literature yet the depth does not match this paper as that was not a methodological paper. This paper is
software testing used as an example throughout this paper.
[44]
2016 Gamification of 20–70.0% This is one of our prior works that uses the same strategy as in [44] but in general the approach is more limited as
software testing it was only a short paper for a conference rather than journal paper.
[54]
2016 Relationship of 234–85.9% This paper used MLR information from [12, 18], and [44] to device a search strategy. The paper combined three
DevOps to agile, data source as it performed it first performed grey literature review, then did an update of an SLR and finally
lean and collected primary information from practitioners. The paper makes no mention how SLR and grey literature
continuous search are linked. First 230 hits of Google search engine were included as it was determined that hits below that
deployment [55] were mostly job adds. Topic and quality based filtering was done.
2016 Characterizing 43–44.2% This paper used MLR information from [12]. They searched Google (grey literature) and Google Scholar (MLR) no
DevOps [56] indication is given whether one was searched before the other. Data collection and extraction was interleaved
and search was stopped when no additional data could be extracted from new sources.
2017 Threat intelligence 22–NA This paper used MLR information from [12, 18], and our previous work [41]. The paper used 9 academic search
sharing platforms engines and 2 search engines. No details on stopping criteria were given. Quality criteria was used for filtering.
of software
vendors [57]
2017 Serious games for 7–14.3% In this paper, scientific searches were done first. Only using the scientific search results grey literature search was
software process performed. It consisted of two steps both using the academic primary studies: 1) for backward and forward
standards snowballing, and 2) for studying the publication list of each academic author to find all the works the authors
education [58] have performed in this area.
2017 Software test 181–28.2% This is one of our prior works that uses the same strategy as in [44].
maturity and test
process
improvement [43]
2018 Smells in software 166–27.7% This is one of our prior works that uses the same strategy as in [44].
test code [59]
1 was from the GL), [55] reviewed relationship of DevOps to agile, lean
and continuous deployment on a large set of 234 sources (of which 201
were from the GL). Ratio of GL in the pools of the MLRs also vary, from
14.3% in [58] to 85.9% in [55], which of course is due to the nature
of the topic under study, i.e., relationship of DevOps to agile, lean and
continuous deployment seems to be a topic very active in the industry
compared to academia.
In some software engineering MLR’s, an SLR has been performed
prior to undertaking the grey literature review of the MLR or the au-
thors’ prior work has had an existing SLR, e.g., [18, 53, 55, 58] (see
Table 2). However, there are papers that have done parallel SLR and
grey literature reviews, e.g., [43, 44, 54, 56, 59]. Some have also com-
bined MLRs also with interviews, e.g., [18, 55]. There are also some
papers that have only done grey literature review, e.g., [42, 60]. It is
hard to reason on the order as it depends on the goal and the existing
body of academic and practitioner work.
Other SLRs have also included the GL in their reviews and have
not used the “multivocal” terminology, e.g., [61]. A 2012 MSc thesis
Fig. 4. An output of the MLR on deciding when and what to automate in testing [50] explored the state of including the GL in the SE SLRs and found
(MLR-AutoTest). that the ratio of grey evidence in the SE SLRs was only about 9%, and
the GL evidence concentrated mostly in the recent past (∼48% between
the years 2007–2012). Furthermore, using GL as data has been described
from the formal literature and the GL as well as the ratio (%) of GL in as a case study, as was done in a 2017 paper investigating pivoting in
the pool. software start-up companies [60].
From Table 2, one can see that MLRs are a recent trend in SE, as
more researchers are seeing the benefit in conducting them (as discussed 2.5. Lack of existing guidelines for conducting MLR studies in SE
above). About nine MLRs have been published in SE between 2015 and
2018. As Table 2 shows, scale of the listed MLRs vary w.r.t. the num- Although, the existing SLR guidelines (e.g., those by Kitchenham and
ber of sources reviewed. While [58] studied serious games for software Charters [22]) have briefly discussed the idea of including GL sources
process standards education on a small set of 7 sources (of which only in SLR studies, most SLRs, published so far in SE, have not actually
5
JID: INFSOF
Fig. 5. An overview of our methodology for developing the guidelines in this paper.
included GL in their studies. A search for the word “grey” in the SLR (1) A survey of 24 MLR guidelines and experience papers in other fields;
guideline by Kitchenham and Charters [22] just returns two hits, which (2) Existing guidelines for SLR and SM studies in SE, notably the popular
we cite below: SLR guidelines by Kitchenham and Charters [22];
“Other sources of evidence must also be searched (sometimes manually) (3) The experience of the authors in conducting several MLRs [43, 44,
including: 54, 62] and one GLR [42]; and
(4) A recent study by Rainer [51] on using argumentation theory to ana-
• Reference lists from relevant primary studies and review articles
lyze software practitioners’ defeasible evidence, inference and belief
• Journals (including company journals such as the IBM Journal of Re-
search and Development), grey literature (i.e. technical reports, work in There are several guidelines for SLR and SM studies in SE available
progress) and conference proceedings [3, 21, 22, 63]. Yet, they mostly ignore the utilization of GL, as discussed
• Research registers in Section 2.4. Therefore, we see that our guidelines fill a gap by raising
• The Internet” the importance of including GL in review studies in SE and by providing
And: concrete guidelines with examples on how to address and include GL in
“Many of the standard search strategies identified above are used, …, review studies.
including: As shown in Fig. 5, we also used our own expertise from our recently-
published MLRs [43, 44, 54, 62] and one GLR [42]. Additionally, our
• Scanning the grey literature
experience includes several SLR studies, e.g., [34–36, 64–68].
• Scanning conference proceedings”
While guidelines for SLR studies, e.g., [22], and SM studies [3, 21], 3.2. Surveying MLR guidelines in other fields
could be useful for conducting MLRs, they do not provide specific guid-
ance on how to treat GL in particular, since GL sources should be as- As shown in Fig. 5, one of the important sources used as input in the
sessed differently in some steps compared to formal literature, e.g., qual- development of our MLR guidelines was a survey of MLR guidelines and
ity assessment (as we discuss in Section 5.3). experience papers in other fields. Via a systematic survey, we identified
Table 2 present analysis which shows that first works in SE have 24 such papers and conducted a review of those studies. The references
mainly cited [12] from education sciences when presenting their MLR of those 24 papers are as follows: [12–15, 17, 19, 20, 25, 45, 50, 69–82].
process. More recent works have cited already existing MLR studies in Each of those 24 MLR guideline and experience papers provided
SE such as [18] and [44] when presenting the MLR process. In the papers guidelines for one or several phases of a MLR: (1) decision to include
of Table 2, the treatment of MLR methodology is quite brief typically, GL in review studies, (2) MLR planning, (3) search process, (4) source
2–4 paragraphs, as they are not methodological papers. Our guidelines selection (inclusion/exclusion), (5) source quality assessment, (6) data
offer much broader coverage of MLR literature than any of the previous extraction, (7) data synthesis, (8) reporting the review (dissemination),
MLR studies in SE. and (9) any other type of guideline. In the rest of this paper, we have
To summarize a lack of MLR guidelines in the SE literature can be synthesized those guidelines and have adopted them to the context of
stated. In particular, two papers explicitly discussed this shortage as MLRs in SE by consolidating them with our own experience in MLRs.
follows: “there are no systematic guidelines for conducting MLRs in computer Fig. 6 shows the number of papers from the set of those 24 papers,
science” [57] and “There is no explicit guideline for collecting ML [multivocal per each phase of a MLR. For example, 14 of those 24 papers provided
literature]” [55]. We are addressing that need in this paper. guidelines for the search process of conducting a MLR. Details about this
classification of MLR guideline papers can be found in an online source
3. An overview of the guidelines and its development
[83] available at goo.gl/b2u1E5.
In Section 3.1, we explain how we developed the guidelines and
Section 3.4 provides an overview of the guidelines. 3.3. Running example
3.1. Developing the guidelines We selected one MLR [44], on deciding when and what to automate
in testing, as the running example and, we refer to it as MLR-AutoTest in
In this section, we discuss our approach to deriving the guidelines for the remainder of this guideline paper. When we present guidelines for
including the GL and conducting MLRs in SE. Fig. 5 shows an overview each step of the MLR process in the next sections, we discuss whether
of our methodology. Four sources are used as input in the development and how the respective step and the guidelines were implemented in
of MLR guidelines: MLR-AutoTest.
6
JID: INFSOF
Fig. 6. Number of papers in other fields presenting guidelines of different activities of MLRs (details can be found in [83]).
Table 3
Phases of the Kitchenham and Charters’ SLR guidelines (taken from page 6 of [22]).
Phase Steps
Planning the review
• Identification of the need for a review

• Commissioning a review
• Specifying the research question(s)
• Developing a review protocol
• Evaluating the review protocol
Conducting the review
• Identification of research
• Selection of primary studies
• Study quality assessment
• Data extraction and monitoring
• Data synthesis
Reporting the review
• Specifying dissemination mechanisms

• Formatting the main report
• Evaluating the report
Since we developed the guidelines presented in this paper after con- 4. Planning a MLR
ducting several MLR studies, and based on our accumulated experience,
it could be that certain steps of the guideline were not systematically As shown in Fig. 7, the MLR planning phase consists of the following
applied in MLR-AutoTest. In such cases, we will discuss how the guide- two phases: (1) Establishing the need for an MLR in a given topic, and
lines of a specific step “should have been” conducted in that MLR. After (2) Defining the MLR’s goal and raising its research questions (RQs). In
all, working with GL has been a learning experience for all of the three this section, these two steps are discussed.
authors.
4.1. A typical process for MLR studies
3.4. Overview of the guidelines
We illustrate a typical MLR process in Fig. 7. As one can see, this pro-
From the SLR guidelines of Kitchenham and Charters [22], we adopt cess is based on the SLR process as presented in Kitchenham and Char-
three phases (1) planning the review, (2) conducting the review, and (3) ters’ guidelines [22] and has been adapted to the context of multivocal
reporting the review for conducing MLRs, since we have found them to literature reviews. Our figure visualizes the process, for better under-
be well classified and applicable to MLRs. The corresponding phases of standability, and we have extended it to make it suitable for MLRs. In
our guidelines are presented in Sections 3, 4 and 5, respectively. There Fig. 7, we have also added the numbers of the sections, where we cover
are also sub-steps for each phase as shown in Table 3. To prevent dupli- guidelines for specific process steps, to ease traceability between this
cation, we do not repeat all steps of the SLR guidelines [22] when they process and the paper text. The process can also be applied to structure
are the same for conducting MLRs, but only present the steps that are a protocol on how the review will be conducted. An alternative way
different for conducting MLRs. Therefore, our guidelines focus mainly to develop a protocol for MLRs is to apply the standard structure of a
on GL sources as handling sources from the formal literature is already protocol for SLRs [27] and to consider the guidelines provided in this
covered by the SLR existing guidelines. Integrating both types of sources paper as specific variation points on how to consider GL. We believe
in an MLR is usually straightforward, as per our experience in conduct- that having a baseline process (template) from which other researchers
ing MLRs [43, 44, 54, 62]. can make their extensions/revisions could provide a semi-homogenous
7
JID: INFSOF
Fig. 7. An overview of a typical MLR process.
process for conducting MLRs, and thus provide the first of our set of While establishing the need for a review, one should assess whether
guidelines as follows: to perform SLR, GLR or MLR or their mapping study counterparts, see
Fig. 2. Note that the question of whether or not to include the GL is the
same as whether or not to conduct an MLR instead of an SLR. If the an-
swer to that question is negative, then the next question is whether or
• Guideline 1: The provided typical process of an MLR can be not to conduct an SLR instead, which has been covered by their respec-
applied to structure a protocol on how the review will be con- tive guidelines [3, 22]. Several MLR guidelines from other fields have
ducted. Alternatively, the standard protocol structure of SLR addressed the decision whether to include the GL and conduct an MLR
in SE can be applied and the provided guidelines can be con-
instead of an SLR. For example, they provide the following suggestions.
sidered as variation points.
• GL provides “current” perspectives and complements gaps of the for-
mal literature [25].
• Including GL may help avoiding publication bias. Yet, the GL that
4.2. Raising (motivating) the need for a MLR
can be located may be an unrepresentative sample of all unpublished
studies [19].
Prior to undertaking an SLR or an MLR, researchers should en-
• Decision to include GL in an MLR was a result of consultation with
sure that conducting a systematic review is necessary. In particular, re-
stakeholders, practicing ergonomists, and health and safety profes-
searchers should identify and review any existing reviews of the phe-
sionals [80].
nomenon of interest [22]. We also think that conductors of an MLR or
• If GL were not included, the researchers thought that an important
SLR should pay close attention to ensure the usefulness of an MLR for
perspective on the topic would have been lost [80], and we observed
its intended audience, i.e., researchers and/or practitioners, as early as
a similar situation in the MLR-AutoTest, see Fig. 4.
its planning phase for defining of its scope, goal and review questions
[3].
Importantly, we found two checklists whether to include GL in an
For example, the motivation of the MLR-AutoTest completely started
MLR. A checklist from [81] includes six criteria. We want to highlight
from our industry-academia collaboration on test automation. Our in-
that according to [81] GL is important when context has a large ef-
dustry partners had challenges to systematically decide when and what
fect on the implementation and the outcome which is typically the case
to automate in testing, e.g., [46-49], and thus we felt the real industrial
in SE [84, 85]. We think that GL may help in revealing how SE out-
need to conduct the MLR-AutoTest. Furthermore, since we found many
comes are influenced by context factors like the domain, people, or ap-
GL on that topic, conducting a MLR was seen much more logical than a
plied technology. Another guideline paper [17] suggests including GL
SLR of academic sources. This brings us to an important guideline about
in reviews when relevant knowledge is not reported adequately in aca-
motivating the need for a MLR:
demic articles, for validating scientific outcomes with practical experi-
ence, and for challenging assumptions in practice using academic re-
search. This guideline also suggests excluding GL from the reviews of
relatively mature and bounded academic topics. In SE, this would mean
• Guideline 2: Identify any existing reviews and plan/execute
the MLR to explicitly provide usefulness for its intended audi- topics such as the mathematical aspects of formal methods which are
ence (researchers and/or practitioners). relatively bounded in the academic domain only, i.e., one would not
find too many practitioner-generated GL on this subject.
8
JID: INFSOF
Table 4
Questions to decide whether to include the GL in software engineering reviews.
# Question Possible answers MLR-AutoTest
1 Is the subject “complex” and not solvable by considering only the formal literature? Yes/No Yes
2 Is there a lack of volume or quality of evidence, or a lack of consensus of outcome measurement in the formal literature? Yes/No Yes
3 Is the contextual information important to the subject under study? Yes/No Yes
4 Is it the goal to validate or corroborate scientific outcomes with practical experiences? Yes/No Yes
5 Is it the goal to challenge assumptions or falsify results from practice using academic research or vice versa? Yes/No Yes
6 Would a synthesis of insights and evidence from the industrial and academic community be useful to one or even both communities? Yes/No Yes
7 Is there a large volume of practitioner sources indicating high practitioner interest in a topic? Yes/No Yes
Note: One or more “yes” responses suggest inclusion of GL.
Based on [81] and [17] and our experience, we present our synthe- Table 5
sized decision aid in Table 4. Note that, one or more “yes” responses The RQs raised in the example MLR (MLR-AutoTest).
suggest the inclusion of GL. Items 1 to 5 are adopted from prior sources MLR study RQs
[17, 81], while items 6 and 7 are added based on our own experience
MLR-AutoTest
in conducting MLRs. For example, item #3 originally [17, 81] was: “Is
the context important to the outcome or to implementing intervention?”. We • RQ 1-Mapping of sources by contribution and
have adopted it as shown in Table 4. It is increasingly discussed in the research-method types:
○ RQ 1.1- How many studies present methods,
SE community that “contextual” information (e.g., what approach works
techniques, tools, models, metrics, or processes for the
for whom, where, when, and why?) [86-89] are critical for most of SE when/what to automate questions?
research topics and shall be carefully considered. Since GL sometimes ○ RQ 1.2- What type of research methods have been
provide contextual information, including them and conducting a MLR used in the studies in this area?
would be important. It is true that question 3 would (almost) always be • RQ 2-What factors are considered in the when/what
questions?
yes for most SE topics, but we still would like to keep it in the list of • RQ 3- What tools have been proposed to support the
questions, in case. when/what questions?
In Table 4, we also apply the checklist of [81] to our running MLR • RQ 4- What are attributes of those systems and projects?
example (MLR-AutoTest) as an “a-posteriori” analysis. While some of ○ RQ 4.1- How many software systems or projects under
analysis have been used in each source?
the seven criteria in this list may seem subjective, we think that a team
○ RQ 4.2- What are the domains of the software systems
of researchers can assess each aspect objectively. For MLR-AutoTest, the or projects under analysis that have been studied in
sum of “Yes” answers is seven as all items have “Yes” answers. The larger the sources (e.g., embedded, safety-critical, and
the sum the higher is the need for conducting an MLR on that topic. control software)?
○ RQ 4.3- What types of measurements, in the context of
the software systems under analysis, to support the
when/what questions have been provided?
• Guideline 3: The decision whether to include the GL in a re-
view study and to conduct an MLR study (instead of a conven-
tional SLR) should be made systematically using a well-defined
set of criteria/questions (e.g., using the criteria in Table 4).
4.3. Setting the goal and raising the research questions Another important criteria in raising RQs is to ensure that they are
as objective and measurable as possible. Open-ended and exploratory
The SLR guidelines of Kitchenham and Charters [22] state that spec- RQs are okay but RQs should not be fuzzy or vague.
ifying the RQs is the most important part of any systematic review. To
make the connection among the review’s goal, research (review) ques-
tions (RQs) as well as the metrics to collect in a more structured and
traceable way, we have often made use of the Goal-Question-Metric • Guideline 4: Based on your research goal and target audience,
(GQM) methodology [90] in our previous SM, SLR and MLR studies define the research (or “review”) questions (RQs) in a way to
[34–36, 64–68]. In fact, the RQs drive the entire review by affecting the (1) clearly relate to and systematically address the review goal,
following aspects directly: (2) match specific needs of the target audience, and (3) be as
objective and measurable as possible.
• The search process must identify primary studies that address the
RQs
• The data extraction process must extract the data items needed to
answer the RQs Based on our own experience, it would also be beneficial to be
• The data analysis (synthesis) phase must synthesize the data in such explicit about the proper type of the raised RQs. Easterbrook et al.
a way that the RQs are properly answered [91] provide a classification of RQ types that we used to classify a total
Table 5 shows the RQs raised in the example MLR. MLR-AutoTest of 267 RQs studied in a pool of 101 literature reviews in software testing
raised four RQs and several sub-RQs under some of the top-level RQs. [29]. The adopted RQ classification scheme [91] and examples RQs from
This style was also applied in many other SM and SLR studies to group the reviewed studies in [29] are shown in Table 6. The findings of the
the RQs in categories. study [29] showed that, in its pool of studies, descriptive-classification
RQs should also match specific needs of the target audience. For RQs were the most popular by large margin. The study [29] further
example, in the planning phase of the MLR-AutoTest, we paid close at- reported that there is a shortage or lack of RQs in types towards the
tention to ensure the usefulness of that MLR for its intended audience bottom of the classification scheme. For example, among all the studies,
(practitioners) by raising RQs which would benefit them, e.g., what fac- no single RQ of type Causality-Comparative Interaction or Design was
tors should be considered for the when/what questions? raised.
9
JID: INFSOF
Table 6
A classification scheme for RQs as proposed by [91] and examples RQs from a tertiary study [29].
RQ category Sub-category Example RQs
Exploratory Existence Does X exist?
• Do the approaches in the area of product lines testing define any measures to evaluate the testing activities? [S2]
• Is there any evidence regarding the scalability of the meta-heuristic in the area of search-based test-case generation?
• Can we identify and list currently available testing tools that can provide automation support during the unit-testing
phase?
Description- What is X like?

Classification
• Which testing levels are supported by existing software-product-lines testing tools?
• What are the published model-based testing approaches?
• What are existing approaches that combine static and dynamic quality assurance techniques and how can they be
classified?
Descriptive-Comparative How does X differ from Y?
• Are there significant differences between regression test selection techniques that can be established using empirical
evidence?
Base-rate Frequency Distribution How often does X occur?
• How many manual versus automated testing approaches have been proposed?
• In which sources and in which years were approaches regarding the combination of static and dynamic quality assurance
techniques published?
• What are the most referenced studies (in the area of formal testing approaches for web services)?
Descriptive-Process How does X normally work?
• How are software-product-lines testing tools evolving?

• How do the software-product lines testing approaches deal with tests of non-functional requirements?
• When are the tests of service-oriented architectures performed?
Relationship Relationship Are X and Y related?
• Is it possible to prove the independence of various regression-test-prioritization techniques from their implementation
languages?
Causality Causality Does X cause (or prevent) Y?
• How well is the random variation inherent in search-based software testing, accounted for in the design of empirical
studies?
• How effective are static analysis tools in detecting Java multi-threaded bugs and bug patterns?
• What evidence is there to confirm that the objectives and activities of the software testing process defined in DO-178B
provide high quality standards in critical embedded systems?
Causality-Comparative Does X cause more Y than does Z?
• Can a given regression-test selection technique be shown to be superior to another technique, based on empirical evidence?
• Are commercial static-analysis tools better than open-source static-analysis tools in detecting Java multi-threaded defects?
• Have different web-application-testing techniques been empirically compared with each other?
Causality-Comparative Does X or Z cause more Y under one condition but not others?
Interaction
• There were no such RQs in the pool of the tertiary study [29]
Design Design What’s an effective way to achieve X?
• There were no such RQs in the pool of the tertiary study [29]
For MLR-AutoTest, as shown in Table 5, all of its four RQs were of 5. Conducting the review
type “descriptive-classification” . If the researchers are planning an MLR
with the goal of finding out about “relationships”, “causality”, or “de- Once an MLR is planned, it shall be conducted. This section is struc-
sign” or certain phenomena, then they should raise the corresponding tured according to five phases of conducting an MLR:
type of RQs. We would like to express the need for RQs of types “relation-
• Search process (Section 5.1)
ships”, “causality”, or “design” in future MLR studies in SE. However,
• Source selection (Section 5.2)
we are aware that the primary studies may not allow such questions to
• Study quality assessment (Section 5.3)
be answered.
• Data extraction (Section 5.4)
• Data synthesis (Section 5.5)
5.1. Search process

• Guideline 5: Try adopting various RQ types (e.g., see those in
Table 6) but be aware that primary studies may not allow all Searching either formal or GL is typically done via means of using de-
question types to be answered. fined search strings. Defining the search strings is an iterative search pro-
cess, where the initial exploratory searches reveal more relevant search
10
JID: INFSOF
strings. Literature can also be searched via a technique called “snow- well as to consult bodies of knowledge such as the Software Engineer-
balling” [92], where one follows citations either backward or forward ing Body of Knowledge (SWEBOK) [97] for SE in general or, for instance
from a set of seed papers. Here we highlight the differences between the standard glossary of terms used in software testing from the ISTQB
searching in formal literature versus GL. [98] for testing in particular.
In MLR-AutoTest, authors used the Google search to search for GL
and Google Scholar to search for academic literature. The authors used
5.1.1. Where to search
four separate search strings. In addition, forward and backward snow-
Formally-published literature is searched via either broad-coverage
balling [92] was applied to include as many relevant sources as possible.
abstract databases, e.g., Scopus, Web of Science, Google Scholar or from
Based on the MLR goal and RQs, researchers should choose the rele-
full-text databases with more limited coverage, e.g., IEEE Xplore, ACM
vant GL types and/or GL producers (data sources) for the MLR and such
digital library, or ScienceDirect. The search strategy for GL is obviously
decisions should be made as explicit and justified as possible. Any mis-
different since academic databases do not index GL. The classified MLR
take in missing certain types of GL types could lead to the final MLR
guideline papers (as discussed in Section 3.2), identified several strate-
output (report) missing important knowledge and evidence in the sub-
gies, as discussed next:
ject under study. For example, for MLR-AutoTest, we considered white
• General web search engine: For example, conventional web search en- papers, blog posts and even YouTube videos, and we found insightful
gines such as Google were used in many GL review studies in man- GL resources of all these types.
agement [79] and health sciences [78]. This advice is valid and eas-
ily applicable in the SE context as well.
• Specialized databases and websites: Many papers mentioned spe- • Guideline 6: Identify the relevant GL types and/or GL produc-
cialized databases and websites that would be different for each ers (data sources) for your review study early on.
discipline. For example, in medical sciences, clinical trial reg- • Guideline 7: General web search engines, specialized
istries are relevant (e.g., the International Standard Randomized databases and websites, backlinks, and contacting individuals
Controlled Trials Number, www.isrctn.com). As another exam- directly are ways to search for grey literature.
ple, in management sciences, investment sites have been used
(e.g., www.socialfunds.com). GL database www.opengrey.eu pro-
vides broader coverage but search for “software engineering” re- 5.1.2. When to stop the search
sulted in only 4,115 hits as of this writing (March 21, 2017). For In the formal literature, one first develops the search string and then
comparison, Scopus provides 120,056 hits for the same search. uses this search string to collect all the relevant literature from an ab-
Relevant databases for SE would be non-peer reviewed electric stract or full text database. This brings a clear stopping condition for
archives (e.g., www.arxiv.org), social question-answer websites the search process and allows moving to study’s next phases. We refer
(e.g., www.stackoverflow.com). In essence, the choice of websites to such as a condition as data exhaustion stopping criteria. However,
that the review authors should focus on, would depend on the the issue of when to stop the GL search is not that simple. Through our
particular search goals. For example, if one is interested in ag- own experiences in MLR studies [43, 44, 54, 62], we have observed that
ile software development, a suitable website could be AgileAl- different stopping criteria for GL searches are needed.
lience (www.agilealliance.org). A focused source for software testing First, the stopping rules are intervened with the goals and types of
would be the website of the International Software Testing Qualifi- evidence of including GL. If evidence is mostly qualitative, one can reach
cations Board (ISTQB, www.istqb.org). Additionally, many annual theoretical saturation, i.e., a point where adding new sources do not
surveys in SE exist which provide inputs to MLRs, e.g., the World increase the number of findings, even if one decides to stop the search
Quality Report [93], the annual state of Agile report [94], worldwide before finding all the relevant sources.
software developer and ICT-skilled worker estimates by the Interna- Second, the stopping rules can be influenced by the large volumes of
tional Data Corporation (IDC) (www.idc.com), National-level sur- data. For example, in MLR-AutoTest, we received 1,330,000 hits from
veys such as the survey of software companies in Finland (“Ohjelmis- Google. Obviously, in such cases, one needs to rely on the search en-
toyrityskartoitus” in Finnish) [95], or the Turkish Software Quality gine page rank algorithm [99] and choose to investigate only a suitable
report [96] by the Turkish Testing Board. However, figuring out suit- number of hits.
able specialized databases is not trivial which brings us to our next Third, stopping rules are influenced due to the varying quality and
method (contacting individuals). availability of evidence (see the model for differentiating the GL in
• Contacting individuals directly or via social media: Individuals can be Fig. 1). For instance, in our review of gamification of software testing
contacted for multiple purposes for example to provide their unpub- [54], the quality of evidence quickly declined when moving down in
lished studies or to find out specialized databases where relevant the search results provided by Google search engine. More and higher
information could be searched. [79] mentions contacting individu- qualities for evidence were available for our MLR-AutoTest. Thus, the
als via multiple methods: direct requests, general requests to orga- availability of not only resources but also the availability of evidence
nizations, request to professional societies via mailing list, and open can determine whether data exhaustion stopping rule is appropriate.
requests for information in social media (Twitter or Facebook). To summarize, we offer three possible stopping criteria for GL
• Reference lists and backlinks: Studying reference lists, so called snow- searches:
balling [92], is done in the white (formal) literature reviews as well
in GL reviews. However, in GL and in particularly GL in web sites, 1 Theoretical saturation, i.e., when no new concepts emerge from the
formal citations are often missing. Therefore, features such as back- search results anymore
links can be navigated either forward or backward. Backlinks can 2 Effort bounded, i.e., only include the top N search engine hits
be extracted using various online back-link checking tools, e.g., MA- 3 Evidence exhaustion, i.e., extract all the evidence
JESTIC (www.majestic.com).
In MLR-AutoTest, authors limited their search to the first 100 search
Due to the lack of standardization of terminology in SE in general hits and continued the search further if the hits on the last page still
and the issue that this problem may even be more significant for the GL, revealed additional relevant search results. This partially matches the
the definition of key search terms in search engines and databases re- “effort bounded” stopping rules augmented with an exhaustive-like sub-
quires special attention. For MLRs we therefore recommend to perform jective stopping criterion.
an informal pre-search to find different synonyms for specific topics as
11
JID: INFSOF
• Guideline 8: When searching for GL on SE topics, three possible • Guideline 10: In the source selection process of an MLR, one
stopping criteria for GL searches are: (1) Theoretical saturation, i.e., should ensure a coordinated integration of the source selection
when no new concepts emerge from the search results; (2) Effort processes for grey literature and formal literature.
bounded, i.e., only include the top N search engine hits, and (3)
Evidence exhaustion, i.e., extract all the evidence.
5.3. Quality assessment of sources

5.2. Source selection
Quality assessment of sources is about determining the extent to
which a source is valid and free of bias. Differing from formal literature,
Once the potentially relevant primary sources have been obtained,
which normally follows a controlled review and publication process,
they need to be assessed for their actual relevance. The source selec-
processes for GL are more diverse and less controlled. Consequently,
tion process normally includes determining the selection criteria and
the quality of GL is more diverse and often more laborious to assess. A
performing the selection process. As GL is more diverse and less con-
variety of models for quality assessment of GL sources exists, and we
trolled than formal literature, source selection can be particularly time-
found the ones in [17, 50], and [70] the most well-developed.
consuming and difficult. Therefore, the selection criteria should be more
To present a synthesized approach for quality assessment of GL
fine-grained and take criteria considering the source type and specific
sources, we used the suggestions provided in [17, 50, 51, 70] and com-
quality assessment criteria for GL, see Table 7, into account. The source
plemented them with our own expertise in [43, 44, 54, 62] to develop
selection process itself is not specific for GL, but typically more time-
the quality assessment checklist shown in Table 7. Each of our checklist
consuming as the selection criteria are more diverse and can be quite
criterion has strengths and weaknesses. Some are suitable only for spe-
vague and furthermore requires a coordinated integration with the se-
cific types of GL sources, e.g., online comments only exist for source
lection process for formal literature.
types open for comments like blog posts, news articles or videos. A
highly commented blog post may indicate popularity, but on the other
5.2.1. Inclusion and exclusion criteria for source selection hand, spam comments may bias the number of comments, thus invali-
Source selection criteria are intended to identify those sources that dating the high popularity.
provide direct evidence about the MLR’s research (review) question. As In principle, one can use any checklist item of the quality assessment
we discuss in Section 5.3, in practice, source selection (inclusion and ex- checklist for source selection as well. For instance, the methodology,
clusion criteria) overlaps and is sometimes even integrated with study the date of publication, or the number of backlinks can be used as a
quality assessment [43, 44, 54, 62]. Therefore, quality assessment crite- selection criterion. As stated before, the advantage of selection criteria
ria selection, see Table 7, should also be used for the purpose of source is that the more sources one can exclude with certainty based on a set
selection. For instance, the methodology, the date of publication, or the of criteria, the less effort is needed for the more time-consuming study
number of backlinks can be used as a selection criterion. The benefit of quality assessment. Furthermore, when using the “research method” as
this approach is that the more sources one can exclude with certainty the selection criterion in a specific source, e.g., survey, case study or
based on suitable selection criteria, the less effort is needed for study experiment, it enables further assessment of the study quality (rigor).
quality assessment, which requires the more time-consuming content To investigate the quality (rigor)of specific study types in detail, check-
analysis of a source. lists tailored to specific study types are available. For instance, Host and
In MLR-AutoTest, sources were included if they are (a) in the area Runeson [100] presented a quality checklist for case studies, which can
of automated testing ROI calculations since they could be used as a de- also be utilized for case studies reported in formal literature.
cision support mechanism for balancing and deciding between manual
versus automated software testing, or (b) sources which provide deci-
sion support for the two questions “what to automate” and “when to
• Guideline 11: Apply and adapt the criteria authority of the
automate” . Sources that did not meet the above criteria were excluded.
producer, methodology, objectivity, date, novelty, impact, as
well as outlet control (e.g., see Table 7), for study quality as-
sessment of grey literature.
• Guideline 9: Combine inclusion and exclusion criteria for grey ○ Consider which criteria can already be applied for source
literature with quality assessment criteria (see Table 7). selection.
○ There is no one-size-fits-all quality model for all types of
GL. Thus, one should make suitable adjustments to the
5.2.2. Source selection process quality criteria checklist and consider reductions or exten-
The source selection process comprises the definition of inclusions sions if focusing on particular studies such as survey, case
and exclusion criteria, see previous section, as well as performing the study or experiment.
process itself. The source selection process for GL requires a coordinated
integration with the selection process for formal literature. Both formal
and GL outlets should be investigated adequately and effort required The decision whether to include a source or not can go beyond a bare
to analyze one source type shall not reduce the effort required for the binary decision (“yes” or “no” informally decided on guiding question),
other source type. Furthermore, source selection can overlap with the and can be based on a richer scoring scheme. For instance, da Silva
searching process when searching involves snowballing or contacting et al. [101] used a 3-point Likert scale (yes = 1, partly = 0.5, and no = 0)
the authors of relevant papers. When two or more researchers assess to assign scores to assessment questions. Based on these scoring results,
each paper, agreement between researchers is required and disagree- agreement between different persons can be measured and a threshold
ments must be discussed and resolved, e.g., by voting. for the inclusion of sources can be defined.
In MLR-AutoTest, the same criteria were applied to GL and to for- As discussed above, we have not seen any of the SE MLRs (even
mal literature. Furthermore, only inclusion criteria were provided, see our working example MLR-AutoTest) using such comprehensive qual-
Section 5.2.1 for the criteria, and sources not meeting them were ex- ity assessment models. Thus, to show an example of how GL quality
cluded. The final decision on inclusion or exclusion for unclear papers assessment models can be applied in practice, we apply our checklist,
was made in a voting between the two authors. The inclusions criteria Table 7, to five random GL sources from the pool of MLR-AutoTest (refer
were also applied in the performed forward and backward snowballing. to Table 8).
12
JID: INFSOF
Table 7
Quality assessment checklist of grey literature for software engineering.
Criteria Questions
Authority of the producer
• Is the publishing organization reputable? E.g., the

Software Engineering Institute (SEI)
• Is an individual author associated with a reputable
organization?
• Has the author published other work in the field?
• Does the author have expertise in the area? (e.g. job title
principal software engineer)
Methodology
• Does the source have a clearly stated aim?

• Does the source have a stated methodology?
• Is the source supported by authoritative, contemporary
references?
• Are any limits clearly stated?
• Does the work cover a specific question?
• Does the work refer to a particular population or case?
Objectivity
• Does the work seem to be balanced in presentation?

• Is the statement in the sources as objective as possible? Or,
is the statement a subjective opinion?
• Is there vested interest? E.g., a tool comparison by authors
that are working for particular tool vendor
• Are the conclusions supported by the data?
Date
• Does the item have a clearly stated date?
Position w.r.t. related sources
• Have key related GL or formal sources been linked to /

discussed?
Novelty
• Does it enrich or add something unique to the research?

• Does it strengthen or refute a current position?
Impact
• Normalize all the following impact metrics into a single

aggregated impact metric (when data are available):
Number of citations, Number of backlinks, Number of
social media shares (the so-called “alt-metrics”), Number
of comments posted for a specific online entries like a blog
post or a video, Number of page or paper views
Outlet type
• 1st tier GL (measure = 1): High outlet control/ High

credibility: Books, magazines, theses, government reports,
white papers
• 2nd tier GL (measure = 0.5): Moderate outlet control/
Moderate credibility: Annual reports, news articles,
presentations, videos, Q/A sites (such as StackOverflow),
Wiki articles
• 3rd tier GL (measure = 0): Low outlet control/ Low
credibility: Blogs, emails, tweets
Table 8
Five randomly-selected GL sources from candidate pool of MLR-AutoTest.
ID Reference
GL1 B. Galen, "Automation Selection Criteria – Picking the “Right” Candidates," http://www.logigear.com/magazine/test-automation/automation-selection-
criteria-%E2%80%93-picking-the-%E2%80%9Cright%E2%80%9D-candidates/, 2007, Last accessed:
Nov. 2017.
GL2 B. L. Suer, "Choosing What To Automate," http://sqgne.org/presentations/2009–10/LeSuer-Jun-2010.pdf, 2010, Last accessed: Nov. 2017.
GL3 Galmont Consulting, "Determining What to Automate," http://galmont.com/wp-content/uploads/2013/11/Determining-What-to-Automate-2013_11.13.pdf,
2013, Last accessed: Nov. 2017.
GL4 R. Rice, "What to Automate First," https://www.youtube.com/watch?v=eo66ouKGyVk, 2014, Last accessed: Nov. 2017.
GL5 J. Andersson and K. Andersson, "Automated Software Testing in an Embedded Real-Time System," BSc Thesis. Linköping University, Sweden, 2007.
13
JID: INFSOF
Table 9
Example application of the quality assessment checklist (in Table 7) to the five example GL sources (see Table 8) in the pool of MLR-AutoTest.
Example GL sources
Criteria Questions Notes
GL1 GL2 GL3 GL4 GL5
Authority of the Is the publishing organization reputable? 0 0 1 0 0 Only GL3 has no person as an author name, thus
producer its authorship can only be attributed to an
organization.
Is an individual author associated with a reputable 1 1 0 1 1 All sources, except GL3, are written by individual
organization? authors.
Has the author published other work in the field? 1 1 1 1 0 GL5 is BSc thesis and a Google search for the
authors’ names does not return any other
technical writing in this area.
Does the author have expertise in the area? (e.g., 1 1 1 1 1 Considered the information in the web pages
job title principal software engineer)
Methodology Does the source have a clearly stated aim? 1 1 1 1 1 All five sources have a clearly stated aim.
Does the source have a stated methodology? 1 1 1 1 0 GL5 only has a section about the topic of when
and what to automate (Sec. 6.1) and that
section has no stated methodology.
Is the source supported by authoritative, 0 1 0 0 1 For GL, references are usually hyperlinks from the
documented references? GL source (e.g., blog post).
Are any limits clearly stated? 0 1 1 1 0 GL1 and GL5 are rather short and not discussing
the limitations of the ideas.
Does the work cover a specific question? 1 1 1 1 1 All five source answer the question on when and
what to automate.
Does the work refer to a particular population? 1 1 1 1 1 All five sources refer to the population of test
cases that should be automated.
Objectivity Does the work seem to be balanced in 1 1 1 1 0 We checked whether the source also explicitly
presentation? looked at the issue of whether a given test case
should NOT be automated.
Is the statement in the sources as objective as 1 1 1 1 1 When enough evidence is provided in a source, it
possible? Or, is the statement a subjective becomes less subjective. The original version of
opinion? the question in Table 8 would get assigned ‘0′
for the positive outcome, thus we negated it.
Is there vested interest? E.g., a tool comparison by 1 1 1 1 1 The original version of the question in Table 8
authors that are working for particular tool would get assigned ‘0′ for the positive outcome,
vendor. Are the conclusions free of bias? thus we negated it.
Are the conclusions supported by the data? 1 1 1 1 1
Date Does the item have a clearly stated date? 0 1 1 1 1 GL1 is a website and does not have a clearly stated
date related to its content.
Position w.r.t. Have key related GL or formal sources been linked 0 1 0 0 1 For GL sources, references (bibliography) are
related sources to / discussed? usually the hyperlinks from the source (e.g.,
blog post).
Novelty Does it enrich or add something unique to the 1 1 1 1 0 GL5 does not add any novel contribution in this
research? area. Its focus is on test automation, but not on
the when/what questions.
Does it strengthen or refute a current position? 1 1 1 1 0 GL5 does not add any novel contribution in this
area. Its focus is on test automation, but not on
the when/what questions.
Impact Normalize all the following impact metrics into a 0 1 0 0 0 -For backlinks count, we used this online tool:
single aggregated impact metric (when data are http://www.seoreviewtools.com/valuable-
available): Number of citations, Number of backlinks-checker/
backlinks, Number of social media shares (the -GL3 became a broken link at the time of this
so-called “alt-metrics”), Number of comments analysis.
posted for a specific online entries like a blog -Only GL2 had two backlinks. The others had 0.
post or a video, Number of page or paper views -All other metrics values were 0 for all five
sources.
-For counts of social media shares, we used
www.sharedcount.com
Outlet type 0 1 1 0.5 1 GL1: blog post
GL2, GL3: white papers
• 1st tier GL (measure = 1): High outlet control/ GL4: YouTube video
High credibility: Books, magazines, theses, GL5: thesis
government reports, white papers
• 2nd tier GL (measure = 0.5): Moderate outlet
control/ Moderate credibility: Annual reports,
news articles, videos, Q/A sites (such as
StackOverflow), Wiki articles
• 3rd tier GL (measure = 0): Low outlet control/
Low credibility: Blog posts, presentations,
emails, tweets
Sum (out of 20): 13 19 16 15.5 12 Summation of the values in the previous rows
Normalized (0–1): 0.65 0.95 0.80 0.78 0.60 Normalized values by dividing the values in the
previous row by 20 (number of factors)
14
JID: INFSOF
Table 9 shows results of applying the quality assessment checklist decided the primary document would be the peer-reviewed with GL doc-
(in Table 7) to the five GL sources of Table 8. We provide notes for each uments as supplemental.
row for justifying each assessment. The sum of the assessments and the The guidelines in [12] suggested maintaining “chains of evidence
last normalized values (each between 0–1) show the quality assessment (records of sources consulted and inferences drawn)”. This is similar to
outcome for each GL source. Out of a total quality score of 20 (the total what we call “traceability” links in SE as highlighted before and also
number of individual criteria in Table 7), the five GL sources GL1, …, suggest in our previous data extraction guidelines [102]. For instance,
GL5 received the scores of 13, 19, 16, 15.5, 12, respectively. If MLR- Fig. 8 shows a snapshot of the online repository (spreadsheet hosted
AutoTest was conduct this type of systematic quality assessment for all on Google Docs) for MLR-AutoTest in which the contribution facets
the GL sources, it could for example set the quality score of 10 as the are shown and a classification of ‘Model’ for the contribution facet of
“threshold” (20/2). Any source above that would be included in the pool [Source 2] is shown, along with the summary text from the source act-
and any source with score below it would be excluded. ing as the corresponding traceability link.
When traceability information (verbatim text from inside the pri-
5.4. Data extraction mary studies) are not included in the data extraction sheets, peer re-
viewing of the data by other team members, and also finding the ex-
For the data extraction phase, we discuss the following aspects: act locations in the primary studies where the data actually come from
become challenging. We have experienced such a challenge in many
• Design of data extraction forms occasions in our past MLR and SLRs.
• Data extraction procedures and logistics Furthermore, the authors of [12] also argue that, because documents
• Possibility of automated data extraction and synthesis in the GL are often written for non-academic purposes and audiences
and because documents often address different aspects of a phenomenon
5.4.1. Design of data extraction forms with different degrees of thoroughness, it is essential that researchers
Most of design aspects of data extraction forms for MLRs are similar record the purpose and specify the coverage of each GL document. These
to those aspects in the SLR guidelines of Kitchenham and Charters [22]. items are also in our quality assessment checklist in Table 7. Authors of
We discuss next a few important additional considerations in the context [12] also wanted to record the extent to which the implicit assumptions
of MLRs based on our experience in conducting MLRs. or causal connections were supported by evidence in the GL documents.
Since many SLR and MLR studies have, as a part of them, an SM step For that purpose, they developed matrices that enabled them to system-
first, it is important to ensure the rigor in data extraction forms. Also, in atically track and see what every document in every category of the
a recent paper [102] we presented a set of experience-based guidelines dataset said about causal connection in every theory of action.
for effective and efficient data extraction which can apply to all four Our own experiences from our past SLRs and MLRs have been as fol-
types systematic reviews in SE (SM, SLR, MLR and GLRs). lows. To extract data, the studies in our pool were reviewed with the
Based on the suggestions in [102], to facilitate design of data ex- focus of each specific RQ. Researchers should also extract and record
traction forms, we have developed spreadsheets with direct traceability as much quantitative/qualitative data as needed to sufficiently address
to MLR’s research questions in mind. For example, Table 10 shows the each RQ. If not, answering the RQ under study will be impossible based
systematic map that we used developed and used in the MLR-AutoTest. on inadequate extracted data and would require further efforts to re-
In this table, column 1 is the list of RQs, column 2 is the correspond- view, read and extract the missing data from the primary studies again.
ing attribute/aspect. Column 3 is the set of all possible values for the We have experienced such a challenge in many occasions in our past
attribute. Finally, column 4 indicates for an attribute whether multiple MLR and SLRs. During the analysis, each involved researcher extracted
selections can be applied. For example, in RQ 1 (research type), the cor- and analyzed data from the share of sources assigned to her/him, then
responding value in the last column is ‘S’ (Single). It indicates that one each researcher peer reviewed the results of each other’s analyses. In
source can be classified under only one research type. In contrast, for the case of disagreements, discussions were conducted. We utilized this
RQ 1 (contribution type), the corresponding value in the last column is process to ensure quality and validity of our results.
‘M’ (Multiple). It indicates that one study can contribute more than one
type of options (e.g., method, tool, etc.).
According to our experience and due to the issue that GL sources • Guideline 12: During the data extraction, systematic proce-
have a less standardized structure than formal literature, it is also use- dures and logistics, e.g., explicit “traceability” links between
ful to provide “traceability” links (i.e., comments) in the data extraction the extracted data and primary sources, should be utilized.
form to indicate the position in the GL source where the extracted in- Also, researchers should extract and record as much quanti-
formation was found. This issue is revisited again in the next subsection tative/qualitative data as needed to sufficiently address each
(see Fig. 8). RQ, to be used in the synthesis phase.
5.4.2. Data extraction procedures and logistics

5.5. Data synthesis
Many authors are reporting logistical and operational challenges in
conducting SLRs, e.g., [63]. We suggest next a summary of best prac-
There are various data synthesis techniques, as reported in Kitchen-
tices based on our survey of MLR guidelines - 4 out of 24 MLR guideline
ham and Charters’ guidelines for SLRs [22] and elsewhere. For instance,
papers, see Section 3.2, provided experience-based advices for data ex-
a guideline paper for synthesizing evidence in SE research [103] distin-
traction - as well as our own experiences.
guishes descriptive (narrative) synthesis, quantitative synthesis, quali-
Authors in [25] offered a worksheet sample to extract data from
tative synthesis, thematic analysis, and meta-analysis.
the GL sources including fields such as: database, organization, web-
Based on the type of RQs and the type of data (primary studies), the
site, pathfinder, guide to topic/subject, date searched, # of hits, and
right data synthesis techniques should be selected and used. We have
observations.
observed that practitioners provide in their reports mainly three types
In [79] the authors emailed and even called individuals to gather
of data:
more detailed GL data. For GL, often only a subset of the original impor-
tant data is made available in the GL source (to keep it short and brief) • First, qualitative and experience-based evidence is very common in
and detailed information is only available in “peoples’ heads” [79]. the GL as practitioners share their reflections on topics such as on
The authors of [80] found cases where both the grey and peer- when to automate testing (MLR-AutoTest). This requires qualitative
reviewed documents described the same study. In those cases, the team data analysis techniques. Their reflection may occasionally include
15
JID: INFSOF
Fig. 8. A snapshot of the publicly-available spreadsheet hosted on Google Docs for MLR-AutoTest. The full final repository can be found in http://goo.gl/zwY1sj.
Table 10
Systematic map developed and used in MLR-AutoTest.
RQ Attribute/Aspect Categories (M)ultiple/

(S)ingle
– Source type Formal literature, GL S
1 Contribution type Heuristics/guideline, method (technique), tool, metric, model, process, empirical results only, other M
Research type Solution proposal, validation research (weak empirical study), evaluation research (strong empirical S
study), experience studies, philosophical studies, opinion studies, other
2 Factors considered for A list of pre-defined categories (Maturity of SUT, Stability of test cases, ’Cost, benefit, ROI’, and Need for M
deciding when/what to regression testing) and an ‘other’ category whose values were later qualitatively coded (by applying
automate ’axial’ ’open’ coding)
3 Decision-support tools Name and features M
4 Attributes of the software Number of software systems: integer M
systems under test (SUT)
SUT names: array of strings
Domain, e.g., embedded systems
Type of system(s): Academic experimental or simple code examples, real open-source, commercial
Test automation cost/benefit measurements: numerical values
quantitative data, e.g., some presented quantitative data on ROI both the use of quantitative and qualitative research methods. For
when automating software testing. However, we see that typically example, a quantitative comparison of technology usage can be done
the quality and accuracy of the reporting does not allow to conduct from the StackOverflow website by extracting the number of ques-
quantitative meta-analysis from practitioner GL reports. tions and view counts that can give an indication of popularity of
• Second, quantitative evidence in the form of questionnaires is rela- testing tools for example [104]. Qualitative analysis (such as open
tively common in GL, e.g., international surveys such as the state-of coding and grounded theory) [105] can also be conducted and it can
the Agile report by VersionOne, and the World Quality Survey by analyze the types of problems software engineers are facing with the
HP & Sogetti. More surveys can be found on national/regional lev- testing tools for example.
els such as the survey of software companies in Finland [95], or the
In MLR-AutoTest, authors conducted qualitative coding [105] to de-
Turkish Software Quality report [96] by the Turkish Testing Board. If
rive the factors for deciding when to automate testing. We had some
the same questionnaire is repeated in multiple or sequential surveys,
pre-defined factors (based on our past knowledge of the area), namely
this may allow meta-analysis. However, often the GL surveys fail to
“regression testing”, “maturity of SUT” and “ROI”. During the process,
report standard deviation, which makes statistical meta-analysis im-
we found out that our pre-determined list of factors was greatly limit-
possible. Furthermore, we have seen virtually no controlled exper-
ing, thus, the rest of the factors emerged from the sources, by conducting
iments or rigorously conducted quasi-experiments in GL, thus, we
“open” and “axial coding” [105]. The creation of the new factors in the
see limited possibilities in using meta-analytic procedures to com-
“coding” phase was an iterative and interactive process in which both
bine experiment results from GL in SE.
researchers participated. Basically, we first collected all the factors af-
• Third, using data from particular GL databases such as ques-
fecting when- and what-to-automate questions from the sources. Then
tion/answer sites (such as the StackOverflow website) may allow
we aimed at finding factors that would accurately represent all the ex-
16
JID: INFSOF
Fig. 9. Phases of qualitative data extraction for factors considered for deciding when/what to automate (taken from the MLR-AutoTest paper).
tracted items but at the same time not be too detailed so that it would For data synthesis from GL sources, utilizing the argumentation the-
still provide a useful overview, i.e., we chose the most suitable level of ory also can be useful. As discussed in Section 2.3, there has been some
“abstraction” as recommended by qualitative data analysis guidelines recent work in SE on extracting SE practitioners’ evidence and beliefs,
[105]. e.g., the study by Rainer [51]. One of the interesting materials presented
Fig. 9 shows the phases of qualitative data extraction for the factors, in [51] was a set of critical questions for using argumentation from ex-
where the process started from a list of pre-defined categories: stability pert opinions in GL, as follows:
(maturity) of SUT, stability of test cases, ’cost, benefit, ROI, and need
for regression testing) and a large number of raw factors phrased under 1 Expertise: How credible is W (“Writer” of the GL source) as an expert
the “Other” category. By an iterative process, those phrases were qual- source?
itatively coded (by applying axial and open coding approaches [105]) 2 Field: Is W an expert in the field that P is in?
to yield the final result, i.e., a set of cohesive well-grouped factors. 3 Opinion: What did W assert that implies P?
4 Trustworthiness: Is W personally reliable as a source?
17
JID: INFSOF
5 Consistency: Is proposition P consistent with what other experts as- and the online spreadsheet of papers, and let us know what they think
sert? about the potential benefits of that review paper. Their general opinion
6 Backup evidence: Is W’s assertion based on evidence? was that a review paper like that paper [9] is a valuable resource and
can actually serve as an index to the body of knowledge in this area.
In the above questions, P is a proposition in a GL source, and W Furthermore, reporting style for scientific journals and practition-
refers to the writer of a GL source, e.g., a SE practitioner who writes ers’ magazines are quite different [106, 108]. While papers in scientific
her/his opinion in a blog. The questions were adopted from a book on journals should provide all the details of the MLR (the planning and
argumentation theory [52]. While the above questions seem to be useful search process), papers in practitioner-oriented outlets (such as IEEE
and rationale, some of them seem slightly questionable, e.g., question Software) should be in general shorter, succinct and “to the point”. We
#4 cannot be reliably assessed. Also question #5 could be irrelevant have been aware of this issue and have followed slightly different report-
since experts should be allowed to have non-consistent opinions with ing styles in the set of our recent MLRs/SLRs. For example, we wrote
each other. [62, 107] and published them in the IEEE Software magazine target-
Furthermore, researchers should carefully balance synthesis using ing practitioners, while we wrote their extended scientific versions af-
sources with different levels of rigor. We can easily see that the rigor terwards and published in the Information and Software Technology
used in a blog post is different than that of a research paper, and when journal [5, 43]. Table 11 shows our publication strategy for three sets
synthesizing evidence from both types, their contributions to the com- of recent MLR/SLR studies, on three testing-related topics: test maturity
bined evidence would ideally be not in the same “amount” (weight). and test process improvement, testing embedded software, and software
Earlier in Section 5.3, we discussed checklists for quality assessment of test-code engineering.
formal and GL sources: using the checklist presented by Host and Rune- With respect to MLR-AutoTest, we did not have an IEEE Software
son [100] for case studies reported in formal literature, and the checklist publication, however, our academic paper in Information and Software
in Table 7 for GL sources. By carefully combining the two chosen check- Technology includes a practitioner oriented list questions that can be
lists, we may be able to objective assign evidence (rigor) weights to used in deciding whether to automate testing or not in particular con-
different sources and thus synthesize evidence from all types in a more text. An excerpt of that checklist, as taken from MLR-AutoTest, is shown
systematic manner. in Table 12. Thus, a practitioner oriented section even if the authors do
not wish to make to separate publications.
Another issue is choosing suitable and attractive titles for papers tar-
• Guideline 13: A suitable data synthesis method should be se- geting practitioners [109]. In two of our review papers published in IEEE
lected. Many GL sources are suitable for qualitative coding and Software [9, 62], we entitled them starting with “What we know about
synthesis. Some GL sources allow combination of survey re- …” . This title pattern seems to be attractive to practitioners, and has
sults but lack of reporting rigor limits the meta-analysis. Quan-
also been used by other authors of IEEE Software papers [110–114].
titative analysis is possible on GL databases such as Stack-
Overflow. Also argumentation theory can be beneficial for Another IEEE Software paper [115] showed, by textual analysis, that
data synthesis from grey literature. Finally, the limitations of practitioners usually prefer simpler phrases for the titles of their talks
GL sources w.r.t. their evidence depth of experiment prevent at conferences or their (grey literature) reports, compared with more
meta-analysis. complex titles used in the formal literature.
A useful resource that the authors of MLR/SLR should publish as a
public online version is the repository of the review studies included
6. Reporting the review in the MLR which many researchers will find a useful add-on to the
MLR itself. Ideally, the online repository comes with additional export,
As shown in the MLR process, see Fig. 7, the last phase is reporting search and filter functions to support further processing of the data.
the review. Typical issues of the reporting phase of an MLR are similar to Fig. 8 from Section 5.4 shows an example of an online paper reposi-
the SLR guidelines of Kitchenham and Charters [22]. In the experience tory implemented as a Google Docs spreadsheet, i.e., the list of included
from our past SLR and MLRs, we have seen two important additional sources of MLR-AutoTest. Such repositories provide various benefits,
issues that we discuss next: (1) reporting style for different audience e.g., transparency on the full dataset, replication and repeatability of
types, and (2) ensuring usefulness to the target audience. the review, support when updating the study in the future by the same
MLR needs to provide benefits for both researchers and practitioners or a different team of researchers, and easy access to the full “index” of
since it contains a summary of both the state-of-the art and –practice in a sources.
given area. Readers of MLR papers (both practitioners and researchers) One of the earliest online repositories serving as companion to its
are expected to benefit from the evidence-based overview and index to corresponding survey papers [116, 117] and showing the usefulness of
the body of knowledge in the given subject area [62]. such online repositories, is the one on the subject of Search-based soft-
Furthermore, conveying and publishing results of MLR and SLR ware Engineering (SBSE) [118], which was first published in 2009 and
studies to practitioners will “enhance the practical relevance of research” has been actively maintained since then.
[106]. To enhance the practical relevance of research, [106] suggested
to “convey relevant insights to practitioners”, “present to practitioners” and
“write for practitioners”. We have followed that advice, and have reported
the shortened (practitioner-oriented) versions of three of our MLR and
SLR studies [9, 62, 107] in the IEEE Software magazine. When the re-
porting style of SLR or an MLR is “fit” for practitioners, they usually find • Guideline 14: The writing style of an MLR paper should match
such papers useful. its target audience, i.e., researchers and/or practitioners.
We have also found it useful to ask for practitioners’ feedback to ○ If targeting practitioners, a plain and to-the-point writing
make the results even more communicative. We recommend including style with clear suggestion and without details about the
in review papers a section about the implications of the results, as we research methodology should be chosen. Asking feedback
from practitioners is highly recommended.
reported in [43, 44], and if possible, a section on the benefits of the
○ If the MLR paper targets researchers, it should be trans-
review. For example, in our SLR on testing embedded software [9], we
parent by covering the underlying research methodology
included a section on “Benefits of this review”. To further assess the ben- as well as an online repository and highlight the research
efits of our review study in [9], we asked several active test engineers findings while providing directions to future work.
in the Turkish embedded software industry to review the review paper
18
JID: INFSOF
Table 11
Publication strategy of three sets of MLR/SLR studies.
MLR/SLR topic Paper title Ref. Journal/magazine Main audience
Researchers Practitioners
Test maturity and test What we know about software test maturity and test [62] IEEE Software x
process improvement process improvement
Software test maturity assessment and test process [43] Information and Software x
improvement: a multivocal literature review Technology
Testing embedded What we know about testing embedded software [9] IEEE Software x
software
Testing embedded software: a systematic literature review Submitted to a journal x
Software test-code Developing, verifying and maintaining high-quality [107] IEEE Software x
engineering automated test scripts
Software test-code engineering: a systematic mapping [5] Information and Software x
Technology
Table 12
Excerpt of a practitioner-oriented checklist of whether to automate testing or not (taken from MLR-AutoTest).
Category Area (weight, i..e., num. of sources) Situation +/-
SUT-related fectors Maturity of SUT (39) SUT or the targeted components will experience major modifications in the future. –
The interface through which the tests are conducted is unlikely to change. +
Other SUT aspects (6) SUT is an application with a long life cycle. +
SUT is generic system, i.e. not tailor made or heavily customized system. +
SUT is tightly integrated into other products, i.e. not independent. –
SUT is complex. –
SUT is mission critical. +
7. Conclusions and future works We suggest future works in the following directions. First, based on
guidelines such as [119] in the educational sciences field, we suggest
We think that software engineering research can improve its rel- specific guidelines and considerations for different types of reviews: sys-
evance by accepting and analyzing input from practitioner literature. tematic review, best-evidence synthesis, narrative review and for differ-
Currently, books and consultancy reports are considered valid evidence ent objectives: integrative research review, theoretical review, method-
while relevant input found in blogs and in social media discussions is ological review, thematic review, state-of-the-art review, historical re-
often ignored. Furthermore, practitioner interviews done and reported view, comparison of two perspectives review and review complement.
by researchers have, for long, been considered as academic evidence in Second, improving the guidelines based experiences in applying them.
empirical software engineering, while grey literature produced by the Third, refine guidelines for specific types of grey literature sources like
very same individuals may have been ignored as unscientific. This paper blog articles or specific SE areas.
wants to lift such a double standard by allowing rigorously conducted
analysis of practitioners’ writings to enter the scientific literature.
As existing guidelines for performing systematic literature studies Acknowledgments
in SE provide limited coverage for including the practitioners sources
and conducting multivocal literature reviews, this paper filled this gap The third author has been partially supported by the Academy
by developing and presenting a set of experience-based guidelines for of Finland Grant no 298020 (Auto-Time) and by TEKES Grant no
planning, conducting and presenting MLR studies in SE. 3192/31/2017 (ITEA3: 16032 TESTOMAT project).
To develop the MLR guidelines, we benefited from three inputs: (1)
existing SLR and SM guidelines in SE, (2) a survey of MLR guidelines
and experience papers in other fields, and (3) our own experiences in References
conducting several MLRs in SE. We took the popular SLR guidelines of
[1] B.A. Kitchenham, T. Dyba, M. Jorgensen, Evidence-based software engineer-
Kitchenham and Charters as the baseline and extended it to conduct MLR
ing, in: Proceedings of International Conference on Software Engineering, 2004,
studies. The presented guidelines covered all phases of conducting and pp. 273–281. 999432: IEEE Computer Society.
reporting MLRs in SE from the planning phase, to conducting the review, [2] B. Kitchenham, O.P. Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman, Sys-
and to reporting the review. In particular, we believe that incorporating tematic literature reviews in software engineering–a systematic literature review,
Inf. Softw. Technol. 51 (1) (2009) 7–15.
and adopting a vast set of experience-based recommendations from MLR [3] K. Petersen, S. Vakkalanka, L. Kuzniarz, Guidelines for conducting systematic map-
guidelines and experience papers in other fields enabled us to propose ping studies in software engineering: an update, Inf. Softw. Technol. 64 (2015)
a set of guidelines with solid foundations. 1–18.
[4] P.-M. Daigneault, S. Jacob, M. Ouimet, Using systematic review methods within a
We should also note the limitations of the guidelines that we have Ph.D. dissertation in political science: challenges and lessons learned from practice,
developed and presented in this paper: (1) although they are based on Int. J. Soc. Res. Methodol. 17 (3) (2014) 267–283.
our previous experience and the guidelines in other fields, they still need [5] V. Garousi, Y. Amannejad, A. Betin-Can, Software test-code engineering: a system-
atic mapping, J. Inf. Softw. Technol. 58 (2015) 123–147.
to be empirically evaluated in future studies; and (2) similar to any set of [6] C. Olsson, A. Ringnér, G. Borglin, Including systematic reviews in PhD programmes
guidelines, our guidelines are based on our experience and also synthesis and candidatures in nursing – ‘Hobson’s choice’? Nurse Educ. Pract. 14 (2) (2014)
of other studies, and thus personal researcher bias could be involved, but 102–105.
[7] S. Gopalakrishnan, P. Ganeshkumar, Systematic reviews and Meta-analysis: under-
we have mitigated such bias to the best of our ability. standing the best evidence in primary healthcare, J. Family Med. Prim. Care 2 (1)
The authors recommend the researchers to apply the guidelines in (2013) 9–14 Jan-Mar.
conducting MLR studies, and then share their lessons learned and ex- [8] R.W. Schlosser, The role of systematic reviews in evidence-based practice, research,
and development, FOCUS 15 (2006).
periences with the community. Guidelines like the ones reported in this
[9] V. Garousi, M. Felderer, Ç.M. Karapıçak, U. Yılmaz, What we know about testing
paper are living entities, and have to be assessed and improved in several embedded software, IEEE Softw. (2018) In press.
iterations. [10] R.L. Glass, T. DeMarco, Software Creativity 2.0, developer.∗ Books, 2006.
19
JID: INFSOF
[11] C. Wohlin, P. Runeson, M. Hst, M.C. Ohlsson, B. Regnell, A. Wessln, in: Experimen- [42] P. Raulamo, M.V. Mäntylä, V. Garousi, Choosing the right test automation tool: a
tation in Software Engineering, Springer Publishing Company, Incorporated, 2012, grey literature review, in: International Conference on Evaluation and Assessment
p. 259. in Software Engineering, Karlskrona, Sweden, 2017, pp. 21–30.
[12] R.T. Ogawa, B. Malen, Towards rigor in reviews of multivocal literatures: applying [43] V. Garousi, M. Felderer, T. Hacaloğlu, Software test maturity assessment and test
the exploratory case study method, Rev. Educ. Res. 61 (3) (1991) 265–286. process improvement: a multivocal literature review, Inf. Softw. Technol. 85 (2017)
[13] M.Q. Patton, Towards utility in reviews of multivocal literatures, Rev. Educ. Res. 16–42.
61 (3) (1991) 287–292. [44] V. Garousi, M.V. Mäntylä, When and what to automate in software testing? A mul-
[14] V. Alberani, P. De Castro Pietrangeli, A.M. Mazza, The use of grey literature tivocal literature review, Inf. Softw. Technol. 76 (2016) 92–117.
in health sciences: a preliminary survey, Bull. Med. Libr. Assoc. 78 (4) (1990) [45] R.F. Elmore, Comment on "Towards rigor in reviews of multivocal Literatures: ap-
358–363. plying the exploratory case study method", Rev. Educ. Res. 61 (3) (1991) 293–297.
[15] A.A. Saleh, M.A. Ratajeski, M. Bertolet, Grey literature searching for Health Sci- [46] V. Garousi, E. Yıldırım, Introducing automated GUI testing and observing its ben-
ences Systematic Reviews: a prospective study of time spent and resources utilized, efits: an industrial case study in the context of law-practice management software,
Evid. Based Libr. Inf. Pract. 9 (3) (2014) 28–50. in: Proceedings of IEEE Workshop on NEXt level of Test Automation (NEXTA),
[16] S. Hopewell, S. McDonald, M.J. Clarke, M. Egger, Grey literature in meta-analyses 2018, pp. 138–145.
of randomized trials of health care interventions, Cochrane Database Syst. Rev. (2) [47] Z. Sahaf, V. Garousi, D. Pfahl, R. Irving, Y. Amannejad, When to automate software
(2007). testing? decision support based on system dynamics – an industrial case study, in:
[17] R.J. Adams, P. Smart, A.S. Huff, Shades of grey: guidelines for working with the Proceeding of International Conference on Software and Systems Process, 2014,
grey literature in systematic reviews for management and organizational studies, pp. 149–158.
Int. J. Manag. Rev. (2016) n/a-n/a. [48] Y. Amannejad, V. Garousi, R. Irving, Z. Sahaf, A search-based approach for cost-ef-
[18] E. Tom, A. Aurum, R. Vidgen, An exploration of technical debt, J. Syst. Softw. 86 fective software test automation decision support and an industrial case study, in:
(6) (2013) 1498–1516. Proceeding of International Workshop on Regression Testing, co-located with the
[19] J.P. Higgins and S. Green, "Including unpublished studies in systematic re- IEEE International Conference on Software Testing, Verification, and Validation,
views " in Cochrane Handbook for Systematic Reviews of Interventions, 2014, pp. 302–311.
http://handbook.cochrane.org/chapter_10/10_3_2_including_unpublished_studies_ [49] V. Garousi, D. Pfahl, When to automate software testing? a decision-support ap-
in_systematic_reviews.htm, Last accessed: Dec. 2017. proach based on process simulation, Wiley J. Softw. 28 (4) (2016) 272–285.
[20] L. McAuley, B. Pham, P. Tugwell, D. Moher, Does the inclusion of grey literature [50] A. Yasin, M.I. Hasnain, Master Thesis, Blekinge Institute of Technology, Sweden,
influence estimates of intervention effectiveness reported in meta-analyses? The 2012.
Lancet 356 (9237) (2000) 1228–1231. [51] A. Rainer, Using argumentation theory to analyses software practitioners’ defeasi-
[21] K. Petersen, R. Feldt, S. Mujtaba, M. Mattsson, Systematic mapping studies in soft- ble evidence, inference and belief, Inf. Softw. Technol. 87 (2017) 62–80.
ware engineering, presented at the International Conference on Evaluation and [52] D. Walton, C. Reed, F. Macagno, Argumentation Schemes, Cambridge University
Assessment in Software Engineering (EASE), 2008. Press, 2008.
[22] B. Kitchenham and S. Charters, "Guidelines for Performing Systematic Literature [53] I. Kulesovs, iOS applications testing, in: Proceedings of the International Scientific
Reviews in Software engineering," in "EBSE Technical Report," 2007, vol. EBSE- and Practical Conference, 2015, pp. 138–150.
2007-01. [54] M.V. Mäntylä, K. Smolander, Gamification of software testing - An MLR, in: In-
[23] J. Schöpfel, D.J. Farace, Grey literature, in: M.J. Bates, M.N. Maack (Eds.), ternational Conference on Product-Focused Software Process Improvement, 2016,
Encyclopedia of Library and Information Sciences, 3rd ed., CRC Press, 2010, pp. 611–614.
pp. 2029–2039. [55] L.E. Lwakatare, P. Kuvaja, M. Oivo, Relationship of DevOps to agile, lean and con-
[24] C. Lefebvre, E. Manheimer, J. Glanville, Searching for studies, in: J.P.T. Higgins, tinuous deployment: a multivocal literature review study, in: Proceedings of Inter-
S. Green (Eds.), Cochrane Handbook For Systematic Reviews of Interventions, Wi- national Conference on Product-Focused Software Process Improvement, Springer
ley-Blackwell, Chichester, 2008. International Publishing, 2016, pp. 399–415.
[25] D. Giustini, Finding the hard to finds: searching for grey (Gray) literature, UBC [56] B.B.N.d. Franca, J. Helvio Jeronimo, G.H. Travassos, Characterizing DevOps by
(Dec. 2017). hlwiki.slais.ubc.ca/images/5/5b/Greylit_manual_2012.doc. Last ac- hearing multiple voices, in: Proceedings of the Brazilian Symposium on Software
cessed. Engineering, 2016, pp. 53–62. 2973845.
[26] B.M. Mathers, et al., Global epidemiology of injecting drug use and HIV among [57] Clemens Sauerwein, C. Sillaber, A. Mussmann, R. Breu, Threat intelligence shar-
people who inject drugs: a systematic review, The Lancet 372 (9651) (2008) ing platforms: an exploratory study of software vendors and research perspectives,
1733–1745. International Conference on Wirtschaftsinformatik, 2017.
[27] B. Kitchenham, D. Budgen, P. Brereton, Evidence-Based Software Engineering and [58] A. Calderón, M. Ruiz, R.V. O’Connor, A multivocal literature review on serious
Systematic Reviews, CRC Press, 2015. games for software process standards education, Comput. Stand. Inter. 57 (2018)
[28] J.A.M. Santos, A.R. Santos, M.G. De Mendonça, Investigating bias in the search 36–48.
phase of Software Engineering secondary studies, in: Ibero-American Conference [59] V. Garousi, B. Küçük, Smells in software test code: a survey of knowledge in indus-
on Software Engineering, 2015, pp. 488–501. try and academia, J. Syst. Softw. 138 (2018) 52–81.
[29] V. Garousi, M.V. Mäntylä, A systematic literature review of literature reviews in [60] S.S. Bajwa, X. Wang, A.N. Duc, P. Abrahamsson, How do software startups pivot?
software testing, Inf. Softw. Technol. 80 (2016) 195–216. Empirical results from a multiple case study, in: Proceedings of International Con-
[30] V. Garousi, M.V. Mäntylä, Citations, research topics and active countries in soft- ference on Software Business, 2016, pp. 169–176.
ware engineering: a bibliometrics study, Elsevier Comput. Sci. Rev. 19 (2016) [61] M. Sulayman, E. Mendes, A systematic literature review of software process im-
56–77. provement in small and medium web companies, in: D. Ślęzak, T.-h. Kim, A. Ki-
[31] V. Garousi, A bibliometric analysis of the Turkish software engineering research umi, T. Jiang, J. Verner, S. Abrahão (Eds.), Advances in Software Engineering, 59,
community, Springer J. Scientometr. 105 (1) (2015) 23–49. Communications in Computer and Information Science: Springer Berlin Heidelberg,
[32] V. Garousi, J.M. Fernandes, Quantity versus impact of software engineering pa- 2009, pp. 1–8.
pers: a quantitative study, Scientometrics 112 (2) (2017) 963–1006 journal article [62] V. Garousi, M. Felderer, T. Hacaloğlu, What we know about software test maturity
August 01,. and test process improvement, IEEE Softw. 35 (1) (2018) 84–92.
[33] V. Garousi, M. Felderer, Ç.M. Karapıçak, U. Yılmaz, What we know about testing [63] M. Kuhrmann, D.M. Fernández, M. Daneva, On the pragmatic design of literature
embedded software, IEEE Sof., (2017) In press. studies in software engineering: an experience-based guideline, Empir. Softw. Eng.
[34] V. Garousi, A. Mesbah, A. Betin-Can, S. Mirshokraie, A systematic mapping study of 22 (6) (2017) 2852–2891.
web application testing, Elsevier J. Inf. Softw. Technol. 55 (8) (2013) 1374–1396. [64] S. Doğan, A. Betin-Can, V. Garousi, Web application testing: a systematic literature
[35] I. Banerjee, B. Nguyen, V. Garousi, A. Memon, Graphical User Interface (GUI) reviewr, J. Syst. Softw. 91 (2014) 174–201.
Testing: systematic Mapping and Repository, Inf. Softw. Technol. 55 (10) (2013) [65] V.G. Yusifoğlu, Y. Amannejad, A. Betin-Can, Software test-code engineering: a sys-
1679–1694. tematic mapping, Journal of Information and Software Technology (2014).
[36] V. Garousi, S. Shahnewaz, D. Krishnamurthy, UML-Driven Software Perfor- [66] R. Farhoodi, V. Garousi, D. Pfahl, J.P. Sillito, Development of scientific software:
mance Engineering: a Systematic Mapping and Trend Analysis, in: V.G. Díaz, a systematic mapping, bibliometrics Study and a paper repository, Int’l J. Softw.
J.M.C. Lovelle, B.C.P. García-Bustelo, O.S. Martínez (Eds.), Progressions and In- Eng. Knowl. Eng. 23 (04) (2013) 463–506 In Press.
novations in Model-Driven Software Engineering, IGI Global, 2013. [67] V. Garousi, Classification and trend analysis of UML books (1997-2009), J. Softw.
[37] C. Escoffery, K.C. Rodgers, M.C. Kegler, M. Ayala, E. Pinsker, R. Haardörfer, A grey Syst. Modeling (SoSyM) (2011).
literature review of special events for promoting cancer screenings, BMC Cancer [68] J. Zhi, V.G. Yusifoğlu, B. Sun, G. Garousi, S. Shahnewaz, G. Ruhe, Cost, benefits
14 (1) (2014) 454 journal article June 19,. and quality of software development documentation: a systematic mapping, J. Syst.
[38] M. Favin, R. Steinglass, R. Fields, K. Banerjee, M. Sawhney, Why children are not Softw. (2014) In Press.
vaccinated: a review of the grey literature, Int. Health 4 (4) (2012) 229–238. [69] R.T. Ogawa, B. Malen, A response to commentaries on "Towards rigor in reviews
[39] J. Kennell, N. MacLeod, A grey literature review of the cultural Olympiad, Cultural of multivocal literatures... ", Rev. Educ. Res. 61 (3) (1991) 307–313.
Trends 18 (1) (2009) 83–88 2009/03/01. [70] J. Tyndall, AACODS checklist, Arch. Flinders Acad. Commons (2017).
[40] G. Tomka, Reconceptualizing cultural participation in Europe: grey literature re- https://dspace.flinders.edu.au/jspui/bitstream/2328/3326/4/AACODS_Checklist.
view, Cultural Trends 22 (3-4) (2013) 259–264 2013/12/01. pdf. Last accessed: Dec..
[41] V. Garousi, M. Felderer, M.V. Mäntylä, The need for multivocal literature reviews [71] R.K. Yin, Advancing rigorous methodologies: a review of "Towards rigor in reviews
in software engineering: complementing systematic literature reviews with grey of multivocal literatures... ", Rev. Educ. Res. 61 (3) (1991) 299–305.
literature, in: International Conference on Evaluation and Assessment in Software
Engineering, Limmerick, Ireland, 2016, pp. 171–176.
20
JID: INFSOF
[72] K. Godin, J. Stapleton, S.I. Kirkpatrick, R.M. Hanning, S.T. Leatherdale, Apply- [95] Survey of Software Companies in Finland (Ohjelmistoyrityskartoitus), 2017. Last
ing systematic review search methods to the grey literature: a case study examin- accessed: May Last accessed: May http://www.softwareindustrysurvey.fi.
ing guidelines for school-based breakfast programs in Canada, Syst. Rev. 4 (2015) [96] T.T. Board, Turkish quality report, 2017. Last accessed: May Last accessed: May
138–148 Oct 22. http://www.turkishtestingboard.org/en/turkey-software-quality-report/.
[73] S.P. Bellefontaine, C.M. Lee, Between black and white: examining grey literature [97] P. Bourque, R.E. Fairley, Guide to the software engineering body of knowledge
in Meta-analyses of psychological research, J. Child Family Stud. 23 (8) (2014) (SWEBOK), version 3.0, IEEE Comput. Soc. Press (2014).
1378–1388 2014//. [98] ISTQB, Standard glossary of terms used in software testing, Version 3.1, 2017.
[74] M. Banks, Blog posts and tweets: the next frontier for grey literature, in: D. Farace, Last accessed: Dec. Last accessed: Dec. https://www.istqb.org/downloads/send/
J. Schöpfel (Eds.), Grey Literature in Library and Information Studies, Walter de 20-istqb-glossary/104-glossary-introduction.html.
Gruyter, 2010. [99] A.N. Langville and C.D. Meyer, Google’s PageRank and Beyond: The Science of
[75] S. Hopewell, M. Clarke, S. Mallett, Grey literature and systematic reviews, in: Search Engine Rankings. Princeton University Press.
H.R. Rothstein, A.J. Sutton, M. Borenstein (Eds.), Publication Bias in Meta-Analysis: [100] M. Host, P. Runeson, Checklists for software engineering case study research, in:
Prevention, Assessment and Adjustments, John Wiley & Sons, 2006. International Symposium on Empirical Software Engineering and Measurement,
[76] V.S. Conn, J.C. Valentine, H.M. Cooper, M.J. Rantz, Grey Literature in Meta-Anal- 2007, pp. 479–481.
yses, Nurs. Res. 52 (4) (2003) 256–261. [101] F.Q.B. da Silva, A.L.M. Santos, S. Soares, A.C.C. França, C.V.F. Monteiro, F.F. Ma-
[77] Penn Libraries, Grey literature in the health sciences: eval- ciel, Six years of systematic literature reviews in software engineering: an updated
uating it, 2017. Last accessed: Dec. Last accessed: Dec. tertiary study, Inf. Softw. Technol. 53 (9) (2011) 899–913.
http://guides.library.upenn.edu/c.php?g=475317&p=3254241 . [102] V. Garousi, M. Felderer, Experience-based guidelines for effective and efficient data
[78] Y. McGrath, H. Sumnall, K. Edmonds, J. McVeigh, M. Bellis, Review of grey lit- extraction in systematic reviews in software engineering, in: International Confer-
erature on drug prevention among young people, Natl. Inst. Health Clin. Excell. ence on Evaluation and Assessment in Software Engineering, Karlskrona, Sweden,
(2006). 2017, pp. 170–179.
[79] J. Adams, et al., Searching and synthesising ‘grey literature’ and ‘grey information’ [103] D.S. Cruzes, T. Dyba, Synthesizing evidence in software engineering research, in:
in public health: critical reflections on three case studies, Syst. Rev. 5 (1) (2016) presented at the Proceedings of the ACM-IEEE International Symposium on Empir-
164 journal article. ical Software Engineering and Measurement, 2010.
[80] Q. Mahood, D. Van Eerd, E. Irvin, Searching for grey literature for systematic re- [104] P. Raulamo-Jurvanen, K. Kakkonen, M. Mäntylä, Using surveys and web-scraping
views: challenges and benefits, Res. Synth. Methods 5 (3) (2014) 221–234. to select tools for software testing consultancy, in: Proceedings International Con-
[81] K.M. Benzies, S. Premji, K.A. Hayden, K. Serrett, State-of-the-Evidence reviews: ference on Product-Focused Software Process Improvement, 2016, pp. 285–300.
advantages and challenges of including grey literature, Worldviews Evid. Based [105] M.B. Miles, A.M. Huberman, J. Saldana, Qualitative Data Analysis: A Methods
Nurs. 3 (2) (2006) 55–61. Sourcebook, Third Edition, SAGE Publications Inc, 2014.
[82] A. Lawrence, J. Houghton, J. Thomas, P. Weldon, Where is the evidence: realising [106] M.W. Toffel, Enhancing the practical relevance of research, Prod. Oper. Manag. 25
the value of grey literature for public policy and practice, Swinburne Inst. Soc. Res. (9) (2016) 1493–1505.
(2014). http://apo.org.au/node/42299. [107] V. Garousi, M. Felderer, Developing, verifying and maintaining high-quality auto-
[83] V. Garousi, M. Felderer, M.V. Mäntylä, Mini-SLR on MLR experience and guideline mated test scripts, IEEE Softw. 33 (3) (2016) 68–75.
papers, 2017. Last accessed: Dec. Last accessed: Dec. https://goo.gl/b2u1E5 . [108] Emerald Publishing Limited, What should you write for a prac-
[84] J. Gargani, S.I. Donaldson, What works for whom, where, why, for what, and when? titioner journal?, 2017. Last accessed: Dec. Last accessed: Dec.
Using evaluation evidence to take action in local contexts, New Dir. Eval. 2011 http://www.emeraldgrouppublishing.com/authors/guides/write/practitioner.
(130) (2011) 17–30. htm?part=3.
[85] K. Petersen, C. Wohlin, Context in industrial software engineering research, in: pre- [109] Unknown freelance authors, Titles that talk: how to create a title for
sented at the Proceedings of the 2009 3rd International Symposium on Empirical your article or manuscript, 2016. Last accessed: Sept. Last accessed: Sept.
Software Engineering and Measurement, 2009. https://www.freelancewriting.com/creative-writing/titles-that-talk/.
[86] L. Briand, D. Bianculli, S. Nejati, F. Pastore, M. Sabetzadeh, The case for context– [110] T. Dingsøyr, F.O. Bjørnson, F. Shull, What do we know about knowledge manage-
driven software engineering research: generalizability is overrated, IEEE Softw. 34 ment? Practical implications for software engineering, IEEE Softw. 26 (3) (2009)
(5) (2017) 72–75. 100–103.
[87] K. Petersen, C. Wohlin, Context in industrial software engineering research, in: [111] C. Giardino, M. Unterkalmsteiner, N. Paternoster, T. Gorschek, P. Abrahamsson,
Proceedings of the 2009 3rd International Symposium on Empirical Software En- What do we know about software development in startups? IEEE Softw. 31 (5)
gineering and Measurement, IEEE Computer Society, 2009, pp. 401–404. (2014) 28–32.
[88] T. Dybå, D.I. Sjøberg, D.S. Cruzes, What works for whom, where, when, and why? [112] F. Shull, G. Melnik, B. Turhan, L. Layman, M. Diep, H. Erdogmus, What do we know
On the role of context in empirical software engineering, in: Empirical Software about test-driven development? IEEE Softw. 27 (6) (2010) 16–19.
Engineering and Measurement (ESEM), 2012 ACM-IEEE International Symposium [113] T. Dyba, T. Dingsoyr, What do we know about agile software development? IEEE
on, IEEE, 2012, pp. 19–28. Softw. 26 (5) (2009) 6–9.
[89] C. Potts, Software-engineering research revisited, IEEE Softw. 10 (5) (1993) 19–28. [114] T. Hall, H. Sharp, S. Beecham, N. Baddoo, H. Robinson, What do we know about
[90] V.R. Basili, "Software modeling and measurement: the Goal/Question/Metric developer motivation? IEEE Softw. 25 (4) (2008) 92–94.
paradigm," Technical Report, University of Maryland at College Park 1992. [115] V. Garousi, M. Felderer, Worlds apart: industrial and academic focus areas in soft-
[91] S. Easterbrook, J. Singer, M.-A. Storey, D. Damian, Selecting empirical methods for ware testing, IEEE Softw. 34 (5) (2017) 38–45.
software engineering research, in: F. Shull, J. Singer, D.I.K. Sjøberg (Eds.), Guide [116] M. Harman, A. Mansouri, Search based software engineering: introduction to the
to Advanced Empirical Software Engineering, Springer London, London, 2008, special issue of the IEEE transactions on software engineering, IEEE Trans. Softw.
pp. 285–311. Eng. 36 (6) (2010) 737–741.
[92] C. Wohlin, Guidelines for snowballing in systematic literature studies and a replica- [117] M. Harman, S.A. Mansouri, Y. Zhang, Search-based software engineering: trends,
tion in software engineering, in: presented at the Proceedings of the 18th Interna- techniques and applications, ACM Comput. Surv. 45 (1) (2012) 1–61.
tional Conference on Evaluation and Assessment in Software Engineering, London, [118] M. Harman, A. Mansouri, Y. Zhang, Search-based software engineer-
England, United Kingdom, 2014. ing online repository, 2017. Last accessed: Dec. Last accessed: Dec.
[93] Capgemini Corp., World quality report 2016–17, 2017. Last accessed: http://crestweb.cs.ucl.ac.uk/resources/sbse_repository/ .
Dec. Last accessed: Dec. https://www.capgemini.com/thought-leadership/ [119] Author unknown, A guide for writing scholarly articles or reviews for the
world-quality-report-2016-17. educational research review, 2017. Last accessed: Dec. Last accessed: Dec.
[94] VersionOne, State of Agile report, 2017. Last accessed: Dec. Last accessed: Dec. https://www.elsevier.com/__data/promis_misc/edurevReviewPaperWriting.pdf.
http://stateofagile.versionone.com.
21

Information and Software Technology: Vahid Garousi, Michael Felderer, Mika V. Mäntylä

Uploaded by

Copyright:

Available Formats

Information and Software Technology: Vahid Garousi, Michael Felderer, Mika V. Mäntylä

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information and Software Technology: Vahid Garousi, Michael Felderer, Mika V. Mäntylä

Uploaded by

Copyright:

Available Formats

JID: INFSOF

ARTICLE IN PRESS [m5GeSdc;September 26, 2018;19:9]

Information and Software Technology 000 (2018) 1–22

Contents lists available at ScienceDirect

Information and Software Technology

Guidelines for including grey literature and conducting multivocal

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

Fig. 1. “Shades” of grey literatures (from [17]).

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

Fig. 2. Relationship among diﬀerent types of systematic secondary studies.

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

Planning the review

• Identiﬁcation of the need for a review

Conducting the review

Reporting the review

• Specifying dissemination mechanisms

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

Fig. 7. An overview of a typical MLR process.

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

# Question Possible answers MLR-AutoTest

Note: One or more “yes” responses suggest inclusion of GL.

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

RQ category Sub-category Example RQs

Exploratory Existence Does X exist?

Description- What is X like?

Descriptive-Comparative How does X diﬀer from Y?

Base-rate Frequency Distribution How often does X occur?

Descriptive-Process How does X normally work?

• How are software-product-lines testing tools evolving?

Relationship Relationship Are X and Y related?

Causality Causality Does X cause (or prevent) Y?

Causality-Comparative Does X cause more Y than does Z?

Design Design What’s an eﬀective way to achieve X?

5.1. Search process

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

5.3. Quality assessment of sources

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

Authority of the producer

• Is the publishing organization reputable? E.g., the

• Does the source have a clearly stated aim?

• Does the work seem to be balanced in presentation?

• Does the item have a clearly stated date?

Position w.r.t. related sources

• Have key related GL or formal sources been linked to /

• Does it enrich or add something unique to the research?

• Normalize all the following impact metrics into a single

• 1st tier GL (measure = 1): High outlet control/ High

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

5.4.2. Data extraction procedures and logistics

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

RQ Attribute/Aspect Categories (M)ultiple/

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

V. Garousi et al. Information and Software Technology 000 (2018) 1–22

V. Garousi et al. Information and Software Technology 000 (2018) 1–22