Nothing Special   »   [go: up one dir, main page]

Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Principles of Data Management and Presentation
Principles of Data Management and Presentation
Principles of Data Management and Presentation
Ebook496 pages18 hours

Principles of Data Management and Presentation

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

The world is saturated with data. We are regularly presented with data in words, tables, and graphics. Students from many academic fields are now expected to be educated about data in one form or another. Yet the typical sequence of courses—introductory statistics and research methods—does not provide sufficient information about how to focus in on a research question, how to access data and work with datasets, or how to present data to various audiences.
  
Principles of Data Management and Presentation addresses this gap. Assuming only that students have some familiarity with basic statistics and research methods, it provides a comprehensive set of principles for understanding and using data as part of a research project, including:
• how to narrow a research topic to a specific research question
• how to access and organize data that are useful for answering a research question
• how to use software such as Stata, SPSS, and SAS to manage data
• how to present data so that they convey a clear and effective message
 
A companion website includes material to enhance the learning experience—specifically statistical software code and the datasets used in the examples, in text format as well as Stata, SPSS, and SAS formats. Visit www.ucpress.edu/go/datamanagement, Downloads tab. 
LanguageEnglish
Release dateJul 3, 2017
ISBN9780520964327
Principles of Data Management and Presentation
Author

Dr. John P. Hoffmann

John P. Hoffmann is Professor of Sociology at Brigham Young University. Before arriving at BYU, he was a senior research scientist at the National Opinion Research Center (NORC), a nonprofit firm affiliated with the University of Chicago. He received a master’s in Law and Justice from American University and a doctorate in Criminal Justice from SUNY–Albany. He also received a master’s in Public Health with emphases in Epidemiology and Behavioral Sciences at Emory University. His research addresses drug use, juvenile delinquency, and the sociology of religion.  

Related to Principles of Data Management and Presentation

Related ebooks

Social Science For You

View More

Related articles

Reviews for Principles of Data Management and Presentation

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Principles of Data Management and Presentation - Dr. John P. Hoffmann

    Principles of Data Management and Presentation

    Principles of Data Management and Presentation

    JOHN P. HOFFMANN

    UC Logo

    UNIVERSITY OF CALIFORNIA PRESS

    University of California Press, one of the most distinguished university presses in the United States, enriches lives around the world by advancing scholarship in the humanities, social sciences, and natural sciences. Its activities are supported by the UC Press Foundation and by philanthropic contributions from individuals and institutions. For more information, visit www.ucpress.edu.

    University of California Press

    Oakland, California

    © 2017 by The Regents of the University of California

    Library of Congress Cataloging-in-Publication Data

    Names: Hoffmann, John P. (John Patrick), 1962—author.

    Title: Principles of data management and presentation / John P. Hoffmann.

    Description: Oakland, California: University of California Press, [2017] | Includes bibliographical references and index. | Description based on print version record and CIP data provided by publisher; resource not viewed.

    Identifiers: LCCN 2017005895 (print) | LCCN 2017012555 (ebook) | ISBN 9780520964327 (epub and ePDF) | ISBN 9780520289956 (cloth : alk. paper) | ISBN 9780520289949 (pbk. : alk. paper)

    Subjects: LCSH: Research—Methodology. | Research—Data processing—Management.

    Classification: LCC Q180.55 (ebook) | LCC Q180.55 .H64 2017 (print) | DDC 001.4/2—dc23

    LC record available at https://lccn.loc.gov/2017005895

    Stata® is a registered trademark of StataCorp LP, 4905 Lakeway Drive, College Station, TX 77845 USA. SAS® and all other SAS Institute Inc. products or service names are registered trademarks in the USA and other countries.® indicates USA registration. SAS Institute, Inc., 100 SAS Campus Drive, Cary, NC 27513 USA. SPSS® is a registered trademark of SPSS, Inc., 233 S. Wacker Drive, 11th Floor, Chicago, IL 60606 USA. SPSS, Inc. is an IBM company. Their use herein is for informational and instructional purposes only.

    Manufactured in United States of America

    25  24  23  22  21  20  19  18  17

    10  9  8  7  6  5  4  3  2  1

    To Curtis

    CONTENTS

    Preface

    Acknowledgments

    1 Why Research?

    Why Research

    What Is Research?

    Classifying Research

    Impediments to Conducting Sound Research

    How Can We Make Research Interesting and Persuasive?

    The Research Process

    Final Words

    Exercises for Chapter 1

    2 Developing Research Questions

    Selecting a Topic

    From Topic to Research Question

    Refining Research Questions

    An Example

    Final Words

    Exercises for Chapter 2

    3 Data

    What Are Data?

    Sources of Data

    From Concepts to Variables

    Forms of Data

    Final Words

    Exercises for Chapter 3

    4 Principles of Data Management

    Codebooks

    Documentation

    Coding

    Data Cleaning and Screening

    Naming Conventions

    Principles of File Management

    Final Words

    Exercises for Chapter 4

    5 Finding and Using Secondary Data

    Types of Secondary Data

    Why Use Secondary Data?

    Sources of Secondary Data

    Examples of Searching for, Downloading, and Importing Data

    A Simple Test of the Conceptual Model

    The Pew Research Center Data

    Final Words

    Exercises for Chapter 5

    6 Primary and Administrative Data

    Principles for Primary Data

    Administrative Data and Linking Datasets

    Final Words

    Exercises for Chapter 6

    7 Working with Missing Data

    Why Are Missing Data a Problem?

    Reasons for Missing Data

    Types of Missing Data

    Forms and Patterns of Missing Data

    Addressing Missing Data in the Analysis Stage

    Finals Words

    Exercises for Chapter 7

    8 Principles of Data Presentation

    Presenting Data

    Visual Images

    First Principles: Clarity, Precision, and Efficiency

    Why Words Are Not Enough

    Types of Tables and Graphics

    Principles of Data Presentation

    Final Words

    Exercises for Chapter 8

    9 Designing Tables for Data Presentations

    Table or Graphic?

    Tables

    Examples of Tables

    Final Words

    Exercises for Chapter 9

    10 Designing Graphics for Data Presentations

    Graphics

    Examples of Graphics

    Where to Next?

    Final Words

    Exercises for Chapter 10

    Appendix: Introduction to Statistical Software

    Stata

    SPSS

    SAS

    R

    Final Words

    References

    Index

    PREFACE

    The world is saturated with data. Readers of newspapers, magazines, blogs, and other media are exposed on a regular basis to data presented in words, tables, pictures, diagrams, and graphics. Customers of large retail establishments, patients who visit medical centers, students attending public and private schools, and many other people provide data—often unwittingly—that are used to predict their behaviors, medical conditions, and scores on various types of tests. You may not know what data you have provided in your life, but there is a high likelihood that information about you is part of a few large databases.

    I have been teaching courses on data analysis and research methods to university students for many years. These types of courses can be challenging for students in the social and behavioral sciences, especially in disciplines that tend to attract those who are not comfortable with quantitative methods. Although I have found that most students can handle the coursework—at least in my home disciplines of sociology and criminology—I’ve come to recognize a tangible gap in the way students are taught to do research. This involves the bridge between statistics courses and research methods courses.

    The typical approach for undergraduate students in the social and behavioral sciences (although this is also common in graduate programs) is to first complete an introductory course in statistics. This type of course typically teaches students about exploratory analyses, basic statistics (e.g., means, standard deviations), elementary statistical inference, and graphical representations of data (e.g., box plots, histograms). Introductory courses often conclude with units on correlations, analysis of variance (ANOVA), and simple linear regression models. In most of these courses, students are taught to use statistical software to conduct various types of analyses. Second, students take a course in research methods. This usually involves learning a little philosophy of science (epistemology, ontology), followed by a review of various research approaches, such as experiments, quantitative methods, and qualitative studies. In my department, students in this course are usually involved in a survey that requires them to become familiar with sampling, questionnaire design, interviewing techniques, and data entry. They are also exposed to theoretical issues in the field. Finally, in some programs, students build on this material by taking more advanced courses, such as regression analysis or qualitative methods, or by applying what they’ve learned in a capstone course.

    This sequence has existed for decades and may work relatively well, but I see the need to reconsider how students are taught to conduct research. I came to this conclusion after seeing the results of alumni surveys in which many of our students reported that they wished they had had more training in how to understand, manipulate, and use data to answer questions for their employers. This corresponds with a concern of some in the statistics education community that most students are not provided with "data habits of mind" that evolve mainly from working with data; these habits develop as students are taught to begin thinking about data at the very beginning of a research project, even before they examine a dataset (Baumer 2015; Finzer 2013, 5). I also noticed that many students who entered our graduate program could remember what means and correlations are used for and why one might choose a survey or an ethnographic approach to study some phenomenon, but they had a difficult time bridging the divide from understanding this material to conducting their own research. There seemed to be something missing, some gaps in their knowledge, skills, and experience.

    After speaking with some colleagues and mulling this over, I worked with one of my fellow faculty members to design a new course that concerns the research process, with a special emphasis on understanding and using data. The course is called Data analysis, management, and presentation and has been a required course in my department for a little more than 10 years. However, the title does not include an important component: how to develop research questions. Although becoming familiar with data management and presentation is important and motivates this course, as well as this book, it should be clear that developing a good research project must begin with a good research question or problem to tackle. Thus, the first part emphasizes the general research process and how to develop good ideas and questions that guide subsequent data needs. In general, my goals for the course—and, derivatively, for this book—are to help students develop the skills learned in introductory statistics and research methods and complement them with a broader perspective on how to understand data and use them to conduct research projects. Given the strictures of a single semester and my own expertise, I’ve limited the course to working with quantitative data. This is not to say that other methods are not equally valuable, but given that most of my students will not go on to research careers, but often do take jobs that require familiarity with quantitative data and their uses, I thought it prudent to focus in one area.

    The reading material for such a course is not available in a single location. Thus, I have used many books and articles to address particular aspects of the curriculum. Yet my concern that many students graduate from a social or behavioral science program without some useful research skills has convinced me that a single source would be valuable, hence the book you are now reading. But what are its general purposes? First, I hope to provide readers with a general understanding of some important aspects of social and behavioral science research. This is motivated not only by my experiences conducting research and teaching undergraduate and graduate students but also with a recent emphasis on workflow. This term has been borrowed from organizational studies to address the steps that a researcher should take to initiate and complete a research project (Kirchkamp 2013; Long 2009). Workflow is typically concerned with the part of a research project that involves the data, including the following steps: (a) collecting, compiling, and organizing a dataset; (b) planning the method of analysis; (c) analyzing the data; and (d) writing a report that describes and interprets the results of the analysis. One of the emphases that sets workflow apart from research steps as generally understood is that, for efficient workflow, each step should be carefully and fully documented so that a researcher can repeat each of them (if needed), share what was done with colleagues, and allow other researchers to reproduce or replicate the work. Although this book does not repeat in detail what others have recommended regarding workflow, this way of understanding a key part of the research process does influence what follows.

    Second, although I find an emphasis on project workflow to be valuable because it reminds us to be well organized and carefully document each stage of the data and analysis work, a key feature of research studies is often omitted from discussions of workflow. That is, for a project to be timely and important, it is not enough to gather data and analyze them. There are key steps that must come first, especially identifying a research question or problem to guide the project. Thus, the first two chapters of this book discuss what research is in a general context and how to narrow down the scope of one’s interest to a research question, problem, or hypothesis that may be investigated within the structure of a social science study.

    Third, whereas many elementary statistics and research methods courses require students to use statistical software, they don’t typically teach much about data management, including labeling and coding practices, missing data, and data cleaning. Exploratory analysis, a typical part of elementary statistics courses, is rarely linked to data management and cleaning, yet it can play a crucial role in helping students understand data better. Hence, data management is an area emphasized in this book.

    Fourth, even though research methods courses do a fine job of teaching students how to collect data, I fear that many of them fail to provide some critical skills in understanding and handling data. In addition, students are rarely exposed to administrative data and how they may be combined from different sources into a dataset designed for a particular research objective. A substantial number of studies in the social sciences also use secondary data: those that have been collected by other researchers and made available to the research community. Secondary datasets offer a cost-effective way to conduct research, yet it is the rare course that teaches students about them in any detail. Thus, this book discusses several ways to acquire data.

    Fifth, social science statistics and methods courses spend a substantial amount of time teaching students how to estimate models, but spend too little time, in my judgment, on how to present data. Many students are simply expected to learn about data presentation by preparing research posters and papers for courses, or, for some, to present at conferences. Much of their education about data presentation comes informally from mentors or from reading research articles in their field. This is an inefficient way to learn, though. Thus, one of my objectives is to provide some principles of data presentation, along with specific examples of good presentation practices. In other words, once we have a research question, gained access to and organized some data, and conducted an analysis, how can we effectively present the results so that various audiences can understand what we’ve done?

    Now that I’ve described a few things this book is designed to accomplish, it may be helpful to discuss what it is not. First, this is not a book, at least not as usually presented, on research methods. There are plenty of excellent books on how to conduct research in general, with an abundance of information on different methods, such as experiments, surveys, and ethnographies. Many of these books also discuss issues such as research ethics, developing good questions, validity and reliability, sampling, and measurement techniques.

    Second, the purpose of this book is not to teach readers about elementary or advanced statistics. Again, there are plenty of books that provide specific guidance on how to use statistical models to analyze data. Elementary statistics books present information on data collection, various exploratory techniques, graphical methods, hypothesis testing, Bayes’s theorem, probability distributions, estimation and inference, nonparametric statistics, ANOVA, and comparing population parameters. More advanced books cover topics such as linear regression analysis, generalized linear models, survival models, simultaneous equations, and multivariate statistics. I assume that readers are familiar only with the statistical tools taught in an elementary statistics course. Along these lines, I try not to take a position that favors frequentist or Bayesian statistics. Although most of my own work has been within the frequentist framework, there is little in this book that could not apply to either analytic approach.

    Third, this is not a book on statistical software. Software is used as a tool to illustrate the various concepts and principles used throughout, but the choice of software is secondary. Thus, readers will likely need to consult resources that fall outside this presentation for more information on using statistical and data management software. Nevertheless, Appendix A provides a brief introduction to three statistical software packages that are used in the examples sprinkled throughout the book: Stata®, SPSS®, and SAS®. It also points to resources for learning how to use each of these packages. Although the programming language and data analysis software R is not used for the examples, Appendix A provides a brief introduction to its capabilities as well.

    Finally, this book is not designed to teach readers how to write research reports or articles, nor how to present conference posters or papers. Although the principles and tools presented herein include some important aspects of preparing reports of research, there are other, more comprehensive resources that describe how to write about numbers (Miller 2004; Morgan et al. 2002), how to prepare research articles (Baglione 2016; Becker 2007; White 2005), and how to put together research presentations (Cohen et al. 2012; Miller 2007b).

    As suggested earlier, this book is designed to fill in some gaps that I see as especially stark in the way research is taught to undergraduate and graduate students in the social and behavioral sciences. I’ve found that their skill set in basic research methods and statistics tends to be sound, but there is more to the research process than the knowledge that is typically imparted in these courses. There are also some specific skills that are all too often missing from their education. I hope to fill in some of these gaps by offering this book.

    A BRIEF DESCRIPTION OF THE CHAPTERS

    The chapters are designed to be read in the order presented, although readers with research experience may wish to skip around to find topics that interest them.

    Chapter 1 addresses why we conduct research and some of the benefits of research to the individual, community, and society in general. It describes generally what research is and reviews various types of research, such as exploratory, descriptive, and analytical/explanatory research. There is also a discussion of impediments to sound research, some characteristics of research that make it interesting and persuasive, and the general research process. The goal of the chapter is to get readers thinking about research, why it is important, and how we can make it sound and useful.

    Chapter 2 focuses on how to develop research questions. It begins with an emphasis on narrowing in on interesting topics that others will also find stimulating and significant. This is followed by a discussion of the role of using various techniques to move from a topic to a reasonable research question. The next section addresses theories, concepts, and arguments. The idea of a conceptual model is utilized to outline this issue, along with concepts and statements that can be used to construct these models.

    Chapter 3 addresses some fundamental issues regarding data, such as what they are and how they are characterized in the social sciences. The next topic addressed in this chapter is measurement, in particular, some ways that researchers move from concepts to variables. The next chapter describes some principles of data documentation, as well as different coding strategies that are common in the social and behavioral sciences. This is followed by an overview of some principles of data management, which include data cleaning and screening practices, and naming conventions. Finally, some principles of file management are presented.

    Because many research projects in the social and behavioral sciences use secondary data (typically quantitative data that have been collected by another researcher and made available through a data repository or in some other way), Chapter 5 emphasizes some positive and negative aspects of this type of data. Examples of large data repositories that share data freely with researchers are listed. The next section describes some common ways to download data and import them into software useful for data management and analysis. The software examples include Stata, SPSS, and SAS, but also discuss how text files are particularly useful for conducting research with different software platforms.

    Since learning how to gather primary data is one of the principal goals of research methods courses, which may be a prerequisite for courses that use this type of book, some more general principles regarding the creation of datasets from primary data collection are provided in Chapter 6. The chief emphasis is on understanding some principles of creating quantitative datasets, including how to combine administrative data to good effect.

    The last few years have seen growing interest in missing data and their implications for social and behavioral science research. There are now many tools for handling missing data. Chapter 7 begins with a discussion of different types of missing data and their implications for research and data analysis. It then outlines several techniques for handling missing values, with an emphasis on providing a straightforward description of how to use state-of-the-art methods, maximum likelihood and multiple imputation, for attenuating missing data problems. This chapter presents material that is at a slightly more advanced level than the material in other chapters, so readers may wish to focus on its first three sections to gain a general understanding of missing data.

    Chapter 8 provides an overview of the principles used in effective data presentation, including those based on research on cognitive processing and recognition to illuminate how people tend to see and interpret data when it is presented to them in tables or graphics. This includes an emphasis on comparisons, pattern recognition/perception, the use of color, dimensionality, and data ordering and labeling that help tables and graphics meet the goals of clarity, precision, and efficiency. A key objective of this chapter is to get readers to think about how their research questions and the audience should guide the most effective ways that the data are presented.

    Chapter 9 addresses principles of designing tables to present research findings. It discusses various principles of organization, data ordering, and using labels, titles, legends, and notes. The aim is to help the reader design tables that convey the main message of the data and analysis.

    The final chapter is an overview of common types of graphics, their respective strengths and weaknesses (relative to the design principles discussed in Chapter 8), and how the choice of a particular graphic depends on the type of data or analysis that one wishes to present. Additional principles of creating effective graphics are discussed. The overarching goal of Chapters 9 and 10 is to provide principles and examples so that one’s audience or readers are most likely to reach an efficient level of understanding. The last section of the chapter mentions some innovative ways of presenting data, such as with dynamic and interactive graphics. It also provides readers with suggestions of where to go next to find good tools for data presentation through visualization.

    A BIT MORE ON SOFTWARE

    I have used many statistical and data management software platforms over the last 30 years. I began many years ago with SPSS, moved to SAS, shifted to SPlus and Stata, and have recently relied on R for many analytical tasks. I’ve also used MS Access and various spreadsheet software for data management tasks, and have dabbled with data visualization and presentation software (e.g., FlowVella, Visual.ly, Tableau, Prezi). However, I teach classes mainly with Stata because it is particularly efficient as a teaching tool. Given that the types of courses and readers who might benefit from the material in this book likely use a variety of software, though, I have tried to provide diversity by presenting some of the information that follows in Stata, SPSS, and SAS. The program R, which is growing rapidly in popularity because of its breadth of capabilities and its cost (free), is also discussed in Appendix A. I also show some data in spreadsheets, mainly imported from comma-separated values (csv) text files, which provide a good cross-platform for downloading data and moving them into statistical software. Almost any spreadsheet software may be used to import and examine data, such as Numbers for Mac, Google Sheets, Apache OpenOffice, MS Excel, or LibreOffice Calc. In the discussion of program documentation and the preparation of program files, I rely mainly on Notepad++ since it is a widely used and easily accessible text editing software. However, there are many other options available, such as Vim, Emacs, TextPad, TextEdit, or Sublime Text. Finally, although I use OS X, Linux, and MS Windows operating systems (depending on the computer), the following material was created on MS Windows and OS X based computers. With a few exceptions, most of the application software used herein may be run with a variety of operating systems.

    LEARNING RESOURCES

    This book is accompanied by a publisher’s website that includes material to enhance the learning experience of the reader. The website includes the statistical software code and the datasets used in the examples. Although Chapters 4 and 5 emphasize that original datasets should be in text format, the datasets on the website are also available in Stata, SPSS, and SAS formats (www.ucpress.edu/go/datamanagement).

    ACKNOWLEDGMENTS

    I owe so much to the many students I’ve taught over the years, especially those who have been in my data management, statistics, and research methods courses. They have taught me how to teach (although I have only myself to blame for the remaining limitations), and I’ve shared in their struggles to understand data and analysis. At the risk of forgetting some, I am especially indebted to Liz Warnick, Dallin Everett, Mandy Workman, Daehyeon Kim, Colter Mitchell, Bryan Johnson, and Scott Baldwin. Some of my colleagues who have taught similar courses have been especially gracious in sharing their experiences and material with me, including Brayden King, Lance Erickson, Carter Rees, and Eric Dahlin. I owe much to those at the University of California Press and IDS Infotech Ltd. who shepherded this work to print. This includes Seth Dobrin, Kate Hoffman, Chris Sosa Loomis, Renee Donovan, Jack Young, Mansi Gupta, S. Bhuvneshwari, and many others who work for these remarkable organizations. Finally, I thank Lynn, Brian, Christopher, Brandon, and Curtis for making family life the center of my existence. This book is dedicated, in particular, to Curtis, whose music and science will one day change the world.

    ONE

    Why Research?


    It seems that every day brings another report of some research finding. A quick Internet search of news articles that appeared on an otherwise ordinary day in June revealed, among other stories, that California just approved a publicly funded gun research center, the Netherlands began a campaign to identify and reduce research misconduct, a geological study discovered that some parts of the San Andreas Fault are sinking and others are rising, a nutrition study suggested that broccoli is healthier than previously thought because its phenolic compounds have notable antioxidant properties, and a survey revealed that about 31% of people admit to snooping on a friend or loved one by looking at their cell phones. As suggested by just a single day’s news coverage, research is a huge enterprise, employing millions of people worldwide and resulting in thousands of reports, articles, and books every year. The American Association for the Advancement of Science (2016) estimates that the US government spends about $70 billion per year on various forms of research.

    But many people have questioned the value of some of this funded research. We regularly see debates about the value of research on global warming, firearms, health-care systems, and many other topics. In addition, conservative politicians such as US Senator Tom Coburn of Oklahoma publish annual reports of federal government waste, taking particular glee in pointing out what are considered dubious scientific studies. For example, the 2014 Wastebook highlights studies of gambling monkeys, mountain lions on treadmills, and synchronized swimming by brine shrimp. Yet, there is clearly much to be gained from good research. Without it, there is little doubt that death, illness, and injury rates would be much higher. Food production would be substantially lower. The field of forensic science would be much more primitive, thus impeding efforts to solve crimes and catch criminals. Producing enough power to light homes, operate cars, and run businesses would be much more difficult. The list goes on and on.

    Social science often gets a particularly bad rap because some do not consider it a true science. But it has also contributed not only to making the world a better place, but also to increasing our understanding of the way people, social groups, communities, and institutions function and interact. Let’s examine a few examples of social science research to see what it has taught us. As you read the following illustrations, think of what broader implications each has for understanding the social world and perhaps even improving people’s lives.

    In the mid-1960s, the social psychologist Stanley Milgram wanted to determine how close or far apart people were socially. He devised a project in which he mailed a letter to random people who lived in several

    Enjoying the preview?
    Page 1 of 1