Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Advances in Predictive Modeling : How 17 In-Database Analytics Will Evolve to Change the Game

2010

It has been almost 25 years since the original data warehouse was conceived. Although the term business intelligence (BI) has since been introduced, little has changed from the original architecture. Meanwhile, business needs have expanded dramatically and technology has advanced far beyond what was ever envisioned in the 1980s. These business and technology changes are driving a broader and more inclusive view of what the business needs from IT; not just in BI but across the entire spectrum—from transaction processing to social networking. If BI is to be at the center of this revolution, we practitioners must raise our heads above the battlements and propose a new, inclusive architecture for the future. Business integrated insight (BI2) is that architecture. This article focuses on the information component of BI2—the business information resource. I introduce a data topography and a new modeling approach that can support data warehouse implementers to look beyond the traditional h...

Volume 15 • Number 2 • 2nd Quarter 2010 THE LEADING PUBLICATION FOR BUSINESS INTELLIGENCE AND DATA WAREHOUSING PROFESSIONALS BI-based Organizations 4 Hugh J. Watson Beyond Business Intelligence 7 Barry Devlin Advances in Predictive Modeling: How In-Database Analytics Will Evolve to Change the Game 17 Sule Balkan and Michael Goul BI Case Study: SaaS Helps HR Firm Better Analyze Sales Pipeline 26 Linda L. Briggs Enabling Agile BI with a Compressed Flat Files Architecture 29 William Sunna and Pankaj Agrawal BI Experts’ Perspective: Pervasive BI 36 Jonathan G. Geiger, Arkady Maydanchik, and Philip Russom BI and Sentiment Analysis 41 Mukund Deshpande and Avik Sarkar Dashboard Platforms Alexander Chiang 51 BI Training Solutions: As Close as Your Conference Room We know you can’t always send people to training, especially in today’s economy. So TDWI Onsite Education brings the training to you. The same great instructors, the same great BI/DW education as a TDWI event—brought to your own conference room at an affordable rate. It’s just that easy. Your location, our instructors, your team. Contact Yvonne Baho at 978.582.7105 or ybaho@tdwi.org for more information. www.tdwi.org/onsite VOLUME 15 • NUMBER 2 3 4 From the Editor BI-based Organizations Hugh J. Watson 7 Beyond Business Intelligence Barry Devlin 17 Advances in Predictive Modeling: How In-Database Analytics Will Evolve to Change the Game Sule Balkan and Michael Goul 26 BI Case Study: SaaS Helps HR Firm Better Analyze Sales Pipeline Linda L. Briggs 29 Enabling Agile BI with a Compressed Flat Files Architecture William Sunna and Pankaj Agrawal 36 BI Experts’ Perspective: Pervasive BI Jonathan G. Geiger, Arkady Maydanchik, and Philip Russom 41 BI and Sentiment Analysis Mukund Deshpande and Avik Sarkar 50 Instructions for Authors 51 Dashboard Platforms Alexander Chiang 56 BI Statshots BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 1 VOLUME 15 • NUMBER 2 tdwi.org EDITORIAL BOARD Editorial Director James E. Powell, TDWI Managing Editor Jennifer Agee, TDWI President Rich Zbylut Director, Online Products & Marketing Melissa Parrish Graphic Designer Rod Gosser President & Chief Executive Officer Neal Vitale Senior Vice President & Chief Financial Officer Richard Vitale Executive Vice President Michael J. Valenti Senior Vice President, Audience Development & Digital Media Abraham M. Langer Vice President, Finance & Administration Christopher M. Coates Senior Editor Hugh J. Watson, TDWI Fellow, University of Georgia Director, TDWI Research Wayne W. Eckerson, TDWI Senior Manager, TDWI Research Philip Russom, TDWI Associate Editors David Flood, TDWI Fellow, Novo Nordisk Mark Frolick, Xavier University Paul Gray, Claremont Graduate University Claudia Imhoff, TDWI Fellow, Intelligent Solutions, Inc. Graeme Shanks, University of Melbourne James Thomann, TDWI Fellow, DecisionPath Consulting Vice President, Erik A. Lindgren Information Technology & Application Development Vice President, Attendee Marketing Carmel McDonagh Vice President, Event Operations David F. Myers Chairman of the Board Jeffrey S. Klein Barbara Haley Wixom, TDWI Fellow, University of Virginia Advertising Sales: Scott Geissler, sgeissler@tdwi.org, 248.658.6365. Reaching the staff List Rentals: 1105 Media, Inc., offers numerous e-mail, postal, and telemarketing Staff may be reached via e-mail, telephone, fax, or mail. lists targeting business intelligence and data warehousing professionals, as well E-mail: To e-mail any member of the staff, please use the following form: FirstinitialLastname@1105media.com as other high-tech markets. For more information, please contact our list manager, Merit Direct, at 914.368.1000 or www.meritdirect.com. Reprints: For single article reprints (in minimum quantities of 250–500), e-prints, plaques and posters contact: PARS International, Phone: 212.221.9595, E-mail: 1105reprints@parsintl.com, www.magreprints.com/QuickQuote.asp Renton office (weekdays, 8:30 a.m.–5:00 p.m. PT) Telephone 425.277.9126; Fax 425.687.2842 1201 Monster Road SW, Suite 250, Renton, WA 98057 Corporate office (weekdays, 8:30 a.m.–5:30 p.m. PT) Telephone 818.814.5200; Fax 818.734.1522 9201 Oakdale Avenue, Suite 101, Chatsworth, CA 91311 © Copyright 2010 by 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. Mail requests to “Permissions Editor,” c/o Business Intelligence Journal, 1201 Monster Road SW, Suite 250, Renton, WA 98057. The information in this journal has not undergone any formal testing by 1105 Media, Inc., and is distributed without any warranty expressed or implied. Implementation or use of any information contained herein is the reader’s sole responsibility. While the information has been reviewed for Business Intelligence Journal (article submission inquiries) Jennifer Agee E-mail: jagee@tdwi.org tdwi.org/journalsubmissions accuracy, there is no guarantee that the same or similar results may be achieved in all environments. Technical inaccuracies may result from printing errors, new developments in the industry, and/or changes or enhancements to either hardware or software components. Printed in the USA. [ISSN 1547-2825] Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies. 2 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 TDWI Membership (inquiries & changes of address) E-mail: membership@tdwi.org tdwi.org/membership 425.226.3053 Fax: 425.687.2842 From the Editor I n good economies and bad, the secret to success is to meet your customers’ or clients’ needs. Your enterprise has to respond to changing conditions and emerging trends, and it has to do so quickly. Your organization must be, in a word, agile. “Agile” has been used to describe an application development methodology designed to help IT get more done in less time. We’re expanding the meaning of agile to include the techniques and best practices that will help an organization as a whole be more responsive to the marketplace, especially as it relates to its business intelligence efforts. In our cover story, William Sunna and Pankaj Agrawal note that rapid results in active data warehousing become vital if organizations are to manage and make optimal use of their data. Their compressed flat-file architecture helps an enterprise develop less costly solutions and do so faster—which is at the very heart of agile BI. Sule Balkan and Michael Goul explain how in-database analytics advance predictive modeling processes. Such technology can significantly reduce cycle times for rebuilding and redeploying updated models. It will benefit analysts who are under pressure to develop new models in less time and help enterprises fine-tune their business rules and react in record time—that is, boost agility. Barry Devlin notes that businesses need more from IT than just BI. Transaction processing and social networking must be considered. Devlin points out how agility is a major driver of operational environment evolution, and how the need for agility in the face of change is driving the need for a new architecture. Alexander Chiang looks at dashboard platforms (the technologies, business challenges, and solutions) and how rapid deployment of agile dashboard development reduces costs and puts dashboards into the hands of users quickly. Also in this issue, senior editor Hugh J. Watson looks at enterprises that have immersed BI in the business environment, where work processes and BI intermingle and are highly interdependent. Mukund Deshpande and Avik Sarkar explain how sentiment data (opinions, emotions, and evaluations) can be mined and assessed as part of your overall business intelligence. In our Experts’ Perspective column, Jonathan G. Geiger, Arkady Maydanchik, and Philip Russom suggest best practices for correcting data quality issues. We’re always interested in your comments about our publication and specific articles you’ve enjoyed. Please send your comments to jpowell@1105media.com. I promise to be agile in my reply. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 3 BI-BASED ORGANIZATIONS BI-based Organizations Hugh J. Watson A growing number of companies are becoming BI-based. For these firms, business intelligence is not just nice to have; rather, it is a necessity for competing in the marketplace. These firms literally cannot survive without BI (Wixom and Watson, 2010). Hugh J. Watson is Professor of MIS and C. Herman and Mary Virginia Terry Chair of Business Administration in the Terry College of Business at the University of Georgia. hwatson@terry.uga.edu In BI-based organizations, BI is immersed in the business environment.1 Work processes and BI intermingle, are highly interdependent, and influence one another. Business intelligence changes the way people work as individuals, in groups, and in the enterprise. People perform their work following business processes that have BI embedded in them. Business intelligence extends beyond organizational boundaries and is used to connect and inform suppliers and customers. An Example of a BI-based Organization I recently completed a case study of a major online retailer. (The well-known company asked that its name not be used in this article.) Business intelligence permeates its operations. The company has a data warehouse group that maintains the decision-support data repository and a decision-support team with analysts scattered throughout the business to help develop and implement BI applications. The applications include: ■■ Forecasting product demand ■■ Determining the selling price for products, both initially and later for products with sales below expectations ■■ Market basket analysis 1. The concept of a BI-based organization is similar to the “immersion view” of IT introduced in O.A. El Sawy, 2003. 4 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 BI-BASED ORGANIZATIONS ■■ Customer segmentation analysis ■■ Product recommendations, both while customers are on the Web site and in follow-on communications ■■ Customer and product profitability analysis ■■ Campaign planning and management ■■ Supply chain integration ■■ Web analytics ■■ Fact-based decision making (e.g., Walmart), telecommunications firms (AT&T), and financial institutions (Bank of America). They all use BI to understand and communicate with customers and to optimize their operations. Consider the online retailer discussed earlier. It must compete against large pure-plays such as Amazon.com and the traditional brick-and-mortar companies that also have a strong online presence. To be successful, it must use analytics to understand and anticipate the needs and wants of its millions of customers, offer products at appealing yet profitable prices, communicate and make offers that are wanted (no spam), and acquire and deliver products in an efficient and effective way. Some of the details of these applications are interesting. For example, the customer profitability analysis considers whether a customer typically buys products at full price or only those that are discounted, and whether a customer has a history of returning products. When a product has an excessive return rate, it is pulled off the Web site and assigned to an investigative team to work with the vendor to identify and fix the problem. When the problem resides with the vendor, the vendor is expected to make good on the costs incurred. Hundreds of tests are also run each year to see what marketing approaches work best, such as the content of offers and the most effective e-mail subject lines. A Strategic Vision and Executive Support for BI Several years ago I worked with a regional bank that was close to going under (Cooper et al, 2000). The new management team “stopped the bleeding” by cutting costs but knew that this was not a sustainable business strategy. For the long run, the bank implemented a strategy based on knowing its customers exceptionally well. The CEO had a vision for how the strategy would work. He wanted everyone on the project team to understand and buy into the strategy and to have the necessary skills to execute it. When the vice president of marketing proved unable to execute the vision, he was replaced, despite having been hired specifically for the job. When asked to describe his company, the retailer’s CEO said he does not view it as a retailer or even an information-based company. Rather, he sees it as “a BI company.” The use of analytics is key to its success. Because time was of the essence and in-house data warehousing and BI skills were lacking, consultants were used extensively. It was an expensive, “bet-the-bank” approach, but it proved highly successful. With senior management’s vision and support, the bank became a leader in the use of analytics and an emerging leader in the industry. Becoming a BI-based Organization If companies are becoming increasingly BI-based, what are the drivers behind this trend and what are the requirements for success? Although there are exceptions, the following conditions are typical. A Highly Competitive Business Environment Nearly all firms face a competitive business environment and can benefit from BI in some way. This is especially true for firms that serve high-volume markets using standardized products and processes. Think of retailers Smart companies have formal BI vision documents and use them to guide and communicate their plans for BI. Sponsors must understand their responsibilities, such as being visible users of the system and helping to handle political problems. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 5 BI-BASED ORGANIZATIONS An Analytical Culture The bank we mentioned had 12 marketing specialists prior to implementing its customer intimacy strategy, and had 12 different people afterwards. All of the original dozen employees had moved to other positions or left the bank. The bank’s CEO said their idea of marketing was handing out balloons and suckers at the teller line and running focus groups. The new marketing jobs were very analytical, and the previous people couldn’t or didn’t want to do that kind of work. At Harrah’s Entertainment, decisions used to be made based on Harrahisms—pieces of conventional wisdom that were believed to be true (Watson and Volonino, 2002). As Harrah’s moved to fact-based decision making, these Harrahisms were replaced by analyses and tests of what worked best. Using this strategy, Harrah’s evolved from a blue-collar casino into the industry leader. In the short run, a company either has an analytical culture or it doesn’t. Change needs to originate at the top, and it may require replacing people who don’t have analytical skills. A Comprehensive Data Infrastructure A company’s BI efforts cannot be any better than the available data. That is why so much time and effort is devoted to building data marts and warehouses, enhancing data quality, and putting data governance in place. Once these exist, however, it is relatively easy to realize the benefits of BI. Continental Airlines has a comprehensive data warehouse that includes marketing, revenue, operations, flight and crew data, and more (Watson et al, 2006). Because the data is in place, and the BI team and business users are familiar with the data and have the ability to build applications, new applications can be developed in days rather than months, allowing Continental to be very agile. Talented BI Professionals BI groups need a mix of technical and business skills. Although good technical talent is a must, an enterprise must have people who can work effectively with users. 6 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 I have been most impressed with those firms that have hybrid employees—that is, people with excellent technical and business skills. At Continental Airlines, it is not always clear whether you are talking with someone from the BI group or from one of the business units. Many people understand both BI and the business, and there is a good reason for this: Some of the people in BI used to work in the business, and vice versa. This approach can help eliminate the chasm that is so common between IT and business people. Conclusion My list of drivers and requirements for a BI-based organization is not all-inclusive, but if you get these things right, you are well on your way to creating a successful BI-based organization. ■ References Cooper, B.L., H.J. Watson, B.H. Wixom, and D.L. Goodhue [2000]. “Data Warehousing Supports Corporate Strategy at First American Corporation,” MIS Quarterly, December, pp. 547–567. El Sawy, O.A. [2003]. “The IS Core—The 3 Faces of IS Identity: Connection, Immersion, and Fusion,” Communications of the Association for Information Systems, Vol. 12, pp. 588–598. Watson, H.J., B.H. Wixom, J.A. Hoffer, R. AndersonLehman, and A.M. Reynolds [2006]. “Real-time Business Intelligence: Best Practices at Continental Airlines,” Information Systems Management, Winter, pp. 7–18. ———, and L. Volonino [2002]. “Customer Relationship Management at Harrah’s Entertainment,” DecisionMaking Support Systems: Achievements and Challenges for the Decade, Forgionne, G.A., J.N.D. Gupta, and M. Mora (eds.), Idea Group Publishing. Wixom, B.H., and H.J. Watson [2010]. “The BI-Based Organization,” International Journal of Business Intelligence Research, January–March, pp. 13–25. BEYOND BI Beyond Business Intelligence Barry Devlin Abstract Barry Devlin, Ph.D., is a founder of the data warehousing industry and among the foremost worldwide authorities on business intelligence and the emerging field of business insight. He is a widely respected consultant, lecturer, and author of the seminal book, Data Warehouse: From Architecture to Implementation. He is founder and principal of 9sight Consulting (www.9sight.com). barry@9sight.com It has been almost 25 years since the original data warehouse was conceived. Although the term business intelligence (BI) has since been introduced, little has changed from the original architecture. Meanwhile, business needs have expanded dramatically and technology has advanced far beyond what was ever envisioned in the 1980s. These business and technology changes are driving a broader and more inclusive view of what the business needs from IT; not just in BI but across the entire spectrum—from transaction processing to social networking. If BI is to be at the center of this revolution, we practitioners must raise our heads above the battlements and propose a new, inclusive architecture for the future. Business integrated insight (BI2) is that architecture. This article focuses on the information component of BI2—the business information resource. I introduce a data topography and a new modeling approach that can support data warehouse implementers to look beyond the traditional hard information content of BI and consider new ways of addressing such diverse areas as operational BI and (so-called) unstructured content. This is an opportunity to take the next step beyond BI to provide complete business insight. The Evolution of an Architecture The first article describing a data warehouse architecture was published in 1988 in the IBM Systems Journal (Devlin and Murphy, 1988), based on work in IBM Europe over the previous three years. At almost 25 years old, data warehousing might thus be considered venerable. It has also been successful; almost all of that original architecture is clearly visible in today’s approaches. The structure and main components of that first warehouse architecture are shown in Figure 1, inverted to match later bottom-to-top flows but otherwise unmodified. Despite changes in nomenclature, all but one of the major components of the modern data BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 7 BEYOND BI Reports End user workstation Data marts End user interface Metadata Metadata Business data warehouse Enhanced data, summary Enhanced data, detailed Raw data, detailed Enterprise data warehouse Data interface Data dictionary and business process definitions Local data Operational systems Operational systems Figure 1. Data warehouse architecture, 1988 Figure 2. The layered data warehouse architecture (Devlin, 1997) warehouse architecture appear. The data interface clearly corresponds to ETL. The business data directory was later labeled metadata. The absence of data marts is more apparent than real. The business data warehouse explicitly described data at different levels of granularity, derivation, and usage—all the characteristics that later defined data marts. The only missing component, seen only recently in data warehouses, is enterprise information integration (EII) or federated access. A key mutation occurred in the architecture in the early 1990s. This mutation, shown in Figure 2, split the singular business data warehouse (and all informational data) into two horizontal layers—the enterprise data warehouse (EDW) and the data marts—and also vertically split the data mart layer into separate stovepipes of data for different informational needs. The realignment was driven largely by the need for better query performance in relational databases. The highly normalized tables in the EDW usually required extensive and expensive joins of such tables to answer user queries. Another driver was “slice-and-dice” analysis, which is most easily supported using dimensional models and even specialized data stores. Figure 1 is a logical architecture. It shows two distinct types of data—operational and informational—and recognizes the fundamental differences between them. Operational data was the ultimate source of all data in the warehouse, but was beyond the scope of the warehouse: fragmented, often unreliable, and in need of cleansing and conditioning before being loaded. The warehouse data, on the other hand, was cleansed, consistent, and enterprisewide. This dual view of data informed how decision support was viewed by both business and IT since its invention in the 1960s (Power, 2007). 8 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 This redrawing of the original, logical architecture picture has had significant consequences for subsequent thinking about data warehousing. First was a level of mental confusion about whether the architecture picture was supposed to be logical or physical. Such a basic architectural misunderstanding divides the community BEYOND BI into factions debating the “right” architecture—recall the Inmon versus Kimball battles of the 1990s. current mess in decision support that will be cured by data warehousing. Second, and more important, is the disconnect from a key requirement of the original architecture: that decision-support information must be consistent and integrated across the whole enterprise. When viewed as a physical picture, Figure 2 can encourage fragmentation of the information vertically (based on data granularity or structure) and horizontally (for different organizational/ user needs or divisions). The implication is that data should be provided to users through separate data stores, optimized for specific query types, performance needs, etc. Vendors of data mart tools thus promoted quick solutions to specific data and analysis needs, paying lip service—at best—to the EDW. In truth, most generalpurpose databases struggled to provide the performance required across all types of queries. The EDW is often little more than a shunting yard for data on its way to data marts or a basic repository for predefined reporting. This brief review of the evolution of data warehousing poses three questions: The third, and more subtle, consequence is that thinking about logical and physical data models and storage has also split into two camps. Enterprise architecture focuses on data consistency and integrity, often assuming that the model may never be physically instantiated. On the other hand are solution developers who focus on application performance at the expense of creating yet more copies of data. The result is dysfunctional IT organizations where corporate and departmental factions promote diametrically opposed principles to the detriment of the business as a whole. Of course, Figure 2 is not the end of the architecture evolution. Today’s pictures show even more data storage components. Metadata is split off into a separate layer or pillar. The EDW is complemented by stores such as master data management (MDM) and the operational data store (ODS). Data marts have multiplied into various types based on usage, function, and data type. The connectivity of EII has been added in recent years. In truth, these modern pictures have become more like graphical inventories of physical components than true logical architectures; they have begun to look like the spaghetti diagrams beloved by BI vendors to show the ■■ After 25 years of changing business needs, do we need a new architecture to meet the current and foreseen business demands? ■■ What would a new logical data architecture look like? ■■ What new modeling and implementation approaches are needed to move to the new architecture? What Business Needs from IT in the 21st Century The concepts of operational BI and unstructured content analytics point to the most significant changes in what business expects of IT over the past decade. The former reflects a huge increase in speed and agility required by modern business; the latter points to a fundamental shift in focus by decision makers and a significant expansion in the scope of their attention. Speed has become one of the key drivers of business success today. Decisions or processes that 20 years ago took days or longer must now be completed in hours or even minutes. The data required for such activities must now be up to the minute rather than days or weeks old. Increasing speed may require eliminating people from decision making, which drives automation of previously manual work and echoes the prior automation of “blue collar” work. As a result, the focus of data warehousing has largely shifted from consistency to speed of delivery. In truth, of course, delivering inconsistent data more quickly is actually worse in the long term than delivering it slowly, but this obvious consideration is often conveniently ignored. As the term “operational BI” implies, decision making is being driven into the operational environment by this trend. Participants from IT in operational BI seminars repeatedly ask: How is this different from what goes on in the operational systems? The answer is: not a lot. This response has profound implications for data warehouse BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 9 BEYOND BI architecture, disrupting the division that has existed between operational and informational data since the 1960s. If BI architects can no longer distinguish between operational and informational activities, how will users do so? Agility—how easily business systems cope with and respond to internal and external change—is a major driver of evolution in the operational environment. Current thinking favors service-oriented architecture (SOA) as a means of allowing rapid and easy modification of workflows and exchange of business-level services as business dictates. Such rapid change in the operational environment creates problems for data loading using traditional ETL tools with more lengthy development cycles. On the plus side, the message-oriented interfaces between SOA services can provide the means to load data continuously into the warehouse. Furthermore, the operational-informational boundary becomes even more blurred as SOA becomes pervasive, especially as it is envisaged that business users may directly modify business processes. Users simply do not distinguish between operational and informational functions. They require any and all services to operate seamlessly in a business workflow. In this environment, the old warehousing belief that operational data is inconsistent while warehouse data is reliable simply cannot be maintained. Operational data will have to be cleansed and made consistent at the source, and as this occurs, one rationale for the EDW—as the dependable source of consistent data—disappears. Turning to the growing interest in and importance of unstructured data, we encounter further fundamental challenges to our old thinking about decision making and how to support it. We are constantly reminded of the near-exponential growth in these data volumes and the consequent storage and performance problems. However, this is really not the issue. The real problem lies in the oxymoron “unstructured data.” All data is structured—by definition. “Structured” data, as it’s known, is designed to be internally consistent and immediately useful to IT systems that record and 10 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 analyze largely numerical and categorized information. Such hard information is modeled and usually stored in tabular or relational form. “Unstructured” information, in reality, has some other structure less amenable to numerical use or categorization. This soft information often contains or is related to hard information. For example, a business order can exist as: (1) a message on a voicemail system; (2) a scanned, handwritten note; (3) an e-mail message; (4) an XML document; and (5) a row in a relational database. As we proceed along this list, the information becomes harder, that is, more usable by a computer. On the other hand, we may lose some value inherent in the softer information: the tone of voice in the voicemail message may alert a person to the urgency of the order or some dissatisfaction of the buyer. Business decision makers, especially at senior levels, have always used soft information, often from beyond the enterprise, in their work. Such information was gleaned from the press and other sources, gathered in conversations with peers and competitors, and grafted together in face-to-face interactions between team members. Today, these less-structured decision-making processes are electronically supported and computerized. The basic content is digitized, stored, and used online. Conversations occur via e-mail and instant messaging. Conferences are remote and Web-based. For data warehousing, as a result, the implications extend far beyond the volumes of unstructured data that must be stored. These volumes would pose major problems—the viability of copying so much data into the data warehouse and management of potentially multiple copies—if we accepted the current architecture. However, of deeper significance is the question of how soft information and associated processes can be meaningfully and usefully integrated with existing hard information and processes. At its core, this is an architectural question. How can existing modeling and design approaches for hard information extend to soft information? Assuming they can, how can soft information, with its loose and fluid structure, be mined on the fly for the metadata inherent in its content? Although these questions are not new, there is little consensus so far about how this will be BEYOND BI Personal Action Domain Deferred Immediate Active Thoughtful Inventive Business Function Assembly Workflow Activity Creative Conditioning Analytical Decisional Business Information Resource Uncertified Unstructured Structured In-flight Live Reconciled Historical Certified Figure 3. The business integrated insight architecture done. As was the case for enterprise data modeling, which matured in tandem with the data warehouse architecture, methods of dealing with soft information will surface as a new architecture for life beyond BI is defined. In the case of operational BI and SOA, the direction is clear and the path is emerging: The barrier between operational and informational data is collapsing, and improvements in database technology suggest that we can begin to envisage something of a common store. For the structured/unstructured divide, the direction is only now emerging and the path is yet unclear. However, the direction echoes that for operational/informational stores—the barriers we have erected between these data types no longer serve the business. We need to tear down the walls. Business Integrated Insight and the Business Information Resource Business integrated insight (BI2), a new architecture that shows how to break down the walls, is described elsewhere (Devlin, 2009). As Figure 3 shows, this is again a layered architecture, but one where the layers are information, process, and people, and all information resides in a single layer. used by the organization—from minute-to-minute operations to strategic decision making—is needed. At its most comprehensive, this comprises every disparate business data store on every computer in the organization, all relevant information on business partners’ computers, and all relevant information on the Internet! It includes in-flight transaction data, operational databases, data warehouses and data marts, spreadsheets, e-mail repositories, and content stores of all shapes and sizes inside the business and on the Web. This article focuses on the business information resource (BIR), the information layer in BI2, to provide an expanded and improved view of that component of Figure 3. The BIR provides a single, logical view of the entire information foundation of the business that aims to significantly reduce the physical tendency to separate and then duplicate data in multiple stores. This BIR is a unified information space with a conceptual structure that allows for reasoned decisions about where to draw boundaries of significant business interest or practical implementation viability. As business changes or technology evolves, the BIR allows boundaries to change in response without reinventing the logical architecture or defining new physical components to simply store alternative representations of the same information. As seen in the business directions described earlier, a single, consistent, and integrated set of all information BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 11 BEYOND BI is ephemeral to eternal; consistency moves from standalone to consistent, integrated data. When data is very timely (i.e., close to real time), ensuring consistency between related data items can be challenging. As timeliness is relaxed, consistency is more easily ensured. Satisfying a business need for high consistency in near-real-time data can be technically challenging and ultimately very expensive. Reliance/usage Global Knowledge density Enterprise Multiplex Local Compound Personal Derived Vague Atomic In-flight Live Stable Reconciled Historical Along this axis, in-flight data consists of messages on the wire or the enterprise service bus; data is valid only at the instant it passes by. This data-in-motion might be processed, used, and discarded. However, it is normally recorded somewhere, at which stage it becomes live. Live data has a limited period of validity and is subject to continuous change. It also is not necessarily completely consistent with other live data. That is the characteristic of stable and reconciled data, which are stable over the medium term. In addition to its stability, reconciled data is also internally consistent in meaning and timing. Historical data is where the period of validity and consistency is, in principle, forever. Timeliness/consistency Figure 4. The axes of the business information resource The structure of the BIR is based on data topography, with a set of three continuously variable axes characterizing the data space. Data topography refers to the type and use of data in a general sense—easy to recognize but often difficult to define. This corresponds to physical topography, where most people can easily recognize a hill or a mountain when they see one, but formal definitions of the difference between them seldom make much sense. Similarly, most business or IT professionals can distinguish between hard and soft information as discussed earlier, but creating definitions of the two and drawing a boundary between them can be problematic. The three axes of data topography, as shown in Figure 4, provide business and IT with a common language to understand information needs and technological possibilities and constraints. Placing data elements or sets along the axes of the data space defines their business usage and directs us to the appropriate technology. The Timeliness/Consistency Axis The timeliness/consistency (TC) axis defines the time period over which data validly exists and its level of consistency with logically related data. These two factors reside on the same axis because there is a distinct, and often difficult, inverse technical relationship between them. From left to right, timeliness moves from data that 12 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 The TC axis broadly mirrors the lifecycle of data from creation through use to disposal or archival. Within its lifecycle, data traverses the TC axis from left to right, although some individual data items may traverse only part of the axis or may be transformed en route. A financial transaction, for example, starts life in-flight and exists unchanged right across the axis to the historical view. On the other hand, customer information usually appears first in live data, often in inconsistent subsets that are transformed into a single set of reconciled data and further expanded with validity time frame data in the historical stage. It is vital to note that this axis (like the others) is a continuum. The words in-flight, live, and so on denote broad phases in the continuous progression of timeliness from shorter to longer periods of validity and consistency from less- to more-easily achieved. They are not discrete categories of data. Nor are there five data layers between BEYOND BI which data must be copied and transformed. They represent broad, descriptive levels of data timeliness and consistency against which business needs and technical implementation can be judged. Placing data at the left end of the axis emphasizes the need for timeliness; at the right end, consistency is more important. It should be clear that the TC axis is the primary one along which data warehousing has traditionally operated. The current architecture splits data along this axis into discrete layers, assigning separate physical storage to each layer and distributing responsibility for the layers across the organization. Reuniting these layers, at first logically and perhaps eventually physically, is a key aim of BI2. The Knowledge Density Axis The knowledge density (KD) axis shows the amount of knowledge contained in a single data instance and reflects the ease with which meaning can be discerned in information. In principle, this measure could be numerical. For example, a single data item, such as Order Item Quantity, contains a single piece of information, while another data item, such as a Sales Contract, contains multiple pieces of information. In practice, however, counting and agreeing on information elements in more complex data items is difficult and, as with the TC axis, the KD axis is more easily described in terms of general, loosely bounded classes. At the lowest density level is atomic data, containing a single piece of information (or fact) per data item. Atomic data is extensively modeled and is most often structured according to the relational model. It is the most basic and simple form of data, and the most amenable to traditional (numerical) computer processing. The modeling process generates the separate descriptions of the data (the metadata) without which the actual data is meaningless. At the next level of density is derived data, which typically consists of multiple occurrences of atomic data that have been manipulated in some way. Such data may be derived or summarized from atomic data; the latter process may result in data loss. Derived data is usually largely modeled, and the metadata is also separate from the data itself. Compound data is the third broad class on the KD axis and refers to XML and similar data structures, where the descriptive metadata has been included (at least in part) with the data and where the combined data and metadata is stored in more complex or hierarchical structures. These structures may be modeled, but their inherent flexibility allows for less rigorous implementation. Although well suited to SOA and Web services approaches, such looseness can impact internal consistency and cause problems when combining with atomic or derived data. The final class is multiplex data, which includes documents, general content, image, video, and all sorts of binary large object (BLOB) data. In such data, much of the metadata about the meaning of the content is often implicit in the content itself. For example, in an e-mail message, the “To:” and “From:” fields clearly identify recipient and sender, but we need to apply judgment to the content of the fields and even the message itself to decide whether the sender is a person or an automated process. This axis allows us to deal with the concepts of hard and soft information mentioned earlier. The KD axis also relates to the much-abused terms “structured,” “semistructured,” and “unstructured.” Placing information on this axis is increasingly important in modern business as more soft information is used. Given that such data makes up 80 percent or more of all stored data, it makes sense that much useful information can be found here, for example, by text mining and automated modeling tools. Just as we have traditionally transformed and moved information along the TC axis in data warehousing, we now face decisions about whether and how to transform and move data along the KD axis. In this case, the direction of movement is likely to be from multiplex to compound, with further refinement into atomic or derived. The challenge is to do so with minimal copying. The Reliance / Usage Axis The final axis, reliance/usage (RU), has been largely ignored in traditional data warehousing, which confines itself to centrally managed and allegedly dependable data. However, the widespread use of personal data, such as spreadsheets, has always been problematic for data management (Eckerson and Sherman, 2008). Similarly, BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 13 BEYOND BI data increasingly arrives from external sources: from trusted business partners all the way to the “world wild west” of the Internet. All this unmanaged and undependable information plays an increasingly important role in running a business. It is becoming clear that centrally managed and certified information is only a fraction of the information resource of any business. The RU axis, therefore, classifies information according to how much faith can be placed in it and the uses to which it can be put. Global and enterprise information is strongly managed, either at an enterprise level or more widely by government, industry, or other regulatory bodies. It adheres to a well-defined and controlled information model, is highly consistent, and may be subject to audit. By definition, reconciled and historical information fall into these classes. Local information is also strongly managed, but only within a departmental or similar scope. Internal operational systems, with their long history of management and auditability, usually contain local or enterprise-class data. Information produced and managed by a single individual is personal and can be relied upon and used only within a very limited scope. A collaborative effort by a group of individuals produces information of higher reliability and wider usage and thus has a higher position on the RU axis. Vague information is the most unreliable and poorly controlled. Internet information is vague, requiring validation and verification before use. Information from other external sources, such as business partners, has varying levels of reliability and usage. The placement of information on this axis and the definition of rules and methods for handling different levels of reliance and usage are topics that are still in their infancy, but they will become increasingly important as the volumes and value of less closely managed and controlled data grow. A Note about Metadata The tendency of prior data warehouse architectures to carve up business information is also evident in their positioning of metadata as a separate layer or pillar. Such separation was always somewhat arbitrary and is no longer reasonable. 14 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 We have probably all encountered debates about whether timestamps, for example, are business data or metadata. This new architecture places metadata firmly and fully in the business information resource for three key reasons. First, as discussed earlier, metadata is actually embedded in the compound and multiplex information classes by definition. Second, metadata is highly valuable and useful to the business. This is obvious for business metadata, but even so-called technical metadata is often used by power users and business analysts as they search for innovative ways to combine and use existing data. Third, as SOA exposes business services to users, their metadata will become increasingly important in creating workflows. Integrating metadata into the BIR simply makes life easier for business and IT alike. Metadata, when extracted from business information, resides in the compound data class. Introducing Data Space Modeling and Implementation The data topography and data space described above recognize and describe a fact of life for the vast majority of modern business processes: Any particular business process (or, in many cases, a specific task) requires information that is distributed over the data space. A call center, for example, uses live, stable, and historical data along the TC axis; atomic, derived, and multiplex data along the KD axis; and local and enterprise data on the RU axis, as shown in Figure 5. Although this data space illustration provides a valuable visual representation of the data needs of the process and their inherent complexity, a more formal method of describing the data relationships is required to support practical implementation: data space modeling. Its aim is to create a data model beyond the traditional scope of hard information. Data space modeling includes soft information and describes the data relationships that exist within and across all data elements used by a process or task, irrespective of where they reside in the data space. To do this, I introduce a new modeling construct, the information nugget, and propose that a new, dynamic approach to modeling is needed, especially for soft information. It should be noted that much work remains to bring data space modeling to fruition. BEYOND BI Reliance/usage Modeling Soft Information Traditional information modeling Global approaches focus on (and, indeed, define Knowledge and create) hard information. It is a density Enterprise relatively small step from such traditional Multiplex modeling to envision how the relationships Local Compound between multiple sets of hard informaPersonal tion used in a particular task can be Derived represented through simple extensions of Vague Atomic existing models to describe information nuggets. The real problem arises with soft In-flight Live Stable Reconciled Historical information, particularly that represented by the multiplex data class on the KD Timeliness/consistency axis. Such data elements are most often modeled simply as text or object entities Figure 5. Sample data space mapping for the call center process at the highest level, with no recognition that more fundamental data elements exist within these high-level entities. The Information Nugget An information nugget is the smallest set of related data Returning to the call center example, consider the customer (wherever it resides in or is distributed through the data complaint information that is vital to interactions between space) that is of value to a business user in a particular agents and customers. When such information arrives context. It is the information equivalent of an SOA service, in the form of an e-mail or voicemail message from the also defined in terms of the smallest piece of business customer, we can be sure that within the content exists real, function from a user viewpoint. An information nugget can valuable, detailed information including product name, thus be as small as a single record when dealing with an type of defect, failure conditions, where purchased, name of individual transaction or as large as an array of data sets customer, etc. In order to relate such information to other used by a business process at a particular time. As with SOA data of interest, we must model the complaint information services, information nuggets may be composed of smaller (multiplex data) at a lower level, internal to the usual text or nuggets or be part of many larger nuggets. They are thus object class. granular, reusable, modular, composable, and interoperable. They often span traditional information types. Such modeling must recognize and handle two characteristics of soft information. First is the level of uncertainty As modeled, an information nugget exists only once in the about the information content and our ability to recognize BIR, although it may be widely dispersed along the three the data items and values contained therein. For example, axes. At a physical level, it ideally maps to a single data “the clutch failed when climbing a hill,” and “I lost the instantiation, although the usual technology performance clutch going up the St. Gotthard Pass,” contain the same and access constraints may require some duplication. information about the conditions of a clutch failure, but However, the purpose of this new modeling concept is may be difficult to recognize immediately. Second, because to ensure that information, as seen by business users, is soft information may contain lower-level information uniquely and directly related to its use, while minimizing elements in different instances of the same text/object entity, the level of physical data redundancy. When implemented, each instance must be individually modeled on the fly as it the information nugget leads to rational decisions about arrives in the store. when and how data should be duplicated and to what extent federation/EII approaches can be used. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 15 BEYOND BI Automated text mining and semantic and structural analysis are key components in soft information modeling given the volumes and variety of information involved. Such tools essentially extract the tacit metadata from multiplex data and store it in a usable form. This enables multiplex data to be used in combination with the simpler atomic, reconciled, and derived classes on the KD axis. By storing this metadata in the BIR and using it as pointers to the actual multiplex data, we can avoid the need to transform, extract, and copy vast quantities of soft information into traditional warehouse data stores. We may also decide to extract certain key elements for performance or referential integrity needs. The important point is that we need to automatically model soft information at a lower level of detail to enable such decisions and to use this information class fully. 3. Conclusions References This article posed three questions: (1) Do we need a new architecture for data warehousing after 25 years of evolution of business needs and technology? (2) If so, what would such an architecture look like? and (3) What new approaches would we need to implement it? The answers are clear. Devlin, B. [1997]. Data Warehouse: From Architecture to Implementation, Addison-Wesley. 1. Business needs and technology have evolved dramatically since the first warehouse architecture. Speed of response, agility in the face of change, and a significantly wider information scope for all aspects of the business demand a new, extensive level of information and process integration beyond any previously attempted. We need a new data warehouse architecture as well as a new enterprise IT architecture of which data warehousing is one key part. 2. Business integrated insight (BI2) is a proposed new architecture that addresses these needs while taking into account current trends in technology. It is an architecture with three layers—information, process, and people. Contrary to the traditional data warehouse approach, all information is placed in a single layer—the business information resource—to emphasize the comprehensive integration of information needed and the aim to eliminate duplication of data. 16 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 An initial step toward implementing this architecture is to describe and model a new topography of data based on broad types and uses of information. A data space mapped along three axes is proposed and a new modeling concept, the information nugget, introduced. The architecture also requires dynamic, in-flight modeling particularly of soft information to handle the expanded data scope. Although seemingly of enormous breadth and impact, the BI2 architecture builds directly on current knowledge and technology. Prior work to diligently model and implement a true enterprise data warehouse will contribute greatly to this important next step beyond BI to meet future enterprise needs for complete business insight. ■ ——— [2009]. “Business Integrated Insight (BI2): Reinventing enterprise information management,” white paper, September. http://www.9sight.com/ bi2_white_paper.pdf ———, and P. T. Murphy [1988]. “An architecture for a business and information system,” IBM Systems Journal, Vol. 27, No. 1, p. 60. Eckerson, Wayne W., and Richard P. Sherman [2008]. Strategies for Managing Spreadmarts: Migrating to a Managed BI Environment, TDWI Best Practices Report, Q1. http://tdwi.org/research/2008/04/ strategies-for-managing-spreadmarts-migrating-to-amanaged-bi-environment.aspx Power, D. J. [2007]. “A Brief History of Decision Support Systems,” v 4.0, March 10. http://dssresources.com/ history/dsshistory.html PREDICTIVE MODELING Advances in Predictive Modeling: How In-Database Analytics Will Evolve to Change the Game Sule Balkan and Michael Goul Abstract Sule Balkan is clinical assistant professor at Arizona State University, department of information systems. sule.balkan@asu.edu Organizations using predictive modeling will benefit from recent efforts in in-database analytics—especially when they become mainstream, and after the advantages evolve over time as adoption of these analytics grows. This article posits that most benefits will remain under-realized until campaigns apply and adapt these enhancements for improved productivity. Campaign managers and analysts will fashion in-database analytics (in conjunction with their database experts) to support their most important and arduous day-to-day activities. In this article, we review issues related to building and deploying analytics with an eye toward how in-database solutions advance the technology. We conclude with a discussion of how analysts will benefit when they take advantage of the tighter coupling of databases and predictive analytics tool suites, particularly in end-to-end campaign management. Introduction Michael Goul is professor and chair at Arizona State University, department of information systems. michael.goul@asu.edu Decoupling data management from applications has provided significant advantages, mostly related to data independence. It is therefore surprising that many vendors are more tightly coupling databases and data warehouses with tool suites that support business intelligence (BI) analysts who construct and manage predictive models. These analysts and their teams construct and deploy models for guiding campaigns in areas such as marketing, fraud detection, and credit scoring, where unknown business patterns and/or inefficiencies can be discovered. “In-database analytics” includes the embedding of predictive modeling functionalities into databases or data warehouses. It differs from “in-memory analytics,” which is BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 17 PREDICTIVE MODELING designed to minimizing disk access. In-database analytics focuses on the movement of data between the database or data warehouse and analysts’ workbenches. In the simplest form of in-database analytics, the computation of aggregates such as average, variance, and other statistical summaries can be performed by parallel database engines quickly and efficiently—especially in contrast to performing computations inside an analytics tool suite with comparatively slow file management systems. In tightly coupled environments, those aggregates can be passed from the data engine to the predictive modeling tool suite when building analytical models such as statistical regression models, decision trees, and even neural networks. In-database analytics also enable streamlining of modeling processes. The typical modeling processes referred to as CRISP-DM, SEMMA, and KDD contain common BI steps or phases. Knowledge Discovery in Databases (KDD) refers to the broad process of finding knowledge using data mining (DM) methods (Fayyad, Piatetski-Shapiro, Smyth, and Uthurusamy, 1996). KDD relies on using a database along with any required preprocessing, sub-sampling, and transformation of values in that database. Another version of a DM process approach was developed by SAS Institute: Sample, Explore, Modify, Model, Assess (SEMMA) refers to the lifecycle of conducting a DM project. Another approach, CRISP-DM, was developed by a consortium of Daimler Chrysler, SPSS, and NCR. It stands for CRoss-Industry Standard Process for Data Mining, and its cycle has six stages: business understanding, data understanding, data preparation, modeling, evaluation, and deployment (Azavedo and Santos, 2008). All three methodologies address data mining processes. Even though the three methodologies are different, their common objective is to produce BI by guiding the construction of predictive models based on historical data. A traditional way of discussing methodologies for predictive analytics involves a “sense, assess, and respond” cycle that organizations and managers should apply in making effective decisions (Houghton, El Sawy, Gray, Donegan, and Joshi, 2004). Using historical data to enable managers to sense what is happening in the environment has been the 18 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 foundation of the recent thrust to vitalize evidence-based management (Pfeffer and Sutton, 2006). Predictive models help managers assess and respond to the environment in ways that are informed by historical data and the patterns within that data. Predictive models help to scale responses because, for example, scoring models can be constructed to enable the embedding of decision rules into business processes. In-database analytics can streamline elements of the “sense, assess, and respond” cycle beyond those steps or phases in KDD, SEMMA, and CRISP-DM. This article explains how basic in-database analytics will advance predictive modeling processes. However, we argue that the most important advancements will be discovered when actual campaigns are orchestrated and campaign managers access the new, more tightly coupled predictive modeling tool suites and database/data warehouse engines. We assert that the most important practical contribution of in-database analytics will occur when analysts are under pressure to produce models within time-constrained campaigns, and performances from earlier campaign steps need to be incorporated to inform follow-up campaign steps. The next section discusses current impediments to predictive analytics and how in-database analytics will attempt to address them. We also discuss the benefits to be realized after more tightly coupled predictive analytics tool suites and databases/data warehouses become widely available. These benefits will be game-changers and will occur in such areas as end-to-end campaign management. What is Wrong with Current Predictive Analytics Tool Suites? Current analytics solutions require many steps and take a great deal of time. For analysts who build, maintain, deploy, and track predictive models, the process consists of many distributed processes (distributed among analysts, tool suites, and so on). This section discusses challenges that analysts face when building and deploying predictive models. Time-Consuming Processes To build a predictive model, an analyst may have to tap into many different data sources. Data sources must con- PREDICTIVE MODELING SAMPLE Input data, sampling, data partition EXPLORE Ranks-plots variable selection MODIFY MODEL Transform variable, filter outliers, missing imputation ASSESS Regression, tree, neural network Assessment, score, report Figure 1. SEMMA methodology supported by SAS Enterprise Mining environment tain known values for target variables in order to be used when constructing a predictive model. All the attributes that might be independent variables in a model may reside in different tables or even different databases. It takes time and effort to collect and synthesize this data. Once all of the needed data is merged, each of the independent variables is evaluated to ascertain the relations, correlations, patterns, and transformations that will be required. However, most of the data is not ready to be analyzed unless it has been appropriately customized. For example, character variables such as gender need to be manipulated, as do numeric variables such as ZIP code. Some continuous variables may need to be converted into scales. After all of this preparation, the modeling process continues through one of the many methodologies such as KDD, CRISP-DM, or SEMMA. For our purposes in this article, we will use SEMMA (see Figure 1). The first step of SEMMA is data sampling and data partitioning. A random sample is drawn from a population to prevent bias in the model that will be developed. Then, a modeling data set is partitioned into training and validation data sets. Next is the Explore phase, where each explanatory variable is evaluated and its associations with other variables are analyzed. This is a time-consuming step, especially if the problem at hand requires evaluating many independent variables. In the Modify phase, variables are transformed; outliers are identified and filtered; and for those variables that are not fully populated, missing value imputation strategies are determined. Rectifying and consolidating different analysts’ perspectives with respect to the Modify phase can be arduous and confusing. In addition, when applying transformations and inserting missing values in large data sets, a tool suite must apply operations to all observations and then store the resulting transformations within the tool suite’s file management system. Many techniques can be used in the Model phase of SEMMA, such as regression analysis, decision trees, and neural networks. In constructing models, many tool suites suffer from slow file management systems, which can constrain the number and quality of models that an analyst can realistically construct. The last phase of SEMMA is the Assess phase, where all models built in the modeling phase are assessed based on validation results. This process is handled within tool suites, and it takes considerable time and many steps to complete. Multiple Versions and Sources of the Truth Another difficulty in building and maintaining predictive models, especially in terms of campaign management, is the risk that modelers may be basing their analysis on multiple versions and sources of data. That base data is often referred to as the “truth,” and the problem is often referred to as having “multiple versions of the truth.” To complete the time-consuming tasks of building predictive models as just described, each modeler extracts data from a data warehouse into an analytics workstation. This may create a situation where different modelers are working from different sources of truth, as modelers might extract data snapshots at different times (Gray and Watson, 2007). Also, having multiple modelers working on different predictive models can mean that each modeler is analyzing the data and creating different transformations from the same raw data without adopting a standardized method or a naming convention. This makes deploying BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 19 PREDICTIVE MODELING multiple models very difficult, as the same raw data may be transformed in different ways using different naming conventions. It also makes transferring or sharing models across different business areas challenging. Another difficulty relates to the computing resources on each modeler’s workbench when multiple modelers are going through similar, redundant steps of data preparation, transformation, segmentation, scoring, and all the other functions that can take a great deal of disk space and CPU time. The Challenges of Leveraging Unstructured Data and Web Data Mining in Modeling Environments Modelers often tap into readily available raw data in the database or data warehouse. However, unstructured data is rarely used during these phases because handling data in the form of text, e-mail documents, and images is computationally difficult and time consuming. Converting unstructured data into information is costly in a campaign management environment, so it isn’t often done. The challenges of creating reusable and repeatable variables for deployment make using unstructured data even more difficult. Web data mining spiders and crawlers are often used to gather unstructured data. Current analyst tool suite processes for unstructured data require that modelers understand archaic processing commands expressed in specialized, non-standard syntax. There are impediments to both gathering and manipulating unstructured data, and there are difficulties in capturing and applying predictive models that deal with unstructured data. For example, clustering models may facilitate identifying rules for detecting what cluster a new document is most closely aligned with. However, exporting that clustering rule from the predictive modeling workbench into a production environment is very difficult. Managing BI Knowledge Worker Training and Standardization of Processes In most organizations, there is a centralized BI group that builds, maintains, and deploys multiple predictive models for different business units. This creates economies of scale, because having a centralized BI group is definitely more 20 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 cost effective than the alternative. However, the economies of scale do not cascade into standardization of processes among analyst teams. Each individual contributor usually ends up with customized versions of codes. Analysts may not be aware of the latest constructs others have advanced. What Basic Changes Will In-Database Analytics Foster? In-database analytics’ major advantage is the efficiencies it brings to predictive model construction processes due to processing speeds made possible by harnessing parallel database/warehouse engine capabilities. Time savings are generated in the completion of computationally intensive modeling tasks. Faster transformations, missing-value imputations, model building, and assessment operations create opportunities by leaving more time available for fine-tuning model portfolios. Thanks to increasing cooperation between database/warehouse experts and predictive modeling practitioners, issues associated with non-standardized metadata may also be addressed. In addition, there is enhanced support for analyses of very large data sets. This couldn’t come at a better time, because data volumes are always growing. In-database analytics make it easier to process and use unstructured data by converting complicated statistical processes into manageable queries. Tapping into unstructured data and creating repeatable and reusable information—and combining this into the model-building process—may aid in constructing much better predictive models. For example, moving clustering rules into the database eliminates the difficulty of exporting these rules to and from tool suites. It also eliminates most temporary data storage difficulties for analyst workbenches. Shared environments created by in-database analytics may bring business units together under common goals. As different business units tap into the same raw data, including all possible versions of transformations and metadata, productivity can be enhanced. When new ways of building models are available, updates can be made in-database. All individual contributors have access to the latest developments, and no single business unit or individual is left behind. Saving time in the labor-intensive steps of model building, working from a single source of truth, PREDICTIVE MODELING Process Benefits Data set creation and preparation Reduce cycle time by parallel-processing multiple functions; accurate andtimely completion of tasks by functional embedding Data processing and model buildingby multiple analysts Eliminate multiple versions of truth and large data set movements to andfrom analytical tool suites Unstructured data management Broaden analytics capability by streamlining repeatability and reusability Training and standardization Create operational and analytical efficiencies; access to latest developments; automatically update metadata Table 1. Preliminary benefits of in-database analytics RET AR N SIG DE MO D ORE PL X E MPLE SA SEMMA EVALUA TE DEPLOYMENT EM PO The DEEPER phases delineate, in sequential fashion, the types of activities involved in model deployment with a special emphasis on campaign management. The T GE ANCE ORM RF SURE PE MEA To drive measurable business results from predictive models, SEMMA (or a similar methodology) is followed by a deployment cycle. That cycle may involve the continued application of models in a (recurring) campaign, refinement when model performance results are used to revise other models, making decisions on whether completely new models are required given model performance, and so on. We distinguish deployment from the SEMMA-supported phase (intelligence) because deployment MODE L often engages the broader organization and Y IF AS requires a predictive model (or models) SE SS to be put into actual business use. This INTELLIGENCE section introduces a new methodology we created to describe deployment: “DEEPER” (Design, Embed, Empower, Performancemeasurement, Evaluate, and Re-target). Figure 2 depicts the iterative relationship between SEMMA and DEEPER. EMBED Context for In-Database Analytics Innovation design phase involves making plans for how to transition a scoring model (or models) from the tool suite (where it was developed) to actual application in a business context. It also involves thinking about how to capture the results of applying the model and storing those results for subsequent analysis. There may also be other data that a campaign manager wishes to capture, such as the time taken before seeing a response from a target. A proper design can eliminate missteps in a campaign. For example, if a targeted catalog mailing is enabled by a scoring model developed using SEMMA, then users must choose which deciles to target first, how to capture the results of the campaign (e.g., actual purchases or requests for new service), and what new data might be appropriate to capture during the campaign. ER W having access to repeatable and reusable structured and unstructured data, and making sure all the business units are working with the same standards and updates—all this makes it easier to transfer knowledge as new analysts join or move across business units. Table 1 summarizes the preliminary benefits of in-database analytics for modelers. DEEPER Figure 2. DEEPER phases guide the deployment, adoption, evaluation, and recalibration of predictive models. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 21 PREDICTIVE MODELING Once designed, the model must be accurately embedded into business processes. Model score views must be secured; developers must ensure scores appear in user interfaces at the right time; and process managers must be able to insert scores into automated business process logic. Embedding a predictive model may require safeguards to exceptions. If there are exceptions to applications of a model, other safeguards need to be considered. Making the results of a predictive model (e.g., a score) available to people and systems is just the first step in ensuring it is used. In the empower phase, employees may need to be trained to interpret model results; they may have to learn to look at data in a certain way using new interfaces; or they may need to learn the benefits of evidence-based management approaches as supported by predictive modeling. Similarly, if people are involved, testing may be required to ensure that training approaches are working as intended. The empower step ensures appropriate behaviors by both systems and people as they pertain to the embedding of the predictive model into business processes. A campaign begins in earnest after the empower phase. Targets receive their model-prescribed treatments, and reactions are collected as planned for in the design phase of DEEPER. This reactions-directed phase, performance measurement, involves ensuring the reactions and events subsequent to a predictive model’s application are captured and stored for later analysis. The results may also be captured and made available in real-time support for campaign managers. Dashboards may be appropriate for monitoring campaign progress, and alerts may support managers in making corrections should a campaign stray from an intended path. If there is an anomaly, or when a campaign has reached a checkpoint, campaign managers take time to evaluate the effectiveness or current progress of the campaign. The objective is to address questions such as: ■■ Are error levels acceptable? ■■ Were campaign results worth the investment in the predictive analytics solution? ■■ How is actual behavior different from predicted behavior for a model or a model decile? 22 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 This is the phase when the campaign’s effectiveness and current progress are assessed. The results of the evaluate phase of DEEPER may lead to a completely new modeling effort. This is depicted in Figure 3 by the gray background arrow leading from evaluate to the sample phase of SEMMA. This implies a transition from deployment back to what we have referred to as intelligence. However, there is not always time to return to the intelligence cycle, and minor alterations to a model might be deemed more appropriate than starting over. The latter decision is most prevalent in time-pressured, recurring campaigns. We refer to this phase as re-target, which requires analysts to take into account new information gathered as part of the performance management deployment phase. It also takes advantage of the plans for how this response information was encoded per the design phase of deployment. The most important consideration involves interpreting results from the campaign and managing non-performing targets. A non-performing target is one that scored high in a predictive model, for example, but that did not respond as predicted. In a recurring campaign, there may be an effort to re-target that subset. There could also be an effort to re-target the campaign to another set of targets, e.g., those initially scored into other deciles. Re-targeting can be a time-consuming process; new data sets with response results need to be made available to predictive modeling tool suites, and findings from tracking need to be incorporated into decisions. DEEPER provides the context for considering how improvements to in-database analytics can be game-changers. In-database analytics can make significant inroads to DEEPER processes that take time and are under-supported by predictive modeling tool suites. However, these improvements will be driven by analysts who work closely with their organizations’ database experts. This combination of analyst and data management skills, experience, and knowledge will spur innovation significantly beyond current expectations. PREDICTIVE MODELING How Might In-Database Analytics for DEEPER Evolve? Extending in-database analytics to DEEPER processes requires considering how each DEEPER phase might be streamlined given tighter coupling between predictive modeling tool suites and databases/data warehouses. Although many of the advantages of this tighter coupling may be realized differently by different organizations, there are generic value streams to guide efforts. Here the phrase “value stream” refers to process flows central to DEEPER. This section discusses these generic value streams: (1) intelligence-to-plan, (2) plan-to-implementation, (3) implementation-to-use, (4) use-to-results, (5) results-toevaluation, and (6) evaluation-to-decision. In the design phase of DEEPER, planning can be facilitated by examining possible end-user database views that could be augmented with predictive intelligence. Instead of creating new interfaces, it is possible that Web pages equipped with embedded database queries can quickly retrieve and display predictive model scores to decision makers or front-line employees. Many of these displays are already incorporated into business processes, so opportunities to use the tables and queries to supply model results can streamline implementation. When additional data items need to be captured, that data may be captured at the point of sale or other customer touch points. A review of current metadata may speed up the design of a suitable deployment strategy. In addition to “pushing” model intelligence to interfaces, there may also be ways of “pulling” data from the database/warehouse to facilitate re-targeting or for initiating new SEMMA cycles. For example, it may be possible to design queries to automate the retrieval of data items such as target response times from operational data stores. Similarly, it may be possible to use SQL to aggregate the information needed for this type of next-step analysis. For example, total sales to a customer within a specified time period can be aggregated using a query and then used in the re-targeting phase to reflect whether a target performed as predicted. In-database analytics can support the design phase because it eliminates many of the traditional bottlenecks such as complex requirements gathering and the creation of formal specification documents (including use cases). Instead, existing use cases can be reviewed and augmented, and database/warehouse–supported metadata facilities can support the design of schema for capturing new target response data. We refer to this as an intelligence-to-plan value stream for the in-database analytics supported design deployment phase. In the embed phase, transferring scored model results to tables is a first step in considering ways to make use of database/warehouse capabilities to support DEEPER. Once the scores are appropriately stored in tables, there are many opportunities to use queries to embed the scores into people-supported and automated business processes. For example, coding to retrieve scores for inclusion in front-line employee interfaces can be done in a manner consistent with other embedded SQL applications. This saves time in training interface developers because it implies that the same personnel who implemented the interfaces can effectively alter them to include new intelligence. There is also no need for additional project governance functions or specialized software. In fact, database/ warehouse triggers and alerts can be used to ensure that predictive analytics are used only when model deployment assumptions are relevant. As the database/warehouse is the same place where analytic model results reside, there are numerous implementation advantages. We refer to this as a plan-to-implementation value stream for the in-database analytics supported embed deployment phase. After implementation, testing will ensure that model results/scores are understandable to decision makers (the empower phase) and that their performance can scale when production systems are at high capacity. Such stress tests can be conducted in a manner similar to database view tests. Because of the inherent speed of database/ warehouse systems, their performance will likely exceed separate, isolated workbench performance. Global roll-out can be eased by tried-and-true database/warehouse roll-out processes. We refer to this as an implementation-to-use value stream for the in-database analytics supported empower deployment phase. Similarly, the use-to-results value stream is that part of a campaign when actions are taken and targets respond. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 23 PREDICTIVE MODELING In this performance management phase of deployment, dashboards can be used to track performance, database tables can automatically collect and store ongoing campaign results, queries can aggregate responses over time as part of automating responses, and many other in-database solutions can help to streamline related processes. This information is central to the evaluate phase, where the results-to-evaluation value stream can enable careful scrutiny of the predictive analytics model portfolio. Queries can be written to compare actual results to those predicted during SEMMA phases. When more than one model has been constructed in the SEMMA processes, all can be re-examined in light of the new information about responses. If-then statements can be embedded in queries to identify target segments that have responded according to business goals, and remaining non-responders can be quickly identified. performance of the models in the portfolio. If models exist that were not used but appear to perform better, those models may be used in the next DEEPER cycle. Alternatively, a combination or pooling of models might be most appropriate. Again, automated queries might be able to provide decision support for such pooling options, and they can aid in scheduling the appropriate model for the data sets as the DEEPER cycle progresses. In addition, it may be possible to use queries to apply business rules to manage data sets, and prior results could inform the scheduling of resting periods for targets such that each target isn’t inundated with catalog mailings, for example. Such analysis can be done for each analytical model in the portfolio and for each decile of predicted respondents associated with those models. This has been an enormously time-consuming process in the past, but the database/warehouse query engine can conduct this type of post-analysis efficiently. Queries can also identify subsets of respondents that outperformed the predicted model performance—and those that significantly under-performed. This type of analysis can be quickly supported through queries, and it can provide significant insight for the re-target phase. Conclusion Following the results-to-evaluation value stream of the deployment cycle, the evaluation-to-decision value stream focuses on whether a new intelligence cycle (a repeat of SEMMA processes) is required. If performance results indicate major model failures, then a repeat is likely necessary to resurrect and continue a campaign. Even if there weren’t major failures, environmental changes such as economic conditions may have rendered models outdated. Data collected in the performance evaluation phase may help to streamline the decision process. If costs aren’t being recovered, then it is likely that either the campaign will cease or a new intelligence cycle is necessary. Often a portfolio of models is created in the initial intelligence cycle. It may be possible to use queries to automate the process of recalculating the prior and anticipated 24 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 Table 2 summarizes key generic value streams that can be supported by in-database analytics and briefly describes the possibilities discussed in this section. Opportunities to evolve in-database analytics are likely to be numerous. In-database analytics create an environment where functions are embedded and processed in parallel, thereby streamlining the steps of both intelligence (e.g., SEMMA) and deployment (e.g., DEEPER) cycles. As data sources are updated, attribute names and formats may change, yet they are sharable. In-database analytics can support quality checks and create warning messages if the range, format, and/or type of data differ from a previous version or model assumptions. If external data has attributes that were not in the data dictionary, metadata can be updated automatically. Data conversions can be handled in-database and only once instead of being repeated by multiple modelers. In-database analytics fosters stability, enhances efficiency, and improves productivity across business units. In-database analytics will be critical to a company’s bottom line when models are deployed and there is time pressure for multiple, successive campaigns where ongoing results can be used to build updated, improved predictive models. Enhancements can be realized in a host of value streams. For example, in-database analytics can significantly reduce cycle times for rebuilding and redeploying updated models to meet campaign deadlines. As multiple models are constructed, in-database analytics will enable managing them as a portfolio. Timely responses, tracking, and fast interpretation of PREDICTIVE MODELING Intelligence-to-plan Planning is streamlined; push and pull strategies are feasible; schema design can support planning Plan-to-implementation Scores maintained in-database; embedded SQL in HTML can facilitate view deployment; triggers and alerts can be used to guard for exceptions Implementation-to-use Stress testing and global rollout follow database/warehouse methodologies and rely on common human and physical resources Use-to-results Dashboards can be readily adapted; database/warehouse tables can be used as response aggregators Results-to-evaluation Re-examine all created models efficiently in light of response information; embed if-then logic to re-target nonresponders Evaluation-to-decision Consider applying different models; allow targeted respondents to “rest”; use database to provide decision support for deciding to re-target or re-enter the intelligence cycle Table 2. Generic value streams and areas for innovation with in-database analytics early responders to campaigns will enable companies to fine-tune business rules and react in record time. As the fine line between intelligence and deployment cycles fades because of the fast-paced environment supported by in-database analytics, businesses may move away from the concept of campaign management into triggerbased, “lights-out” processing, where all data feeds are automatically updated and processed, and there is no need to compile data into periodic campaigns. There will be realtime decision making with instant scoring each time there is an update in one of the important independent variables. Analysts will spend their time fine-tuning model performance, building business rules, analyzing early results, monitoring data movements, and optimizing the use of multiple models—instead of dealing with the manual tasks of data preparation, data cleansing, and managing file movements and basic statistical processes that have been moved into the database/warehouse. Although lights-out processing is not on the near-term horizon, the evolution of in-database analytics promises to move organizations in that direction. Once in the hands of analysts and their database/warehouse teams, in-database analytics will be a game-changer. ■ References Azevedo, Ana, and Manuel Felipe Santos [2008]. “KDD, SEMMA AND CRISP-DM: A Parallel Overview.” IADIS European Conference Data Mining, pp. 182–185. Fayyad, U. M., Gregory Piatetski-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy [1996]. Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press. Gray, Paul, and Hugh J. Watson, Hugh [2007]. “What Is New in BI,” Business Intelligence Journal, Vol. 12, No. 1. Houghton, Bob, Omar A. El Sawy, Paul Gray, Craig Donegan, and Ashish Joshi [2004]. “Vigilant Information Systems for Managing Enterprises in Dynamic Supply Chains: Real-Time Dashboards at Western Digital,” MIS Quarterly Executive, Vol. 3, No. 1. Pfeffer, Jeffrey, and Robert I. Sutton [2006]. “Evidence Based Management,” Harvard Business Review, January. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 25 BI CASE STUDY BI Case Study SaaS Helps HR Firm Better Analyze Sales Pipeline By Linda L. Briggs When Tom Svec joined Taleo as marketing operations manager, he immediately ran up against what he calls “The Beast,” a massive, 100-MBplus sales and marketing report in Microsoft Excel. Ugly as it was, the monster Excel report, created weekly from Salesforce. com data, served a critical function in helping with basic sales trend analysis. Each Monday, data imported from Salesforce.com offered snapshots of the previous week’s patterns to provide guidance on upcoming sales opportunities. The information was critical to Taleo’s sales managers. The publicly traded company, with 900 employees and just under $200 million in reported revenue in 2009, provides software-as-a-service (SaaS) solutions for talent management. Its products are designed to help HR departments attract, hire, and retain talent; they range from recruiting and performance management functions to compensation and succession planning tools. Given Taleo’s current needs and projected continuing rapid growth, Svec says he realized that along with the need for more sales visibility—especially for senior managers—the risks of manipulating such critical data in Excel had increased to an unacceptable level. He also needed a tool that could manipulate data and provide information faster than Excel could. “I needed to look for a scalable solution, a reliable solution, and a low-risk solution,” Svec notes. He thus began a search for a BI tool to help manage the sales opportunity data, particularly entry and pipeline metrics, for the demand-generation group as well as for Taleo’s sales organization overall. The tool would need to work with Salesforce.com initially, but eventually might be used with other data as well. For example, Taleo uses a front-end marketing automation and demand-generation platform called Eloqua to execute and measure marketing activity. In time, Svec says, the company may want to import and manipulate Eloqua data directly in its BI solution. As the company’s only marketing operations expert—with lots of overlap with the sales operations team as well—Svec needed a complete lifecycle view of both sales and marketing data. “The demand-generation team and I are very, very focused on everything from the top of the funnel all the way through to close of business,” he explains. That includes involvement in 26 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 BI CASE STUDY the sales pipeline side of things—all of which made “The Beast” a challenge when users needed to glean useful information quickly. The Search for SaaS Embarking on the search for a BI solution, Svec turned to Salesforce. com’s AppExchange, an online marketplace that lists more than 800 cloud-based applications that work with Salesforce.com. From that list, Svec selected and considered vendors including Cloud9 Analytics, LucidEra (which closed in 2009), and other SaaS BI vendors. “Initially, this was supposed to be just a departmental solution” for specific use in managing marketing demand generation and sales pipeline data, Svec says. Rather than a large ERP system or on-premises solution, “we were looking for a solution to solve a specific issue.” With limited technology resources to call on within the company, he wanted a quick, easy implementation that he could accomplish without IT involvement and that could be ramped up quickly while providing rock-solid security. As a SaaS company itself, Taleo was in a good position to understand and appreciate the SaaS concept of on-demand software hosted offsite by the vendor. In that vein, the company eventually selected PivotLink, which offers an on-demand BI solution that includes technologies such as in-memory analytics and columnar data storage. Svec says a key PivotLink feature was its ability to handle data from any source. That helped it stand out from the many other solutions he found on AppExchange that seemed geared specifically toward working with Salesforce. With more anticipated growth ahead, both organic and through acquisitions, the company needed something more versatile. larly some of the historical snapshot requirements we had.” The end result: A master set of locked-down sales reports built in PivotLink that sales and marketing managers can use for a detailed view of the demand-generation “funnel” and analysis of the sales pipeline for historical trending. Looking under the Hood Given limited technology resources, Svec wanted a quick, easy implementation that he could accomplish without IT involvement. “Today our [focus] is Salesforce,” Svec says, “but looking down the road 6, 12, however many months, we wanted something built to accommodate other data sources.” During a relatively quick six- to eight-week implementation, Svec worked closely with PivotLink in a collaborative process, pushing them a bit, he says, to integrate more deeply with Salesforce. He was pleased overall with how the integration proceeded, in particular with the vendor relationship: “I think [PivotLink] was discovering new things along the way, particu- A key concern during the selection process was the long-term financial viability of any SaaS provider. “We were very cognizant of financial viability,” Svec says (and in fact, SaaS company LucidEra closed its doors just weeks after Taleo signed its deal with PivotLink). To avoid potential problems, Taleo examined PivotLink closely, weighing factors including funding history: When was the last round of funding? When is the next round of funding scheduled? Where is the company in the fundraising process? They also considered number of customers and growth rate. Taleo also conducted an extensive security review. “As a SaaS company, we’re very, very serious about security,” Svec stresses. Having conducted a SAS-70 compliance process himself as VP of operations at a SaaS company earlier in his career, Svec was highly conscious of what he required in terms of security from any hosted software vendor. The focus was especially sharp because PivotLink is a relatively new company, he says. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 27 BI CASE STUDY “We asked, ‘Is this reliable? Is this secure?’ The answers were a very important part of why we chose PivotLink.” PivotLink users so far at Taleo make up a small group in marketing management; data they prepare is consumed by users in key positions such as the chief marketing officer, head of sales, head of finance, and regional sales VPs. Over time, Svec hopes to begin work on “the holy grail,” as he calls it—using predictive analytics to examine data and predict future events. A new version of PivotLink will help spread the tool to more users, Svec hopes, based on what he’s seen so far of the revamped user interface. “The improvements [PivotLink] has made in the [user interface], especially in the dashboard itself, make it much more conducive to casual use. That’s really going to help bring [more casual users] to the tool” at Taleo, he predicts. Future Plans To that end, Svec first wants to boost user interest in and adoption of PivotLink throughout the company. Taleo’s finance department, for example, is enviously eyeing the PivotLink-produced reports coming from Svec’s group and is thus a candidate for adoption. Second, he envisions incorporating additional data sources—and this is where PivotLink’s ability to handle disparate data sources will be important—thus giving him 28 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 and his team a more unified view of marketing and demand-generation activity. He wants “to see how we can better leverage our marketing metrics” by bringing in data from other sources beyond Salesforce. Those sources include the company’s marketing automation system, as well as large data sets that are sent to outside companies for cleanup and validation, which then need to be re-imported and manipulated by Svec and his team. The ability to examine a before-and-after view of the data can reveal what changes have been made and how data quality has been increased. Taleo’s finance department is enviously eyeing the PivotLink-produced reports coming from Svec’s group and is thus a candidate for adoption. Although the return on investment from making better decisions is always an elusive measure, Svec says that PivotLink’s pricing model has proven “economical” for a company the size of Taleo. Certainly, issues such as better risk management from avoiding the manipulation of copious amounts of data in an Excel spreadsheet is part of the savings equation. Svec also sees everyday time savings gleaned from simply making it more efficient for users to dig into data. “Those are things that you can’t really measure,” he says. “[PivotLink] is definitely allowing us to measure and see things in different ways [or more efficiently] than we were able to do in the past.” In a nutshell, what the BI tool really does, Svec says, is allow sales and marketing management “to zero in on circumstance”—that is, to identify situations where trending patterns are evident from the data in the sales pipeline. Exposing that data much more clearly in order to find patterns and anomalies allows users to drill down and perform further comparative analysis. Excel, while still put to everyday use throughout the company for one-off data extracts, manipulations, validations, and the like, is no longer the primary analysis tool. “Our reliance on it isn’t as great,” Svec says. “We’ve mitigated risk and improved our scalability by using PivotLink instead.” Linda L. Briggs writes about technology in corporate, education, and government markets.She is based in San Diego. lbriggs@lindabriggs.com AGILE BI Enabling Agile BI with a Compressed Flat Files Architecture William Sunna and Pankaj Agrawal Abstract Dr. William Sunna is a principal consultant with Compact Solutions. william.sunna@compactsolutionsllc.com As data volumes explode and business needs continually change in large organizations, the need for agile business intelligence (BI) becomes crucial. Furthermore, business analysts often need to perform studies on granular data for strategic and tactical decision making such as risk or fraud analysis and pricing analysis. Rapid results in active data warehousing become vital in order for organizations to better manage and make optimal use of their data. All of this triggers the need for new approaches to data warehousing that can support both agility and access to granular data. This article presents a new approach to agile BI: the compressed flat files (CFF) architecture, a file-based analytics solution in which large amounts of core enterprise transactional data are stored in compressed flat files instead of an RDBMS. The data is accessed via a metadata-driven, high-performance query engine built using a standard ETL or software tool. Pankaj Agrawal is CTO of Compact Solutions. pankaj.agrawal@compactsolutionsllc.com When compared to traditional solutions, the CFF architecture is substantially faster and less costly to build thanks to its simplicity. It does not use any commercial database management systems; is quick and easy to maintain and update (making it highly agile); and could potentially become the single version of truth in an organization and therefore act as an authoritative data source for downstream applications. Introduction Large enterprises often find themselves unable to use their core data effectively to perform BI. This is mainly due to a lack of agility in their information systems and the delays required to update their data warehouses with new information. As business climates change rapidly, new dimensions, key performance indicators, and derived facts need to be added quickly to the data warehouse so the BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 29 AGILE BI business can stay competitive. In addition, access to historical, low-granularity transaction data is vital for tactical and strategic decision making. Traditional data warehouse solutions that use relational databases and implement complicated models may not be sufficient to satisfy the agility needs of such BI environments. Introducing new data into a warehouse often involves relatively long development and testing cycles. Furthermore, the traditional data warehouse architectures do not adequately cope with many years of transactional data while meeting the performance expectations of end users. Enterprises often settle for summarized data in the warehouse, but this severely compromises their ability to perform advanced analytics that require access to vast amounts of low-level transactional data. With all of these inconveniences, the need for an agile solution that can handle these challenges has become acute. This article presents an innovative architecture that As business climates change rapidly, new dimensions, key performance indicators, and derived facts need to be added quickly to the data warehouse so the business can stay competitive. offers a cost-effective solution to create large transactional repositories to support complex data analytics and has agile development and maintenance phases. In this architecture, the core enterprise data is extracted from operational sources and stored in a denormalized form on a more granular level in compressed flat files. The data is then extracted using a high-performance extraction engine that performs SQL-like queries including selection, filtering, aggregation, and join operations. Power users 30 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 can extract transactional or aggregated data using a simple graphical interface. More casual business users can use a standard OLAP tool to access data from the compressed flat files. The benefits of CFF architecture are manifold. The infrastructure cost of a CFF solution is substantially lower, as RDBMS license costs are eliminated. Storing data in standard compressed flat files can reduce disk storage requirements by an order of magnitude. This not only reduces cost, but also allows an organizationto provide many more years of transactional data to its analysts, allowing for much richer analysis. In addition, the simplified architecture can be built and supported by a much smaller team. This article will explain the CFF architecture through a simple case study; discuss the metadata-driven feature of the architecture; and compare the CFF architecture to traditional architecture, with an emphasis on agility. Case Study We will use a simple case study to demonstrate the CFF architecture. Suppose researchers and pricing analysts in a major retail chain want to study the sales trends and profitability of the products sold at their stores located in all 50 states. To support their analyses, they need 10 years of detailed sales transaction data available online. Let’s assume the chain sells more than 30 categories of products such as automotive and hardware. Each category contains a wide range of products. For example, the automotive category contains engine oil, windshield washer fluid, and wiper blades; each of these products has a unique product code. Once a day, all the stores send a flat file containing point-of-sale (POS) transactions to headquarters. In addition to product and geographical information, the transactions also contain other information such as the manufacturer code, sales channel, cost of the product, and sale price. Assume that most users’ analysis is based on the geographic location, product category, and the accounting month in which the products are sold. Let’s refer to such attributes as “major key attributes.” For example, a business analyst may request a profitability report for a selected number of products in a given category in Illinois in the first quarter of 2009. AGILE BI Operational Data Sources Data Data ta ETL Da Data Files Compressed Flat Files High-Performance Query Engine Query Business Analysts Figure 1. Overview of CFF architecture The Compressed Flat Files Architecture The CFF architecture (Figure 1) is best characterized by its simplicity, yet it delivers many invaluable benefits. The architecture is highly metadata-driven to allow flexibility and agility in development and maintenance. The architecture also allows the enterprise to implement a security layer to regulate data access. This section describes how the data is generated, organized, and extracted, along with the CFF metadata-driven characteristics. Data Generation and Organization In this step, the data is extracted from operational data sources using a standard ETL tool. The data can be extracted from legacy systems, operational databases, flat files, or any other available data sources. The extracted data can be on a very granular level, such as POS transactions for a certain retail chain, as described in the case study. For the widest possible use, we recommend storing the data at the most granular level. The data is then cleansed and transformed according to a set of business rules, then partitioned, compressed, and stored in multiple files on hard disk. A set of key performance indicators (KPIs) is also calculated at this stage, again at the most granular level. The way the data is organized and distributed in compressed flat files is a key factor in the success of this architecture. Similar to commercial database partition elimination mechanisms, the compressed flat files should be organized to optimize extraction as much as possible. In other words, the main goal is to read as few files as possible to satisfy any given query. To ensure this is the case, some extraction patterns should be analyzed before finalizing the organization of the files. For our case study, it makes sense to split the data files by their major key attributes. For example, we can split the files by product category, state, and accounting month because these three attributes are used in almost all the extractions. If we are storing 10 years of data for 50 states and 30 product categories, then the number of compressed flat files will be 10 years x 12 months x 50 states x 30 categories = 180,000 files. Each compressed file should be named in a way that describes its contents. For example, given a compressed flat file, a user should be able to identify what product categories it contains for what state and what accounting month. If the file names do not describe the major key attributes, then there should be a mapping file to link the file name to its major key attributes. Data Extraction The extraction process starts with the end users, who compose their requests in a simple, standard user interface that can be developed using Java or .NET. The user interface should allow users to specify the data attributes they would like to see and what measures or metrics they BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 31 AGILE BI would like to calculate. In addition, it should provide filters to further refine the data. up-front performance gain in query processing and is one of the major strengths of the CFF architecture. Let’s take a data extraction request for our case study: A user wishes to perform a profitability analysis for four products (with codes 01, 02, 03, and 04) in the automotive category, which has a code of 01, for the state of Illinois (IL) in the first quarter of 2009. The user interface allows the user to select attributes (category code, state, product code, transaction date, number of items sold, sales amount, and cost amount) and specify the relevant filters, as shown in Figure 2. The query engine then reads the data in the relevant files and applies additional data filters such as the product code. The next step will be aggregating the measures requested (sales amount and cost) by product code and presenting the results to the analyst. The resulting data sets can be produced in any format, such as comma-separated or SAS-formatted files. Note that the user interface presented here is to be used as a data extraction interface, as opposed to a standard reporting or presentation interface. Standard BI tools such as MicroStrategy and Business Objects are also supported by this architecture. Once the user submits the request, the query details are passed to the high-performance query engine that is responsible for extracting data directly from the compressed flat files. The query engine will first build a list of the compressed flat files needed for the extraction based on the major key attributes selected. In our example, only one category code has been requested for one state during a three-month period. Therefore, only three compressed flat files out of the 180,000 total files are needed to satisfy the request. This early selection of files represents a huge Figure 2. Example of a query user interface 32 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 The high-performance query engine can be implemented with any commercial or open source ETL tool, or it can be built using any programming language. If the organization uses such tools and software, then there will be no need to purchase additional licenses for a database management system. AGILE BI Metadata Management Module User Login User Interface Security Grid Schema Files Query Control Process CFF High Performance Query Engine Requests Configuration Repository Results Figure 3. Metadata-driven architecture Metadata-driven Approach The CFF architecture is highly metadata-driven to allow for maximum agility in both the initial build of the application and any required maintenance in the future. Due to the simplicity of the data model manifested in the CFF, the data layouts (schema files) of the CFF are leveraged to generate the contents of the user interface via the metadata management module, as shown in Figure 3. Therefore, the addition of new fields or modifications to existing fields are reflected in the user interface unit without requiring any programming effort. The metadata management module also takes into consideration the classification of attributes in the data as specified in the schema files; it distinguishes major key attributes from other dimensional attributes and measures. Furthermore, it provides user privileges information to the interface by consulting the security grid module, which contains privileges and security rules for data access. The user interface builds custom data extraction menus for different users depending on what they are allowed to query or extract. All user requests are deposited in the requests configuration repository, a standard, secure location that contains the specifics of each request. This allows users to access any requests they submitted in the past, modify them if needed, and resubmit them. The query control process gathers new requests from the request configuration repository and submits them to the high-performance query engine. Queuing of requests, priorities, and other scheduling considerations are implemented in the query control process. Traditional Architectures, CFF, and Agility The CFF presents an alternate way to implement complex data analytics solutions with huge gains. Compared to traditional architectures, it is significantly faster to build due to its simplicity. It is far easier to maintain due to its metadata-driven characteristics. Systems based on this architecture can provide very rich information to analysts because very large amounts of highly granular data can be kept online at a fraction of the cost of traditional architectures. In one implementation of this architecture, more than 100 power users at a large insurance company perform complex analytics on 22 years’ worth of claims and premium transactions. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 33 AGILE BI Business Table Manager Staged Raw Data Extract Acquire Stage Data Architecture Conformed Operational Data Source Enterprise Data Warehouse Presentation Conform Synchronize Integrate Load EDW Present ETL Processes Reprocessing Automated Balance and Audit Control Operational Source Systems Notification Services Reporting and Monitoring ETL Components ETL Metadata Admin Services Tool Metadata ABAC Tables Figure 4. Traditional data warehouse architecture Traditional data warehousing solutions based on relational databases require many layers of data models with corresponding ETL processes, making the architecture very complex, as shown in Figure 4. The traditional data architectures usually require separate models to be built for staged data, conformed data, the operational data store, a data warehouse or data mart, and presentation layers. These models are populated by multiple ETL processes. Because this architecture depends heavily on an RDBMS for storing data, data is often aggregated to provide better performance and manage data growth. Because of the very large data volumes involved, it is extremely expensive to store many years of transactional data in such data warehouses. Therefore, most such solutions keep a small amount of granular data (say a few months) in base tables, and rely heavily on aggregated data to meet user demands. Such aggregated data is often of limited use for applications such as risk and fraud analysis, price modeling, and other analytics that require a longer historical perspective. If we compare the CFF solution to a traditional data warehousing solution on basic development and maintenance activities, we can easily recognize the agility gains offered by the CFF architecture. Table 1 compares the 34 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 CFF architecture with the traditional architecture along some key criteria. During the development phase of the CFF architecture, adding new attributes or deleting/updating existing attributes requires making changes to only one repository and one ETL application, whereas the traditional architecture requires changes in many places. In fact, this simple difference can save substantial time, money, and resources because it eliminates the need to build many sophisticated models, whether dimensional or normalized in a relational database. Since data is stored in only one repository (the set of compressed flat files), only one set of ETL routines needs to be developed, saving time and money. Thus, the architecture is intrinsically agile. During the maintenance phase, inserting new attributes in the data is easier in CFF because of its metadata-driven nature. Once the CFF layout is modified, the rest of the updates are done automatically all the way to the user interface. In a traditional solution, the new attributes have to be propagated from one process to another and from one data model to another, requiring significant development and testing. Making a small change requires the involvement of data modelers, database administrators, ETL developers, and testers. AGILE BI Phase Development Maintenance Change Step Traditional Architecture CFF New attribute Updates to several layouts, data models, and ETL processes Updates to only one layout and one ETL process Delete attribute Updates to several layouts, data models, and ETL processes Updates to only one layout and one ETL process Update attributes Updates to several ETL processes Updates to only one ETL process Insert attributes NULL for historical data and layouts; updates to several ETL processes going forward Easier with metadata automation; updates to only one ETL process Delete attributes Nullify column; updates to several ETL processes going forward Easier with metadata automation; updates to only one ETL process going forward Update attributes Updates to several ETL processes Updates to only one ETL process Table 1. Comparison of CFF solution and a traditional data warehouse solution Summary According to Forrester Research principal analyst Boris Evelson, the slightest change in a traditional data warehouse solution can trigger massive amounts of work involving changing multiple ETL routines, operational data store attributes, facts, dimensions, major key performance indicators, filters, reports, cubes, and dashboards. Such changes cost time and money. This frustrates IT managers and business users alike. The need for agile data management has, therefore, become acute. Such solutions should not be driven by what tools are available but by smart strategies and architectures. In response to business needs for agility and lower cost, we have presented a new but proven data management architecture, the compressed flat files architecture. We have demonstrated the simplicity of this architecture and how it can be used to satisfy business needs in an agile environment. We have shown how this architecture is independent of any technologies or tools. We also demonstrated how it allows business users to analyze vast amounts of data at the most granular level without any loss of detail, a feature that would be prohibitively expensive to build using a traditional solution. We compared the CFF architecture with traditional architectures to demonstrate the agility of CFF in multiple activities in the development and maintenance phases. We have shown that the CFF architecture offers important benefits: reduced development time due to simplicity and metadata-driven architecture; reduced cost from eliminating the need to use a relational database management system; and the ability to store much larger amounts of data on smaller storage devices. A solution based on the CFF architecture has already proved its value at a large corporation where it handles more than 50 TB of raw historical transactional data. Furthermore, the CFF architecture has been recognized by data warehousing experts such as Bill Inmon and industry analysts such as Forrester as an important evolutionary step in data management and BI. Today’s BI challenges require non-traditional solutions to rein in the cost and complexity of managing data, as well as more agile responses to business changes. The CFF architecture meets these requirements. ■ BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 35 BI EXPERTS’ PERSPECTIVE BI Experts’ Perspective: Pervasive BI Jonathan G. Geiger, Arkady Maydanchik, and Philip Russom Jonathan G. Geiger, CBIP, is an executive Kelsey Graham has recently taken over as business intelligence (BI) director at Omega, a manufacturer of office products. She inherits a BI staff that has been in place for four years and boasts many accomplishments, includingan enterprise data warehouse, performance dashboards, forecasting models, and pricing models. There are eight BI professionals on staff; they perform roles and tasks that vary: planning the BI architecture, developing and maintaining the warehouse, and developing enterprisewide applications. vice president with Intelligent Solutions, Inc. He presents frequently at national and international conferences, has written more than 30 articles, and is a co-author of three books: Data Stores, Data Warehousing and the Zachman Framework: One of Kelsey’s charges is to make BI more pervasive. Senior management wants decision support data, tools, and applications available to more employees and trading partners along the supply chain. Although Kelsey is on board with this initiative, she is concerned about the quality of both the data in the warehouse and the metadata. Managing Enterprise Knowledge; Building the Customer-Centric Enterprise; and Mastering Data Warehouse Design. jggeiger@earthlink.net Arkady Maydanchik is a recognized practitioner, author, and educator in the field Her predecessor didn’t make much progress in working with some of the business units to correct the data quality problems originating in the source systems, and there is limited metadata that informs users about the quality of the data they are accessing. Kelsey knows that as BI becomes more pervasive, these data quality issues will demand more attention. She needs to think through what actions to take. of data quality and information integration. Arkady’s data quality methodology and 1. How should Kelsey start a dialogue with senior management about correcting the data quality problems in the source systems? Her sense is that she needs senior management’s help to get the business units to allocate the necessary resources to address the problems. 2. What metadata about data quality does Kelsey need to provide to users? Should she use categorical indicators such as “excellent, good, fair, or poor,” or specific numerical indicators such as “90 percent accurate”? 3. Should the indicators of quality be placed at the warehouse or the application level? Kelsey knows that data quality is related to the data’s intended use, but providing data quality metrics at the application level would be much more labor intensive for her staff. breakthrough ARKISTRA technology were used to provide services to numerous organizations. He is co-author of Data Quality Assessment for Practitioners. arkadym@dataqualitygroup.com Philip Russom is the senior manager of TDWI Research at The Data Warehousing Institute (TDWI), where he oversees many of TDWI’s research-oriented publications, services, and events. Before joining TDWI in 2005, Russom was an industry analyst covering business intelligence (BI) at Forrester Research, Giga Information Group, and Hurwitz Group. prussom@tdwi.org 36 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 BI EXPERTS’ PERSPECTIVE Jonathan G. Geiger Kelsey is dealing with a BI program that is perceived to be sufficiently successful to be widely adopted but that has some gaps under the covers. In addition, her team seems to be oblivious to the data quality issues. Fortunately, she recognizes that she needs to address the deficiencies before providing wider access to the data. She needs to address the team’s attitude, get a realistic assessment of the situation, gain senior management support, and provide information on the actual data quality. Team Attitude Kelsey’s team is proud of its accomplishments, and probably with good reason. They have, after all, implemented a data warehouse that provides data to its intended audience, and this information is used to provide benefits to the organization. If Kelsey is to address her data quality concerns, she must first discuss these concerns with her team. Kelsey should speak with her team, individually and collectively, to discuss the strengths and risks of the existing environment. If there is any merit to her suspicions, at least some of the team members will mention concerns about data quality. Being careful to give the team credit for its accomplishments, as the manager, Kelsey is in a position to determine which deficiencies need to be addressed first. If she feels that the most significant issue to be addressed prior to widespread implementation is data quality, she should communicate this to the team and gain its understanding and support. (There may be other high-priority issues, but they are outside the scope of this article.) Data Quality Assessment Kelsey is not in a position to start a dialogue with senior management until she can substantiate her concerns about the data’s quality. If she were to simply approach management with her concerns, she would probably be perceived as a naysayer and would lose the support of both senior management and her proud team. By the same token, she does not have the luxury of time to conduct a full data profiling effort. Once she has enlisted the team’s understanding and support, Kelsey should solicit input from the team about the areas in which they are most concerned about data quality. The team should then conduct some quick analysis to identify specific examples and possible root causes. The root causes are likely to include aspects of both business processes and operational systems. Kelsey should accumulate this information and project how these deficiencies might impact the quality of the decisions people make if they are using the poorquality data. Kelsey should develop a realistic plan for providing a pervasive BI environment (that includes addressing the major issues such as data quality). This places her in a position of present- ing management with a solution that addresses its objectives. Management Support and Commitment Kelsey is now prepared to have a dialogue with senior management. She should structure her discussion as a plan for meeting the goal of having a more pervasive BI environment. Within that plan, she needs to point out the need to address data quality deficiencies and the business involvement that will be needed to make it happen. It’s probably premature to establish a formal data stewardship program, but her presentation should lay the foundation for the subsequent introduction of such a program. Key business roles to be discussed include setting the quality expectations, ensuring that the business processes support the desired levels, and ultimate responsibility for the data quality. In addition to describing the business roles, Kelsey should discuss how the data quality can be measured and reported. Data Quality Metrics and Metadata There are three basic sets of data quality metrics that should be developed. These involve: ■■ The data quality of the source systems: This implicitly involves the business processes. The information should be initially collected during the data profiling (if conducted), and then through the ETL process on an ongoing basis. It BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 37 BI EXPERTS’ PERSPECTIVE will be useful when deficiencies identified by the third set of metrics must be addressed. ■■ ■■ The audit and control metadata: This is the measurement of the quality of the ETL process, and it should confirm that no errors were introduced during that process. It is of primary interest to the BI team, as it must address any deficiencies. The business-facing set of metrics: These are measures of the quality of the delivered data. This is where information needs to be provided to the business community so it can determine if the data is good enough for its intended purpose. The metrics should yield an indication of how well each relevant quality expectation is being met. (Supporting, lower-level measures could also be available to guide preventive and corrective actions.) Kelsey correctly recognizes that if the business users don’t trust the data, the program will ultimately fail. By addressing her data quality concerns head on, she will be better positioned to ensure the program’s success. warehouse and mutate along the way. New problems are inevitably created in ETL processes because of inconsistencies between the data in various source applications. As a result, data quality in the data warehouse is often the lowest among all databases. In theory, given a known data problem, the best course of action would be for Kelsey to perform a root cause analysis and fix the problems at the source. This way she does not just reactively cleanse individual erroneous data elements, but rather proactively prevents all future problems of the same kind before they occur. Regardless of the ideal, however, it is not practical to expect that data quality will be ensured at the source. There are several reasons for this. A systemic data quality assessment project can be executed with limited resources and in a short time period. Arkady Maydanchik Data quality in data warehousing and BI is a common problem because the data comes to data warehouses from numerous source systems and through numerous interfaces. Existing source data problems migrate to the data 38 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 First, in most organizations, comprehensive data quality management is a distant dream. Many source systems lack adequate controls. Source system stewards and data owners often do not know that their data is bad, or at least do not have any specific knowledge of which data are bad and what impact the data quality problem has on the business. Second, data warehouses obtain data from multiple source systems. Oftentimes, the data coming from each source seems consistent and accurate when examined independently from the other sources. It is only when data from multiple sources is put together that the inconsistencies and inaccuracies can be discovered. Finally, lack of data quality controls is sometimes a conscious financial decision. Data quality management is not free! Thus, it is often decided that existing data quality is adequate for the purposes for which the data is used within the source system and investing in data quality improvement is not worth the investment. Of course, such calculations typically ignore the impact of poor source data quality on downstream systems such as data warehouses. To attack the problem, Kelsey must start by assessing data quality in the data warehouse. A systemic data quality assessment project can be executed with limited resources and in a short time period. Data quality assessment produces a detailed data quality metadata warehouse that shows all identified individual data errors, as well as a data quality scorecard that allows for aggregating the results and estimating the financial impact of BI EXPERTS’ PERSPECTIVE the bad data on various data uses and business processes. One important category of aggregate scores is by data source. These scores indicate where the bad data came from. Another important category incorporates the time dimension, showing the trends in data quality overall and by the data source. Armed with this information, Kelsey can go to the source data stewards and discuss the financial implications of their data quality problems. Hopefully, understanding the downstream implications of bad data would allow for an adequate argument for data quality management at the source. Also, such findings may give the source data stewards a glimpse into their own data quality and thus a better understanding of what it may cost them directly. The next step is setting up data quality monitoring solutions for the data interfaces through which source data comes to the data warehouse. This is necessary even if the source data systems have adequate data quality controls in place. The reality is that it is simply impossible to completely ensure data quality at the source and guarantee that all data coming via interfaces to downstream systems is accurate. Monitoring data quality in each interface is a necessary part of any data integration solution. There are different types of data quality monitors. Error monitors look for individual erroneous data elements. Change monitors look for unexpected changes in data structure and meaning. Of course, monitoring data quality in data interfaces is not free. Advanced monitors require greater investment of time and money. The desired level of data quality monitoring in the interfaces is a financial decision and requires analysis of the ramifications of bad data. Getting a data quality program started information about data quality is absolutely accurate and up-to-date. In any case, this is the decision that can be made and changed many times as Kelsey’s data quality management program matures. Once she sets up the processes for data quality assessment at the data warehouse, data quality monitoring for the interfaces, and root-cause analysis and data quality management at the source, she has all the ingredients to fine-tune data quality reporting to the individual needs of the users. Philip Russom and organized is 90 I envy Kelsey. Then again, I don’t. percent organizational Kelsey’s position is strong because the BI team has an impressive track record of producing a wide range of successful BI solutions. More strength comes from senior management’s direct support of an expansion of BI solutions to more employees and partnering companies. Kelsey and team have useful and exciting work ahead, backed up by an executive mandate. I envy them shamelessly. dynamics. The final question is how much Kelsey wants to expose data quality metadata to the data users. There is no right answer to this question. A good guideline is that any information must be actionable. Providing too much detail is of little value to someone who cannot act upon it. Another factor to consider is that in the data warehouse that gets large volumes of data from numerous sources at breakneck speed, it may be impossible to ensure that the data quality metadata are always current. In that case, providing detailed information to the users may be counterproductive, as it may lead to a perception that the That’s the good news. Here comes the bad. Kelsey is contemplating crossing the line by sticking her nose into another team’s business so she can tell them their hard work isn’t good enough. Not only is this a tall hurdle, but Kelsey will face fearsome opposition on the other side. Her chances of success are BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 39 BI EXPERTS’ PERSPECTIVE slim if she goes it alone. Frankly, I don’t envy this part of her job. You see, fixing data quality problems isn’t really a technology problem—at least, not in the beginning. Getting a data quality program started and organized is 90 percent organizational dynamics. That’s a euphemism for turf wars, office politics, and “my IT system ain’t broke so it don’t need fixin’.” You have to work through these barriers and build a big foundation for your data quality program. Way down the road, you eventually get to fix something. I’m exaggerating for dramatic effect, but you get the point. Kelsey cannot—and should not— lead the campaign for data quality. After all, it’s not her job, and—to emphasize my point—she’d probably drown in the torrent of organizational dynamics anyway. Instead, the march into a data quality program should be led in a way that defuses most organizational dynamics. Essentially, any initiative that involves coordination and change across multiple teams and business units (as does data quality) will need a strong executive sponsor who’s placed high enough in the organizational chart to be impossible to ignore. The sponsor needs to carry a big stick and speak softly. The stick is a firmly stated executive mandate, the kind that limits your career should you fail to deliver on it. To avoid insurgencies, however, there 40 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 must be soft speaking that clearly defines goals for data’s quality and how improving data will improve the business for everyone. The sponsor needs to parachute in unannounced and repeat this pep talk occasionally. Furthermore, the soft speaking needs to avoid blame. If getting a data quality campaign started depends on a unilateral pardon of all data-related crimes, then so be it. By this point, you’re probably sick of hearing about organizational dynamics relative to data quality, but there’s more. An executive mandate forms a required foundation, but you (and Kelsey) still have to build a team or organizational structure on top of it. This is an immutable truth, not just my assertion. The fact that data quality work is almost always identified and approved via a data stewardship program corroborates my assertion. In recent years, data stewardship has evolved into (or been swallowed by) data governance. Kelsey needs to pick one of these team types, based on Omega’s corporate culture and pre-existing organizational structures. Next, she’ll need staffing that’s appropriate to the team type. For example, data governance is often overseen by a committee that’s populated part-time by people who have day jobs in (or influenced by) data management. Finally, the team must institute a process for proposing changes. Yes, effective data quality improvements come down to a credible, non-ignorable process for change orders. Why didn’t I just jump to this conclusion earlier, and save us all a lot of time? It’s because the change management process only works when built atop a strong foundation. The foundation is required because the changes that are typical of data quality improvements reach across multiple lines of business, plus their managers, technical staff, application users, and others. As Kelsey will soon discover, that’s quite a number of people, technologies, and businesses to coordinate. She’s right to start with a conversation with senior management, not the owners of offending applications. She’s also right not to go it alone. Kelsey has a firm conviction that data quality is a critical success factor for BI. With any luck, she’ll convince the right business sponsor, who’ll start building the big foundation that cross-business-unit data quality solutions demand. ■ SENTIMENT ANALYSIS BI and Sentiment Analysis Mukund Deshpande and Avik Sarkar Overview Dr. Mukund Deshpande is senior architect at the business intelligence competency center of Persistent Systems. He has helped enterprises, e-commerce companies, and ISVs make better business decisions for the past 10 years by using machine learning and data mining techniques. mukund_deshpande@persistent.co.in Dr. Avik Sarkar is technical lead at the analytics competency center at Persistent Systems and has over nine years of experience using analytics, data mining, and statistical modeling techniques across different industry vertical markets. avik_sarkar@persistent.co.in Over the past two decades, there has been explosive growth in the volume of information and articles published on the Internet. With this enormous increase in online content came the challenge of quickly finding specific information. Google, AltaVista, MSN, Yahoo, and other search sites stepped in and developed novel technologies to efficiently search and harness the massive amount of Internet information. Some search engines indexed keywords; others used information hierarchies, arranging Web pages in a structured way for easy browsing and for quickly locating requested information. Text classification, also known as text categorization, and text-clustering-based techniques advanced, allowing Web pages to be automatically organized into relevant hierarchies. Web sites frequently discuss consumer products or services—from movies and restaurants to hotels and politics. These shared opinions, termed the “voice of the customer,” have become highly valuable to businesses and organizations large and small. In fact, a recent study by Deloitte found that “82 percent of purchase decisions have been directly influenced by reviews.” The rapid spread of information over the Internet and the heightened impact of the media have broken down physical and geographical boundaries and caused organizations to become increasingly cautious about their reputations. Businesses and market research firms have carried out traditional sentiment analysis (also referred to as opinion analysis or reputation analysis) for some time, but it requires significant resources (travel to a given location; staffing the survey process; offering survey respondents incentives; and collecting, aggregating, and analyzing results). Such analysis is cumbersome, time-consuming, and costly. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 41 SENTIMENT ANALYSIS Automated sentiment analysis based on text mining techniques offers a simpler, more cost-effective solution by providing timely and focused analysis of huge, everincreasing volumes of content. The concept of automated sentiment analysis is gaining prominence as companies seek to provide better products and services to capture market share and increase revenues, especially in a challenging global economy. Understanding market trends and “buzz” enables enterprises to better target their campaigns and determine the degree to which sentiment is positive, negative, or neutral for a given market segment. Text Mining Research and business communities are using text mining to harness large amounts of unstructured textual information and transform it into structured information. Text mining refers to a collection of techniques and algorithms from multiple domains, such as data mining, artificial intelligence, natural language processing (NLP), machine learning, statistics, linguistics, and computational linguistics. The objective of text mining is to put the already accumulated data to better use and enhance an organization’s profitability. With a variety of customer trends and behavior and increasing competition in each market segment, the better the quality of the intelligence, the better the chances of increasing profitability. ■■ Document summarization: An automated technique for deriving a short summary of a longer text document Sentiment analysis applies these techniques to assign sentiment or opinion information to certain entities within text. Sentiment evaluation is another step in the process of converting unstructured content into structured content so that data can be tracked and analyzed to identify trends and patterns. Sentiment Analysis Sentiment analysis broadly refers to the identification and assessment of opinions, emotions, and evaluations, which, for the purposes of computation, might be defined as written expressions of subjective mental states. For example, consider this unstructured English sentence in the context of a digital camera review: Canon PowerShot A540 had good aperture combined with excellent resolution. Consider how sentiment analysis breaks down the information. First, the entities of interest are extracted from the sentence: ■■ Digital camera model: Canon PowerShot A540 ■■ Camera dimensions or features: aperture, resolution The major text mining techniques include: ■■ ■■ ■■ 42 Text clustering: The automated grouping of textual documents based on their similarity—for example, clustering documents in an enterprise to understand its broad areas of focus Text classification or categorization: The automated assignment of documents into some specific topics or categories—for example, assigning topics such as politics, sports, or business to an incoming stream of news articles Entity extraction: The automated tagging or extraction of entities from text—for example, extracting names of people, organizations, or locations BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 Sentiments are further extracted and associated for each entity, as follows: ■■ Digital camera model = Canon PowerShot A540; Dimension = aperture, Sentiment = good (positive) ■■ Digital camera model = Canon PowerShot A540; Dimension = resolution, Sentiment = excellent (positive) Based on the individual sentence-level sentiments, aggregated and summarized sentiment about the digital camera is obtained and stored in the database for reporting purposes. SENTIMENT ANALYSIS Fetch/Crawl + Cleanse Text Classification Sentiment Extraction Entity Extraction Sentiment Summary Reports/ Charts Figure 1. Sentiment analysis steps The following sections delve into the technical details and algorithms used for this type of sentiment analysis. Sentiment Analysis Steps Suppose we are interested in deriving the sentiment or opinion of various digital cameras across dimensions such as price, usability, and features. Figure 1 illustrates the steps we will follow in this analysis. Step 1: Fetch, Crawl, and Cleanse Comments about digital cameras might be available on gadget review sites or in discussion forums about digital cameras, as well as in specialized blogs. Data from all of these sources needs to be collected to give a holistic view of all the ongoing discussions about digital cameras. Web crawlers—simple applications that grab the content of a Web page and store it on a local disk—fetch data from the targeted sites. The downloaded Web pages are in HTML format, so they need to be cleansed to retain only the textual content and the remaining HTML tags used for rendering the page on the Web site. Step 2: Text Classification The sites from which data is fetched might contain extra information and discussions about other electronic gadgets, but our current interest is limited to digital cameras. A text classifier determines whether the page or discussions on it are related to digital cameras; based on the decision of the classifier, the page is either retained for further analysis or discarded from the system. The text classifier is provided by a list of relevant (positive) and irrelevant (negative) words. This list consists of a base list of words supplied by the software provider, which is typically enhanced by the user (the enterprise) to make it relevant to the particular domain. A simple rule-based classifier determines the polarity of the page based on the proportion of positive or negative words it contains. You can “train” complex and robust classifiers by feeding them samples of positive and negative pages. These samples allow you to build probabilistic models based on machinelearning principles. Then, these models are applied on unknown pages to determine the pages’ relevance. Commercial forums, blog aggregation services, and search engines (such as BoardReader and Moreover) have become popular recently, eliminating the need to build in-house text classifiers. You can use these services to specify keywords or a taxonomy of interest (in this case, digital camera models), and they will fetch the matching forums or blog articles. Step 3: Entity Extraction Entity extraction involves extracting the entities from the articles or discussions. In this example, the most important entity is the name or model of the digital camera—if the name is incorrectly extracted, the entire sentiment or opinion analysis becomes irrelevant. There are three major approaches for entity extraction: ■■ Dictionary or taxonomy: A dictionary or taxonomy of available and known models of digital cameras is provided to the system. Whenever the system finds BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 43 SENTIMENT ANALYSIS a name in the article, it tags it as a digital camera entity. This technique, though simple to set up, needs frequent updates on every subsequent model launch, so it’s not robust. ■■ ■■ Rules: A digital camera model name has a certain pattern, such as Canon PowerShot A540. Therefore, a rule may be written to tag any alphanumeric token following the string “Canon PowerShot” as a digital camera model. Such techniques are more robust than the dictionary-based method, but if Canon decides to launch a new model, say the SuperShot, such rules must be updated manually. Machine learning: This algorithm learns the extraction rules automatically based on a sample of articles with the entities properly tagged. The rules are learned by forming graphical and probabilistic models of the entities and the arrangement of other terms adjoining them. Popular machine learning models for entity extraction are based on hidden Markov models (HMM) and conditional random fields (CRF). Step 4: Sentiment Extraction Sentiment extraction involves spotting sentiment words within a particular sentence. This is typically achieved using a dictionary of sentiment terms and their their semantic orientations. There are obvious limitations to the dictionary-based approach. For example, the sentiment word “high” in the context of “price” might have a negative polarity, whereas “high” in the context of “camera resolution” will be of positive polarity. (Approaches to dealing with varying and domain-specific sentiment words and their semantic orientation are discussed in the next section.) Once an entity of interest (for example, the digital camera model or sentiment word) is identified, structured sentiment is extracted from the sentence in the form of {model name, score}, where score is the positive or negative polarity value of the identified sentiment word in the sentence. If some dimension (such as “price” or “resolution”) is also found in the sentence, then the sentiment is extracted in the form of {model name, dimension, score}. We may also choose to report the source name or source ID to associate the extracted sentiment back to that source. 44 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 The presence of negation words, such as “not,” “no,” “didn’t,” and “never,” require special attention. These keywords lead to a transformation in the polarity value of the sentiment words and hence in their reported score. Natural-language techniques are used to detect the effect of the negation word on the adjoining sentiment word. If the negation effect is detected, then the polarity of the sentiment word is inverted. The extracted sentiment data is now in a structured format that can be loaded into relational databases for further transformation and reporting. Step 5: Sentiment Summary The raw sentiments extracted in Step 4 come from individual sentences that are specific to certain entities. To make the data meaningful for reporting, it must be aggregated. One of the obvious aggregations in the context of digital cameras will be model-name-based aggregation—in this case, all of the positive, negative, or neutral entries in the database are grouped together. Again, model- and dimension-based sentiment aggregation would allow the discovery of detailed, dimension-wise sentiment distribution for every model. Based on the reporting needs, different levels of aggregation and summarization need to be carried out and stored in a database or data warehouse. Step 6: Reports/Charts Reports and charts can be generated directly from the database or data warehouse where the aggregated data is stored in a structured format. Such reporting falls under the purview of traditional BI and reporting, and is not related to the core sentiment analysis steps. The steps described above have been used to transform the unstructured textual data in blogs and forums to structured, quantifiable numeric sentiment data related to the entity of interest. Sentiment Analysis Challenges There are challenges in sentiment analysis, but fortunately some simple tactics can help you overcome them. The challenges discussed in this section are related to sentiment assignment, co-reference resolution, and assigning domainspecific polarity values to sentiment words. SENTIMENT ANALYSIS Sentiment Assignment Suppose a sentence mentions digital camera features such as resolution, usage, and megapixels; the sentence also mentions a sentiment word, say, “good.” Should we relate all or only some of the features to the sentiment word? The issue becomes even more challenging when multiple sentiment words or model names are mentioned in the same sentence. Limited accuracy can be achieved by using simple heuristics, such as assigning the model name or feature to the nearest occurring sentiment word (this yields acceptable accuracy). Deep NLP techniques may be used to identify the model names or features (nouns) that are related to the sentiment word (adjective or adverb) in the context of that sentence. Reviews often include comparative comments about multiple digital camera models within single sentences. For example: ■■ “Kodak V570 is better than the Canon Power-Shot A460.” ■■ “Kodak V570 scores more points than Canon PowerShot A460 in terms of resolution.” ■■ “In comparing the Kodak V570 and Canon PowerShot A460, the latter wins in terms of resolution.” ■■ “Nikon D200 is good in terms of resolution, while Kodak V570 and Canon PowerShot A460 have better usability.” Dealing with such comparative sentences requires building complex natural-language rules to understand the impact and span of every word. For example, the word “better” would signal a positive sentiment extraction for one camera model or feature and negative sentiment data for another. Co-reference Resolution Suppose a discussion about a digital camera mentions the model in the beginning of the article, but subsequent references use pronouns such as “it” or phrases such as “the camera.” Referring to a proper noun by using a pronoun is called co-reference. Co-reference is a common feature of the English language. Ignoring sentences that use it will lead to a loss in data and incorrect reporting. Co-reference resolution, also referred to as anaphora resolution, is a vast area of research in the NLP and computational linguistics communities. It is achieved using rule-based methods or machine-learning-based techniques. Open source co-reference resolution systems such as GATE (General Architecture for Text Engineering) provide the accuracy required for sentiment analysis. Domain-specific polarity values and sentiment words As discussed earlier, sentiment words have different interpretations in different contexts. For example, “long” in the context of movies might convey a negative sentiment, whereas in sports it would indicate positive polarity. Similarly, “unpredictable” might convey positive sentiment for movies, but would indicate negative polarity when used to describe digital cameras or mobile phones. This problem can be tackled by using a domain-specific sentiment word list. Such a list is created by analyzing all the adjectives, adverbs, and phrases in the domain-specific document collection. The analysis calculates the proximity of these words to generic positive words such as “good” and generic negative words such as “bad.” Another calculation is called point-wise mutual information, which provides a measure of whether two terms are related and hence jointly occurring, rather than showing up together by chance. These calculations can be performed for the word across all documents to determine whether a word occurs more often in the positive sense than in the negative sense. These techniques work well if a certain sentiment word has a fixed polarity interpretation within a certain domain. Now, suppose we have the sentiment word “high,” which in the digital camera domain could indicate negative sentiment for “price” but positive sentiment for “camera resolution.” Such cases are a bit more difficult to handle and can often lead to errors in sentiment analysis. To tackle such scenarios, the system has to store some mapping of entity, the sentiment word, and its associated polarity—for example, {high, price, –ve} and {high, resolution, +ve}. Creating and verifying such mappings involves considerable manual work on top of automated techniques. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 45 SENTIMENT ANALYSIS Examples Sentiment Analysis of Digital Camera Reviews There are many Web sites that contain reviews related to digital cameras. Suppose a consumer is looking to buy a particular digital camera and would like to get a complete understanding of the camera’s different features, strengths, and weakness. She would then compare this information to other contemporary digital camera models of the same or competing brands. This would involve manual research across all related Web sites, which might require days or even months of research. Rather than doing this, the consumer is more likely to gather incomplete information by visiting just a few sites. Automated sentiment analysis and BI-based reporting can come to the rescue by providing a complete overview of the many discussions about digital camera models and their features. First, a list of available digital camera models is collected from the various companies’ catalogs to create a comprehensive taxonomy of digital camera models. An initial list of digital camera features or dimensions is also collected from these catalogs. All online discussion pages are collected from the digital camera review Web sites. One important consideration during taxonomy creation is the grouping of synonymous entities. For example, “Canon PowerShot A540” may also be referred to as “PowerShot 540” or “Canon A540.” All of these should be grouped as a single entity. Again, the dimension “camera resolution” may be referred to as “resolution,” “megapixel,” or simply “MP”; all should be aligned to the single entity “resolution.” The presence of the camera model name on a given page indicates that it should be considered for further analysis. The next challenge is to extract the entities of interest from the text—that is, the digital camera model names and features. A taxonomy-based method is used to extract those that are known. Machine-learning-based approaches can extract the others. Here, documents tagged with existing model names and features are provided as training to the machine learning the algorithm, which uses the data to learn the extraction rules. These rules are then used to extract entities from other incoming articles. 46 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 Raw sentiment is extracted from the sentiment-bearing sentences using the approaches described above. A list of sentiment-bearing words, along with their polarity values, is provided as input. Based on the raw sentiments, sentiment aggregation is carried out on two dimensions: digital camera model and digital camera features. Further aggregation can be carried out for each Web site to identify any site-specific bias in the extracted sentiments. These aggregated values are then stored in the data warehouse for reporting purposes. Sentiment Analysis of Election Campaigns The most recent U.S. presidential election saw a large number of online Web sites discussing the post-election policies and agendas of Democratic nominee Barack Obama and Republican John McCain. These discussions come from people who are very likely to be legitimate American voters (rather than, say, children or people residing outside the U.S.). Political parties such as Democrats and Republicans employ armies of people across the U.S. to survey people about their opinions on the policies of the presidential candidates. These surveys incur huge costs and delays in information collection and analysis. Automated BI and sentiment analysis can work magic here by continuously analyzing the comments posted on Web sites and providing prompt, sentiment-based reporting. For example, a popular presidential debate on television one evening will lead to comments on the Web. Sentiment analysis performed on the comments can be completed in real time, and the political parties can gauge the response to the debate and to the policy matters discussed. Smart technology use and intelligent data collection can provide in-depth, state-wide sentiment analysis of the comments. Such analysis would be extremely powerful in determining the future election campaigning strategy in each state. Considering the sensitivity and impact of the analysis, careful attention must be paid to generating the taxonomy, which consists of two main entities: the presidential nominees and the policies or issues discussed. The presidential nominees list is finite, corresponding to the major political parties. Variations in the names, acronyms, or synonyms should also be carefully studied and collated. SENTIMENT ANALYSIS Generating the taxonomy of issues or policies is far more challenging. Each issue is defined in terms of keywords or phrases; some of these will appear in multiple issues or policies. Variations among keywords and phrases can be quite large, and capturing them requires considerable time and effort. Automated methods may be used for many of these steps, but manual verification and editing is required to remove discrepancies. Another challenge is determining the location of each person entering comments. This can be done by capturing their Internet protocol (IP) addresses, then associating them with physical and geographical locations. Comments from outside the country are ignored. Other comments are associated with states (or cities, as available). Finally, carefully selected, election-specific sentiment words are added to the taxonomy. Once the taxonomy is in place, the raw sentiments may be extracted from the comments. They are in two primary forms: ■■ {Presidential Nominee, Location, Sentiment}, which captures generic sentiment about the presidential candidate regardless of issue ■■ {Presidential Nominee, Issue or Policy, Location, Sentiment}, which captures the sentiment or opinion about the particular issue for the presidential candidate A single comment may lead to the extraction of more than one raw sentiment, as shown above. Next, the data is aggregated along dimensions such as presidential nominee, policy issue, or location. The aggregated results are stored in a warehouse for quick access and reporting. In the future, many Web sites will likely collect further details about the people making the comments, including age group, income, education, religion, race, ethnic origin, and number of family members. This would allow more detailed analysis and drill-down of the sentiment results, which would aid in advanced campaign management such as micro-targeting specific groups of voters. Washington Montana Maine North Dakota Minnesota Oregon Idaho South Dakota Vt. Wisconsin Wyoming Nebraska Nevada Utah California Arizona New Mexico Kansas Oklahoma N.H. Mass. R.I. Pennsylvania Indiana Ohio Md. West Virginia Virginia Missouri Kentucky North Carolina Tennessee Arkansas South Mississippi Texas Michigan Iowa Illinois Colorado New York Alabama Georgia Negative Louisiana Florida Positive Alaska Obama McCain Figure 2. Sample election campaign—voter sentiment report BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 47 SENTIMENT ANALYSIS Other Applications of BI and Sentiment Analysis Additional applications of sentiment analysis and BI-based reporting include: ■■ Online product reviews. These contributed to the development of sentiment analysis. Product reviews are analyzed to provide an overall idea about the features of the product along with its strengths and weaknesses. ■■ Online movie reviews. These are available in abundance, which led to the discovery of a new domain of sentiment analysis that analyzes people’s opinions about movies. ■■ Company news. Analyzing news articles and discussions related to a company can provide detailed sentiment analysis about an organization’s performance, along with criteria such as profit, customer satisfaction, and products. ■■ Online videos. Sentiment analysis helps to capture opinions about both video quality and the events portrayed. ■■ Hotels, vacation homes, holiday destinations, and restaurants. Sentiment analysis helps people make informed decisions about holiday plans or where to dine out. ■■ Movie stars, popular sports figures, and television personalities. Sentiment analysis can capture the sentiments and opinions of large groups of people by analyzing discussions or articles related to such public figures. Existing Research in Sentiment Analysis Sentiment/opinion analysis is an emerging area of research in text mining. Early researchers rated movie reviews on a positive/negative scale by treating each review as a “bag of words” and applying machine-learning algorithms like Naïve Bayes. Successive research progressed to detecting sentence-level sentiment and hence reporting higher accuracy figures. In contrast to the research on movie reviews, experts from the finance domain analyzed the 48 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 sentiment in published news articles to predict the price of a certain stock for the following day. Experts also discovered new techniques for using Web search to determine the semantic orientation of words, which is at the core of quantifying the sentiment expressed in a sentence. See the bibliography at the end of this article for additional studies and reports. Final Thoughts In closing, we would like to spotlight two observations that highlight the growing need for sentiment analysis: With the explosion of Web 2.0 platforms such as blogs, discussion forums, peer-to-peer networks, and various other types of social media all of which continue to proliferate across the Internet at lightning speed, consumers have at their disposal a soapbox of unprecedented reach and power by which to share their brand experiences and opinions, positive or negative, regarding any product or service. As major companies are increasingly coming to realize, these consumer voices can wield enormous influence in shaping the opinions of other consumers—and, ultimately, their brand loyalties, their purchase decisions, and their own brand advocacy. Companies can respond to the consumer insights they generate through social media monitoring and analysis by modifying their marketing messages, brand positioning, product development, and other activities accordingly. —Jeff Zabin and Alex Jefferies [2008]. “Social Media Monitoring and Analysis: Generating Consumer Insights from Online Conversation,” Aberdeen Group Benchmark Report. Marketers have always needed to monitor media for information related to their brands—whether it’s for public relations activities, fraud violations, or competitive intelligence. But fragmenting media and changing consumer behavior have crippled traditional monitoring methods. Technorati estimates that 75,000 new blogs are created daily, along with 1.2 million new posts SENTIMENT ANALYSIS each day, many discussing consumer opinions on products and services. Tactics [of the traditional sort] such as clipping services, field agents, and ad hoc research simply can’t keep pace. —Peter Kim [2006]. “The Forrester Wave: Brand Monitoring, Q3 2006,” white paper, Forrester Wave. Bibliography Baeza-Yates, Ricardo, and B. Ribeiro-Neto [1999]. Modern Information Retrieval. Addison-Wesley Longman Publishing Company. Cunningham, Hamish, Diana Maynard, Kalina Bontcheva, and Valentin Tablan [2002]. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Philadelphia, PA. Pang, Bo, and Lillian Lee [2005]. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. Proceedings of the ACL, pp. 115–124. ——— [2004]. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. Proceedings of the ACL, pp. 271–278. ———, and Shivakumar Vaithyanathan [2002]. Thumbs up? Sentiment Classification Using Machine Learning Techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Vol. 10, pp. 79–86. Rabiner, Lawrence R. [1989]. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, Vol. 77, No. 2, pp. 257–286. Das, Sanjiv Ranjan, and Mike Y. Chen [2001]. Yahoo! for Amazon: Sentiment Parsing from Small Talk on the Web. Proceedings of the 8th Asia Pacific Finance Association Annual Conference. Sebastiani, Fabrizio [2002]. “Machine Learning in Automated Text Categorization.” ACM Computing Surveys, Vol. 34, No. 1, pp. 1–47. Esuli, Andrea, and Fabrizio Sebastiani [2006]. SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining. Proceedings of LREC-06, 5th Conference on Language Resources and Evaluation, Genova, Italy, pp. 417–422. Turney, Peter D. [2002]. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 417–424. Philadelphia, PA. Hurst, Matthew, and Nigam Kamal [2004]. “Retrieving Topical Sentiments from Online Document Collections.” Document Recognition and Retrieval XI, pp. 27–34. ———, and Michael L. Littman [2003]. “Measuring praise and criticism: Inference of semantic orientation from association.” ACM Transactions on Information Systems, Vol. 21, No. 4, pp. 315–346. Lafferty, John, Andrew McCallum, and Fernando Pereira [2001]. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Nigam, Kamal, and Matthew Hurst [2004]. Towards a Robust Metric of Opinion. AAAI Spring Symposium on Exploring Attitude and Affect in Text. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 49 AUTHOR INSTRUCTIONS Editorial Calendar and Instructions for Authors The Business Intelligence Journal is a quarterly journal that focuses on all aspects of data warehousing and business intelligence. It serves the needs of researchers and practitioners in this important field by publishing surveys of current practices, opinion pieces, conceptual frameworks, case studies that describe innovative practices or provide important insights, tutorials, technology discussions, and annotated bibliographies. The Journal publishes educational articles that do not market, advertise, or promote one particular product or company. Editorial Acceptance ■ All articles are reviewed by the Journal’s editors before they are accepted for publication. ■ The publisher will copyedit the final manuscript to conform to its standards of grammar, style, format, and length. ■ Articles must not have been published previously without the knowledge of the publisher. Submission of a manuscript implies the authors’ assurance that the same work has not been, will not be, and is not currently submitted elsewhere. ■ Authors will be required to sign a release form before the article is published; this agreement is available upon request (contact journal@tdwi.org). ■ The Journal will not publish articles that market, advertise, or promote one particular product or company. Editorial Topics for 2010 Journal authors are encouraged to submit articles of interest to business intelligence and data warehousing professionals, including the following timely topics: ■ Agile business intelligence ■ Project management and planning ■ Architecture and deployment ■ Data design and integration Submissions ■ Data management and infrastructure ■ Data analysis and delivery tdwi.org/journalsubmissions Materials should be submitted to: Jennifer Agee, Managing Editor E-mail: journal@tdwi.org ■ Analytic applications Upcoming Submissions Deadlines ■ Selling and justifying the data warehouse Volume 15, Number 4 Submission Deadline: September 3, 2010 Distribution Date: December 2010 Volume 16, Number 1 Submission Deadline: December 17, 2010 Distribution Date: March 2011 50 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 DASHBOARD PLATFORMS Dashboard Platforms Alexander Chiang Introduction This article discusses the importance of a platformbased dashboard solution for business professionals responsible for developing a digital dashboard. The first two sections focus on business users and information workers such as business analysts. The latter sections speak to technologists, including software developers. Alexander Chiang is director of consulting services for Dundas Data Visualization, Inc. alexanderc@dundas.com We will take a brief look at the technologies in the context of the BI stack to help readers put the significance of dashboard platforms into perspective. Next, we will present the business challenges of the dashboards, followed by an explanation of how these challenges can be addressed with a dashboard solution that is based on a platform. A Brief History of the BI Stack The business intelligence (BI) community has mature technologies for several components of the BI stack. In particular, the data and the analytics layers have been focal points for most BI solution vendors for the last few decades. This makes sense, considering those layers represent the basic foundations of storing and analyzing data. The data layer has received the most attention, and technologists have implemented the majority of features necessary to address the challenges of storing, consolidating, and retrieving data pertinent to organizations. The analytics layer has been revitalized since the dotcom boom. As more information was brought online, massive amounts of unstructured data began floating in cyberspace, and analysts realized the value proposition of mining and disseminating all this useful data. Content analysis tools were built to solve the challenges of making sense of all the data; they came from a ready supply of analysis tools on which to build. Other sectors still maturing in this area include predictive analysis, which allows organizations to analyze historical data in search of insight about future trends. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 51 DASHBOARD PLATFORMS Finally, there is the presentation layer. This is the mechanism for delivering to end users all the information provided by the data and analytics layers. Traditionally, the information is presented in the form of scorecards, reports, and/or dashboards. This article discusses the ideal presentation layer solution for dashboards. In general, the concepts covered here can be applied to other areas (such as scorecards and reporting) as well as the newer advances in analytics. The Dashboard Platform A dashboard platform is a software framework designed to address the visualization needs of businesses. The platform must provide interfaces and common functionality to help users address common business cases with minimum involvement from technologists. It must also be both highly customizable and extensible to address complex needs. These software concepts can be summarized by a statement from computer scientist Alan Kay: “Simple things should be simple and complex things should be possible” (Leuf and Cunningham, 2001). A dashboard platform serves specifically to develop and deploy dashboards. Out of the box, the ideal dashboard platform should provide: ■■ An accelerated development and deployment timeline ■■ A collaborative workflow ■■ An open application programming interface (API) dashboard—assuming there are enough resources to execute this concurrent development. From a technology perspective, a dashboard solution should facilitate defining business metrics without requiring the underlying data to be prepared first. The personnel responsible for finding the data can use these business metric definitions as a communication medium; that is, they can start looking for the columns and preparing the calculations necessary to satisfy the definitions. Simultaneously, those responsible for designing and creating the dashboard can use these business metric definitions to begin choosing appropriate visualizations and adding interactivity. Once the data and the design are ready, the dashboard can be deployed. This approach will accelerate production of the dashboard solution thanks to the concurrent workflow between the data team and the dashboard designers. Collaborative Dashboard Development I discussed the key players and processes of a dashboard initiative in detail in a previous Business Intelligence Journal article (see References). To summarize, the following resources are generally needed: ■■ Business users to utilize the dashboards and confirm the business metrics needed ■■ Business analysts to determine the business metrics and design the dashboards ■■ Database administrators to discover the underlying data used in the business metrics ■■ IT workers to maintain and integrate any technology needed for delivering a BI solution Rapid Dashboard Development and Deployment There are many ways to develop and deploy a dashboard. The best place to start is to define the business metrics needed. By starting with business metrics, the data behind the business metrics can be discovered while the dashboards are being designed. This parallel workflow significantly decreases the time needed to create a 52 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 A dashboard solution should take advantage of all the players participating in a dashboard initiative. This goal can be accomplished by implementing interfaces and functionality specific to the particular audience. Business users should have a portal to access the dashboards; business analysts should have a work area so they can define business metrics and design dash- DASHBOARD PLATFORMS boards. Database administrators should have a work area that allows them to connect to their data sources and manipulate the data so it can satisfy the business metric definitions. Finally, IT should have interfaces for administering those who have access to particular dashboards and interfaces within the system. By providing work areas, tools, and functionalities specific to particular tasks and resources, expertise is leveraged to achieve maximum resource efficiency. Open Application Programming Interface An API allows technologists to effortlessly leverage all the services and functionalities within a platform. A well-designed dashboard platform will have been developed with this paradigm in mind. Such a platform should leverage its own API to add new features. Furthermore, dashboard platforms (in fact, any software platform) won’t necessarily have all the features an organization needs at the time they are evaluated, but the organization should understand that the platform will allow for customization so it can meet future needs. Before examining the details of the open API, it is important to recognize the technology challenges dashboard solutions need to address. Dashboard Technology Challenges The three key technical challenges in leveraging dashboards as an information delivery mechanism are: ■■ System integration ■■ Data source compatibility ■■ Specialized data visualizations System Integration In general, organizations have an existing IT infrastructure in place, including corporate Web portals. Ideally, the chosen dashboard solution should be easy to integrate within this infrastructure; traditionally, most dashboard solutions and their respective tools were standalone desktop applications. It is difficult to couple the corporate Web portal with such applications because they are two different types of technologies. Dashboard vendors recognized this and moved their tools toward Web-based solutions. The full benefits of moving to this type of solution are beyond the scope of this article, but scalability and maintainability are the two major advantages. An IT infrastructure usually has a security subsystem. The dashboard solution should leverage this existing subsystem so the IT team won’t have to maintain two different security systems. Data Source Compatibility Data source neutrality is important for dashboard vendors. That is, these solutions must connect to multiple data sources to feed into the dashboards. Although most dashboard products provide connectivity to popular databases and analytics packages, the challenge arises when an organization has to use a homegrown analytics engine or a more specialized database. For those businesses investing in complete BI solutions provided by bigger vendors, this is a non-issue, as they can leverage their consolidation technologies. For the mid-market, choosing an end-to-end solution may not be practical or within the budget. This makes it important for the dashboard solution to provide a way to connect to various types of data sources. Specialized Data Visualizations There are various dashboard types (e.g., strategic, tactical, operational) as well as dashboards targeted on specific verticals. Generally, vertical dashboards require particular types of visualizations. For example, a media company may be interested in a dashboard that analyzes social networks so the company can target specific individuals or groups with many ties to other individuals and groups. This requires a specific type of network diagram that is not found in most dashboard products. As a result, the media company might consider creating a custom solution. These challenges make it difficult for an organization to choose a vendor and understand what effect their choice has on their long-term strategy and growth BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 53 DASHBOARD PLATFORMS prospects. For example, the same media company may decide to provide television broadcasting services and may require real-time dashboards to monitor ratings. This scenario would require visualizations specifically created for real-time presentation, which typically entails performance challenges. The point is that dashboard solutions provide basic visualizations such as standard chart types, gauges, and maps, but they do not generally provide more specialized visualizations. How do we address these technology issues? The Dashboard Platform API A dashboard platform should address the technology problems previously described: system integration, data source compatibility, and specialized visualizations. These can be resolved by an API that affords the following: ■■ A standalone dashboard viewer ■■ Data source connectors ■■ Third-party data visualization integration with dashboards. In addition, files can be created from dashboard data and exported to a variety of file formats using export APIs. With these areas exposed for customization, the majority of the integration requirements typical of a BI infrastructure are addressed. Data Source Connectors A good data source connector API should provide standard data schemas for consumption by the platform. Developers would then develop an adapter that connects to the unsupported data sources and manipulates the data they contain until they map to a dashboard platform data schema. Once completed, the platform can consume the data source. There are many types of data sources, such as Web services and database engines. Their importance is apparent: to facilitate the connection of dashboards to new data sources without third-party consolidation software. This will keep the door wide open for emerging data technologies. Each newly supported data type should be accessible through an appropriate user interface—either by reuse of an existing screen or the creation of a custom one. Standalone Dashboard Viewer A standalone dashboard viewer is a separate control that allows developers to integrate dashboards into other applications. Most organizations have a Webbased portal, which suggests that a dashboard platform should—at a minimum—include a Web-based viewer. Although rare, company portals built around desktop technologies are not necessarily out of luck. Most thickclient development tools have a standalone browser control that will allow the viewer to be embedded. Many businesses display sensitive data on dashboards, and the viewer should take this into consideration. The viewer should leverage a company’s security system to allow dashboard access using existing role-based credentials. This allows for role- and parameterspecific dashboards to be shown rather than generic dashboards. With adequate integration, further supporting data and files can be paired and shared 54 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 Third-Party Data Visualization Integration A dashboard designer interface generally comes with a set of standard charts, gauges, and maps for visualizing data. However, there are many types of additional visualizations for dashboards, and a dashboard solution may not have all that are needed to satisfy an organization’s requirements. A good plug-in API should provide a standard interface for developers to integrate third-party visualizations into the platform’s dashboard designer. This interface should allow KPIs defined in the platform to be hooked up to the visualization. In addition, it should define common events associated with dashboard interactivity (such as mouse clicks). This allows developers to customize any interaction that may be associated with the visualization. One example is a workflow diagram. When the dashboard user clicks on a particular block DASHBOARD PLATFORMS of the workflow, the visualization may zoom in and show sub-workflows of that block. The standard data visualizations (DV) that come with the platform should also be incorporated into an extensible API. For example, a chart type not provided by the platform may have many properties of a chart, such as X and Y axes. Consider a real-time line chart: it has similar properties to a line chart, but the key difference is that it changes with time and it should move the window of time as new data points are received. With a DV API, developers can leverage the basic functionality and properties of the platform’s charts and customize them to their organization’s needs. References Chiang, Alexander [2009]. “Creating Dashboards: The Players and Collaboration You Need for a Successful Project,” Business Intelligence Journal, Vol. 14, No. 1, pp. 59–63. Leuf, Bo, and Ward Cunningham [2001]. The Wiki Way: Quick Collaboration on the Web, Addison-Wesley. Choosing a platform that allows third-party visualizations to be integrated into the dashboard design provides comfort to a company that is unsure of what types of DVs it will need in the future. Final Note A dashboard solution should facilitate accelerated dashboard production, infuse a sense of collaboration among personnel involved in development, and provide an open API to allow for a customized solution. Companies choosing a flexible and customizable dashboard solution should be looking for these features. The benefits are apparent and should be realized immediately. Rapid dashboard development deployment decreases development costs and gets dashboards into the hands of decision makers more quickly. Interfaces and workflow designed for specific resources reduce the learning curve and increase the likelihood of corporate adoption so the software doesn’t just sit on a shelf. Finally, an open API will allow an organization to customize a solution specific to its requirements, lowering the risk of choosing an inappropriate solution for its immediate and long-term needs. Viewing these areas as checkboxes during a product evaluation will help an organization select the right solution. BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 55 BI STATSHOTS BI StatShots Strategic Value. To test perceptions of UDM’s strategic status, this report’s survey asked respondents to rate UDM’s possible strategic value. Unified Data Management Barriers. According to our research survey, unified data management is most often stymied by turf issues. These include a corporate culture based on silos, data ownership, and other politics. UDM also suffers when there’s a lack of governance or stewardship, a lack of business sponsorship, or unclear business goals for data. In the perceptions of survey respondents, UDM has a strong potential for high strategic impact. By extension, UDM is indeed strategic (despite its supporting role) when it is fully aligned with and satisfying the data requirements of strategic business initiatives and strategic business goals. A whopping 59 percent reported that it could be highly strategic, whereas an additional 22 percent felt it could be very highly strategic. Few survey respondents said that UDM is not very strategic (5 percent) and no one felt it’s not strategic at all (0 percent). —Philip Russom In your organization, what are the top potential barriers to coordinating multiple data management practices? (Select six or fewer.) Corporate culture based on silos 61% Data ownership and other politics 60% Lack of governance or stewardship 44% Lack of business sponsorship 42% Poor master data or metadata 32% Inadequate budget for data management 31% Data management over multiple organizations 28% Inadequate data management infrastructure 28% Unclear business goals for data 28% Poor quality of data 24% Independence of data management teams 23% Consolidation/reorganization of data management teams 20% Existing tools not conducive to UDM 20% Lack of compelling business case 19% Poor integration among data management tools Other 14% 4% Figure 1. Based on 857 responses from 179 respondents (4.8 average responses per respondent). Source: Unified Data Management, TDWI Best Practices Report, Q2 2010. 56 BUSINESS INTELLIGENCE JOURNAL • VOL. 15, NO. 2 www.tdwi.org/cbip Set yourself apart from the crowd. Get certified. W H AT SE T S YOU A PA R T FROM T HE CROW D? Distinguishing yourself in your career can be a difficult task. Through TDWI’s CBIP (Certified Business Intelligence Professional) program, we help you define, establish, and set yourself apart professionally with a meaningful BI certification credential. Become a Certified Business Intelligence Professional today! To find out what CBIP exams you should take, how to prepare, and where you can take the exams, visit www.tdwi.org/cbip. TDWI Partner Members These solution providers have joined TDWI as special Partner Members and share TDWI’s strong commitment to quality and content in education and knowledge transfer for business intelligence and data warehousing.