Transcription

DISCOVERINGKNOWLEDGE IN DATAAn Introduction to Data MiningDANIEL T. LAROSEDirector of Data MiningCentral Connecticut State UniversityA JOHN WILEY & SONS, INC., PUBLICATION

DISCOVERINGKNOWLEDGE IN DATA

DISCOVERINGKNOWLEDGE IN DATAAn Introduction to Data MiningDANIEL T. LAROSEDirector of Data MiningCentral Connecticut State UniversityA JOHN WILEY & SONS, INC., PUBLICATION

Copyright 2005 by John Wiley & Sons, Inc. All rights reserved.Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400,fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission shouldbe addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken,NJ 07030, (201) 748-6011, fax (201) 748-6008.Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts inpreparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not be suitablefor your situation. You should consult with a professional where appropriate. Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages, including but not limited tospecial, incidental, consequential, or other damages.For general information on our other products and services please contact our Customer Care Departmentwithin the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,however, may not be available in electronic format.Library of Congress Cataloging-in-Publication Data:Larose, Daniel T.Discovering knowledge in data : an introduction to data mining / Daniel T. Larosep. cm.Includes bibliographical references and index.ISBN 0-471-66657-2 (cloth)1. Data mining. I. Title.QA76.9.D343L38 2005006.3 12—dc222004003680Printed in the United States of America10987654321

DedicationTo my parents,And their parents,And so on.For my children,And their children,And so on.2004 Chantal Larose

CONTENTSPREFACE12xiINTRODUCTION TO DATA MINING1What Is Data Mining?Why Data Mining?Need for Human Direction of Data MiningCross-Industry Standard Process: CRISP–DMCase Study 1: Analyzing Automobile Warranty Claims: Example of theCRISP–DM Industry Standard Process in ActionFallacies of Data MiningWhat Tasks Can Data Mining cationClusteringAssociationCase Study 2: Predicting Abnormal Stock Market Returns UsingNeural NetworksCase Study 3: Mining Association Rules from Legal DatabasesCase Study 4: Predicting Corporate Bankruptcies Using Decision TreesCase Study 5: Profiling the Tourism Market Using k-Means Clustering 1921232425DATA PREPROCESSING27Why Do We Need to Preprocess the Data?Data CleaningHandling Missing DataIdentifying MisclassificationsGraphical Methods for Identifying OutliersData TransformationMin–Max NormalizationZ-Score StandardizationNumerical Methods for Identifying ii

viii345CONTENTSEXPLORATORY DATA ANALYSIS41Hypothesis Testing versus Exploratory Data AnalysisGetting to Know the Data SetDealing with Correlated VariablesExploring Categorical VariablesUsing EDA to Uncover Anomalous FieldsExploring Numerical VariablesExploring Multivariate RelationshipsSelecting Interesting Subsets of the Data for Further ATISTICAL APPROACHES TO ESTIMATION AND PREDICTION67Data Mining Tasks in Discovering Knowledge in DataStatistical Approaches to Estimation and PredictionUnivariate Methods: Measures of Center and SpreadStatistical InferenceHow Confident Are We in Our Estimates?Confidence Interval EstimationBivariate Methods: Simple Linear RegressionDangers of ExtrapolationConfidence Intervals for the Mean Value of y Given xPrediction Intervals for a Randomly Chosen Value of y Given xMultiple RegressionVerifying Model AssumptionsReferencesExercises67k-NEAREST NEIGHBOR ALGORITHM90Supervised versus Unsupervised MethodsMethodology for Supervised ModelingBias–Variance Trade-OffClassification Taskk-Nearest Neighbor AlgorithmDistance FunctionCombination FunctionSimple Unweighted VotingWeighted VotingQuantifying Attribute Relevance: Stretching the AxesDatabase Considerationsk-Nearest Neighbor Algorithm for Estimation and PredictionChoosing 106106

CONTENTS6789ixDECISION TREES107Classification and Regression TreesC4.5 AlgorithmDecision RulesComparison of the C5.0 and CART Algorithms Applied to Real DataReferencesExercises109NEURAL NETWORKS128Input and Output EncodingNeural Networks for Estimation and PredictionSimple Example of a Neural NetworkSigmoid Activation FunctionBack-PropagationGradient Descent MethodBack-Propagation RulesExample of Back-PropagationTermination CriteriaLearning RateMomentum TermSensitivity AnalysisApplication of Neural Network ModelingReferencesExercises129HIERARCHICAL AND k-MEANS CLUSTERING147Clustering TaskHierarchical Clustering MethodsSingle-Linkage ClusteringComplete-Linkage Clusteringk-Means ClusteringExample of k-Means Clustering at WorkApplication of k-Means Clustering Using SAS Enterprise MinerUsing Cluster Membership to Predict 58161161162KOHONEN NETWORKS163Self-Organizing MapsKohonen NetworksExample of a Kohonen Network StudyCluster ValidityApplication of Clustering Using Kohonen NetworksInterpreting the ClustersCluster Profiles163165166170170171175

x10CONTENTSUsing Cluster Membership as Input to Downstream Data Mining ModelsReferencesExercises177ASSOCIATION RULES180Affinity Analysis and Market Basket AnalysisData Representation for Market Basket AnalysisSupport, Confidence, Frequent Itemsets, and the A Priori PropertyHow Does the A Priori Algorithm Work (Part 1)? Generating Frequent ItemsetsHow Does the A Priori Algorithm Work (Part 2)? Generating Association RulesExtension from Flag Data to General Categorical DataInformation-Theoretic Approach: Generalized Rule Induction MethodJ-MeasureApplication of Generalized Rule InductionWhen Not to Use Association RulesDo Association Rules Represent Supervised or Unsupervised Learning?Local Patterns versus Global 9190190191193196197198198MODEL EVALUATION TECHNIQUES200Model Evaluation Techniques for the Description TaskModel Evaluation Techniques for the Estimation and Prediction TasksModel Evaluation Techniques for the Classification TaskError Rate, False Positives, and False NegativesMisclassification Cost Adjustment to Reflect Real-World ConcernsDecision Cost/Benefit AnalysisLift Charts and Gains ChartsInterweaving Model Evaluation with Model BuildingConfluence of Results: Applying a Suite of 2213213EPILOGUE: “WE’VE ONLY JUST BEGUN”215INDEX217

PREFACEWHAT IS DATA MINING?Data mining is predicted to be “one of the most revolutionary developments of thenext decade,” according to the online technology magazine ZDNET News (February 8,2001). In fact, the MIT Technology Review chose data mining as one of ten emergingtechnologies that will change the world. According to the Gartner Group, “Data mining is the process of discovering meaningful new correlations, patterns and trends bysifting through large amounts of data stored in repositories, using pattern recognitiontechnologies as well as statistical and mathematical techniques.”Because data mining represents such an important field, Wiley-Interscience andDr. Daniel T. Larose have teamed up to publish a series of volumes on data mining,consisting initially of three volumes. The first volume in the series, DiscoveringKnowledge in Data: An Introduction to Data Mining, introduces the reader to thisrapidly growing field of data mining.WHY IS THIS BOOK NEEDED?Human beings are inundated with data in most fields. Unfortunately, these valuabledata, which cost firms millions to collect and collate, are languishing in warehousesand repositories. The problem is that not enough trained human analysts are availablewho are skilled at translating all of the data into knowledge, and thence up thetaxonomy tree into wisdom. This is why this book is needed; it provides readers with:r Models and techniques to uncover hidden nuggets of informationr Insight into how data mining algorithms workr The experience of actually performing data mining on large data setsData mining is becoming more widespread every day, because it empowerscompanies to uncover profitable patterns and trends from their existing databases.Companies and institutions have spent millions of dollars to collect megabytes andterabytes of data but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of datamining becomes more widespread, companies that do not apply these techniquesare in danger of falling behind and losing market share, because their competitorsare using data mining and are thereby gaining the competitive edge. In DiscoveringKnowledge in Data, the step-by-step hands-on solutions of real-world business problems using widely available data mining techniques applied to real-world data setsxi

xiiPREFACEwill appeal to managers, CIOs, CEOs, CFOs, and others who need to keep abreast ofthe latest methods for enhancing return on investment.DANGER! DATA MINING IS EASY TO DO BADLYThe plethora of new off-the-shelf software platforms for performing data mining haskindled a new kind of danger. The ease with which these GUI-based applicationscan manipulate data, combined with the power of the formidable data mining algorithms embedded in the black-box software currently available, make their misuseproportionally more hazardous.Just as with any new information technology, data mining is easy to do badly. Alittle knowledge is especially dangerous when it comes to applying powerful modelsbased on large data sets. For example, analyses carried out on unpreprocessed datacan lead to erroneous conclusions, or inappropriate analysis may be applied to datasets that call for a completely different approach, or models may be derived that arebuilt upon wholly specious assumptions. If deployed, these errors in analysis can leadto very expensive failures.‘‘WHITE BOX’’ APPROACH: UNDERSTANDING THEUNDERLYING ALGORITHMIC AND MODEL STRUCTURESThe best way to avoid these costly errors, which stem from a blind black-box approachto data mining, is to apply instead a “white-box” methodology, which emphasizesan understanding of the algorithmic and statistical model structures underlying thesoftware. Discovering Knowledge in Data applies this white-box approach by:r Walking the reader through the various algorithmsr Providing examples of the operation of the algorithm on actual large data setsr Testing the reader’s level of understanding of the concepts and algorithmsr Providing an opportunity for the reader to do some real data mining on largedata setsAlgorithm Walk-ThroughsDiscovering Knowledge in Data walks the reader through the operations and nuancesof the various algorithms, using small-sample data sets, so that the reader gets atrue appreciation of what is really going on inside the algorithm. For example, inChapter 8, we see the updated cluster centers being updated, moving toward thecenter of their respective clusters. Also, in Chapter 9 we see just which type of networkweights will result in a particular network node “winning” a particular record.Applications of the Algorithms to Large Data SetsDiscovering Knowledge in Data provides examples of the application of variousalgorithms on actual large data sets. For example, in Chapter 7 a classification problem

DATA MINING AS A PROCESSxiiiis attacked using a neural network model on a real-world data set. The resultingneural network topology is examined along with the network connection weights, asreported by the software. These data sets are included at the book series Web site, sothat readers may follow the analytical steps on their own, using data mining softwareof their choice.Chapter Exercises: Checking to Make Sure That You Understand ItDiscovering Knowledge in Data includes over 90 chapter exercises, which allowreaders to assess their depth of understanding of the material, as well as to have alittle fun playing with numbers and data. These include conceptual exercises, whichhelp to clarify some of the more challenging concepts in data mining, and “tinydata set” exercises, which challenge the reader to apply the particular data miningalgorithm to a small data set and, step by step, to arrive at a computationally soundsolution. For example, in Chapter 6 readers are provided with a small data set andasked to construct by hand, using the methods shown in the chapter, a C4.5 decisiontree model, as well as a classification and regression tree model, and to compare thebenefits and drawbacks of each.Hands-on Analysis: Learn Data Mining by Doing Data MiningChapters 2 to 4 and 6 to 11 provide the reader with hands-on analysis problems,representing an opportunity for the reader to apply his or her newly acquired datamining expertise to solving real problems using large data sets. Many people learnby doing. Discovering Knowledge in Data provides a framework by which the readercan learn data mining by doing data mining. The intention is to mirror the real-worlddata mining scenario. In the real world, dirty data sets need cleaning; raw data needsto be normalized; outliers need to be checked. So it is with Discovering Knowledge inData, where over 70 hands-on analysis problems are provided. In this way, the readercan “ramp up” quickly and be “up and running” his or her own data mining analysesrelatively shortly.For example, in Chapter 10 readers are challenged to uncover high-confidence,high-support rules for predicting which customer will be leaving a company’s service.In Chapter 11 readers are asked to produce lift charts and gains charts for a set ofclassification models using a large data set, so that the best model may be identified.DATA MINING AS A PROCESSOne of the fallacies associated with data mining implementation is that data miningsomehow represents an isolated set of tools, to be applied by some aloof analysisdepartment, and is related only inconsequentially to the mainstream business or research endeavor. Organizations that attempt to implement data mining in this waywill see their chances of success greatly reduced. This is because data mining shouldbe view as a process.Discovering Knowledge in Data presents data mining as a well-structuredstandard process, intimately connected with managers, decision makers, and those

xivPREFACEinvolved in deploying the results. Thus, this book is not only for analysts but also formanagers, who need to be able to communicate in the language of data mining. Theparticular standard process used is the CRISP–DM framework: the Cross-IndustryStandard Process for Data Mining. CRISP–DM demands that data mining be seenas an entire process, from communication of the business problem through data collection and management, data preprocessing, model building, model evaluation, andfinally, model deployment. Therefore, this book is not only for analysts and managers but also for data management professionals, database analysts, and decisionmakers.GRAPHICAL APPROACH, EMPHASIZING EXPLORATORYDATA ANALYSISDiscovering Knowledge in Data emphasizes a graphical approach to data analysis.There are more than 80 screen shots of actual computer output throughout the book,and over 30 other figures. Exploratory data analysis (EDA) represents an interestingand exciting way to “feel your way” through large data sets. Using graphical andnumerical summaries, the analyst gradually sheds light on the complex relationshipshidden within the data. Discovering Knowledge in Data emphasizes an EDA approachto data mining, which goes hand in hand with the overall graphical approach.HOW THE BOOK IS STRUCTUREDDiscovering Knowledge in Data provides a comprehensive introduction to the field.Case studies are provided showing how data mining has been utilized successfully(and not so successfully). Common myths about data mining are debunked, andcommon pitfalls are flagged, so that new data miners do not have to learn theselessons themselves.The first three chapters introduce and follow the CRISP–DM standard process,especially the data preparation phase and data understanding phase. The next sevenchapters represent the heart of the book and are associated with the CRISP–DMmodeling phase. Each chapter presents data mining methods and techniques for aspecific data mining task.r Chapters 5, 6, and 7 relate to the classification task, examining the k-nearestneighbor (Chapter 5), decision tree (Chapter 6), and neural network (Chapter7) algorithms.r Chapters 8 and 9 investigate the clustering task, with hierarchical and k-meansclustering (Chapter 8) and Kohonen network (Chapter 9) algorithms.r Chapter 10 handles the association task, examining association rules throughthe a priori and GRI algorithms.r Finally, Chapter 11 covers model evaluation techniques, which belong to theCRISP–DM evaluation phase.

ACKNOWLEDGMENTSxvDISCOVERING KNOWLEDGE IN DATA AS A TEXTBOOKDiscovering Knowledge in Data naturally fits the role of textbook for an introductorycourse in data mining. Instructors may appreciate:r The presentation of data mining as a processr The “white-box” approach, emphasizing an understanding of the underlyingalgorithmic structures: algorithm walk-throughsapplication of the algorithms to large data setschapter exercises hands-on analysisr The graphical approach, emphasizing exploratory data analysisr The logical presentation, flowing naturally from the CRISP–DM standard process and the set of data mining tasksDiscovering Knowledge in Data is appropriate for advanced undergraduateor graduate courses. Except for one section in Chapter 7, no calculus is required.An introductory statistics course would be nice but is not required. No computerprogramming or database expertise is required.ACKNOWLEDGMENTSDiscovering Knowledge in Data would have remained unwritten without the assistance of Val Moliere, editor, Kirsten Rohsted, editorial program coordinator, andRosalyn Farkas, production editor, at Wiley-Interscience and Barbara Zeiders, whocopyedited the work. Thank you for your guidance and perserverance.I wish also to thank Dr. Chun Jin and Dr. Daniel S. Miller, my colleagues in theMaster of Science in Data Mining program at Central Connecticut State University;Dr. Timothy Craine, the chair of the Department of Mathematical Sciences; Dr. DipakK. Dey, chair of the Department of Statistics at the University of Connecticut; andDr. John Judge, chair of the Department of Mathematics at Westfield State College.Your support was (and is) invaluable.Thanks to my children, Chantal, Tristan, and Ravel, for sharing the computerwith me. Finally, I would like to thank my wonderful wife, Debra J. Larose, for herpatience, understanding, and proofreading skills. But words cannot express. . . .Daniel T. Larose, Ph.D.Director, Data Mining @CCSUwww.ccsu.edu/datamining

CHAPTER1INTRODUCTION TODATA MININGWHAT IS DATA MINING?WHY DATA MINING?NEED FOR HUMAN DIRECTION OF DATA MININGCROSS-INDUSTRY STANDARD PROCESS: CRISP–DMCASE STUDY 1: ANALYZING AUTOMOBILE WARRANTY CLAIMS: EXAMPLEOF THE CRISP–DM INDUSTRY STANDARD PROCESS IN ACTIONFALLACIES OF DATA MININGWHAT TASKS CAN DATA MINING ACCOMPLISH?CASE STUDY 2: PREDICTING ABNORMAL STOCK MARKET RETURNS USINGNEURAL NETWORKSCASE STUDY 3: MINING ASSOCIATION RULES FROM LEGAL DATABASESCASE STUDY 4: PREDICTING CORPORATE BANKRUPTCIES USINGDECISION TREESCASE STUDY 5: PROFILING THE TOURISM MARKET USING k-MEANSCLUSTERING ANALYSISAbout 13 million customers per month contact the West Coast customer servicecall center of the Bank of America, as reported by CIO Magazine’s cover storyon data mining in May 1998 [1]. In the past, each caller would have listened tothe same marketing advertisement, whether or not it was relevant to the caller’sinterests. However, “rather than pitch the product of the week, we want to be asrelevant as possible to each customer,” states Chris Kelly, vice president and directorof database marketing at Bank of America in San Francisco. Thus, Bank of America’scustomer service representatives have access to individual customer profiles, so thatthe customer can be informed of new products or services that may be of greatestDiscovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. LaroseC 2005 John Wiley & Sons, Inc.ISBN 0-471-66657-2 Copyright 1

2CHAPTER 1 INTRODUCTION TO DATA MININGinterest to him or her. Data mining helps to identify the type of marketing approachfor a particular customer, based on the customer’s individual profile.Former President Bill Clinton, in his November 6, 2002 address to the Democratic Leadership Council [2], mentioned that not long after the events of September11, 2001, FBI agents examined great amounts of consumer data and found that fiveof the terrorist perpetrators were in the database. One of the terrorists possessed30 credit cards with a combined balance totaling 250,000 and had been in the countryfor less than two years. The terrorist ringleader, Mohammed Atta, had 12 differentaddresses, two real homes, and 10 safe houses. Clinton concluded that we shouldproactively search through this type of data and that “if somebody has been here acouple years or less and they have 12 homes, they’re either really rich or up to nogood. It shouldn’t be that hard to figure out which.”Brain tumors represent the most deadly cancer among children, with nearly3000 cases diagnosed per year in the United States, nearly half of which are fatal.Eric Bremer [3], director of brain tumor research at Children’s Memorial Hospitalin Chicago, has set the goal of building a gene expression database for pediatricbrain tumors, in an effort to develop more effective treatment. As one of the firststeps in tumor identification, Bremer uses the Clementine data mining software suite,published by SPSS, Inc., to classify the tumor into one of 12 or so salient types. Aswe shall learn in Chapter 5 classification, is one of the most important data miningtasks.These stories are examples of data mining.WHAT IS DATA MINING?According to the Gartner Group [4], “Data mining is the process of discoveringmeaningful new correlations, patterns and trends by sifting through large amounts ofdata stored in repositories, using pattern recognition technologies as well as statisticaland mathematical techniques.” There are other definitions:r “Data mining is the analysis of (often large) observational data sets to findunsuspected relationships and to summarize the data in novel ways that areboth understandable and useful to the data owner” (Hand et al. [5]).r “Data mining is an interdisciplinary field bringing togther techniques frommachine learning, pattern recognition, statistics, databases, and visualization toaddress the issue of information extraction from large data bases” (EvangelosSimoudis in Cabena et al. [6]).Data mining is predicted to be “one of the most revolutionary developmentsof the next decade,” according to the online technology magazine ZDNET News [7].In fact, the MIT Technology Review [8] chose data mining as one of 10 emergingtechnologies that will change the world. “Data mining expertise is the most soughtafter . . .” among information technology professionals, according to the 1999 Information Week National Salary Survey [9]. The survey reports: “Data mining skills

WHAT IS DATA MINING?3are in high demand this year, as organizations increasingly put data repositoriesonline. Effectively analyzing information from customers, partners, and suppliershas become important to more companies. ‘Many companies have implemented adata warehouse strategy and are now starting to look at what they can do with all thatdata,’ says Dudley Brown, managing partner of BridgeGate LLC, a recruiting firm inIrvine, Calif.”How widespread is data mining? Which industries are moving into this area?Actually, the use of data mining is pervasive, extending into some surprising areas.Consider the following employment advertisement [10]:STATISTICS INTERN: SEPTEMBER–DECEMBER 2003Work with Basketball OperationsResposibilities include:r Compiling and converting data into format for use in statistical modelsr Developing statistical forecasting models using regression, logistic regression, datamining, etc.r Using statistical packages such as Minitab, SPSS, XLMinerExperience in developing statistical models a differentiator, but not required.Candidates who have completed advanced statistics coursework with a strong knowledgeof basketball and the love of the game should forward your résumé and cover letter to:Boston CelticsDirector of Human Resources151 Merrimac StreetBoston, MA 02114Yes, the Boston Celtics are looking for a data miner. Perhaps the Celtics’ dataminer is needed to keep up with the New York Knicks, who are using IBM’s AdvancedScout data mining software [11]. Advanced Scout, developed by a team led by Inderpal Bhandari, is designed to detect patterns in data. A big basketball fan, Bhandariapproached the New York Knicks, who agreed to try it out. The software depends onthe data kept by the National Basketball Association, in the form of “events” in everygame, such as baskets, shots, passes, rebounds, double-teaming, and so on. As it turnsout, the data mining uncovered a pattern that the coaching staff had evidently missed.When the Chicago Bulls double-teamed Knicks’ center Patrick Ewing, the Knicks’shooting percentage was extremely low, even though double-teaming should open upan opportunity for a teammate to shoot. Based on this information, the coaching staffwas able to develop strategies for dealing with the double-teaming situation. Later,16 of the 29 NBA teams also turned to Advanced Scout to mine the play-by-playdata.

4CHAPTER 1 INTRODUCTION TO DATA MININGWHY DATA MINING?While waiting in line at a large supermarket, have you ever just closed your eyes andlistened? What do you hear, apart from the kids pleading for candy bars? You mighthear the beep, beep, beep of the supermarket scanners, reading the bar codes on thegrocery items, ringing up on the register, and storing the data on servers located atthe supermarket headquarters. Each beep indicates a new row in the database, a new“observation” in the information being collected about the shopping habits of yourfamily and the other families who are checking out.Clearly, a lot of data is being collected. However, what is being learned fromall this data? What knowledge are we gaining from all this information? Probably,depending on the supermarket, not much. As early as 1984, in his book Megatrends[12], John Naisbitt observed that “we are drowning in information but starved forknowledge.” The problem today is not that there is not enough data and informationstreaming in. We are, in fact, inundated with data in most fields. Rather, the problemis that there are not enough trained human analysts available who are skilled attranslating all of this data into knowledge, and thence up the taxonomy tree intowisdom.The ongoing remarkable growth in the field of data mining and knowledgediscovery has been fueled by a fortunate confluence of a variety of factors:r The explosive growth in data collection, as exemplified by the supermarketscanners abover The storing of the data in data warehouses, so that the entire enterprise hasaccess to a reliable current databaser The availability of increased access to data from Web navigation and intranetsr The competitive pressure to increase market share in a globalized economyr The development of off-the-shelf commercial data mining software suitesr The tremendous growth in computing power and storage capacityNEED FOR HUMAN DIRECTION OF DATA MININGMany software vendors market their analytical software as being plug-and-play outof-the-box applications that will provide solutions to otherwise intractable problemswithout the need for human supervision or interaction. Some early definitions of datamining followed this focus on automation. For example, Berry and Linoff, in theirbook Data Mining Techniques for Marketing, Sales and Customer Support [13], gavethe following definition for data mining: “Data mining is the process of explorationand analysis, by automatic or semi-automatic means, of large quantities of data inorder to discover meaningful patterns and rules” (emphasis added). Three years later,in their sequel, Mastering Data Mining [14], the authors revisit their definition ofdata mining and state: “If there is anything we regret, it is the phrase ‘by automaticor semi-automatic means’ . . . because we feel there has come to be too much focuson the automatic techniques and not enough on the exploration and analysis. This has

CROSS-INDUSTRY STANDARD PROCESS: CRISP

The plethora of new off-the-shelf software platforms for performing data mining has kindled a new kind of danger. The ease with which these GUI-based applications can manipulate data, combined with the power of the formidable data mining algo-rithms embedded in the black-box software currently available, make their misuse proportionally more .