DATA:
File Format: PDF
Size: 4.7MB
Pages: 293
Making Sense of Data – A Practical Guide to Exploratory Data Analysis and Data Mining
By Glenn J. Myatt
CONTENTS
Preface xi
1. Introduction 1
1.1 Overview 1
1.2 Problem de.nition 2
1.3 Data preparation 2
1.4 Implementation of the analysis 2
1.5 Deployment of the results 5
1.6 Book outline 5
1.7 Summary 7
1.8 Further reading 7
2. Definition 8
2.1 Overview 8
2.2 Objectives 8
2.3 Deliverables 9
2.4 Roles and responsibilities 10
2.5 Project plan 11
2.6 Case study 12
2.6.1 Overview 12
2.6.2 Problem 12
2.6.3 Deliverables 13
2.6.4 Roles and responsibilities 13
2.6.5 Current situation 13
2.6.6 Timetable and budget 14
2.6.7 Cost/bene.t analysis 14
2.7 Summary 14
2.8 Further reading 16
3. Preparation 17
3.1 Overview 17
3.2 Data sources 17
3.3 Data understanding 19
3.3.1 Data tables 19
3.3.2 Continuous and discrete variables 20
3.3.3 Scales of measurement 21
3.3.4 Roles in analysis 22
3.3.5 Frequency distribution 23
3.4 Data preparation 24
3.4.1 Overview 24
3.4.2 Cleaning the data 24
3.4.3 Removing variables 26
3.4.4 Data transformations 26
3.4.5 Segmentation 31
3.5 Summary 33
3.6 Exercises 33
3.7 Further reading 35
4. Tables and graphs 36
4.1 Introduction 36
4.2 Tables 36
4.2.1 Data tables 36
4.2.2 Contingency tables 36
4.2.3 Summary tables 39
4.3 Graphs 40
4.3.1 Overview 40
4.3.2 Frequency polygrams and histograms 40
4.3.3 Scatterplots 43
4.3.4 Box plots 45
4.3.5 Multiple graphs 46
4.4 Summary 49
4.5 Exercises 52
4.6 Further reading 53
5. Statistics 54
5.1 Overview 54
5.2 Descriptive statistics 55
5.2.1 Overview 55
5.2.2 Central tendency 56
5.2.3 Variation 57
5.2.4 Shape 61
5.2.5 Example 62
5.3 Inferential statistics 63
5.3.1 Overview 63
5.3.2 Con.dence intervals 67
5.3.3 Hypothesis tests 72
5.3.4 Chi-square 82
5.3.5 One-way analysis of variance 84
5.4 Comparative statistics 88
5.4.1 Overview 88
5.4.2 Visualizing relationships 90
5.4.3 Correlation coef.cient (r) 92
5.4.4 Correlation analysis for more than two variables 94
5.5 Summary 96
5.6 Exercises 97
5.7 Further reading 100
6. Grouping 102
6.1 Introduction 102
6.1.1 Overview 102
6.1.2 Grouping by values or ranges 103
6.1.3 Similarity measures 104
6.1.4 Grouping approaches 108
6.2 Clustering 110
6.2.1 Overview 110
6.2.2 Hierarchical agglomerative clustering 111
6.2.3 K-means clustering 120
6.3 Associative rules 129
6.3.1 Overview 129
6.3.2 Grouping by value combinations 130
6.3.3 Extracting rules from groups 131
6.3.4 Example 137
6.4 Decision trees 139
6.4.1 Overview 139
6.4.2 Tree generation 142
6.4.3 Splitting criteria 144
6.4.4 Example 151
6.5 Summary 153
6.6 Exercises 153
6.7 Further reading 155
7. Prediction 156
7.1 Introduction 156
7.1.1 Overview 156
7.1.2 Classi.cation 158
7.1.3 Regression 162
7.1.4 Building a prediction model 166
7.1.5 Applying a prediction model 167
7.2 Simple regression models 169
7.2.1 Overview 169
7.2.2 Simple linear regression 169
7.2.3 Simple nonlinear regression 172
7.3 K-nearest neighbors 176
7.3.1 Overview 176
7.3.2 Learning 178
7.3.3 Prediction 180
7.4 Classi.cation and regression trees 181
7.4.1 Overview 181
7.4.2 Predicting using decision trees 182
7.4.3 Example 184
7.5 Neural networks 187
7.5.1 Overview 187
7.5.2 Neural network layers 187
7.5.3 Node calculations 188
7.5.4 Neural network predictions 190
7.5.5 Learning process 191
7.5.6 Backpropagation 192
7.5.7 Using neural networks 196
7.5.8 Example 197
7.6 Other methods 199
7.7 Summary 204
7.8 Exercises 205
7.9 Further reading 209
8. Deployment 210
8.1 Overview 210
8.2 Deliverables 210
8.3 Activities 211
8.4 Deployment scenarios 212
8.5 Summary 213
8.6 Further reading 213
9. Conclusions 215
9.1 Summary of process 215
9.2 Example 218
9.2.1 Problem overview 218
9.2.2 Problem de.nition 218
9.2.3 Data preparation 220
9.2.4 Implementation of the analysis 227
9.2.5 Deployment of the results 237
9.3 Advanced data mining 237
9.3.1 Overview 237
9.3.2 Text data mining 239
9.3.3 Time series data mining 240
9.3.4 Sequence data mining 240
9.4 Further reading 240
Appendix A Statistical tables 241
A.1 Normal distribution 241
A.2 Student’s t-distribution 241
A.3 Chi-square distribution 245
A.4 F-distribution 249
Appendix B Answers to exercises 258
Glossary 265
Bibliography 273
Index 275
Preface
Almost every .eld of study is generating an unprecedented amount of data. Retail companies collect data on every sales transaction, organizations log each click made on their web sites, and biologists generate millions of pieces of information related to genes daily. The volume of data being generated is leading to information overload and the ability to make sense of all this data is becoming increasingly important. It requires an understanding of exploratory data analysis and data mining as well as an appreciation of the subject matter, business processes, software deployment, project anagement methods, change management issues, and so on.
The purpose of this book is to describe a practical approach for making sense
out of data. A step-by-step process is introduced that is designed to help you avoid some of the common pitfalls associated with complex data analysis or data mining projects. It covers some of the more common tasks relating to the analysis of data including (1) how to summarize and interpret the data, (2) how to identify nontrivial facts, patterns, and relationships in the data, and (3) how to make predictions from the data. The process starts by understanding what business problems you are trying to solve, what data will be used and how, who will use the information generated and how will it be delivered to them. A plan should be developed that includes this problem de.nition and outlines how the project is to be implemented. Speci.c and
measurable success criteria should be de.ned and the project evaluated against them. The relevance and the quality of the data will directly impact the accuracy of the results. In an ideal situation, the data has been carefully collected to answer the speci.c questions de.ned at the start of the project. Practically, you are often dealing with data generated for an entirely different purpose. In this situation, it will be necessary to prepare the data to answer the new questions. This is often one of the most time-consuming parts of the data mining process, and numerous issues need to be thought through. Once the data has been collected and prepared, it is now ready for analysis. What methods you use to analyze the data are dependent on many factors including the problem de.nition and the type of data that has been collected. There may be many methods that could potentially solve your problem and you may not know which one works best until you have experimented with the different alternatives. Throughout the technical sections, issues relating to when you would apply the different methods along with how you could optimize the results are discussed. Once you have performed an analysis, it now needs to be delivered to your target audience. This could be as simple as issuing a report. Alternatively, the delivery may involve implementing and deploying new software. In addition to any technical challenges, the solution could change the way its intended audience operates on a daily basis, which may need to be managed. It will be important to understand how well the solution implemented in the .eld actually solves the original business problem. Any project is ideally implemented by an interdisciplinary team, involving subject matter experts, business analysts, statisticians, IT professionals, project managers, and data mining experts. This book is aimed at the entire interdisciplinary team and addresses issues and technical solutions relating to data analysis or data mining projects. The book could also serve as an introductory textbook for students of any discipline, both undergraduate and graduate, who wish to understand exploratory data analysis and data mining processes and methods.
The book covers a series of topics relating to the process of making sense of
data, including
Problem de.nitions
Data preparation
Data visualization
Statistics
Grouping methods
Predictive modeling
Deployment issues
Applications
The book is focused on practical approaches and contains information on how the techniques operate as well as suggestions for when and how to use the different methods. Each chapter includes a further reading section that highlights additional books and online resources that provide background and other information. At the end of selected chapters are a set of exercises designed to help in understanding the respective chapter’s materials. Accompanying this book is a web site (http://www.makingsenseofdata.com/) containing additional resources including software, data sets, and tutorials to help in understanding how to implement the topics covered in this book. In putting this book together, I would like to thank the following individuals for their considerable help: Paul Blower, Vinod Chandnani, Wayne Johnson, and Jon Spokes. I would also like to thank all those involved in the review process for the book. Finally, I would like to thank the staff at John Wiley & Sons, particularly Susanne Steitz, for all their help and support throughout the entire project.