Giới thiệu sách python: Python data Analysis Cookbook

python data analysis cookbook

Python Data Analysis Cookbook

Ivan Idris

Table of Contents
Preface vii
Chapter 1: Laying the Foundation for Reproducible Data Analysis 1

Introduction 2
Setting up Anaconda 2
Installing the Data Science Toolbox 4
Creating a virtual environment with virtualenv and virtualenvwrapper 6
Sandboxing Python applications with Docker images 8
Keeping track of package versions and history in IPython Notebook 10
Confguring IPython 13
Learning to log for robust error checking 16
Unit testing your code 19
Confguring pandas 22
Confguring matplotlib 24
Seeding random number generators and NumPy print options 28
Standardizing reports, code style, and data access 30

Chapter 2: Creating Attractive Data Visualizations 35
Introduction 36
Graphing Anscombe's quartet 36
Choosing seaborn color palettes 39
Choosing matplotlib color maps 42
Interacting with IPython Notebook widgets 43
Viewing a matrix of scatterplots 47
Visualizing with d3.js via mpld3 49
Creating heatmaps 51
Combining box plots and kernel density plots with violin plots 54
Visualizing network graphs with hive plots 55
Displaying geographical maps 58

Table of Contents
Using ggplot2-like plots 60
Highlighting data points with influence plots 62

Chapter 3: Statistical Data Analysis and Probability 67
Introduction 68
Fitting data to the exponential distribution 68
Fitting aggregated data to the gamma distribution 71
Fitting aggregated counts to the Poisson distribution 72
Determining bias 75
Estimating kernel density 78
Determining confdence intervals for mean, variance, and
standard deviation 81
Sampling with probability weights 83
Exploring extreme values 87
Correlating variables with Pearson's correlation 91
Correlating variables with the Spearman rank correlation 94
Correlating a binary and a continuous variable with the point
biserial correlation 97
Evaluating relations between variables with ANOVA 99

Chapter 4: Dealing with Data and Numerical Issues 103
Introduction 103
Clipping and fltering outliers 104
Winsorizing data 107
Measuring central tendency of noisy data 109
Normalizing with the Box-Cox transformation 112
Transforming data with the power ladder 114
Transforming data with logarithms 116
Rebinning data 118
Applying logit() to transform proportions 120
Fitting a robust linear model 122
Taking variance into account with weighted least squares 125
Using arbitrary precision for optimization 128
Using arbitrary precision for linear algebra 131

Chapter 5: Web Mining, Databases, and Big Data 135
Introduction 136
Simulating web browsing 136
Scraping the Web 139
Dealing with non-ASCII text and HTML entities 142
Implementing association tables 144
Setting up database migration scripts 147

Adding a table column to an existing table 148
Adding indices after table creation 150
Setting up a test web server 151
Implementing a star schema with fact and dimension tables 153
Using HDFS 159
Setting up Spark 160
Clustering data with Spark 161

Chapter 6: Signal Processing and Timeseries 167
Introduction 167
Spectral analysis with periodograms 168
Estimating power spectral density with the Welch method 170
Analyzing peaks 172
Measuring phase synchronization 174
Exponential smoothing 177
Evaluating smoothing 180
Using the Lomb-Scargle periodogram 183
Analyzing the frequency spectrum of audio 185
Analyzing signals with the discrete cosine transform 188
Block bootstrapping time series data 191
Moving block bootstrapping time series data 193
Applying the discrete wavelet transform 197

Chapter 7: Selecting Stocks with Financial Data Analysis 201
Introduction 202
Computing simple and log returns 202
Ranking stocks with the Sharpe ratio and liquidity 204
Ranking stocks with the Calmar and Sortino ratios 206
Analyzing returns statistics 208
Correlating individual stocks with the broader market 211
Exploring risk and return 214
Examining the market with the non-parametric runs test 216
Testing for random walks 219
Determining market effciency with autoregressive models 221
Creating tables for a stock prices database 223
Populating the stock prices database 225
Optimizing an equal weights two-asset portfolio 230

Chapter 8: Text Mining and Social Network Analysis 235
Introduction 235
Creating a categorized corpus 236
Tokenizing news articles in sentences and words 239

Stemming, lemmatizing, fltering, and TF-IDF scores 240
Recognizing named entities 244
Extracting topics with non-negative matrix factorization 246
Implementing a basic terms database 248
Computing social network density 252
Calculating social network closeness centrality 254
Determining the betweenness centrality 255
Estimating the average clustering coeffcient 257
Calculating the assortativity coeffcient of a graph 258
Getting the clique number of a graph 259
Creating a document graph with cosine similarity 261

Chapter 9: Ensemble Learning and Dimensionality Reduction 265
Introduction 266
Recursively eliminating features 266
Applying principal component analysis for dimension reduction 269
Applying linear discriminant analysis for dimension reduction 271
Stacking and majority voting for multiple models 272
Learning with random forests 276
Fitting noisy data with the RANSAC algorithm 279
Bagging to improve results 283
Boosting for better learning 286
Nesting cross-validation 289
Reusing models with joblib 292
Hierarchically clustering data 294
Taking a Theano tour 296

Chapter 10: Evaluating Classifers, Regressors, and Clusters 299
Introduction 300
Getting classifcation straight with the confusion matrix 300
Computing precision, recall, and F1-score 303
Examining a receiver operating characteristic and the area under a curve 306
Visualizing the goodness of ft 309
Computing MSE and median absolute error 310
Evaluating clusters with the mean silhouette coeffcient 313
Comparing results with a dummy classifer 316
Determining MAPE and MPE 319
Comparing with a dummy regressor 321
Calculating the mean absolute error and the residual sum of squares 324
Examining the kappa of classifcation 326
Taking a look at the Matthews correlation coeffcient 329

Chapter 11: Analyzing Images 333
Introduction 333
Setting up OpenCV 334
Applying Scale-Invariant Feature Transform (SIFT) 337
Detecting features with SURF 339
Quantizing colors 341
Denoising images 343
Extracting patches from an image 345
Detecting faces with Haar cascades 348
Searching for bright stars 351
Extracting metadata from images 355
Extracting texture features from images 357
Applying hierarchical clustering on images 360
Segmenting images with spectral clustering 361

Chapter 12: Parallelism and Performance 365
Introduction 365
Just-in-time compiling with Numba 367
Speeding up numerical expressions with Numexpr 369
Running multiple threads with the threading module 370
Launching multiple tasks with the concurrent.futures module 374
Accessing resources asynchronously with the asyncio module 377
Distributed processing with execnet 380
Profling memory usage 384
Calculating the mean, variance, skewness, and kurtosis on the fly 385
Caching with a least recently used cache 390
Caching HTTP requests 393
Streaming counting with the Count-min sketch 395
Harnessing the power of the GPU with OpenCL 398

Appendix A: Glossary 401
Appendix B: Function Reference 407

IPython 407
Matplotlib 408
NumPy 409
pandas 410
Scikit-learn 411
SciPy 412
Seaborn 412
Statsmodels 413

Appendix C: Online Resources 415
IPython notebooks and open data 415
Mathematics and statistics 416

Appendix D: Tips and Tricks for Command-Line and
Miscellaneous Tools 419

IPython notebooks 419
Command-line tools 420
The alias command 420
Command-line history 421
Reproducible sessions 421
Docker tips 422

Index 425