Python+Machine Learning tutorial – Data munging for predictive modeling with pandas and scikit-learn

Building predictive models first requires shaping the data into the right format to meet the mathematical assumptions of machine learning algorithms. In this session we will introduce the pandas data frame data structure for munging heterogeneous data into a representation that is suitable for most scikit-learn models. In particular we address problems such as missing value imputation and categorical variables. We will illustrate those concepts by combining pandas-based feature engineering with scikit-learn Logistic Regression, Random Forests and Gradient Boosted Trees.