Machine Learning with a Data-Unfriendly Stack | Wrangle Conference

Presenter: Michael Manapat, Stripe

Stripe processes billions of dollars in payments a year on behalf of tens of thousands of businesses, using machine learning to detect and stop fraudulent transactions and fraudulent merchants. Our modeling workflow involves the typical “data science” tools: R and IPython for exploratory analysis, Hadoop for batch data processing, and scikit-learn for model building. However, Stripe’s production backend is written in Ruby and uses MongoDB as its data store, and this has introduced difficulties for both model training and production scoring. In this talk, I’ll describe the various choices we’ve made to bridge “main land” and “data land” and how, in the process, our model development process has gone from terrible to “ok.”