Machine learning at Stripe has a foundation built on Python and the PyData stack, with scikit-learn and pandas continuing to be core components of an ML pipeline that feeds a production system written in Scala. This talk will cover the ML Infra team’s work to bridge the serialization and scoring gap between Python and the JVM, as well as how ML Engineers ship models to production.
The Stripe Machine Learning Infrastructure team exists to help engineers, data scientists, and analysts at Stripe develop and ship models to production. They own and operate the primary service that provides an API for scoring models for applications such as fraud and NLP, and are always looking for ways to help internal Stripe customers ship ML for new applications or model types.
ML models at Stripe are trained and evaluated in Python, with scikit-learn as an integral piece in our pipeline. However, the primary scoring service is written in Scala, which presents us with a problem: how do we serialize and export models from Python to the JVM? This talk will discuss our serialization framework for serializing and packaging machine learning components; by the end you will learn how we export models, transformers, encoders, and pipelines from the world of scikit to that of our Scala service.
We’ll then cover what happens after the model has been loaded by our Scala service, namely how we name models uniquely and use metadata we call “tags” to keep track of what model is currently running in production, history of production models, etc. We’ll discuss how we score candidate models in parallel to the production model to evaluate them for promotion to production.
By the end of the talk you should have a clear idea of how we serialize, package, promote, and evaluate candidate models across the entire machine learning infra stack, from the start of training in Python to the final scoring in Scala.
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.