Google and Uber’s Best Practices for Deep Learning

阿新 • • 發佈：2018-12-23

Google and Uber’s Best Practices for Deep Learning

There is more to building a sustainable Deep Learning solution than what is provided by Deep Learning frameworks like TensorFlow and PyTorch. These frameworks are good enough for research, but they don’t take into account the problems that crop up with production deployment. I’ve

written previously about technical debt and the need from more adaptive biological like architectures. To support a viable business using Deep Learning, you absolutely need an architecture that supports sustainable improvement in the presence of frequent and unexpected changes in the environment. Current Deep Learning framework only provide a single part of a complete solution.

Fortunately, Google and Uber have provided a glimpse of their internal architectures. The architectures of these two giants can be two excellent base-camps if you need to build your own production ready Deep Learning solution.

The primary motivations of Uber’s system named Michelangelo was that “there were no systems in place to build reliable, uniform, and reproducible pipelines for creating and managing training and prediction data at scale.” In their

paper, they describe the limitations of existing frameworks with the issues of deployment and managing technical debt. The paper has enough arguments that should convince any skeptic that existing frameworks are insufficient for the production.

I’m not going to go through Uber’s paper with you in its entirety. Rather, I’m just going to highlight some important points about their architecture. The Uber system is not a strictly Deep Learning system, but rather a Machine Learning system that can employ many ML methods depending on suitability. It is built on the following open source components: HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow. So, it’s a conventional BigData system that incorporates Machine Learning components for its analytics:

Michelangelo is built on top of Uber’s data and compute infrastructure, providing a data lake that stores all of Uber’s transactional and logged data, Kafka brokers that aggregate logged messages from all Uber’s services, a Samza streaming compute engine, managed Cassandra clusters, and Uber’s in-house service provisioning and deployment tools.

The architecture supports the following workflow:

Manage data
Train models
Evaluate models
Deploy, predict and monitor

Uber’s Michaelangelo architectures is depicted as follows:

I am going to skip over the usual Big Data architecture concerns and point out some notable ideas that relates more to machine learning.

Michaelangelo divides the management of data between online and offline pipelines. In addition, to permit knowledge sharing and reuse across the organization, a “feature store” is made available:

At the moment, we have approximately 10,000 features in Feature Store that are used to accelerate machine learning projects, and teams across the company are adding new ones all the time. Features in the Feature Store are automatically calculated and updated daily.

Uber created a Domain Specific Language (DSL) for modelers to select, transform and combine feature prior to sending a model to training and prediction. Currently supported ML methods are decision trees, linear and logistic models, k-means, time-series and deep neural networks.

The model configuration specifies type, hyper-parameters, data source references, the feature DSL expressions and compute resource requirements (i.e. cpus, memory, use of GPU, etc.). Training is performed in either a YARN or Mesos cluster.

After model training, performance metrics are calculated and provided in an evaluation report. All of the information, that is the model configuration, the learned model and the evaluation report are stored in the a versioned model repository for analysis and deployment. The model information contains:

Who trained the model
Start and end time of the training job
Full model configuration (features used, hyper-parameter values, etc.)
Reference to training and test data sets
Distribution and relative importance of each feature
Model accuracy metrics
Standard charts and graphs for each model type (e.g. ROC curve, PR curve, and confusion matrix for a binary classifier)
Full learned parameters of the model
Summary statistics for model visualization

The idea is to democratize access to ML models, sharing it with other to improve organizational knowledge. The unique feature of Uber’s approach is the surfacing of a “Feature Store” that allows many different parties to share their data across different ML models.

The paper is structured similarly to Uber’s paper in that they cover the same workflow:

Manage data — Data Analysis, Transformation and Validation
Train models — Model Training: Warm-Starting and Model Specification
Evaluate models — Model Evaluation and Validation
Deploy, predict and monitor — Model Serving

Google’s architecture is driven by the following stated high level guidelines:

Capture Data Anomalies early.
Automate data validation.
Treat data errors with the same rigor as code.
Support continuous training.
Uniform configuration to improve sharing.
Reliable and scalable production deployment and serving.

Let’s dig a little deeper into the unique capabilities of Google’s TFX. There are plenty of tidbits of wisdom as well as an introduction of several unique capabilities.

TFX provides several capabilities in the scope of data management. Data analysis performs statistics on each dataset providing information about value distribution, quantiles, mean, standard-deviation etc. The idea is that this allows users to quickly gain insights on the shape of dataset. This automated analysis is used to improve the continuous training and serving environment.

TFX handles the data wrangling and stores the transformations to maintain consistency. Furthermore, the system provides are uniform and consistent framework for managing feature-to-integer mappings.

TFX proves a schema that is version that specifies the expectations on the data. This schema is used to flag any anomalies found and also provide recommendations of actions such as blocking training or deprecating features. The tooling provide auto-generation of this schema to make it easy to use for new projects. This is a unique capability that draws inspiration from the static type checking found in programming languages.

TFX uses TensorFlow as its model description. TFX has this notion of ‘warm-starting’ that is inspired by transfer learning technique found in Deep Learning. The idea is to reduce the amount of training by leveraging existing training. Unlike transfer learning that employs an existing pre-trained network, warm-starting selectively identifies a general features network as its starting point. The network that is trained on general features are used as the basis for training more specialized networks. This feature appears to be implememented in TF-Slim.

TFX uses a common high level TensorFlow specification (see: TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks ) to provide uniformity and encode best practices across different implementations. See this article on Estimators for more detail.

TFX uses the TensorFlow Serving framework for deployment and serving. The framework allows different models to be served while keep the same architecture and API. TensorFlow Serving provies a “soft model-isolation” to allow multi-tenant deployment of models. The framework is also designed to support scalable inferences.

The TFX paper mentioned the need to optimize the deserialization of models. Apparently, a customized protocol buffer parses was created to improve performance up to 2–5 times.

Dissecting Uber and Google’s internal architecture provides good insight on pain-points and solutions for building your own internal platform. As compared to available open source DL frameworks, there is a greater emphasis in managing and sharing of meta-information. Google’s approach also demands additional effort to ensure uniformity as well as automated validation. These are practices that we have seen previously in conventional software-engineering projects.

Software engineering practices such as Test Driven Development (TDD), continuous integration, rollback and recovery, change control etc. are being introduced into advanced machine learning practices. It is not enough for a specialist to develop on a Jupyter notebook and throw it over the wall to a team to make operational. The same end-to-end devops practices that we find today in the best engineering companies are also going to be demanded in machine learning endeavors. We see this today in both Uber and Google, and thus we should expect it in any sustainable ML/DL practice.

Google and Uber’s Best Practices for Deep Learning

Google and Uber’s Best Practices for Deep LearningThere is more to building a sustainable Deep Learning solution than what is provided by Deep Learning fra

Best Practices for QML and Qt Quick

ins proto IT fault qmake scala simple text view Despite all of the benefits that QML and Qt Quick offer, they can be challenging in certa

Learn Best Practices for Securing Your Account and Resources

AWS offers a number of tools to help secure your account. Many of these measures are not active by default, and you must take direct action to

Best Practices for Spies, Stubs and Mocks in Sinon.js

Introduction Testing code with Ajax, networking, timeouts, databases, or other dependencies can be difficult. For example, if you use A

Fw: EPM 11.1.2.x – Planning/PBCS Best Practices for BSO Business Rule Optimisation

trigge rec oval sage depend opera manage 1.2 group 1. Introduction This document is intended to provide best practices for Business Rule

轉錄組分析綜述A survey of best practices for RNA-seq data analysis

轉錄組分析綜述轉錄組文獻解讀 Trinity cufflinks 轉錄組研究綜述文章解讀今天介紹下小編最近閱讀的關於RNA-seq分析的文章，文章發在Genome Biology 上的A survey of

視訊行為識別閱讀[2]Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[2016]

[2]Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[2016]（TSN網路）概括：為了解決長序列的視訊行為識別問題，將長序列切分成短序列並從中隨機選擇部分，作為雙流網路的

PBR最佳實踐（Best Practices For Physically Based Content Creation）

該視訊是Anton Hand在Unite 大會上做的分享，比較老的視訊了，但是PBR理論及最佳實踐永遠不會過時。Anton Hand在Youtube上還有一個頻道 (需科學上網)，每隔一段時間會上傳一個開發日誌，演示他做的VR專案進展，有興趣的可以看一下，是關於他的VR槍戰遊戲，目前

【論文閱讀】韓鬆《Efficient Methods And Hardware For Deep Learning》節選《Learning both Weights and Connections 》

Pruning Deep Neural Networks 本節內容主要來自NIPS 2015論文《Learning both Weights and Connections for Efﬁcient Neural Networks》。這部分主要介紹如何剪枝網路

Ask HN: Whats the best way to learn C++ for Deep learning?

What is your reason for learning "C++ for deep learning"?This will kind of define how to go about doing it.I can think of a few different reasons you might

Opinionated openness: Facebook AI research strategy, ecosystem, and target audience for Deep Learning, and the nuances of using

Chintala's take is that some people would have to be assigned on something like this anyway. If PyTorch had not been created, the other option would be to

Google and Uber’s Best Practices for Deep Learning

Google and Uber’s Best Practices for Deep Learning

Google and Uber’s Best Practices for Deep Learning

Best Practices for QML and Qt Quick

Learn Best Practices for Securing Your Account and Resources

Best Practices for Spies, Stubs and Mocks in Sinon.js

Fw: EPM 11.1.2.x – Planning/PBCS Best Practices for BSO Business Rule Optimisation

轉錄組分析綜述A survey of best practices for RNA-seq data analysis

視訊行為識別閱讀[2]Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[2016]

PBR最佳實踐（Best Practices For Physically Based Content Creation）

【論文閱讀】韓鬆《Efficient Methods And Hardware For Deep Learning》節選《Learning both Weights and Connections 》

Ask HN: Whats the best way to learn C++ for Deep learning?

Opinionated openness: Facebook AI research strategy, ecosystem, and target audience for Deep Learning, and the nuances of using

Google Home Hub's best feature is not having a camera

Evaluating PlaidML and GPU Support for Deep Learning on a Windows 10 Notebook

DarwinAI Emerges from Stealth with Design, Optimization and Explainability Platform for Deep Learning

Best practices for building API Keys

Google Colab 免費的谷歌GPU for deep learning

論文筆記-Temporal segment network:towards good practices for deep action recognition

Best Practices for Implementing Custom CloudFormation Resources with Lambda

Best Practices for Running Apache Cassandra on Amazon EC2

What are best practices for identifying users? Documentation

Google and Uber’s Best Practices for Deep Learning

Google and Uber’s Best Practices for Deep Learning

相關推薦