Cloud ML Super Quick Tour

cloud.google.com

Background

Google Cloud ML is now available as a Beta release (as of 2016/10/11). Super simply stated, you can do the following things using Cloud ML.

　(1) Train your custom TensorFlow models on GCP.
　(2) Serve prediction API with your custom models.

Regarding the custom model training, you can use useful features such as hyper-parameter tuning and distributed training, but in this post, I will show you the minimum steps to migrate your existing TensorFlow models to Cloud ML. As an example, I will use the following code. It classifies the MNIST dataset with a single layer neural network.

・MNIST single layer network.ipynb

Modification to the existing code

First, you have to create a library by putting all files in a single directory. If you have a single executable file 'task.py', your library directory is something like this:

trainer/
├── __init__.py   # Empty file
└── task.py       # Executable file

The name of library directory and executable file can be arbitrary.

Then you will add the following code at the end of the executable file:

if __name__ == '__main__':
    tf.app.run()

The run() method at the end implicitly calls the main() function. And you need to use Cloud Storage to exchange files with the runtime environment. It can be done by specifying the Cloud Storage URI "gs://..." for file paths in your code. Considering the use case where you test the code on your local machine before submitting it to Cloud ML, you'd better make your code such that you can specify the file paths through command line options. The followings are the typical directories you need to consider:

Directory to store checkpoint files during the training.
Directory to store the trained model binary (The filename should be 'export'.)
Directory to store log data for TensorBoard.
Directory to store training data.

Note that you don't necessarily have to use Cloud Storage for the training data. You can use other data sources such as Cloud Dataflow as training data.

In this example, I will make the entry point of my code 'task.py' like this:

def main(_):
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_dir', type=str, default='/tmp/train')  # Checkpoint file
    parser.add_argument('--model_dir', type=str, default='/tmp/model')  # Model file
    parser.add_argument('--train_step', type=int, default=2000)         # Training steps
    args, _ = parser.parse_known_args()
    run_training(args)

if __name__ == '__main__':
    tf.app.run()

This enables users to specify directories for checkpoint files and model binary with the command line options '--train_dir' and '--model_dir'. In addition, the users can specify the number of training iterations with '--train_step'. In this example, the training data is directly fetched from the Internet using the TensorFlow library.

In addition, as a particular point in Cloud ML, you have to specify the input/output objects for the prediction API service using the Collection object of TensorFlow. Collection is a generic object to store arbitrary key-value style data. In Cloud ML, you store Placeholders as API inputs with the key 'inputs', and store prediction value objects as API outputs with the key 'outputs' like this:

input_key = tf.placeholder(tf.int64, [None,])
x = tf.placeholder(tf.float32, [None, 784])

inputs = {'key': input_key.name, 'image': x.name}
tf.add_to_collection('inputs', json.dumps(inputs))

p = tf.nn.softmax(tf.matmul(hidden1, w0) + b0)
output_key = tf.identity(input_key)

outputs = {'key': output_key.name, 'scores': p.name}
tf.add_to_collection('outputs', json.dumps(outputs))

More precisely, you create dictionaries containing the name attributes of input/output objects and store JSON serialization of them in the Collection object using the tf.add_to_collection() method. The keys in the dictionaries are used as the name attributes in the API request/response. In this case, in addition to the input image 'x' and the prediction result 'p' (list of probabilities for each category), 'input_key' and 'output_key' are included in the input/output objects. The 'output_key' simply returns the same value as the 'input_key'. When you send multiple entries to the prediction API, you can match the entries in the response using these key values.

That's all. The following is the modified code considering what I have explained so far:

task.py

import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
import argparse, os, json
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

def run_training(args):
    # Define filepath for checkpoint and final model
    checkpoint_path = os.path.join(args.train_dir, 'checkpoint')
    model_path = os.path.join(args.model_dir, 'export') # Filename should be 'export'.
    num_units = 1024
    
    x = tf.placeholder(tf.float32, [None, 784])
    
    w1 = tf.Variable(tf.truncated_normal([784, num_units]))
    b1 = tf.Variable(tf.zeros([num_units]))
    hidden1 = tf.nn.relu(tf.matmul(x, w1) + b1)
    
    w0 = tf.Variable(tf.zeros([num_units, 10]))
    b0 = tf.Variable(tf.zeros([10]))
    p = tf.nn.softmax(tf.matmul(hidden1, w0) + b0)
    
    t = tf.placeholder(tf.float32, [None, 10])
    loss = -tf.reduce_sum(t * tf.log(p))
    train_step = tf.train.AdamOptimizer().minimize(loss)
    correct_prediction = tf.equal(tf.argmax(p, 1), tf.argmax(t, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    # Define key element
    input_key = tf.placeholder(tf.int64, [None,], name='key')
    output_key = tf.identity(input_key)

    # Define API inputs/outpus object
    inputs = {'key': input_key.name, 'image': x.name}
    outputs = {'key': output_key.name, 'scores': p.name}
    tf.add_to_collection('inputs', json.dumps(inputs))
    tf.add_to_collection('outputs', json.dumps(outputs))
    
    saver = tf.train.Saver()
    sess = tf.InteractiveSession()
    sess.run(tf.initialize_all_variables())

    i = 0
    for _ in range(args.train_step):
        i += 1
        batch_xs, batch_ts = mnist.train.next_batch(100)
        sess.run(train_step, feed_dict={x: batch_xs, t: batch_ts})
        if i % 100 == 0:
            loss_val, acc_val = sess.run([loss, accuracy],
                feed_dict={x:mnist.test.images, t: mnist.test.labels})
            print ('Step: %d, Loss: %f, Accuracy: %f'
                   % (i, loss_val, acc_val))
            saver.save(sess, checkpoint_path, global_step=i)

    # Export the final model.
    saver.save(sess, model_path)


def main(_):
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_dir', type=str, default='/tmp/train')  # Checkpoint directory
    parser.add_argument('--model_dir', type=str, default='/tmp/model')  # Model directory
    parser.add_argument('--train_step', type=int, default=2000)         # Training steps
    args, _ = parser.parse_known_args()
    run_training(args)


if __name__ == '__main__':
    tf.app.run()

Running the code on Cloud ML

To submit a job to Cloud ML, you need a local machine with Cloud ML SDK, or you can use Cloud Shell as a local environment. I will use Cloud Shell here. Please refer to the official document for other environments.

At first, you create a new project and enable Cloud ML API through the API Manager. Then you launch Cloud Shell and install the SDK.

$ curl https://storage.googleapis.com/cloud-ml/scripts/setup_cloud_shell.sh | bash
$ export PATH=${HOME}/.local/bin:${PATH}
$ curl https://storage.googleapis.com/cloud-ml/scripts/check_environment.py | python
Success! Your environment is configured correctly.

The following command sets the 'editor' authority of the project to a service account. This is necessary to submit jobs using the service account.

$ gcloud beta ml init-project

Prepare the TensorFlow codes (which I explained in the previous section) in the 'trainer' directory under you home directory.

trainer/
├── __init__.py   # Empty file
└── task.py       # Executable file

Before submitting a job, try to run the code on the local environment with a small number of iterations to see there's no obvious mistakes.

$ mkdir -p /tmp/train /tmp/model
$ cd $HOME
$ python -m trainer.task --train_step=200
Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Step: 100, Loss: 3183.995850, Accuracy: 0.903500
Step: 200, Loss: 2237.709229, Accuracy: 0.934500

$ ls -l /tmp/train /tmp/model/
/tmp/model/:
total 9584
-rw-r--r-- 1 enakai enakai     203 Oct  5 17:14 checkpoint
-rw-r--r-- 1 enakai enakai 9770436 Oct  5 17:14 export
-rw-r--r-- 1 enakai enakai   35514 Oct  5 17:14 export.meta
/tmp/train:
total 28744
-rw-r--r-- 1 enakai enakai     163 Oct  5 17:14 checkpoint
-rw-r--r-- 1 enakai enakai 9770436 Oct  5 17:14 checkpoint-100
-rw-r--r-- 1 enakai enakai   35514 Oct  5 17:14 checkpoint-100.meta
-rw-r--r-- 1 enakai enakai 9770436 Oct  5 17:14 checkpoint-200
-rw-r--r-- 1 enakai enakai   35514 Oct  5 17:14 checkpoint-200.meta

Looks good. Now let's run the code on the cloud. First, you create a Cloud Storage bucket to store data. The bucket name can be arbitrary, but you'd better include the project name following the convention.

$ PROJECT_ID=project01 # your project ID
$ TRAIN_BUCKET="gs://$PROJECT_ID-mldata"
$ gsutil mkdir $TRAIN_BUCKET

Decide the job name ('job01' in this example), and submit it to Cloud ML.

$ JOB_NAME="job01"
$ touch .dummy
$ gsutil cp .dummy $TRAIN_BUCKET/$JOB_NAME/train/
$ gsutil cp .dummy $TRAIN_BUCKET/$JOB_NAME/model/
$ gcloud beta ml jobs submit training $JOB_NAME \
  --region=us-central1 \
  --package-path=trainer --module-name=trainer.task \
  --staging-bucket=$TRAIN_BUCKET \
  -- \
  --train_dir="$TRAIN_BUCKET/$JOB_NAME/train" \
  --model_dir="$TRAIN_BUCKET/$JOB_NAME/model"

createTime: '2016-10-05T08:53:35Z'
jobId: job01
state: QUEUED
trainingInput:
  args:
  - --train_dir=gs://project01/job01/train
  - --model_dir=gs://project01/job01/model
  packageUris:
  - gs://project01/cloudmldist/1475657612/trainer-0.0.0.tar.gz
  pythonModule: trainer.task
  region: us-central1

Folder 'cloudmldist' will be created under the bucket specified with '--staging-bucket', and your codes will be placed under it. Then Cloud ML starts the execution of the code. In the steps above, you explicitly create folders to store checkpoint files and model binary with the gsutil command. You can automate it in your code if you prefer.

Monitor the job execution with the following command:

$ watch -n1 gcloud beta ml jobs describe --project $PROJECT_ID $JOB_NAME
createTime: '2016-10-05T08:53:35Z'
jobId: job01
startTime: '2016-10-05T08:53:45Z'
state: RUNNING
trainingInput:
  args:
  - --train_dir=gs://project01/job01/train
  - --model_dir=gs://project01/job01/model
  packageUris:
  - gs://project01/cloudmldist/1475657612/trainer-0.0.0.tar.gz
  pythonModule: trainer.task
  region: us-central1

The 'state' becomes 'SUCCEEDED' when the job has been completed. You can see the stdout/stderr logs on the Stackdriver's log management console by selecting the 'Cloud Machine Learning' log.

On successful job completion, the model binary 'export' is created as below:

$ gsutil ls $TRAIN_BUCKET/$JOB_NAME/model/export*
gs://project01/job01/model/export
gs://project01/job01/model/export.meta

Serving you model with the prediction API

Now you can start the prediction API service using the trained model binary 'export' by executing the following commands:

$ MODEL_NAME="MNIST"
$ gcloud beta ml models create $MODEL_NAME
$ gcloud beta ml versions create \
  --origin=$TRAIN_BUCKET/$JOB_NAME/model --model=$MODEL_NAME v1
$ gcloud beta ml versions set-default --model=$MODEL_NAME v1

You specify the model name with the environment variable 'MODEL_NAME'. And you can manage multiple versions of the model. In this case, you created a service with 'v1' version model, and made it the default version.

You need to wait for a few minutes until the service becomes available. So while this time, let's create a test dataset with the following python script:

import json
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
with open("data.json", "w") as file:
    for i in range(10):
        data = {"image": mnist.test.images[i].tolist(), "key": i}
        file.write(json.dumps(data)+'\n')

It generates a JSON file 'data.json' containing a pair of image and key value per line. You can submit the date to the prediction API with the following command:

$ gcloud beta ml predict --model=${MODEL_NAME} --json-instances=data.json
predictions:
- key: 0
  scores:
  - 2.53733e-08
  - 6.47722e-09
  - 2.23573e-06
  - 5.32844e-05
  - 3.08012e-10
  - 1.33022e-09
  - 1.55983e-11
  - 0.99991
  - 4.39428e-07
  - 3.38841e-05
- key: 1
  scores:
  - 1.98303e-08
  - 2.84799e-07
  - 0.999985
  - 1.47131e-05
  - 1.45546e-13
  - 1.90945e-09
  - 3.50033e-09
  - 2.24941e-18
  - 2.60025e-07
  - 1.45738e-14
- key: 2
  scores:
  - 3.63027e-09
...

You can see the response on the command line. Please refer to the official document for URLs to directly submit REST requests.

Note on the distributed training

In this example, I used the sample code using the low level TensorFlow APIs. So you need additional modifications to the code following the Distributed TensorFlow if you want to distribute the training jobs on Cloud ML. It's not a trivial change, unfortunately. Some basic points are explained in the following article.

enakai00.hatenablog.com

But don't worry. The TensorFlow team is planning to provide high level TensorFlow APIs so that you can write TensorFlow codes automatically executed in a distributed manner on Cloud ML.

Stay tuned!

Disclaimer: All code snippets are released under Apache 2.0 License. This is not an official Google product.