Background
Google Cloud ML is now available as a Beta release (as of 2016/10/11). Super simply stated, you can do the following things using Cloud ML.
(1) Train your custom TensorFlow models on GCP.
(2) Serve prediction API with your custom models.
Regarding the custom model training, you can use useful features such as hyper-parameter tuning and distributed training, but in this post, I will show you the minimum steps to migrate your existing TensorFlow models to Cloud ML. As an example, I will use the following code. It classifies the MNIST dataset with a single layer neural network.
Modification to the existing code
First, you have to create a library by putting all files in a single directory. If you have a single executable file 'task.py', your library directory is something like this:
trainer/ ├── __init__.py # Empty file └── task.py # Executable file
The name of library directory and executable file can be arbitrary.
Then you will add the following code at the end of the executable file:
if __name__ == '__main__': tf.app.run()
The run() method at the end implicitly calls the main() function. And you need to use Cloud Storage to exchange files with the runtime environment. It can be done by specifying the Cloud Storage URI "gs://..." for file paths in your code. Considering the use case where you test the code on your local machine before submitting it to Cloud ML, you'd better make your code such that you can specify the file paths through command line options. The followings are the typical directories you need to consider:
- Directory to store checkpoint files during the training.
- Directory to store the trained model binary (The filename should be 'export'.)
- Directory to store log data for TensorBoard.
- Directory to store training data.
Note that you don't necessarily have to use Cloud Storage for the training data. You can use other data sources such as Cloud Dataflow as training data.
In this example, I will make the entry point of my code 'task.py' like this:
def main(_): parser = argparse.ArgumentParser() parser.add_argument('--train_dir', type=str, default='/tmp/train') # Checkpoint file parser.add_argument('--model_dir', type=str, default='/tmp/model') # Model file parser.add_argument('--train_step', type=int, default=2000) # Training steps args, _ = parser.parse_known_args() run_training(args) if __name__ == '__main__': tf.app.run()
This enables users to specify directories for checkpoint files and model binary with the command line options '--train_dir' and '--model_dir'. In addition, the users can specify the number of training iterations with '--train_step'. In this example, the training data is directly fetched from the Internet using the TensorFlow library.
In addition, as a particular point in Cloud ML, you have to specify the input/output objects for the prediction API service using the Collection object of TensorFlow. Collection is a generic object to store arbitrary key-value style data. In Cloud ML, you store Placeholders as API inputs with the key 'inputs', and store prediction value objects as API outputs with the key 'outputs' like this:
input_key = tf.placeholder(tf.int64, [None,]) x = tf.placeholder(tf.float32, [None, 784]) inputs = {'key': input_key.name, 'image': x.name} tf.add_to_collection('inputs', json.dumps(inputs)) p = tf.nn.softmax(tf.matmul(hidden1, w0) + b0) output_key = tf.identity(input_key) outputs = {'key': output_key.name, 'scores': p.name} tf.add_to_collection('outputs', json.dumps(outputs))
More precisely, you create dictionaries containing the name attributes of input/output objects and store JSON serialization of them in the Collection object using the tf.add_to_collection() method. The keys in the dictionaries are used as the name attributes in the API request/response. In this case, in addition to the input image 'x' and the prediction result 'p' (list of probabilities for each category), 'input_key' and 'output_key' are included in the input/output objects. The 'output_key' simply returns the same value as the 'input_key'. When you send multiple entries to the prediction API, you can match the entries in the response using these key values.
That's all. The following is the modified code considering what I have explained so far:
task.py
import tensorflow as tf import numpy as np from tensorflow.examples.tutorials.mnist import input_data import argparse, os, json mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) def run_training(args): # Define filepath for checkpoint and final model checkpoint_path = os.path.join(args.train_dir, 'checkpoint') model_path = os.path.join(args.model_dir, 'export') # Filename should be 'export'. num_units = 1024 x = tf.placeholder(tf.float32, [None, 784]) w1 = tf.Variable(tf.truncated_normal([784, num_units])) b1 = tf.Variable(tf.zeros([num_units])) hidden1 = tf.nn.relu(tf.matmul(x, w1) + b1) w0 = tf.Variable(tf.zeros([num_units, 10])) b0 = tf.Variable(tf.zeros([10])) p = tf.nn.softmax(tf.matmul(hidden1, w0) + b0) t = tf.placeholder(tf.float32, [None, 10]) loss = -tf.reduce_sum(t * tf.log(p)) train_step = tf.train.AdamOptimizer().minimize(loss) correct_prediction = tf.equal(tf.argmax(p, 1), tf.argmax(t, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # Define key element input_key = tf.placeholder(tf.int64, [None,], name='key') output_key = tf.identity(input_key) # Define API inputs/outpus object inputs = {'key': input_key.name, 'image': x.name} outputs = {'key': output_key.name, 'scores': p.name} tf.add_to_collection('inputs', json.dumps(inputs)) tf.add_to_collection('outputs', json.dumps(outputs)) saver = tf.train.Saver() sess = tf.InteractiveSession() sess.run(tf.initialize_all_variables()) i = 0 for _ in range(args.train_step): i += 1 batch_xs, batch_ts = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, t: batch_ts}) if i % 100 == 0: loss_val, acc_val = sess.run([loss, accuracy], feed_dict={x:mnist.test.images, t: mnist.test.labels}) print ('Step: %d, Loss: %f, Accuracy: %f' % (i, loss_val, acc_val)) saver.save(sess, checkpoint_path, global_step=i) # Export the final model. saver.save(sess, model_path) def main(_): parser = argparse.ArgumentParser() parser.add_argument('--train_dir', type=str, default='/tmp/train') # Checkpoint directory parser.add_argument('--model_dir', type=str, default='/tmp/model') # Model directory parser.add_argument('--train_step', type=int, default=2000) # Training steps args, _ = parser.parse_known_args() run_training(args) if __name__ == '__main__': tf.app.run()
Running the code on Cloud ML
To submit a job to Cloud ML, you need a local machine with Cloud ML SDK, or you can use Cloud Shell as a local environment. I will use Cloud Shell here. Please refer to the official document for other environments.
At first, you create a new project and enable Cloud ML API through the API Manager. Then you launch Cloud Shell and install the SDK.
$ curl https://storage.googleapis.com/cloud-ml/scripts/setup_cloud_shell.sh | bash $ export PATH=${HOME}/.local/bin:${PATH} $ curl https://storage.googleapis.com/cloud-ml/scripts/check_environment.py | python Success! Your environment is configured correctly.
The following command sets the 'editor' authority of the project to a service account. This is necessary to submit jobs using the service account.
$ gcloud beta ml init-project
Prepare the TensorFlow codes (which I explained in the previous section) in the 'trainer' directory under you home directory.
trainer/ ├── __init__.py # Empty file └── task.py # Executable file
Before submitting a job, try to run the code on the local environment with a small number of iterations to see there's no obvious mistakes.
$ mkdir -p /tmp/train /tmp/model $ cd $HOME $ python -m trainer.task --train_step=200 Extracting /tmp/data/train-images-idx3-ubyte.gz Extracting /tmp/data/train-labels-idx1-ubyte.gz Extracting /tmp/data/t10k-images-idx3-ubyte.gz Extracting /tmp/data/t10k-labels-idx1-ubyte.gz Step: 100, Loss: 3183.995850, Accuracy: 0.903500 Step: 200, Loss: 2237.709229, Accuracy: 0.934500 $ ls -l /tmp/train /tmp/model/ /tmp/model/: total 9584 -rw-r--r-- 1 enakai enakai 203 Oct 5 17:14 checkpoint -rw-r--r-- 1 enakai enakai 9770436 Oct 5 17:14 export -rw-r--r-- 1 enakai enakai 35514 Oct 5 17:14 export.meta /tmp/train: total 28744 -rw-r--r-- 1 enakai enakai 163 Oct 5 17:14 checkpoint -rw-r--r-- 1 enakai enakai 9770436 Oct 5 17:14 checkpoint-100 -rw-r--r-- 1 enakai enakai 35514 Oct 5 17:14 checkpoint-100.meta -rw-r--r-- 1 enakai enakai 9770436 Oct 5 17:14 checkpoint-200 -rw-r--r-- 1 enakai enakai 35514 Oct 5 17:14 checkpoint-200.meta
Looks good. Now let's run the code on the cloud. First, you create a Cloud Storage bucket to store data. The bucket name can be arbitrary, but you'd better include the project name following the convention.
$ PROJECT_ID=project01 # your project ID $ TRAIN_BUCKET="gs://$PROJECT_ID-mldata" $ gsutil mkdir $TRAIN_BUCKET
Decide the job name ('job01' in this example), and submit it to Cloud ML.
$ JOB_NAME="job01" $ touch .dummy $ gsutil cp .dummy $TRAIN_BUCKET/$JOB_NAME/train/ $ gsutil cp .dummy $TRAIN_BUCKET/$JOB_NAME/model/ $ gcloud beta ml jobs submit training $JOB_NAME \ --region=us-central1 \ --package-path=trainer --module-name=trainer.task \ --staging-bucket=$TRAIN_BUCKET \ -- \ --train_dir="$TRAIN_BUCKET/$JOB_NAME/train" \ --model_dir="$TRAIN_BUCKET/$JOB_NAME/model" createTime: '2016-10-05T08:53:35Z' jobId: job01 state: QUEUED trainingInput: args: - --train_dir=gs://project01/job01/train - --model_dir=gs://project01/job01/model packageUris: - gs://project01/cloudmldist/1475657612/trainer-0.0.0.tar.gz pythonModule: trainer.task region: us-central1
Folder 'cloudmldist' will be created under the bucket specified with '--staging-bucket', and your codes will be placed under it. Then Cloud ML starts the execution of the code. In the steps above, you explicitly create folders to store checkpoint files and model binary with the gsutil command. You can automate it in your code if you prefer.
Monitor the job execution with the following command:
$ watch -n1 gcloud beta ml jobs describe --project $PROJECT_ID $JOB_NAME createTime: '2016-10-05T08:53:35Z' jobId: job01 startTime: '2016-10-05T08:53:45Z' state: RUNNING trainingInput: args: - --train_dir=gs://project01/job01/train - --model_dir=gs://project01/job01/model packageUris: - gs://project01/cloudmldist/1475657612/trainer-0.0.0.tar.gz pythonModule: trainer.task region: us-central1
The 'state' becomes 'SUCCEEDED' when the job has been completed. You can see the stdout/stderr logs on the Stackdriver's log management console by selecting the 'Cloud Machine Learning' log.
On successful job completion, the model binary 'export' is created as below:
$ gsutil ls $TRAIN_BUCKET/$JOB_NAME/model/export* gs://project01/job01/model/export gs://project01/job01/model/export.meta
Serving you model with the prediction API
Now you can start the prediction API service using the trained model binary 'export' by executing the following commands:
$ MODEL_NAME="MNIST" $ gcloud beta ml models create $MODEL_NAME $ gcloud beta ml versions create \ --origin=$TRAIN_BUCKET/$JOB_NAME/model --model=$MODEL_NAME v1 $ gcloud beta ml versions set-default --model=$MODEL_NAME v1
You specify the model name with the environment variable 'MODEL_NAME'. And you can manage multiple versions of the model. In this case, you created a service with 'v1' version model, and made it the default version.
You need to wait for a few minutes until the service becomes available. So while this time, let's create a test dataset with the following python script:
import json from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) with open("data.json", "w") as file: for i in range(10): data = {"image": mnist.test.images[i].tolist(), "key": i} file.write(json.dumps(data)+'\n')
It generates a JSON file 'data.json' containing a pair of image and key value per line. You can submit the date to the prediction API with the following command:
$ gcloud beta ml predict --model=${MODEL_NAME} --json-instances=data.json predictions: - key: 0 scores: - 2.53733e-08 - 6.47722e-09 - 2.23573e-06 - 5.32844e-05 - 3.08012e-10 - 1.33022e-09 - 1.55983e-11 - 0.99991 - 4.39428e-07 - 3.38841e-05 - key: 1 scores: - 1.98303e-08 - 2.84799e-07 - 0.999985 - 1.47131e-05 - 1.45546e-13 - 1.90945e-09 - 3.50033e-09 - 2.24941e-18 - 2.60025e-07 - 1.45738e-14 - key: 2 scores: - 3.63027e-09 ...
You can see the response on the command line. Please refer to the official document for URLs to directly submit REST requests.
Note on the distributed training
In this example, I used the sample code using the low level TensorFlow APIs. So you need additional modifications to the code following the Distributed TensorFlow if you want to distribute the training jobs on Cloud ML. It's not a trivial change, unfortunately. Some basic points are explained in the following article.
But don't worry. The TensorFlow team is planning to provide high level TensorFlow APIs so that you can write TensorFlow codes automatically executed in a distributed manner on Cloud ML.
Stay tuned!
Disclaimer: All code snippets are released under Apache 2.0 License. This is not an official Google product.