From 64b8af0a0859f1729e66649b5f84da508566d09a Mon Sep 17 00:00:00 2001 From: Elson Rodriguez Date: Sun, 15 Apr 2018 19:10:50 -0700 Subject: [PATCH] Improving S3 documentation. (#18406) * Improving S3 documentation. Added a copy-pastable guide on the variables, and also provided usable examples that give immediate feedback. * Updating docs based on feedback. Added back old configuration section, moved s3 implementations to bottom of document. * Rearranged documentation before example, renamed sections to be more clear. --- tensorflow/docs_src/deploy/s3.md | 81 +++++++++++++++++++++++++++++++++------- 1 file changed, 67 insertions(+), 14 deletions(-) diff --git a/tensorflow/docs_src/deploy/s3.md b/tensorflow/docs_src/deploy/s3.md index 38f8428..ef3b030 100644 --- a/tensorflow/docs_src/deploy/s3.md +++ b/tensorflow/docs_src/deploy/s3.md @@ -1,22 +1,13 @@ # How to run TensorFlow on S3 -This document describes how to run TensorFlow on S3 file system. +Tensorflow supports reading and writing data to S3. S3 is an object storage API which is nearly ubiquitious, and can help in situations where data must accessed by multiple actors, such as in distributed training. -## S3 +This document guides you through the required setup, and provides examples on usage. -We assume that you are familiar with @{$reading_data$reading data}. - -To use S3 with TensorFlow, change the file paths you use to read and write -data to an S3 path. For example: - -```python -filenames = ["s3://bucketname/path/to/file1.tfrecord", - "s3://bucketname/path/to/file2.tfrecord"] -dataset = tf.data.TFRecordDataset(filenames) -``` +## Configuration When reading or writing data on S3 with your TensorFlow program, the behavior -could be controlled by various environmental variables: +can be controlled by various environmental variables: * **AWS_REGION**: By default, regional endpoint is used for S3, with region controlled by `AWS_REGION`. If `AWS_REGION` is not specified, then @@ -28,7 +19,7 @@ could be controlled by various environmental variables: * **S3_VERIFY_SSL**: If HTTPS is used, SSL verification could be disabled with `S3_VERIFY_SSL=0`. -To read or write objects in a bucket that is no publicly accessible, +To read or write objects in a bucket that is not publicly accessible, AWS credentials must be provided through one of the following methods: * Set credentials in the AWS credentials profile file on the local system, @@ -38,3 +29,65 @@ AWS credentials must be provided through one of the following methods: variables. * If TensorFlow is deployed on an EC2 instance, specify an IAM role and then give the EC2 instance access to that role. + +## Example Setup + +Using the above information, we can configure Tensorflow to communicate to an S3 endpoint by setting the following environment variables: + +```bash +AWS_ACCESS_KEY_ID=XXXXX # Credentials only needed if connecting to a private endpoint +AWS_SECRET_ACCESS_KEY=XXXXX +AWS_REGION=us-east-1 # Region for the S3 bucket, this is not always needed. Default is us-east-1. +S3_ENDPOINT=s3.us-east-1.amazonaws.com # The S3 API Endpoint to connect to. This is specified in a HOST:PORT format. +S3_USE_HTTPS=1 # Whether or not to use HTTPS. Disable with 0. +S3_VERIFY_SSL=1 # If HTTPS is used, conterols if SSL should be enabled. Disable with 0. +``` + +## Usage + +Once setup is completed, Tensorflow can interact with S3 in a variety of ways. Anywhere there is a Tensorflow IO function, an S3 URL can be used. + +### Smoke Test + +To test your setup, stat a file: + +```python +from tensorflow.python.lib.io import file_io +print file_io.stat('s3://bucketname/path/') +``` + +You should see output similar to this: + +```console + > +``` + +### Reading Data + +When @{$reading_data$reading data}, change the file paths you use to read and write +data to an S3 path. For example: + +```python +filenames = ["s3://bucketname/path/to/file1.tfrecord", + "s3://bucketname/path/to/file2.tfrecord"] +dataset = tf.data.TFRecordDataset(filenames) +``` + +### Tensorflow Tools + +Many Tensorflow tools, such as Tensorboard or model serving, can also take S3 URLS as arguments: + +```bash +tensorboard --logdir s3://bucketname/path/to/model/ +tensorflow_model_server --port=9000 --model_name=model --model_base_path=s3://bucketname/path/to/model/export/ +``` + +This enables an end to end workflow using S3 for all data needs. + +## S3 Endpoint Implementations + +S3 was invented by Amazon, but the S3 API has spread in popularity and has several implementations. The following implementations have passed basic compatibility tests: + +* [Amazon S3](https://aws.amazon.com/s3/) +* [Google Storage](https://cloud.google.com/storage/docs/interoperability) +* [Minio](https://www.minio.io/kubernetes.html)(Standalone mode only) -- 2.7.4