S3 to HDFS Sync App

Summary

Ingest and backup Amazon S3 data to hadoop HDFS for data download from Amazon to hadoop. This application copies files from the configured S3 location to the destination path in HDFS. The source code is available at: https://github.com/DataTorrent/app-templates/tree/master/s3-to-hdfs-sync.

Please send feedback or feature requests to: feedback@datatorrent.com

This document has a step-by-step guide to configure, customize, and launch this application.

Steps to launch application

Click on the AppHub tab from the top navigation bar.
Page listing the applications available on AppHub is displayed. Search for S3 to see all applications related to S3. Click on import button for S3 to HDFS Sync App.
Notification is displayed on the top right corner after application package is successfully imported.
Click on the link in the notification which navigates to the page for this application package. Detailed information about the application package like version, last modified time, and short description is available on this page. Click on launch button for S3-to-HDFS-Sync application.
Launch S3-to-HDFS-Sync dialogue is displayed. One can configure name of this instance of the application from this dialogue.
Select Use saved configuration option. This displays a list of pre-saved configurations. Please select sandbox-memory-conf.xml or cluster-memory-conf.xml depending on whether your environment is the DataTorrent sandbox, or other cluster.
Select Specify custom properties option. Click on add default properties button.
This expands a key-value editor pre-populated with mandatory properties for this application. Change values as needed. For example, suppose we wish to copy from all files in directory1 from com.example.s3test using ACCESS_KEY_ID:SECRET_KEY combination to /user/appuser/output on the host cluster (on which app is running). Properties should be set as follows:

name value

dt.operator.HDFSFileCopyModule.prop.outputDirectoryPath /user/appuser/output

dt.operator.S3InputModule.prop.files s3n://ACCESS_KEY_ID:SECRET_KEY@com.example.s3test/directory1

Details about configuration options are available in Configuration options section.
Click on the Launch button on lower right corner of the dialog to launch the application. A notification is displayed on the top right corner after application is launched successfully and includes the Application ID which can be used to monitor this instance and to find its logs.
Click on the Monitor tab from the top navigation bar.
A page listing all running applications is displayed. Search for the current application based on name or application id or any other relevant field. Click on the application name or id to navigate to application instance details page.
Application instance details page shows key metrics for monitoring the application status. logical tab shows application DAG, Stram events, operator status based on logical operators, stream status, and a chart with key metrics.
Click on the physical tab to look at the status of physical instances of the operator, containers etc.

Configuration options

Mandatory properties

End user must specify the values for these properties (these properties are all strings).

Property	Description	Example
dt.operator.HDFSFileCopyModule.prop.outputDirectoryPath	HDFS path for destination directory	/user/appuser/ output/directory1 hdfs://node1.company1.com /user/appuser/output
dt.operator.S3InputModule.prop.files	Access URL for S3 source	s3n://ACCESS_KEY_ID:SECRET_KEY @BUCKET_NAME/DIRECTORY

Advanced properties

There are pre-saved configurations based on the application environment. Recommended settings for datatorrent sandbox edition are in sandbox-memory-conf.xml and for a cluster environment in cluster-memory-conf.xml.

Property	Description	Type	Default for cluster- memory- conf.xml	Default for sandbox -memory - conf.xml
dt.operator.S3InputModule.prop.maxReaders	Maximum number of BlockReader partitions for parallel reading.	int	16	1
dt.operator.S3InputModule.prop.blocksThreshold	Rate at which block metadata is emitted per second	int	16	1

You can override default values for advanced properties by specifying custom values for these properties in the step specify custom property step mentioned in steps to launch an application.

Steps to customize the application

Make sure you have following utilities installed on your machine and available on PATH in environment variable:
- Java : 1.7.x
- maven : 3.0 +
- git : 1.7 +
- Hadoop (Apache-2.2)+
Use following command to clone the examples repository:

git clone git@github.com:DataTorrent/app-templates.git
Change directory to examples/tutorials/s3-to-hdfs-sync:

cd examples/tutorials/s3-to-hdfs-sync
Import this maven project in your favorite IDE (e.g. eclipse).
Change the source code as per your requirements. This application is for copying files from source to destination. Thus, Application.java does not involve any processing operator in between.
Make respective changes in the test case and properties.xml based on your environment.
Compile this project using maven:

mvn clean package

This will generate the application package with .apa extension in the target directory.
Go to DataTorrent UI Management console on web browser. Click on the Develop tab from the top navigation bar.
Click on upload package button and upload the generated .apa file.
Application package page is shown with the listing of all packages. Click on the Launch button for the uploaded application package. Follow the steps for launching an application.

name	value
dt.operator.HDFSFileCopyModule.prop.outputDirectoryPath	/user/appuser/output
dt.operator.S3InputModule.prop.files	s3n://ACCESS_KEY_ID:SECRET_KEY@com.example.s3test/directory1