Setting up your development environment
To begin with, please follow the steps outlined in: Apache Apex Development Environment Setup to setup your development environment; you can skip the sandbox download and installation if you already have a Hadoop cluster with Datatorrent RTS installed where you can deploy applications.
Sample input files
For this tutorial, you need some sample text files to use as input to the application.
Binary files such as PDF or DOCX files are not suitable since they contain a
lot of meaningless strings that look like words (for example, “Wqgi”).
Similarly, files using markup languages such as XML or HTML files are also not
suitable since the tag names such as div
, td
and p
dominate the word
counts. The RFC (Request for Comment) files that are used as de-facto
specifications for internet standards are good candidates since they contain
pure text; download a few of them as follows:
Open a terminal and run the following commands to create a directory named
data
under your home directory and download 3 files there:
cd; mkdir data; cd data
wget http://tools.ietf.org/rfc/rfc1945.txt
wget https://www.ietf.org/rfc/rfc2616.txt
wget https://tools.ietf.org/rfc/rfc4844.txt