In this step we’ll launch our first cluster. This will be a transient cluster that will be shut down after it finishes running the application we submit to it at cluster creation time, and will run solely on Spot Instances. The application is a simple wordcount that will run against a public data set of Amazon product reviews, located in an Amazon S3 bucket in the N. Virginia region. If you want to know more about the Amazon Customer Reviews Dataset, click here
Normally our dataset on S3 would be located on the same region where we are going to run our EMR clusters. In this workshop, for educational purposes, it is fine if you are running EMR in a different region, and the Spark application will work against the dataset which is located in the N. Virginia region.
To launch the cluster, follow these steps:
--executor-memory 18G --executor-cores 4
import sys from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Amazon reviews word count').getOrCreate() df = spark.read.parquet("s3://amazon-reviews-pds/parquet/") df.selectExpr("explode(split(lower(review_body), ' ')) as words").groupBy("words").count().write.mode("overwrite").parquet(sys.argv) exit()
Then add the location of the file under the Application location field, i.e: s3://<your-bucket-name>/script.py
Check the Auto-terminate cluster after the last step is completed option. Since we are looking to run a transient cluster just for running our Spark application, this will terminate the cluster once our submitted step (Spark Application) has completed.
If you are not running through the workshop in one sitting, then don’t use Auto-terminate cluster after the last step is completed, otherwise your cluster will be terminated before you examine it, later in the workshop.
Click Next to continue setting up the EMR cluster and move from “Step 1: Software and steps”” to “Step 2: Hardware”.