In this step we’ll launch our first cluster, which will run solely on Spot Instances. We will also submit an EMR step for a simple wordcount Spark application which will run against a public dataset of Amazon product reviews, located in an Amazon S3 bucket in the N. Virginia region. If you want to know more about the Amazon Customer Reviews Dataset, [click here] (https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
Normally our dataset on S3 would be located in the same region where we are going to run our EMR clusters. In this workshop, it is fine if you are running EMR in a different region, and the Spark application will work against the dataset which is located in the N. Virginia region. This will be negligible in terms of price and performance.
To launch the cluster, follow these steps:
--executor-memory 18G --executor-cores 4
import sys
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Amazon reviews word count').getOrCreate()
df = spark.read.parquet("s3://amazon-reviews-pds/parquet/")
df.selectExpr("explode(split(lower(review_body), ' ')) as words").groupBy("words").count().write.mode("overwrite").parquet(sys.argv[1])
exit()
Then add the location of the file under the Application location field, i.e: s3://<your-bucket-name>/script.py
In the After last step completes selection, make sure that the “Clusters enters waiting state” option is checked. Since we are looking to examine the cluster during and after the Spark application run, we might end up with a terminated cluster before we complete the next steps in the workshop, if we opt to auto-terminate the cluster after our step is completed.
Auto-terminate cluster after the last step is completed is a powerful EMR feature that is used for running transient clusters. This is an effective model for clusters that perform periodic processing tasks, such as a daily data processing run, event-driven ETL workloads, etc. We will not be running a transient cluster, since it might terminate before we complete some of the next steps in the workshop.
Click Next to continue setting up the EMR cluster and move from “Step 1: Software and steps”" to “Step 2: Hardware”.