Running ETL Spark Job through Dataproc (an ephermal cluster)with Workflow Templates

Let’s automate our daily ETL job!

As we know today that a Data Engineer should give a right dimension data into business user. A product from Data Engineer could be a table that accessible for doing the next processing step. The challenges part as a Data engineer is providing a right data to the business user based on their needs.

The next questions are, How do we create an ETL for large-scale data? or maybe how can we manage an ETL performance every day?

I have been asking to several person which as a user from Data Engineer such as, Data scientist, Data Analyst or maybe Business Intelligence about “how often do they need data from a data engineer” and surprisingly they need the data everyday! So, from here we can conclude that as a data engineer we not only manage the data but also providing the data continuously

In this case, i assume that all of you guys already understand about an ETL Concept. Before i give you a solution how we make a daily ETL, its better that you familiar with these following cloud technology (we are going to use Google Cloud Platform services):

  • Dataproc included dataproc workflow-templates
  • Google Cloud Storage
  • BigQuery as our Data Warehouse
Dataproc Use-case

Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at planet scale, fully integrated with Google Cloud, at a fraction of the cost. I recommend that you watch this video for getting better understanding. Dataproc also integrates with other Google Cloud services.

Referring to the documentation, Cloud Storage is a service for storing your objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. All buckets are associated with a project, and you can group your projects under an organization.

After you create a project, you can create Cloud Storage buckets, upload objects to your buckets, and download objects from your buckets. You can also grant permissions to make your data accessible to members you specify, or — for certain use cases such as hosting a website — accessible to everyone on the public internet. Here, i also giving you a good article about terminology of Google Cloud Storage (GCS)

The figure bellow tell us about the Cloud Storage structure looks like.

Cloud Storage Structure

I recommend you to read my previous article about BigQuery.

Here is the source:

  • Configuration Google Cloud for dealing with BigQuery Part (I)
  • Configuration Google Cloud for dealing with BigQuery Part (II)

Here is my solution

Yups, from the figure above there are a several general step :

  • Upload the spark job into GCS
  • Create Dataproc workflow-templates
  • Create Dataproc workflow-templates set-managed-cluster
  • add spark job to the workflow-templates
  • run the workflow-templates
  • load to the BigQuery

But, there are a few of step that you should do before running the dataproc workflow-templates:

  • Make an output directory in GCS called “/flights_data_output” for receiving a spark job result
  • Make a BigQuery table

In this case i assume that you already make an output directory called “/flights_data_output” and 2 table in BigQuery. So, we can focus on the general step and start it!

  • Upload the spark job into GCS
gsutil cp /Users/macbookpro/Documents/GCP/dataproc_project/spark-etl-job-1.py gs://bigdata-etl-2_flights/spark-job/
  • Create a spark job
  • Create bash file for executing a daily ETL

Yups, with this solution as a Data engineer we don’t need anymore to make a script or modified it. We just only running this script locally through our terminal.

Let’s try it!

  • First, go to your “spark-workflow-template.sh” file directory. In this case my PATH is
cd /Users/macbookpro/Documents/GCP/dataproc_project
  • Then, execute the bash file by
bash spark-workflow-template.sh

If your spark job success you will get this message from your terminal

(base) macbookpro@MacBooks-MacBook-Pro Downloads % bash spark-workflow-template.shcreateTime: '2021-03-12T05:16:17.976Z'id: flights_etljobs:- pysparkJob:mainPythonFileUri: gs://bigdata-etl-2_flights/spark-job/spark-etl-job-1.pystepId: flight_delays_etl_jobname: projects/bigdata-etl-2/regions/asia-southeast2/workflowTemplates/flights_etlplacement:managedCluster:clusterName: spark-job-flightsconfig:gceClusterConfig:serviceAccountScopes:- https://www.googleapis.com/auth/devstorage.read_only- https://www.googleapis.com/auth/logging.write- https://www.googleapis.com/auth/monitoring.write- https://www.googleapis.com/auth/pubsub- https://www.googleapis.com/auth/service.management.readonly- https://www.googleapis.com/auth/servicecontrol- https://www.googleapis.com/auth/trace.appendmasterConfig:diskConfig:bootDiskSizeGb: 20machineTypeUri: n1-standard-2softwareConfig:imageVersion: '1.3'workerConfig:diskConfig:bootDiskSizeGb: 20machineTypeUri: n1-standard-2numInstances: 2updateTime: '2021-03-12T05:16:19.865Z'version: 3Waiting on operation [projects/bigdata-etl-2/regions/asia-southeast2/operations/c0a0a7cf-6f72-388a-bcd0-a472d657df15].WorkflowTemplate [flights_etl] RUNNINGCreating cluster: Operation ID [projects/bigdata-etl-2/regions/asia-southeast2/operations/6c7c5beb-b85e-478b-a0c0-c5c815119639].Created cluster: spark-job-flights-lkszi7despdvi.Job ID flight_delays_etl_job-lkszi7despdvi RUNNINGJob ID flight_delays_etl_job-lkszi7despdvi COMPLETEDDeleting cluster: Operation ID [projects/bigdata-etl-2/regions/asia-southeast2/operations/bb7614d3-ed3a-4db5-bd23-cb96352ec3fc].WorkflowTemplate [flights_etl] DONEDeleted cluster: spark-job-flights-lkszi7despdvi.Waiting on bqjob_r6d705d1fc04b87bd_0000017824e0a9e5_1 ... (1s) Current status: DONEWaiting on bqjob_r27984d086e53dc95_0000017824e0be19_1 ... (1s) Current status: DONE

If you are a data engineer you just run this bash script daily whenever you want. The spark job script will read into your files in GCS automatically based on ‘today’s date’. In the further development, this script will run automatically every 1 AM Jakarta Local Time. I will use an Apache Airflow as an orchestration tools. Wait for it!

I love Data Science!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store