Running ETL Spark Job through Dataproc (an ephermal cluster)with Workflow Templates
Let’s automate our daily ETL job!
“Why we should make a script for automating our daily ETL ?”
As we know today that a Data Engineer should give a right dimension data into business user. A product from Data Engineer could be a table that accessible for doing the next processing step. The challenges part as a Data engineer is providing a right data to the business user based on their needs.
The next questions are, How do we create an ETL for large-scale data? or maybe how can we manage an ETL performance every day?
I have been asking to several person which as a user from Data Engineer such as, Data scientist, Data Analyst or maybe Business Intelligence about “how often do they need data from a data engineer” and surprisingly they need the data everyday! So, from here we can conclude that as a data engineer we not only manage the data but also providing the data continuously
“How we can provide data automatically without writing a script repeatedly ?”
In this case, i assume that all of you guys already understand about an ETL Concept. Before i give you a solution how we make a daily ETL, its better that you familiar with these following cloud technology (we are going to use Google Cloud Platform services):
- Dataproc included dataproc workflow-templates
- Google Cloud Storage
- BigQuery as our Data Warehouse
Dataproc
Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. Use Dataproc for data lake modernization, ETL, and secure data science, at planet scale, fully integrated with Google Cloud, at a fraction of the cost. I recommend that you watch this video for getting better understanding. Dataproc also integrates with other Google Cloud services.
Google Cloud Storage
Referring to the documentation, Cloud Storage is a service for storing your objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. All buckets are associated with a project, and you can group your projects under an organization.
After you create a project, you can create Cloud Storage buckets, upload objects to your buckets, and download objects from your buckets. You can also grant permissions to make your data accessible to members you specify, or — for certain use cases such as hosting a website — accessible to everyone on the public internet. Here, i also giving you a good article about terminology of Google Cloud Storage (GCS)
The figure bellow tell us about the Cloud Storage structure looks like.
Yups, from the figure above there are a several general step :
- Upload the spark job into GCS
- Create Dataproc workflow-templates
- Create Dataproc workflow-templates set-managed-cluster
- add spark job to the workflow-templates
- run the workflow-templates
- load to the BigQuery
But, there are a few of step that you should do before running the dataproc workflow-templates:
- Make an output directory in GCS called “/flights_data_output” for receiving a spark job result
- Make a BigQuery table
In this case i assume that you already make an output directory called “/flights_data_output” and 2 table in BigQuery. So, we can focus on the general step and start it!
- Upload the spark job into GCS
gsutil cp /Users/macbookpro/Documents/GCP/dataproc_project/spark-etl-job-1.py gs://bigdata-etl-2_flights/spark-job/
- Create a spark job
- Create bash file for executing a daily ETL
Yups, with this solution as a Data engineer we don’t need anymore to make a script or modified it. We just only running this script locally through our terminal.
Let’s try it!
- First, go to your “spark-workflow-template.sh” file directory. In this case my PATH is
cd /Users/macbookpro/Documents/GCP/dataproc_project
- Then, execute the bash file by
bash spark-workflow-template.sh
If your spark job success you will get this message from your terminal
(base) macbookpro@MacBooks-MacBook-Pro Downloads % bash spark-workflow-template.shcreateTime: '2021-03-12T05:16:17.976Z'id: flights_etljobs:- pysparkJob:mainPythonFileUri: gs://bigdata-etl-2_flights/spark-job/spark-etl-job-1.pystepId: flight_delays_etl_jobname: projects/bigdata-etl-2/regions/asia-southeast2/workflowTemplates/flights_etlplacement:managedCluster:clusterName: spark-job-flightsconfig:gceClusterConfig:serviceAccountScopes:- https://www.googleapis.com/auth/devstorage.read_only- https://www.googleapis.com/auth/logging.write- https://www.googleapis.com/auth/monitoring.write- https://www.googleapis.com/auth/pubsub- https://www.googleapis.com/auth/service.management.readonly- https://www.googleapis.com/auth/servicecontrol- https://www.googleapis.com/auth/trace.appendmasterConfig:diskConfig:bootDiskSizeGb: 20machineTypeUri: n1-standard-2softwareConfig:imageVersion: '1.3'workerConfig:diskConfig:bootDiskSizeGb: 20machineTypeUri: n1-standard-2numInstances: 2updateTime: '2021-03-12T05:16:19.865Z'version: 3Waiting on operation [projects/bigdata-etl-2/regions/asia-southeast2/operations/c0a0a7cf-6f72-388a-bcd0-a472d657df15].WorkflowTemplate [flights_etl] RUNNINGCreating cluster: Operation ID [projects/bigdata-etl-2/regions/asia-southeast2/operations/6c7c5beb-b85e-478b-a0c0-c5c815119639].Created cluster: spark-job-flights-lkszi7despdvi.Job ID flight_delays_etl_job-lkszi7despdvi RUNNINGJob ID flight_delays_etl_job-lkszi7despdvi COMPLETEDDeleting cluster: Operation ID [projects/bigdata-etl-2/regions/asia-southeast2/operations/bb7614d3-ed3a-4db5-bd23-cb96352ec3fc].WorkflowTemplate [flights_etl] DONEDeleted cluster: spark-job-flights-lkszi7despdvi.Waiting on bqjob_r6d705d1fc04b87bd_0000017824e0a9e5_1 ... (1s) Current status: DONEWaiting on bqjob_r27984d086e53dc95_0000017824e0be19_1 ... (1s) Current status: DONE
If you are a data engineer you just run this bash script daily whenever you want. The spark job script will read into your files in GCS automatically based on ‘today’s date’. In the further development, this script will run automatically every 1 AM Jakarta Local Time. I will use an Apache Airflow as an orchestration tools. Wait for it!