Exploring Google Cloud Storage
After we make a configuration to particular project in your google cloud platform, we are going to explore first to the google cloud storage. We can’t load our data directly to the BigQuery before we load the file first into our cloud storage. So, we have to create a Buckets first, then load the file into those buckets by following this commands below.
The file system on your local computer stores data in files and organizes files using directories. Cloud Storage stores data in objects and collects objects inside Buckets. Buckets have a lot in common with directories, but there’s one major difference:
- Buckets can’t be nested.
That is, you can’t organize Buckets into a hierarchy in the way that you can organize directories.
When working with buckets, we should be familiar with three points:
- All load/store/delete operations involving Cloud Storage must identify at least one target bucket.
- Every bucket has a globally unique name, a storage class, and a geographic location. A bucket’s name must be unique across all GCP projects, not just your own projects. Therefore, it’s a good idea to prepend your project ID to your bucket name such as project_id = bigdata-etl-2 and Buckets names should bigdata-etl-2_flights
- A project can create/delete buckets at most once every two seconds.
This command will create a Buckets
gsutil mb -p PROJECT_ID -c STORAGE_CLASS -l BUCKET_LOCATION -b on gs://BUCKET_NAME
In this case i create a Buckets into my project namely bigdata-etl-2 with storage class “STANDARD”, buckets location in Jakarta “ASIA-SOUTHEAST2” . Here is about a detail information about storage class and buckets location.
gsutil mb -p bigdata-etl-2 -c STANDARD -l ASIA-SOUTHEAST2 -b on gs://bigdata-etl-2_flights
and then i get this message below, that means i already create a “bigdata-etl-2_flights” as my buckets in bigdata-etl-2 project.
Upload files to Buckets
We are using
gsutil cp to uploading or copying our
.json file from local to the Buckets.
gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/
In this case, i try to upload
2019-04-27.json from my local directory as a local path to my Buckets
gs://bigdata-etl-2_flights/ as my cloud path.
gsutil cp /Users/macbookpro/Documents/GCP/005_resources/json-files/2019-04-27.json gs://bigdata-etl-2_flights/#you will get this message below after run the above commandCopying file:///Users/macbookpro/Documents/GCP/005_resources/json-files/2019-04-27.json [Content-Type=application/json]...- [1 files][ 23.1 MiB/ 23.1 MiB] 252.6 KiB/sOperation completed over 1 objects/23.1 MiB.
Exploring Google BigQuery
Google BigQuery is a serverless architecture that allows operate at scale and speed to provide incredibly fast SQL analytics over large datasets. Since its inception, numerous features and improvements have been made to improve performance, security, reliability, and making it easier for users to discover insights. we can also use a BigQuery as a Data Warehouse. The file from our Buckets can be loaded into a BigQuery. But, we should make an empty table with the schema that we want.
So, let’s take a look how we create a dataset in BigQuery and make an empty table with particular schema.
bq --location=ASIA-SOUTHEAST2 mk -d \
--description "This is my flight dataset." \
Make an empty table with particular schema.
For creating an empty table using our schema that built in
.json , go to your schema file directory. In this case the location of my schema
.json file is located at load_json folder. So, please go to your schema
After that, we can easily make an empty table there.
bq mk -t --schema schema.json \ data_flights.flight_non_partitioned_json
Let’s load the data into an empty table that called flight_non_partitioned_json. Before that, we should go to our
.json file path. Here i upload multiple files.
bq load --source_format=NEWLINE_DELIMITED_JSON \
Let’s try to make another empty table with partitioned concept. and store the data into the table
bq mk -t \
--schema schema.json \
--time_partitioning_field flight_date data_flights.flight_partitioned_jsonbq load --source_format=NEWLINE_DELIMITED_JSON \
Now, check the different between non_partitioned table with partitioned table. Try this query
select * from `bigdata-etl- 2.data_flights.flight_non_partitioned_json` where flight_date = '2019-04-27'
select * from `bigdata-etl- 2.data_flights.flight_partitioned_json` where flight_date = '2019-04-27'
We can see from 2 queries, we can see that each query produces the same result. However, the first query process was processed 34 Mib while the second query in the partition table was processed only 8.5 Mib. In using cloud services, of course, this is a very concern because a large process will create a large bill. In the background, a second query only scan the data in fligh_date which has value
2019–04–27 while the first query, they scan all value.
This is the end of this article, perhaps you can explore more about BigQuery. In further artcile, i will explain how we create a sparks job using another service of GCP such as Dataproc that provie an Apache Airflow as a scheduller.