Configuration Google Cloud for dealing with BigQuery part (II)

Playing commands through Terminal!

Exploring Google Cloud Storage

After we make a configuration to particular project in your google cloud platform, we are going to explore first to the google cloud storage. We can’t load our data directly to the BigQuery before we load the file first into our cloud storage. So, we have to create a Buckets first, then load the file into those buckets by following this commands below.

Buckets

The file system on your local computer stores data in files and organizes files using directories. Cloud Storage stores data in objects and collects objects inside Buckets. Buckets have a lot in common with directories, but there’s one major difference:

  • Buckets can’t be nested.

That is, you can’t organize Buckets into a hierarchy in the way that you can organize directories.

When working with buckets, we should be familiar with three points:

  • All load/store/delete operations involving Cloud Storage must identify at least one target bucket.

This command will create a Buckets

gsutil mb -p PROJECT_ID -c STORAGE_CLASS -l BUCKET_LOCATION -b on gs://BUCKET_NAME

In this case i create a Buckets into my project namely bigdata-etl-2 with storage class “STANDARD”, buckets location in Jakarta “ASIA-SOUTHEAST2” . Here is about a detail information about storage class and buckets location.

gsutil mb -p bigdata-etl-2 -c STANDARD -l ASIA-SOUTHEAST2 -b on gs://bigdata-etl-2_flights

and then i get this message below, that means i already create a “bigdata-etl-2_flights” as my buckets in bigdata-etl-2 project.

Creating gs://bigdata-etl-2_flights/...
bigdata-etl-2_flights buckets was created (UI)

Upload files to Buckets

We are using gsutil cp to uploading or copying our .json file from local to the Buckets.

gsutil cp OBJECT_LOCATION gs://DESTINATION_BUCKET_NAME/

In this case, i try to upload 2019-04-27.json from my local directory as a local path to my Buckets gs://bigdata-etl-2_flights/ as my cloud path.

gsutil cp /Users/macbookpro/Documents/GCP/005_resources/json-files/2019-04-27.json gs://bigdata-etl-2_flights/#you will get this message below after run the above commandCopying file:///Users/macbookpro/Documents/GCP/005_resources/json-files/2019-04-27.json [Content-Type=application/json]...- [1 files][ 23.1 MiB/ 23.1 MiB]  252.6 KiB/sOperation completed over 1 objects/23.1 MiB.
“2019–04–27” files was uploaded into Buckets

Exploring Google BigQuery

Google BigQuery is a serverless architecture that allows operate at scale and speed to provide incredibly fast SQL analytics over large datasets. Since its inception, numerous features and improvements have been made to improve performance, security, reliability, and making it easier for users to discover insights. we can also use a BigQuery as a Data Warehouse. The file from our Buckets can be loaded into a BigQuery. But, we should make an empty table with the schema that we want.

So, let’s take a look how we create a dataset in BigQuery and make an empty table with particular schema.

bq --location=ASIA-SOUTHEAST2 mk -d \
--description "This is my flight dataset." \
data_flights

Make an empty table with particular schema.

For creating an empty table using our schema that built in .json , go to your schema file directory. In this case the location of my schema .json file is located at load_json folder. So, please go to your schema .json file.

cd Documents/GCP/005_resources/load-tables-bash/load_json

After that, we can easily make an empty table there.

bq mk -t --schema schema.json \ data_flights.flight_non_partitioned_json

Let’s load the data into an empty table that called flight_non_partitioned_json. Before that, we should go to our .json file path. Here i upload multiple files.

bq load --source_format=NEWLINE_DELIMITED_JSON \
data_flights.flight_non_partitioned_json \
gs://bigdata-etl-2_flights/2019-04-27.json,gs://bigdata-etl-2_flights/2019-04-28.json,gs://bigdata-etl-2_flights/2019-04-29.json,gs://bigdata-etl-2_flights/2019-04-30.json

Let’s try to make another empty table with partitioned concept. and store the data into the table

bq mk -t \
--schema schema.json \
--time_partitioning_field flight_date data_flights.flight_partitioned_json
bq load --source_format=NEWLINE_DELIMITED_JSON \
data_flights.flight_partitioned_json \
gs://bigdata-etl-2_flights/2019-04-27.json,gs://bigdata-etl-2_flights/2019-04-28.json,gs://bigdata-etl-2_flights/2019-04-29.json,gs://bigdata-etl-2_flights/2019-04-30.json

Now, check the different between non_partitioned table with partitioned table. Try this query

select * from `bigdata-etl- 2.data_flights.flight_non_partitioned_json` where flight_date = '2019-04-27'
Non_partitioned table
select * from `bigdata-etl- 2.data_flights.flight_partitioned_json` where flight_date = '2019-04-27'

We can see from 2 queries, we can see that each query produces the same result. However, the first query process was processed 34 Mib while the second query in the partition table was processed only 8.5 Mib. In using cloud services, of course, this is a very concern because a large process will create a large bill. In the background, a second query only scan the data in fligh_date which has value 2019–04–27 while the first query, they scan all value.

This is the end of this article, perhaps you can explore more about BigQuery. In further artcile, i will explain how we create a sparks job using another service of GCP such as Dataproc that provie an Apache Airflow as a scheduller.

I love Data Science!