Eduardo R. B. Marques, DCC/FCUP
Getting to know BigQuery:
bq
utilityWe will use Colab notebooks for a user-friendly experience.
Google Colab lets you run notebooks for free and integrates easily with Google Cloud services. You may inspect the runtime characteristics using this notebook.
Notebooks are nice to illustrate how things work, but let’s now do the same in a standard Python program using the Cloud Shell.
Open the Cloud Shell and install some required Python libraries: vega_datasets
, pyarrow
, and db-dtypes
.
pip3 install vega_datasets pyarrow db-dtypes
(the BigQuery client API is installed in the Cloud Shell machine already)
Upload bigquery_table_creation.py to Cloud Shell, and open it in the Cloud Shell editor.
You’ll see that it resembles the code in the Colab notebook. Change the PROJECT_ID
variable in the code then run it in the command line:
python3 bigquery_table_creation.py
The execution should create a new BigQuery dataset vega2
Creating client for project bdcc2023
Creating dataset vega2
Creating table bdcc2023.vega2.airports
Loading data into bdcc2023.vega2.airports
waiting for the load job to complete
waiting for the load job to complete
Load complete!
To create the flights
table you should be able to reuse the code you wrote in the Colab notebook. Do it to also create the flights
table.
Write a Python program that executes a few queries on the BigQuery tables and saves the results to CSV files. You can use Pandas dataframes for that purpose (check the documentation of Dataframe.to_csv).
bq
command-line toolbq
is a Google Cloud SDK command-line tool to interface with BigQuery.
Reference: Using the bq command-line tool
Open your Google Cloud Shell and issue the following commands to get familiar with bq
commands.
Type
bq help
to get general help or
bq help command_name
to get help on a specific command designated by command_name
, e.g., try
bq help ls
Execute
bq ls
to list all datasets for the active project, and
bq ls dataset_name
to list all tables in the dataset with the given name.
You can also get more specific information about a dataset or a table using the show
command. For instance, considering the dataset we created using Colab, you can run bq show vega
to get something like
Dataset bdcc2023:vega
Last modified ACLs Labels Type
----------------- ------------------------------- -------- ---------
20 Mar 10:55:19 Owners: DEFAULT
eduardorbmarques@gmail.com,
projectOwners
Writers:
projectWriters
Readers:
projectReaders
Executing bq show vega.airports
, you’ll get:
Table bdcc2023:vega.airports
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- --------------------- ------------ ------------- ------------ ------------------- ------------------ --------
20 Mar 10:56:23 |- iata: string 3376 198272
|- name: string
|- city: string
|- state: string
. . . [ remaining output omitted] . . .
You can run queries using the query
command.
For instance
bq query "SELECT iata,name FROM vega.airports ORDER BY iata LIMIT 5"
will execute the query and display the results
Waiting on bqjob_r606aada88ef5764f_0000017fa7351c10_1 ... (0s) Current status: DONE
+------+---------------------------------+
| iata | name |
+------+---------------------------------+
| 0R3 | Abbeville Chris Crusta Memorial |
| 0J0 | Abbeville Municipal |
| U36 | Aberdeen Municipal |
| ABR | Aberdeen Regional |
| GZS | Abernathy |
+------+---------------------------------+
You can copy and remove datasets and tables using the cp
and rm
commands respectively.
Executing
bq cp vega.airports vega.airports2
will create vega.airports2
as a copy of vega.airports
. You can verify that the new table exists using bq ls vega
for instance.
You can remove the vega.airports2
table using
bq rm vega.airports2