BigQuery tutorial

Big Data & Cloud Computing

Eduardo R. B. Marques, DCC/FCUP

0. Summary

Getting to know BigQuery:

1. Creating datasets and running queries

We will use Colab notebooks for a user-friendly experience.

Google Colab lets you run notebooks for free and integrates easily with Google Cloud services. You may inspect the runtime characteristics using this notebook.

1.1. Table creation

BigQuery - table creation

1.2. Running queries

BigQuery - running queries

2. Doing the same in Cloud Shell …

Notebooks are nice to illustrate how things work, but let’s now do the same in a standard Python program using the Cloud Shell.

2.1. Install complementary libraries

Open the Cloud Shell and install some required Python libraries: vega_datasets, pyarrow, and db-dtypes.

  pip3 install vega_datasets pyarrow db-dtypes

(the BigQuery client API is installed in the Cloud Shell machine already)

2.2. Table creation

Upload bigquery_table_creation.py to Cloud Shell, and open it in the Cloud Shell editor.

You’ll see that it resembles the code in the Colab notebook. Change the PROJECT_ID variable in the code then run it in the command line:

  python3 bigquery_table_creation.py

The execution should create a new BigQuery dataset vega2

  Creating client for project bdcc2023
  Creating dataset vega2
  Creating table bdcc2023.vega2.airports
  Loading data into bdcc2023.vega2.airports
  waiting for the load job to complete
  waiting for the load job to complete
  Load complete!

To create the flights table you should be able to reuse the code you wrote in the Colab notebook. Do it to also create the flights table.

2.3. Run queries

Write a Python program that executes a few queries on the BigQuery tables and saves the results to CSV files. You can use Pandas dataframes for that purpose (check the documentation of Dataframe.to_csv).

3. Using the bq command-line tool

bq is a Google Cloud SDK command-line tool to interface with BigQuery.

Reference: Using the bq command-line tool

Open your Google Cloud Shell and issue the following commands to get familiar with bq commands.

3.1. Get help!

Type

  bq help

to get general help or

  bq help command_name

to get help on a specific command designated by command_name, e.g., try

  bq help ls

3.2. Get information on datasets and tables

Execute

  bq ls 
  

to list all datasets for the active project, and

  bq ls dataset_name

to list all tables in the dataset with the given name.

You can also get more specific information about a dataset or a table using the show command. For instance, considering the dataset we created using Colab, you can run bq show vega to get something like

  Dataset bdcc2023:vega

   Last modified                ACLs                Labels    Type
 ----------------- ------------------------------- -------- ---------
  20 Mar 10:55:19   Owners:                                  DEFAULT
                      eduardorbmarques@gmail.com,
                      projectOwners
                    Writers:
                      projectWriters
                    Readers:
                      projectReaders

Executing bq show vega.airports, you’ll get:

Table bdcc2023:vega.airports

   Last modified          Schema          Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels
 ----------------- --------------------- ------------ ------------- ------------ ------------------- ------------------ --------
  20 Mar 10:56:23   |- iata: string       3376         198272                             
                    |- name: string                                                       
                    |- city: string                                                       
                    |- state: string           
. . . [ remaining output omitted] . . .

3.3. Run queries

You can run queries using the query command.

For instance

  bq query "SELECT iata,name FROM vega.airports ORDER BY iata LIMIT 5"  

will execute the query and display the results

  Waiting on bqjob_r606aada88ef5764f_0000017fa7351c10_1 ... (0s) Current status: DONE   
  +------+---------------------------------+
  | iata |              name               |
  +------+---------------------------------+
  | 0R3  | Abbeville Chris Crusta Memorial |
  | 0J0  | Abbeville Municipal             |
  | U36  | Aberdeen Municipal              |
  | ABR  | Aberdeen Regional               |
  | GZS  | Abernathy                       |
  +------+---------------------------------+

3.4. Copy and remove tables

You can copy and remove datasets and tables using the cp and rm commands respectively.

Executing

  bq cp vega.airports vega.airports2 

will create vega.airports2 as a copy of vega.airports. You can verify that the new table exists using bq ls vega for instance.

You can remove the vega.airports2 table using

  bq rm vega.airports2