This post will briefly introduce BigQuery‘s architecture, including a few tips to ingest data into BigQuery better.
BigQuery is an outstanding Data Warehouse in the cloud. There are more details about the architecture and ingestion. I will mention a couple of sources to get further information.
As for BigQuery’s features, they are vast, and some of them are unique. The possibilities that they offer are infinite. I made an example where I retried meteorological information to load geospatial data in BigQuery and explained how to represent it, and I used BigQuery ML to predict customer churn.
BigQuery Architecture
Most databases are single-tenant ones. That’s to say, the database servers are physically segregated from other databases. An organisation (a company, a division, etc.) has its own database and grants permissions to users.
However, BigQuery is a multi-tenant database. I.e., organisations share the platform. There are data sets, which are logical groups of data, including the related database objects like tables and views, to organise and control access to data. You can have several data sets in your project, and projects belong to organisations, as per Google Cloud’s Resource Hierarchy.
When you create a data set, you are the only one who has access to it. Then you can grant permissions to users in the same organisation or different ones. E.g., different organisations share public data sets for demo purposes, learning or research. Everyone has reading access to the public data sets.
BigQuery is a huge system. Its components are spread over Google Cloud’s data centres, and storage and compute are decoupled. So every Google Cloud data centre has disks and computers that support BigQuery’s processing. Thus your data may be stored in several data centres, and you may be using compute power in different places to access it or load more data.
The main components of BigQuery are:
- Storage: BigQuery uses many disks, and it relies on Google’s proprietary file system called Colossus.
- Compute: BigQuery is built up with many computers to let you access data or load more. BigQuery optimiser is spread over these computers. It is Google’s proprietary, and it is called Dremel. The optimiser creates an abstraction layer on the files store on the file system, so it makes them look like tables to the user and query them with SQL.
- Network: The compute power needs to access data frequently, organise it, and group it to retrieve or load it. Data and compute power are spread over different locations, so BigQuery needs to rely upon a robust network. It is the shuffle tier, and it is built on Google’s proprietary network framework called Jupyter.
As already mentioned, BigQuery components are spread over several data centres. However, due to regulations and performance, you need to allocate the data to some extend. I.e., you want your data to lay as near as possible to the place where users are. It is equally important that some institutions require that data resides where the owners live, as the European Union’s GDPR legal framework. So, when you create a data set, you choose a location where you want your data to be.
You can locate your BigQuery data set in a Google Cloud region. Every region is made up of several zones that are different data centres.
In addition, you can choose a BigQuery multi-region to place your data. A multi-region is a group of several regions that cover a continent. Presently, there are two multi-regions: the European Union (EU) and the United States (US). Customers usually choose to place their data set in a multi-region because it is cheaper than selecting only one region. It is also convenient if their users access the data from different places within a continent.
You can’t copy data between data centres that are on different continents. I.e., you can’t use INSERT SELECT. You need to use BigQuery Data Transfer Service to move data from a data set in EU multi-region to the US one, or vice versa.
On a separate note, BigQuery stores data in column stores, in Protocol Buffers (Protobufs) storage format. The column store reduces the amount of data BigQuery reads with queries that process all rows but not all columns. This is one of the reasons why Dremel can process terabytes’ worth of logs in seconds.
The other reason why Dremel can process data so fast is that its query engine uses distributed computing. Dremel scales to thousands of workers by structuring the computation as a tree.
Another advantage to the fact that storage and compute are decoupled is that they can scale on-demand independently. It offers immense flexibility and cost control for customer’s organisations as they don’t need to keep their compute resources up and running all the time.
Finally, BigQuery is fully managed and serverless, i.e., you don’t need to perform administration tasks as collecting statistics or building indexes.
If you want to know more about BigQuery architecture, you can read this article. In addition, the book “Google BigQuery: The Definitive Guide” by Valliappa Lakshmanan, Jordan Tigani and others, published in O’Reilly, is one of the best resources for BigQuery. Chapter 6 explains BigQuery architecture.
Slots
A BigQuery slot is a unit of computational capacity required to execute SQL queries. Each query executed is broken up into stages which the slots then process. Every step of a query or a load job requires a different number of slots.
To put it differently, a slot is a thread of execution on a query worker shard; it generally represents half of a CPU core and about 1 GB of RAM. This amount is imprecise because slots can grow or shrink if they need more or fewer resources and as computers in a Google Cloud data centre are upgraded.
BigQuery also uses slots as an abstraction to indicate how many physical compute resources are available.
By default, a BigQuery project has 2000 slots. However, you can purchase more as per Google Cloud’s reservations.
Optimising the ingestion of data in BigQuery
Ingestion formats
Ingesting files into BigQuery is fast and straightforward, especially if they are not landing an existing table. You can upload them, and BigQuery decides the best data types it should assign to every field. However, if you load a table in BigQuery with Qlik Replicate for the first time, Qlik Replicate defines the table (DDL) and then loads the data.
Google Cloud recommends ingesting the following file formats in the order shown below due to performance considerations.
Connection to BigQuery
Google Cloud provides several ways to connect to BigQuery: ODBC, JDBC and API.
- BigQuery API is recommended to ingest small to medium volumes of data.
- JDBC is preferred for ingesting large amounts of data.
- You could also connect with ODBC, but it’s not my favourite option. Here you have an external benchmark that compares the performance with JDBC vs ODBC.
Slots
If you have many concurrent users or several processes which consume many slots, your load jobs will be competing for them.
BigQuery proposes several queries that can help you to know how many slots you are using in your BigQuery project:
- Average slot utilisation,
- The load job history,
- The most expensive jobs, and
- The pending and running jobs.
Additionally, you can monitor the BigQuery reservations in the Google Cloud console.
Leave a Reply