Many people are familiar with Amazon AWS cloud, but Google Cloud Platform (GCP) is another interesting cloud provider. For Cloud DB storage option on GCP, Google provides the options like Cloud SQL, Cloud Datastore, Google BigTable, Google Cloud BigQuery, and Google Spanner. In this blog, I am going to discuss all of these five options, but mainly focusing on last three as I am more interested in the options that handle large amount of data.
If you want to have full relational database in supporting customized table views, stored procedures, tons of indexes and ACID compliance, Cloud SQL is probably your potential choice here. Google Cloud SQL is the database service that support two types of databases: MySQL and PostgreSQL. Both support High Available (HA) and Pay Per Use without Lock-in. It can scale up to 32 processor cores and more than 200GB RAM. Although this option might make your life easier in migrating your data to cloud, it does have all the limitations in MySQL and PostgreSQL, and not scaling well for huge data volume. There are many blogs about performance limitation on MySQL and PostgreSQL. I am not going to repeat here.
For more information about Cloud SQL, please visit https://cloud.google.com/sql/.
Google Cloud DataStore is a cloud-based NoSQL database for web and mobile applications. It’s scalable NoSQL database and can automatically handles sharding and replication. It also supports ACID transaction, SQL-like queries and REST API. Unlike BigTable, Datastore is optimized for smaller set of data. Although Cloud Datastore is a NoSQL db and you don’t need to define a schema before storing a row, it actually uses more for ad hoc storage of structured data. Cloud Datastore does not have SQL, but have an API called GQL to perform some kind of queries. Here is one example of the query.
// List Google companies with less than 400 employees. var companies = query.filter('name =', 'Google').filter('size <', 400);
Someone mentioned that Cloud Datastore actually originated from Google’s internal-use database, Megastore. Megastore is widely used inside Google. I couldn’t find Google’s official statement about the link between these two products. But from Google’s publication about Megastore, it does look quite similar to Cloud Datastore.
For more information about Cloud Datastore, please visit https://cloud.google.com/datastore/.
Google BigTable is Google’s cloud storage solution for low latency data access. It was originally developed in 2004 and was built on Google File System (GFS). There is one paper about BigTable: Bigtable: A Distributed Storage System for Structured Data. Now It is widely used in many Google’s core services like Google Search, Google Maps, and Gmail. It is designed in NoSQL architecture, but can still use row-based data format. With data read/write under 10 milliseconds, it is good for applications that have frequent data ingestion. It can be scaleable to hundreds of petabytes and handle millions of operations per second.
BigTable is compatible with HBase 1.0 API via extensions. Any move from HBase will be easier. BigTable has no SQL interface and you can only use API go Put/Get/Delete individual rows or run scan operations. BigTable can be easily integrated with other GCP tools, like Cloud Dataflow and Dataproc. BigTable is also the foundation for Cloud Datastore.
Unlike other clouds, GCP compute and storage are separate. You need to consider the following three parts when calculating the cost.
1. The type of Cloud instance, and the number of nodes in the instance.
2. The total amount of storage your tables use.
3. The amount of network bandwidth used. Please note: some part of network traffic is free.
It’s good and bad. The good part is that you don’t need to pay for the compute cost if your system is idle and you pay only the storage cost. The bad part is that it is not easy to forecast your compute usage if you have very large dataset.
This is the pricing as of June 10, 2017, it could change in the future without notice. There are some interesting observations:
- The compute cost is the same no matter you choose SSD and HDD storage type. It makes sense as storage and compute are separated in GCP.
- The Writes are the same for both cases. The Reads is about 20 times slower in HDD compared with SSD. But the scans for HDD drops just 20%. If you know your access pattern is mostly scan, HDD option seem not bad. Although I would not recommend HDD, just feel this observation is interesting and puzzling.
- The cost to go with HDD storage is only 15% of the cost of going with SSD.
For more information , please visit http://cloud.google.com/bigtable/.
BigQuery is Google’s Cloud-based data warehousing solution. Unlike BigTable, it targets data in big picture and can query huge volume of data in a short time. As the data is stored in columnar data format, it is much faster in scanning large amounts of data compared with BigTable. BigQuery allows you to scale to petabyte and is great enterprise data warehouse for analytics. BigQuery is serverless. Serverless computing means computing resource can be spun up on-demand. It benefits users from zero server usage to full-scale usage without involving administrators and managing infrastructure. According to Google, BigQuery can scan Terabytes of data in seconds and Petabytes of data in minutes. For data ingestion, BigQuery allows you to load data from Google Cloud Storage, or Google Cloud DataStore, or stream into BigQuery storage.
However, BigQuery is really for OLAP type of query and scan large amount of data and is not designed for OLTP type queries. For small read/writes, it takes about 2 seconds while BigTable takes about 9 milliseconds for the same amount of data. BigTable is much better off for OLTP type of queries. Although BigQuery support atomic single-row operations, it lacks cross-row transaction support.
For pricing, there are some free operations. I won’t discuss more about the free operation, but just about standard pricing for the most important components in the BigQuery. There are two major components in the cost of using BigQuery: Storage Cost and Query Cost
For storage cost, it is $0.02 per GB/month. However, Google has a long term storage pricing, which is 50% off to $0.01 per GB/month. The definition of long term storage is the table that is not edited (APPEND, OVERWRITE or STEAMING) for 90 days. Each partition in the table is considered separate storage. So you could have standard pricing for some recent partitions while have long term storage pricing for some historical partitions. Even the data is in long term storage, there is no degradation of performance, durability and availability.
For query cost, the first 1 TB of data processed in a month is free, then it is $5 per TB. No charge for cached queries. As BigQuery is stored in columnar data format, the query cost is based on the columns selected. For enterprise with large amount of data and tons of applications, although the bill for data storage is predictable, the bill for query cost is not. The good news is that Google does offer a flat rate monthly cost model instead of on-demand pricing. For example, you can pay $10,000 for 500 BigQuery Slots and BigQuery automatically manages these slot quota.
For more documentation about BigQuery, please visit http://cloud.google.com/bigquery/docs. A sample of BigQuery screen is shown below:
Cloud Spanner is a globally distributed database and was just officially released last month in May 2017. It is a versioned key-value store. From this perspective, it is similar to BigTable. However, it support general-purpose transactions and provide SQL-based query language.
Spanner was developed in 2011 and used internally for Google’s advertising backend, which is called F1. F1 was initially based on a MySQL database. As Google grows rapidly in its advertising revenue, so is the F1 MySQL database. The uncompressed dataset is in tens of TB. It’s definitely way beyond the comfort zone for MySQL. Even with tremendous effort in sharding scheme, the management of the database became very complex and costly. The last resharding of this MySQL database took two years of intense effort. Please note, even for a great company like Google with so many talented people, it still took two years efforts. I can’t image how other companies can survive this size of MySQL database. This is why I am usually cautious about using MySQL database with large size footprint.
Two features I like most in Cloud Spanner:
1. Replication Configuration
Data replication is handled automatically and transparently. But user application can control the way how data is stored. For example, if user data has the requirement to stay in USA only, you could specify to store data in US data centers only. If you want to improve the read performance and availability, you could increase the number of replicas used and geographic placement of replicas to make the data is close to the users as much as possible. If you want to have fast write throughput, you could decide how far replicas are from each other.
2. Globally-distributed database allowing consistent reads and writes.
This feature is critical if I want to have a consistent backup, or have consistent reads at global scale level. The implementation of this feature is using Google’s TrueTime. In stead of using only one source for the time reference, TrueTime is based on the time references from both GPS and atomic clocks. Google indicates the reason to use two different kinds of time reference because they have different failure models. Atomic clocks can fail over long periods of time like drift significantly while GPS can fail when receivers fail or radio interference. Usually you won’t see both time reference fails at the same time because they have different failure models.
Spanner is organized in a set of zones. Each zone has one zone master and 100 to 1000 spanserver. Each table is split into multiple tablets. A table’s state is stored in set of B-tree like structure files and Write-Ahead Log on a file system called Colossus. Colossus is a global distributed file system and the successor to Google File System (GFS). Spanner’s data model is not purely relational, but semi-relational. Each rows must have names, and each table is required to have an ordered set of one or more primary-key columns. Google publishes a Best Practice for Spanner Schema Design. For more information about the Spanner architecture, please check out Google’s research paper: Spanner: Google’s Globally-Distributed Database.
The storage cost is $0.30 per GB/month and $9 per hour per node. Each Spanner node can provide up to 10,000 QPS of reads or 2000 QPS of writes (writing single rows at 1KB data per row), and 2 TB disk storage. Google also recommend provision more spanner nodes to keep CPU utilization below 75%.
Among these five database storage options in GCP, I like Cloud Spanner the most as I feel it is the best to meet the requirement of supporting Relational databases and great scalability and availability. But if you want to find which option is really for you, Google has a nice decision tree to help you to determine the best option for you.
For more information about Cloud Spanner, please visit https://cloud.google.com/spanner/docs/.