28Feb 2018

TiDB: Performance-tuning a distributed NewSQL database


A brief introduction to the open source hybrid transactional and analytical database and how to tune write operations for maximum performance

Doing performance tuning on distributed systems is no joking matter. It’s much more complicated than on a single node server, and bottlenecks can pop up anywhere, from system resources in a single node or subcomponent, to cooperation between nodes, to even network bandwidth. Performance tuning is a practice that aims to find these bottlenecks and address them, in order to reveal more bottlenecks and address them as well, until the system reaches an optimal performance level.

In this article, I introduce you to TiDB, a distributed NewSQL database, and share some best practices on how to tune write operations in TiDB to achieve maximum performance. TiDB is an open source, hybrid transactional/analytical processing (HTAP) database, designed to support both OLTP and OLAP scenarios.

One TiDB cluster has several TiDB servers, several TiKV servers, and a group of Placement Drivers (PDs), usually three or five nodes. The TiDB server is a stateless SQL layer, the TiKV server is the key-value storage layer, and each PD is a manager component with a “god view” that is responsible for storing metadata and doing load balancing. Below is the architecture of a TiDB cluster. You can find more details on each component in the official TiDB documentation.

tidb architecture PingCAP

Gathering TiDB metrics

We gather a lot of metrics inside each TiDB component. These are periodically sent to Prometheus, an open source system monitoring solution. You can easily observe the behaviors of these metrics in Grafana, an open source platform for time series analytics. If you deploy the TiDB cluster using Ansible, Prometheus and Grafana will be installed by default. By observing various metrics, we can see how each component is working, pinpoint the bottlenecks, and address them via tuning. Let’s see an example.

Writeflow of a SQL INSERT statement

Let’s use the following SQL statement to insert a record into table t.

mysql >> INSERT INTO t(id, name, address) values(1, “Jack”, “Sunnyvale”);

tidb writeflow PingCAP

Above is a simplified overview of how this SQL statement is processed in TiDB. After the TiDB server receives this instruction, it will transform the statement into one or more key-value pairs, depending on the number of indexes. These key-value pairs will then be sent to an associated TiKV server, where they will be replicated to other TiKV servers in the form of a Raft log. Finally, when the Raft log has been committed, these key-value pairs will be written to the engine for storage.

Thus there are three key steps to processing this statement in TiDB: transforming SQL to key-value pairs, replication by region, and two-phase commit. Let’s dive into each in more detail.

Step 1: From SQL to key-value pairs

Unlike other database systems, TiDB stores key-value pairs only to provide infinite horizontal scalability with strong consistency. So how do we implement high-level concepts, such as database, table, and index? In TiDB each table has an associated global unique number called table-id. The keys of all data in a table include records and indexes and begin with the eight-byte table-id. Each index has a table-scope unique number called index-id. The following two lines show the encoding rules of record keys and index keys.

tidb encoding rules PingCAP

Step 2: Replication by region

In TiDB, a region represents a continuous, left-close-right-open key range, [start_key, end_key). Each region has several replicas, and each replica is called a peer. Every region also belongs to an individual Raft group to guarantee strong data consistency among all peers. (For more information on how the Raft consensus algorithm is implemented in TiKV, see this blog post by Liu Tang, distinguished engineer at PingCAP.) Nearby records of the same table are most likely in the same region, because of the encoding rules that I mentioned above.

When the cluster is first initialized, there is only one region. A region will dynamically split into two nearby regions when it reaches a certain size (currently the default is 96MB), automatically distributing data across the system to provide horizontal scaling.

Step 3: Two-phase commit

Our transaction model is inspired by Percolator, but with some additional optimizations. Essentially, it is a two-phase commit protocol with prewrite and commit.

There is a lot more under the hood in each of these components, but a macro-level understanding is enough to set the scene for performance tuning. Now, let’s dig into four specific tuning techniques.

Tuning the TiDB scheduler

All write commands are dispatched to the scheduler model before replicating. The scheduler model is comprised of one scheduler thread and several worker threads. Why do we need a scheduler model? Before writing to the database, we need to check whether the specific write command is allowed and whether it satisfies the constraints of a transaction. All this checking, which might require reading from the underlying storage engine, is handled by scheduler workers.

When you see in your metric that the total CPU usage of the scheduler workers exceeds scheduler-worker-pool-size * 80%, the way to tune your system is to increase the number of scheduler workers to improve performance.

You can change the scheduler workers count by modifying the scheduler-worker-pool-size item in the storage section of the configuration file. There are four scheduler workers by default on machines whose CPU core count is less than 16; otherwise the default is eight scheduler workers. See the relevant code section: scheduler-worker-pool-size = 4.

tidb scheduler cpu PingCAP
tidb scheduler worker cpu PingCAP

Tuning the TiKV raftstore and apply threads

As I mentioned above, we use Raft to implement strong consistency between multiple peers. Before we write one key-value pair into the database, this operation must first be replicated by Raft in the form of the Raft log, which also needs to be written to disk in each peer to maintain durability. Only when the Raft log has been committed can the associated key-value pair of the write command be written to the database.

Thus, there are two types of write: writing the Raft log and writing the key-value pair to the database. To process each type separately inside TiKV, there is one thread named raftstore that handles all Raft messages and writes Raft logs to the disk, and another thread named apply workerthat writes key-value pairs to storage. These two threads, which you can view in the Thread CPU sub-panel of the TiKV panel in Grafana (see below), play significant roles in the write workload. We can easily track how busy these threads are through Grafana.

What should you watch for? If the apply or raftstore threads of some TiKV servers are very busy, while those of other TiKV servers are not, that means the write load is unbalanced. A few busy TiKV servers thus become bottlenecks for the whole cluster. You could accidentally create such a hotspot situation by using AUTOINCREMENT on the primary key or creating an index on a column whose value keeps increasing, like last access timestamp.

To tune this scenario and remove the bottleneck, you must avoid designing primary keys and indexes on columns whose values increase monotonically.

In traditional, single-node database systems, using the AUTOINCREMENTkeyword can offer a lot of benefit for sequential writing. But when it comes to a distributed database system such as TiDB, the most important consideration is balancing load across all components.

tidb raft store cpu PingCAP
tidb async apply cpuPingCAP
tidb async apply cpuPingCAP

Tuning the RocksDB storage engine

RocksDB is a persistent key-value store with high performance and many useful features. TiKV uses RocksDB as its underlying storage engine and takes advantage of many of its features including column families, delete range, prefix seek, MemTable prefix bloom filter, and SST (Static Sorted Table) user-defined properties. RocksDB provides its own detailed performance tuning documentation.

Underneath each TiKV server are two RocksDB instances, which we call kv-engine and raft-engine: The former stores the data, and the latter stores the Raft log. The kv-engine has four column families: defaultlockwrite, and raft. Most records are stored in the default column family and all indexes are stored in the writecolumn family. You can tune these RocksDB instances by modifying the block-cache-size value in the associated sections in the configuration file to achieve top performance. These are the relevant sections are [rocksdb.defaultcf] block-cache-size = “1GB” and [rocksdb.writecf] block-cache-size = “1GB”.

The reason we tune block cache size is because the TiKV server frequently reads data from the write column family to check whether the transaction constraints are satisfied when inserting. So it is important to set an appropriate size for the write column family’s block cache. When the block cache hit rate of the writecolumn family is below 90 percent, you should increase its block cache size. The default value of block-cache-size for the write column family is 15 percent of total memory, and the default for the default column family is 25 percent.

For example, if we deploy a TiKV node on a machine with 32GB memory, then the values for block-cache-size would be about 4.8GB for the write column family and 8GB for the default column family. Under heavy write workloads, the data in the default column family is rarely accessed, so when we see a cache hit rate below 90 percent, for example 50 percent, then we know the frequently accessed data in the write column family is about twice as large as the default cache size of 4.8GB. To tune this scenario for better performance, we could explicitly set block-cache-size for the write column to 9GB. However, we would also need to decrease block-cache-size for the default column family to 4GB to avoid out of memory errors. You can find detailed statistics for RocksDB in the RocksDB-kv panel in Grafana to help you make these adjustments.

rocksdb block cache hitPingCAP
rocksdb block cache sizePingCAP

Take advantage of batch insert

You can achieve better write performance by using batch insert. Batch insert can reduce not only the RPC latency between clients and TiDB servers, but also the SQL parsing time. Inside TiKV, batch insert can reduce the total number of Raft messages by combining multiple records into one Raft log entry. Based on our experience, it is recommended that you keep the batch size between 50 and 100 rows. When there are more than 10 indexes in one table, you should decrease the batch size, because inserting one row will create more than 10 key-value pairs, based on the encoding rules I described above.

I hope this article helped you get a good sense of some common bottleneck scenarios and how to tune these issues away to achieve optimal performance on “write” procedures when using TiDB. In summary:

  1. Don’t let a few TiKV nodes handle the majority of a “write” workload. Avoid designing primary keys and indexes on a column whose value monotonically increases.
  2. When the total CPU usage of the scheduler workers in TiKV’s scheduler model exceeds scheduler-worker-pool-size * 80%, increase the scheduler worker pool size.
  3. When writing tasks read the write column family frequently and the block cache hit rate is below 90 percent, increase its block cache size in RocksDB.
  4. Use batch insert to improve performance on “write” operations.

Many of our customers—from ecommerce marketplaces and gaming to fintech, media, and travel—are already applying these tuning techniques in production to get the full benefits of TiDB’s design, architecture, and optimizations. We look forward to sharing their use cases and experiences in the near future.