TiDB: Performance-tuning a distributed NewSQL database
A brief introduction to the open source hybrid transactional and analytical database and how to tune write operations for maximum performance
Doing performance tuning on distributed systems is no joking matter. It’s much more complicated than on a single node server, and bottlenecks can pop up anywhere, from system resources in a single node or subcomponent, to cooperation between nodes, to even network bandwidth. Performance tuning is a practice that aims to find these bottlenecks and address them, in order to reveal more bottlenecks and address them as well, until the system reaches an optimal performance level.
In this article, I introduce you to TiDB, a distributed NewSQL database, and share some best practices on how to tune write operations in TiDB to achieve maximum performance. TiDB is an open source, hybrid transactional/analytical processing (HTAP) database, designed to support both OLTP and OLAP scenarios.
One TiDB cluster has several TiDB servers, several TiKV servers, and a group of Placement Drivers (PDs), usually three or five nodes. The TiDB server is a stateless SQL layer, the TiKV server is the key-value storage layer, and each PD is a manager component with a “god view” that is responsible for storing metadata and doing load balancing. Below is the architecture of a TiDB cluster. You can find more details on each component in the official TiDB documentation.
Gathering TiDB metrics
We gather a lot of metrics inside each TiDB component. These are periodically sent to Prometheus, an open source system monitoring solution. You can easily observe the behaviors of these metrics in Grafana, an open source platform for time series analytics. If you deploy the TiDB cluster using Ansible, Prometheus and Grafana will be installed by default. By observing various metrics, we can see how each component is working, pinpoint the bottlenecks, and address them via tuning. Let’s see an example.
Writeflow of a SQL INSERT statement
Let’s use the following SQL statement to insert a record into table t.
mysql >> INSERT INTO t(id, name, address) values(1, “Jack”, “Sunnyvale”);
Above is a simplified overview of how this SQL statement is processed in TiDB. After the TiDB server receives this instruction, it will transform the statement into one or more key-value pairs, depending on the number of indexes. These key-value pairs will then be sent to an associated TiKV server, where they will be replicated to other TiKV servers in the form of a Raft log. Finally, when the Raft log has been committed, these key-value pairs will be written to the engine for storage.
Thus there are three key steps to processing this statement in TiDB: transforming SQL to key-value pairs, replication by region, and two-phase commit. Let’s dive into each in more detail.
Step 1: From SQL to key-value pairs
Unlike other database systems, TiDB stores key-value pairs only to provide infinite horizontal scalability with strong consistency. So how do we implement high-level concepts, such as database, table, and index? In TiDB each table has an associated global unique number called
table-id. The keys of all data in a table include records and indexes and begin with the eight-byte
table-id. Each index has a table-scope unique number called
index-id. The following two lines show the encoding rules of record keys and index keys.
Step 2: Replication by region
In TiDB, a region represents a continuous, left-close-right-open key range,
[start_key, end_key). Each region has several replicas, and each replica is called a peer. Every region also belongs to an individual Raft group to guarantee strong data consistency among all peers. (For more information on how the Raft consensus algorithm is implemented in TiKV, see this blog post by Liu Tang, distinguished engineer at PingCAP.) Nearby records of the same table are most likely in the same region, because of the encoding rules that I mentioned above.
When the cluster is first initialized, there is only one region. A region will dynamically split into two nearby regions when it reaches a certain size (currently the default is 96MB), automatically distributing data across the system to provide horizontal scaling.
Step 3: Two-phase commit
Our transaction model is inspired by Percolator, but with some additional optimizations. Essentially, it is a two-phase commit protocol with prewrite and commit.
There is a lot more under the hood in each of these components, but a macro-level understanding is enough to set the scene for performance tuning. Now, let’s dig into four specific tuning techniques.
Tuning the TiDB scheduler
All write commands are dispatched to the scheduler model before replicating. The scheduler model is comprised of one scheduler thread and several worker threads. Why do we need a scheduler model? Before writing to the database, we need to check whether the specific write command is allowed and whether it satisfies the constraints of a transaction. All this checking, which might require reading from the underlying storage engine, is handled by scheduler workers.
When you see in your metric that the total CPU usage of the scheduler workers
exceeds scheduler-worker-pool-size * 80%, the way to tune your system is to increase the number of scheduler workers to improve performance.
You can change the scheduler workers count by modifying the
scheduler-worker-pool-size item in the
storage section of the configuration file. There are four scheduler workers by default on machines whose CPU core count is less than 16; otherwise the default is eight scheduler workers. See the relevant code section:
scheduler-worker-pool-size = 4.
Tuning the TiKV raftstore and apply threads
As I mentioned above, we use Raft to implement strong consistency between multiple peers. Before we write one key-value pair into the database, this operation must first be replicated by Raft in the form of the Raft log, which also needs to be written to disk in each peer to maintain durability. Only when the Raft log has been committed can the associated key-value pair of the write command be written to the database.
Thus, there are two types of write: writing the Raft log and writing the key-value pair to the database. To process each type separately inside TiKV, there is one thread named
raftstore that handles all Raft messages and writes Raft logs to the disk, and another thread named
apply workerthat writes key-value pairs to storage. These two threads, which you can view in the Thread CPU sub-panel of the TiKV panel in Grafana (see below), play significant roles in the write workload. We can easily track how busy these threads are through Grafana.
What should you watch for? If the
raftstore threads of some TiKV servers are very busy, while those of other TiKV servers are not, that means the write load is unbalanced. A few busy TiKV servers thus become bottlenecks for the whole cluster. You could accidentally create such a hotspot situation by using
AUTOINCREMENT on the primary key or creating an index on a column whose value keeps increasing, like
last access timestamp.
To tune this scenario and remove the bottleneck, you must avoid designing primary keys and indexes on columns whose values increase monotonically.
In traditional, single-node database systems, using the
AUTOINCREMENTkeyword can offer a lot of benefit for sequential writing. But when it comes to a distributed database system such as TiDB, the most important consideration is balancing load across all components.