In my previous article Introduction to Azure Cosmos DB, I mentioned Partition and Throughput only briefly. Adopting a good partition scheme is quintessential to setting up your Cosmos DB container for elastic scaling and blazing performance. This article will take a closer look at these two aspects to help fully utilize the storage and performance offerings of Cosmos DB.
Azure Cosmos DB containers store documents, graphs or tables. Containers (a.k.a. collections in the context of documents) are logical entities that could be distributed across multiple physical partitions or servers.
Physical and Logical Partitions
A physical partition is an internal Cosmos DB concept, essentially a fixed amount of SSD storage combined with a variable amount of compute power (CPU, memory and IO). The number of physical partitions of a container depends on its storage and throughput. For containers with shared throughput, number of partitions depends on RU/s assigned to the set of containers.
Request Unit (RU/s) – is the unit of throughput. 1 RU/s serves a get by self-link (internal property) or id of a 1 KB item.
When a collection is created, we can specify a fixed storage capacity of 10 GB or unlimited capacity. A fixed storage collection is limited in performance to a max of 10,000 RU/s. If we choose unlimited capacity, the collection created potentially has no max RU/s limit. Collections are supposedly unlimited in terms of storage and throughput, and physical partition management is handled by Cosmos DB behind the curtains. Note that for a multi-partition collection, we need to specify a partition key.
Data within a container having the same partition key value form a logical partition. The max storage limit of a logical partition is 10 GB, which means if data associated with a certain partition key value goes beyond 10 GB, the logical partition will be full and cannot grow any further. This is why adopting a good partition scheme is very important to avail the storage and performance guarantees of Azure Cosmos DB.
Azure Cosmos DB internally has a limit for the max throughput that can be provided by a physical partition – PRUmax. This value keeps changing based on factors such as hardware used and platform upgrades. For now, keep in mind that this happens behind the scenes.
Let us assume PRUmax = 10,000 RU/s. We create an unlimited Collection product at 20,000 RU/s initial throughput and productid as the partition key. Cosmos DB has to create at least 2 physical partitions to support the 20,000 RU/s throughput requested. Currently, the default seems to be 5. So, Cosmos DB creates a new collection with 5 physical partitions. The throughput requested will be equally assigned to these physical partitions. This means, the max throughput limit for each partition is 20,000/5 = 4000 RU/s.
As we add new documents, Cosmos DB allocates the key space of partition key hashes evenly and consistently across the 5 physical partitions. If the partition key is well chosen, writes will be distributed evenly across the partitions, each partition serving nearly 5000 RU/s and a cumulative of nearly 20,000 RU/s. This is ideal. In real world, it is possible that we chose a bad partition key.
What can go wrong?
- Performance impact: If majority of the concurrent writes/reads pertain to a specific partition key value, we could have 1 physical partition maxing out the 5000 RU/s allocated to it (hot partition), while the other 4 partitions idling. When this happens, requests are bound to get rate-limited and we will get Http 429 response code.
- Storage impact: Earlier in the article, I mentioned the concept of logical partition. All data having the same partition key form a logical partition. Logical partitions cannot be split across physical partitions. For the same reason, if the partition key chosen is of bad cardinality, we could potentially have skewed storage distribution. Say, 1 logical partition becomes fatter faster and hits the max limit of 10 GB, while the others are nearly empty. The physical partition housing the maxed out logical partition cannot split and could thus cause an application downtime.
Physical partition split
Azure Cosmos DB manages physical partitions seamlessly behind the scenes, if you chose your partition key smartly that is. Following are two scenarios when Cosmos DB will split a physical partition.
- Storage limit of 10 GB: When a physical partition is full, Cosmos DB will split it into 2 new partitions assigning data corresponding to nearly half of the keys to each new partition. As mentioned previously, the split cannot happen if data in the physical partition in question have the same partition key value.
- Increasing throughput: When throughput assigned is increased such that the existing number of physical partitions are insufficient to support it, Cosmos DB will add new physical partitions. In the above example, if the throughput is increased to 100,000 RU/s, Cosmos DB would add 5 new physical partitions.
Cosmos DB needs 100,000/PRUmax = 10 physical partitions to support the throughput setting.
What makes Cosmos DB an attractive high volume transaction database is the ease of scaling. When request rates are low, throughput could be lowered to keep costs down. Cosmos DB’s performance is predictable. For example, a read of a 1-KB document with session consistency always consumes 1 RU, regardless of number of concurrent requests or amount of data stored.
There are, however, two major design considerations to facilitate elastic scaling of Azure Cosmos DB.
Distribute requests and storage
Ideal candidate property for partition key will allow writes to be distributed across various distinct values. Requests to the same partition key should remain lower than the max throughput limit allocated to a partition. A good partition key will evenly distribute writes across all physical partitions and not cause hot partitions. In our example, productid is a good partition key, because it is unlikely that all concurrent requests will be focused on a specific product. If we were to chose the property productcategory as partition key, that could potentially cause hot partitions
Partition scope for queries and transactions
At one extreme, we can use the same partition key for all documents. At the other extreme, we can have unique partition key for each document. Both approaches have their limitations. Using the same partition key for all documents will limit scalability and cause a hot partition and inefficient utilization of throughput. Using unique partition keys will support high scalability, but result in a lot of cross-partition queries and prevent use of cross-document transactions. Occasional fan-out of queries is not too bad, but frequent fan-out will incur high RU consumption and result in rate-limiting.
Throughput can be estimated based on the number of expected reads/writes per second. 1 Request Unit (RU) corresponds to read of a 1-KB document containing 10 unique property values by self-link or id. Write, replace or delete will consume more RU/s.
Microsoft provides a Request Unit calculator that serves to arrive at a base throughput to assign when creating a new collection. Be prepared to fine tune the RU setting as you trot along, but this is a good starting point.
This URL ignites nostalgia 🙂
Azure Cosmos DB is a lot more versatile compared to the initial Document DB days. With added support for Mongo DB, Graph, Cassandra and Table APIs and multi-master and global distribution support, Cosmos DB is definitely the most exciting product in database technology at the moment. With the new Azure data products such as Azure Stream Analytics, Azure Data Bricks and HDInsight supporting out-of-the-box integration with Cosmos DB, it is fast becoming a good candidate for Big Data solutions.
Please feel free to reach out if you have questions. I’m always happy to discuss technology 🙂
9 thoughts on “Azure Cosmos DB – Partition and Throughput”
[…] you are exploring Cosmos DB, consider reading my blog about Azure Cosmos DB – Partition and Throughput to get an overview of partitioning and scaling […]
[…] MS recently went public with the support for Cassandra API. A few years from now, Cosmos DB will be the most popular database offering from Microsoft. So, don’t get left behind. If you want to get a gentle intro to Cosmos DB, check out my previous posts – Introduction to Azure Cosmos DB and Azure Cosmos DB – Partition and Throughput. […]
[…] https://sqlroadie.wordpress.com/2018/08/05/azure-cosmos-db-partition-and-throughput/ […]
[…] Change Feed items are available in order of their modification time, per logical partition key […]
[…] Feed items are available in order of their modification time (_ts system attribute), per logical partition key, and tagged with the same _lsn (system attribute) value for all items modified in the same […]
“Cosmos DB creates a new collection with 5 physical partitions”.
If I create a new collection and I allocate the minimum of 400 RU does this mean that 5 underlying physical partitions will be created? Each contributing to approximately 80RU?
Can you please cite your sources?
I am finding it hard to believe that 5 physical partitions are blindly assigned on every new collection created
Hi Herbey, thanks for the comment.
My blog says –
“Azure Cosmos DB internally has a limit for the max throughput that can be provided by a physical partition – PRUmax. This value keeps changing based on factors such as hardware used and platform upgrades. For now, keep in mind that this happens behind the scenes.
Let us assume PRUmax = 10,000 RU/s. We create an unlimited Collection product at 20,000 RU/s initial throughput and productid as the partition key. Cosmos DB has to create at least 2 physical partitions to support the 20,000 RU/s throughput requested. Currently, the default seems to be 5. So, Cosmos DB creates a new collection with 5 physical partitions.”
You quoted only a part of it – “Cosmos DB creates a new collection with 5 physical partitions” – which might be taking it out of context. And I might be wrong in saying that the default “seemed” to be 5. It is the default (in cases where initial throughput was higher than PRUmax) that was probably 5. I was ambiguous there with the wording as I had never come across an unlimited partition that had lower than 5 physical partitions upon creation.
Please note that the article was written back in the day when fixed containers were still around. At the time, the min throughput for an “unlimited container” was 1000 RU/s. There could be change in behavior since then. I will have to test out some scenarios to ascertain current behavior.
1. Create container with new lowest RU limit of 400 – in this case, I expect only 1 physical partition.
2. Create container with 1000 RU/s (lower than PRUmax).
3. Create container with 20,000 RU/s (higher than PRUmax, assuming PRUmax is 10,000 RU/s).
After creation of container in scenario 1, if you increase its throughput to say 50,000 RU/s, Cosmos DB will have to create additional physical partitions to support it. If you then try to scale down, the lowest RU you can assign would no longer be 400.
Nice pick there Herbey.
Here is the documentation concerning physical partitions.
Here,they claim differently to your statement of the default 5 partitions per container:
Most small Cosmos containers have many logical partitions but only require a single physical partition
Yes, that makes sense with fixed containers getting deprecated. Please refer to my comment above.