Azure Cosmos DB – HTAP using Azure Synapse Link

I haven’t done much blogging in the last 6 months. My last article was posted in January, on Data Governance in the Covid era. After changing jobs in March, I prioritized getting settled into my new role. That said, I have had the writer’s itch for a while and thought what better topic to break the rut with than Cosmos DB. I am a big fan of Cosmos DB – Microsoft’s multi-model NoSQL database service in Azure, and I have written a few articles on the topic.

Recent stats of my blog show more interest in data movement out of Cosmos DB, possibly for reporting purposes. Check out my 2019 article on real-time data movement from Cosmos DB using Change Feed and Azure Functions, where I explained how to persist container changes to SQL Server. Well, that is one way to enable real-time analytics. Cosmos DB has evolved a lot since then, and there are new and smarter ways to achieve similar results. Last year (Dec 2020), Microsoft announced the general availability of Azure Synapse Analytics – the unified analytics service in Azure.

sqlroadie – 2021 blog stats

HTAP capabilities of Cosmos DB using Synapse Analytics

In this article, we will explore the HTAP capabilities of Cosmos DB. The goal is to derive real-time insights from transactional changes made to Cosmos DB, in a cost-effective manner with minimal overhead. I want the solution to be scalable, reliable and overall simple to maintain.

What is HTAP?

HTAP stands for hybrid transaction/analytical processing. Traditionally, data systems can be classified as either OLTP – Online Transaction Processing or OLAP – Online Analytical Processing. Amazon shopping website is a good example of an OLTP system, as it manages millions of customer orders daily in real-time. However, Amazon would have billions of historical customer order details in their OLAP data warehouse. This helps them to create product recommendations by analyzing historical order data, and serve it to the OLTP system through analytics APIs.

Ordinarily, transaction data has to be copied or replicated from OLTP -> OLAP databases through ETL processes, and reporting/insights/recommendations generated in the OLAP system at a latency that made business sense. This is a tedious process and involves a lot of overhead. Majority of enterprises still use this approach. However, advancements in computing have enabled a new age HTAP architecture.

Gartner defines HTAP as “a hybrid transaction/analytical processing (HTAP) architecture is best enabled by in-memory computing techniques and technologies to enable analytical processing on the same data store that is used to perform transaction processing. By removing the latency associated with moving data from operational databases to data warehouses and data marts for analytical processing, this architecture enables real-time analytics and situation awareness on live transaction data as opposed to after-the-fact analysis on stale data.

This is exciting stuff, and quite a game-changer in the world of databases. Enough said, how do we go ahead and design a solution using Cosmos DB’s HTAP capabilities!

Synapse Link for Cosmos DB

Synapse Link for Cosmos DB is a cloud-native HTAP capability that creates a tight seamless integration between Cosmos DB and Synapse Analytics. It gives you the ability to gain near real-time analytics and insights over operational data in Cosmos DB with no ETL and no impact to the performance of OLTP transactions in Cosmos DB containers.

Let us break that down. Cloud-native (if you didn’t click the link) is the modern approach, where you design systems using cloud services to scale and perform well consistently. Synapse Analytics (if you are not familiar) can be thought of as Microsoft’s cloud data warehouse, for starts. And that is because Synapse is a lot more than that, but we have to start somewhere, don’t we?

So, Synapse Link creates a seamless, automagic integration between Cosmos DB and Synapse Analytics, thus removing the need for ETL. How does that work? Because a picture is worth a thousand words, I have referred to a diagram from Microsoft’s official doco on Synapse Link. The highlighted portion is new and works behind the scenes to deliver near-real time analytics with no additional ETL. When we create a container in an Azure Cosmos DB account that has Synapse Link enabled, an analytical store is automatically provisioned. This analytical store helps to ensure that there is no impact to OLTP transactions against the Cosmos DB container.

Image courtesy: Microsoft

This is a robust mechanism to perform large-scale analytics over data in Cosmos DB and implement an archival policy efficiently to keep the containers sleek without losing access to valuable historical data. It is worth noting that the transactional and analytical stores get their own, separate TTL (Time to Live) properties. TTL is one of my favorite features of Cosmos DB. Self-destruction can be so powerful sometimes! I will cover this in more detail in my sample solution.

Sample Solution

Confucius says, “I hear and I forget. I see and I remember. I do and I understand”. So, I built a sample solution that would ingest data into Cosmos DB at scale, and I was impressed at just how quickly data was available downstream in Synapse Analytics. The sample solution features a Python Notebook, running on Databricks and ingesting the popular New York City taxi trip data into a Cosmos DB container using Cosmos DB Spark 3 OLTP connector for SQL API. The trip data that became available in Synapse Analytics was used to build a Power BI dashboard, thus effectively producing near real-time analytics with no ETL and minimal code. I will give a step-by-step breakdown of the sample solution. Refer to this guide by Microsoft for more details.

Sample Cosmos DB HTAP solution
Sample solution using NYC Taxi trip data

1. Cosmos DB container

Firstly, we want to enable Synapse Link in the Cosmos DB account using the Settings blade. It is worth noting that at the time of writing this article (Oct 2021), Synapse Link and analytical store is supported only for SQL and MongoDB APIs.

Enable Azure Synapse Link

At the moment, analytical store can be turned ON only for new containers. If you wish to make use of Cosmos DB’s HTAP capabilities for an existing container, you will need to migrate data to a new container. Note that containers cannot be renamed. Auto-sync latency is within 2 minutes, as per Microsoft documentation. In my tests, I did not see a delay of more than a minute or so for data to be available downstream in Synapse Analytics.

New Cosmos DB container with Analytical store turned ON

Analytical store does not need request units (RU/throughput) to be allocated, nor does it consume the RU assigned to the transactional store – 1000 in the snip above. Analytical store follows a consumption-based pricing model, which is based on data storage and analytical read/write operations and queries executed. Storage can be optimized by setting the Analytical Store Time To Live property in the Data Explorer blade under Scale & Settings option. This dictates for how long data would be retained in the analytical store, and is independent from the container TTL property. You should set the Analytical Store TTL to a value that gives you sufficient time to process the data and persist or derive insights from it as required. At the time of writing this article (Oct 2021), analytical store does not support backup and restore, i.e. if a Cosmos DB backup is restored, only the transactional store will be recovered. It is also worth noting that schema representation in the analytical store could be different from the container schema.

Analytical Store Time to Live

2. Databricks – Python notebook to write NYC Taxi trip data to Cosmos DB

In the sample solution, I used a Python notebook to write data to Cosmos DB using the Spark 3 OLTP Connector for SQL API. This was just to simulate incoming OLTP application requests.

Python notebook
Sample data

3. Synapse Analytics

In the Synapse Workspace, Cosmos DB analytical store may be accessed using the Spark pool or built-in serverless SQL pool. I used the serverless SQL pool, which is a pay per query distributed data processing system built for big data and computational functions.

In the Synapse Studio, I created a linked service by connecting to the external Cosmos DB resource.

Connect to Azure Cosmos DB

In this case, I chose Managed Identity as the authentication method.

New Linked Service – Cosmos DB SQL API

In a little while, the linked service was added. At this stage, we are almost ready to query the underlying analytical store of GreenTaxiTrip container.

The next step was to create a user database in the serverless SQL pool, so that a CREDENTIAL could be created to access the Cosmos DB database securely. You cannot create the credential in master database.

4. Data Ingestion

I scheduled a Databricks job to execute the Python notebook to ingest NYC Taxi trip data from DBFS to Cosmos DB using the Spark 3 OLTP connector for SQL API. You may also execute it manually. Once data became available in Synapse Analytics through the analytical store, I wrote a couple of queries over the data using familiar T-SQL and converted them to views.

Databricks Job
Cosmos DB query

Note the number of records, and the corresponding timestamp in the transactional store. When the record count query completed in Synapse Analytics 9 seconds later, the number of records were out by only 545! Auto-sync latency is 2 minutes, as per Microsoft doco. It is worth noting that even if a document had a TTL set to lower than 2 minutes, it would still appear in the analytical store however, I have not tested it.

Cosmos DB – transactional store record count

My test dataset had 7 million records, and I left the job running to see how well the analytical store would perform. I was quite satisfied with the performance of Synapse Link. In this case, I used only one container but it would be interesting to see how well auto-sync holds up at scale.

5. Near real-time analytics using Power BI

Having validated that trip data written to the Cosmos DB container were visible in Synapse Analytics in near real-time, I proceeded to create 2 simple views using the OPENROWSET function, and a Power BI dashboard that used the views as its data source. Please excuse my lazy visualization efforts, but the point is you can easily get near real-time insights over operational data at scale using Synapse Link. All without any ETL!

Views created in Synapse serverless SQL pool over data in analytical store

Other alternatives

When it comes to enabling analytics over operational data in Cosmos DB, I cannot think of a better way to do it. The Change Feed way of doing it is not as scalable or easily maintainable as using Synapse Link. The obvious difference is that you will need a Synapse Analytics account to make use of Synapse Link. If you are a Microsoft house, this would make sense as Synapse is the unified analytics service and integrates directly with several services such as Azure Machine Learning, Cognitive Services and Power BI.

Limitations

At the time of writing this article (Oct 2021), Synapse Link for Cosmos DB is only supported for the SQL and Mongo DB APIs. I think it will be really cool to have it enabled for the Gremlin API. Graph databases are getting increasingly popular, and graph-enabled analytics such as fraud detection and money laundering is bound to become more common place.

More limitations are available here. The most glaring one is the lack of support for backup and restore of data in the analytical store.

The road ahead

HTAP is made possible by the power of cloud computing. I believe it will not be too long before HTAP capability is available for Azure SQL databases. That will be quite a game changer!

Aside from the basic Business Intelligence feature I demonstrated in the sample solution, Synapse Analytics also enables batch scoring using native scoring for models trained using RevoscalePy and RevoscaleR packages, AutoML to develop a Regression/Classification/Time Series Forecasting model using the Azure Machine Learning linked service, and multi language support for advanced analytics using the Spark pool.

Cosmos DB’s HTAP capability is fascinating, and I can think of several use cases to enable real-time decision intelligence at scale. I hope you had a good read. Please leave a comment if you have questions or any feedback.

References

https://docs.microsoft.com/en-us/azure/cosmos-db/synapse-link
https://docs.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
https://docs.microsoft.com/en-us/azure/cosmos-db/configure-synapse-link
https://www.youtube.com/watch?v=H6Ei_t34BGs – Jovan Popovic
https://www.youtube.com/watch?v=yd8vynwA36A – Microsoft Mechanics
https://devblogs.microsoft.com/cosmosdb/spark-3-connector-databricks/
https://docs.microsoft.com/en-us/azure/cosmos-db/sql/create-sql-api-spark
https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-api-sdk-java-spark-v3
https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#write-config
https://docs.microsoft.com/en-us/sql/relational-databases/security/authentication-access/credentials-database-engine?view=sql-server-ver15
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset
https://docs.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-automl

Azure Cosmos DB: real-time data movement using Change Feed and Azure Functions

Many people who read my Cosmos DB articles are looking for an effective way to export data to SQL, either on-demand or in real-time. After performing a search term analysis for my blog earlier this year, I had made up my mind about posting a solid article on exporting data from Cosmos DB to SQL Server.

Note that this serverless and event-based architecture may be used to not only persist Cosmos DB changes to SQL Server, but trigger alternate actions such as stream processing or loading to blob/Data Lake.

CosmosDBSearches_SQLRoadie.jpg

Real-time ETL using Cosmos DB Change Feed and Azure Functions

In this article, we will focus on creating a data pipeline to ETL (Extract, Transform and Load) Cosmos DB container changes to a SQL Server database. My main requirements or design considerations are:

  • Fault-tolerant and near real-time processing
  • Incur minimum additional cost
  • Simple to implement and maintain

Cosmos DB Change Feed

Cosmos DB Change Feed listens to Cosmos DB containers for changes and outputs the list of items that were changed in the chronological order of their modification. Cosmos DB Change Feed enables building efficient and scalable solutions for the following use cases:

  • Triggering a notification or calling an API
  • Real-time stream processing
  • Downstream data movement or archiving

AzureCosmosDBChangeFeedOverview

Types of operations

  • Change feed tracks inserts and updates. Deletes are not tracked yet
  • Cannot control change feed to track only one kind of operation, for example only inserts
  • For tracking deletes in the Change Feed, workaround is to soft-delete and assign a small TTL (Time To Live) value of “n” to automatically delete the item after “n” seconds
  • Change Feed can be read for historic items, as long as the items have not been deleted
  • Change Feed items are available in order of their modification time (_ts system attribute), per logical partition key, and tagged with the same _lsn (system attribute) value for all items modified in the same transaction

Read more about Azure Cosmos DB Change Feed from Microsoft docs to gain a thorough understanding. Change Feed can be processed using Azure Functions or Change Feed Processor Library. In this article, we will use Azure Functions.

Azure Functions

Azure Functions is an event-driven, serverless compute platform for easily running small pieces of code in Azure. Key points to note are:

  • Write specific code for a problem without worrying about an application or the infrastructure to run it
  • Use either C#, F#, Node.js, Java, or PHP for coding
  • Pay only for the time your code runs and trust Azure to scale
  • As of July 2019, the Azure Functions trigger for Cosmos DB is supported for use with the Core (SQL) API only

Read more from Microsoft docs to understand full capabilities of Azure Functions.

If you use Consumption plan pricing, it includes a monthly free grant of 1 million requests and 400,000 GBs of resource consumption per month per subscription in pay-as-you-go pricing across all function apps in that subscription, as per MS docs.

Compare hosting plans and check out pricing details for Azure Functions at the Functions pricing page to gain a thorough understanding of pricing options.

Real-time data movement using Change Feed and Azure Functions

The following architecture will allow us to listen to a Cosmos DB container for inserts and updates, and copy changes to a SQL Server Table. Note that Change Feed is enabled by default for all Cosmos DB containers.

I will create a Cosmos DB container and add an Azure Function to listen to the Cosmos DB container. I will then modify the Azure Function code to parse modified container items and save them to a SQL Server table.

1. First, I navigated to Azure portal, Cosmos DB blade and created a container called reservation in my Cosmos DB database. As it is purely for the purposes of this demo, I assigned lowest throughput of 400 RU/s

01_ContainerCreation02_ContainerCreated

 

2. Now that the container is ready, proceed to create an Azure Function App. The Azure Function will be hosted in the Azure Function app

03_AddFunctionApp.png

04_AddFunctionApp.png

 

3. Add an Azure Function within the newly created Azure Function App. Azure Function trigger for Cosmos DB utilizes the scaling and event-detection functionalities of Change Feed processor, to allow creation of small reactive Azure Functions that will be triggered on each new input to the Cosmos DB container.

055_AzureFunction.png

05_FunctionAppCreated.png

06_AddAzureFunction

 

4. Configure the trigger. Leases container may be manually created. Alternately, check the box that says “Create lease collection if it does not exist”. Please note that you would incur cost for storage and compute for leases container.

07_AzureFunctionConfig

I got this error that read – “The binding type(s) ‘cosmosDBTrigger’ are not registered. You just need to install the relevant extension. I saw many posts about this, so it will most likely be fixed soon.

08_AzureFunctionBindingError.png

Sort out the error by installing the extension for Azure Cosmos DB trigger.

09_AzureCosmosDBTriggerExtensionInstall

 

5. Once the function is up and running, add an item to the reservations container that we are monitoring. And we have a working solution!

10_CosmosDBContainer_AddEntries11_AzureFunctionRunning

 

6. Trigger definition may be modified to achieve different things, in our case we will parse the feed output and persist changes to SQL server. You can download the csx file I used.

12_AzureFunctionDefinitionModify.png

13_AzureCosmosDBContainer_ModifyItem14_AzureFunction_SavingToDatabase15_SavedInDatabase

Summary

We have successfully implemented a serverless, event-based low cost architecture that is built to scale. Bear in mind that you would still end up paying for Azure Function and the underlying leases collection, but there will be minimum additional RU cost incurred from reading your monitored container(s) as you are tapping into the Change Feed.

You can monitor the function and troubleshoot errors.

17_Funcion_Control20_Monitor_AzureFunction

I hope you found the article useful. Add a comment if you have feedback for me. If you have any question, drop me a line on LinkedIn. I’ll be happy to help 🙂 Happy coding!

Resources:

https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed
https://docs.microsoft.com/en-us/azure/cosmos-db/changefeed-ecommerce-solution
https://azure.microsoft.com/en-au/services/functions/
https://docs.microsoft.com/en-us/azure/azure-functions/functions-overview
https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-cosmosdb-v2
https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-functions
https://azure.microsoft.com/en-au/resources/videos/azure-cosmosdb-change-feed/
https://h-savran.blogspot.com/2019/03/introduction-to-change-feed-in-cosmos.html

Azure Cosmos DB Free Trial – walk through of Gremlin API for Graph

After the 2018 Microsoft Ignite event, Microsoft announced a free trial of Azure Cosmos DB. For those who are eager to check out Cosmos DB, this is a great opportunity to familiarize yourself with the hottest No SQL database in the market. Cosmos DB is currently sitting pretty at Rank 29 in the DB Engines Ranking page. Quite an achievement considering that the product is “only 1 year old”.

MS recently went public with the support for Cassandra API. A few years from now, Cosmos DB will be the most popular database offering from Microsoft. If you want to get a gentle intro to Cosmos DB, check out my previous posts – Introduction to Azure Cosmos DB and Azure Cosmos DB – Partition and Throughput.

This article is a walk through of using the Free Trial to get started with Cosmos DB.

Limited Time Free Trialhttps://azure.microsoft.com/en-au/try/cosmosdb/

There is no need for a credit card or subscription to avail this free trial. Please do note though that MS is likely to withdraw trial in a few months.

Step 1: Choosing an API/data model
Once you click the above URL, you will be asked to pick an API and data model. Go ahead and pick SQL to check out document (JSON) data model if you are unsure. I want to check out Gremlin API for Graph, so I am choosing Graph in this example.
You will be asked to login with your Microsoft account. If you don’t have one, create a new account.

CosmosDB Trial

Step 2: Choose default options and click on Create container.
Note that container will be created free of cost at a low throughput of 400 RU/s. If you are not familiar with throughput, read my article about Azure Cosmos DB – Partition and Throughput.
You now have a brand new container and the familiar Azure portal access. My trial container has read/write location of Central US.

Azure Portal Database

Closer look: There is a lot going on here, so let us take a closer look.
My container is called Persons and it is in graphdb database. Cosmos DB uses Apache Tinkerpop Gremlin API for graph traversal. Currently, my container Persons is empty, so let us connect a sample application. If you are new to Graph DB, I recommend reading this free ebook from another popular Graph DB product – Neo4j.
https://neo4j.com/graph-databases-book/

Graph DB - Persons

Step 3: Sample project
Click on the Quick start blade to download a sample project and explore graph data.
The sample project already has the connection string set to the trial database, so we can execute the project right away. How awesome is that!

Quickstart

In Step 1, I chose Graph API and model, so my sample project has Gremlin API queries. Read more about Gremlin API here – https://aka.ms/gremlin.

Gremlin API Queries

Step 4: Execute sample project
Now it’s time to execute the project!

Sample Program Execution

Step 5: Data Explorer
Looks like the program has added a number of graph documents to the container. Let us head over to Azure Portal and explore using Data Explorer blade. Click on the Execute Gremlin Query button to take a look at the data added by sample program. Now let us take a closer look at the output.

Graph DB Data Explorer.jpg

Closer look: A quick look shows that the query g.V() returned 4 nodes. Result can be viewed in either JSON or Graph mode.

Output - closer look 1

If we click on the second node – Ben – and zoom in, we get a nice graphical view of ben’s relationships. Click and explore to view related nodes.

Output - closer look 3

Switch to JSON mode to view result in JSON format. Pretty self-explanatory there 🙂

Output - closer look 2

Step 6: Monitor activity
Head over to Activity Log blade to take a look at activity in the Cosmos DB account. This could be downloaded as a .csv or exported to Event Hub for further analysis.

Activity Log

Summary

We took a look at the Cosmos DB Trial offered by Microsoft and got started with Graph API. Graph DB has a number of attractive use cases such as Fraud Detection and Recommendation Engine. Another popular Graph DB is Neo4j. I hope this article helped you understand how powerful Cosmos DB’s multi model support is. If you have any questions, drop me an email or add a comment. I will be happy to help. Add me on LinkedIn to stay connected – https://www.linkedin.com/in/arjunsivadasan/

Further Reading

  1. Introduction to Azure Cosmos DB and Azure Cosmos DB – Partition and Throughput
  2. Intro to Cassandra API in Cosmos DB – https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra-introduction
  3. Intro to Graph DB – https://neo4j.com/graph-databases-book/
  4. Explore Gremlin API – https://aka.ms/gremlin
  5. Cosmos DB playground – https://www.documentdb.com/sql/demo
  6. Microsoft Ignite 2018 Updates – https://news.microsoft.com/uploads/prod/sites/507/2018/09/IGNITEBOOKOFNEWS-5ba95469d658b.pdf

Azure Cosmos DB – Partition and Throughput

In my previous article Introduction to Azure Cosmos DB, I mentioned Partition and Throughput only briefly. Adopting a good partition scheme is quintessential to setting up your Cosmos DB container for elastic scaling and blazing performance. This article will take a closer look at these two aspects to help fully utilize the storage and performance offerings of Cosmos DB.

Partition

Azure Cosmos DB containers store documents, graphs or tables. Containers (a.k.a. collections in the context of documents) are logical entities that could be distributed across multiple physical partitions or servers.

Physical and Logical Partitions

A physical partition is an internal Cosmos DB concept, essentially a fixed amount of SSD storage combined with a variable amount of compute power (CPU, memory and IO). The number of physical partitions of a container depends on its storage and throughput. For containers with shared throughput, number of partitions depends on RU/s assigned to the set of containers.

Request Unit (RU/s) – is the unit of throughput. 1 RU/s serves a get by self-link (internal property) or id of a 1 KB item.

When a collection is created, we can specify a fixed storage capacity of 10 GB or unlimited capacity. A fixed storage collection is limited in performance to a max of 10,000 RU/s. If we choose unlimited capacity, the collection created potentially has no max RU/s limit. Collections are supposedly unlimited in terms of storage and throughput, and physical partition management is handled by Cosmos DB behind the curtains. Note that for a multi-partition collection, we need to specify a partition key.

CosmosDB Container - Partitions

Data within a container having the same partition key value form a logical partition. The max storage limit of a logical partition is 10 GB, which means if data associated with a certain partition key value goes beyond 10 GB, the logical partition will be full and cannot grow any further. This is why adopting a good partition scheme is very important to avail the storage and performance guarantees of Azure Cosmos DB.

Partitioning example

Azure Cosmos DB internally has a limit for the max throughput that can be provided by a physical partition – PRUmax. This value keeps changing based on factors such as hardware used and platform upgrades. For now, keep in mind that this happens behind the scenes.

Let us assume PRUmax = 10,000 RU/s. We create an unlimited Collection product at 20,000 RU/s initial throughput and productid as the partition key. Cosmos DB has to create at least 2 physical partitions to support the 20,000 RU/s throughput requested. Currently, the default seems to be 5. So, Cosmos DB creates a new collection with 5 physical partitions. The throughput requested will be equally assigned to these physical partitions. This means, the max throughput limit for each partition is 20,000/5 = 4000 RU/s.

Partitioning-Example-Product collection

As we add new documents, Cosmos DB allocates the key space of partition key hashes evenly and consistently across the 5 physical partitions. If the partition key is well chosen, writes will be distributed evenly across the partitions, each partition serving nearly 5000 RU/s and a cumulative of nearly 20,000 RU/s. This is ideal. In real world, it is possible that we chose a bad partition key.

What can go wrong?

  • Performance impact: If majority of the concurrent writes/reads pertain to a specific partition key value, we could have 1 physical partition maxing out the 5000 RU/s allocated to it (hot partition), while the other 4 partitions idling. When this happens, requests are bound to get rate-limited and we will get Http 429 response code.
  • Storage impact: Earlier in the article, I mentioned the concept of logical partition. All data having the same partition key form a logical partition. Logical partitions cannot be split across physical partitions. For the same reason, if the partition key chosen is of bad cardinality, we could potentially have skewed storage distribution. Say, 1 logical partition becomes fatter faster and hits the max limit of 10 GB, while the others are nearly empty. The physical partition housing the maxed out logical partition cannot split and could thus cause an application downtime.

Physical partition split

Azure Cosmos DB manages physical partitions seamlessly behind the scenes, if you chose your partition key smartly that is. Following are two scenarios when Cosmos DB will split a physical partition.

  • Storage limit of 10 GB: When a physical partition is full, Cosmos DB will split it into 2 new partitions assigning data corresponding to nearly half of the keys to each new partition. As mentioned previously, the split cannot happen if data in the physical partition in question have the same partition key value.
  • Increasing throughput: When throughput assigned is increased such that the existing number of physical partitions are insufficient to support it, Cosmos DB will add new physical partitions. In the above example, if the throughput is increased to 100,000 RU/s, Cosmos DB would add 5 new physical partitions.
    Cosmos DB needs 100,000/PRUmax = 10 physical partitions to support the throughput setting.

Throughput

What makes Cosmos DB an attractive high volume transaction database is the ease of scaling. When request rates are low, throughput could be lowered to keep costs down. Cosmos DB’s performance is predictable. For example, a read of a 1-KB document with session consistency always consumes 1 RU, regardless of number of concurrent requests or amount of data stored.

There are, however, two major design considerations to facilitate elastic scaling of Azure Cosmos DB.

Distribute requests and storage

Ideal candidate property for partition key will allow writes to be distributed across various distinct values. Requests to the same partition key should remain lower than the max throughput limit allocated to a partition. A good partition key will evenly distribute writes across all physical partitions and not cause hot partitions. In our example, productid is a good partition key, because it is unlikely that all concurrent requests will be focused on a specific product. If we were to chose the property productcategory as partition key, that could potentially cause hot partitions

Partition scope for queries and transactions

At one extreme, we can use the same partition key for all documents. At the other extreme, we can have unique partition key for each document. Both approaches have their limitations. Using the same partition key for all documents will limit scalability and cause a hot partition and inefficient utilization of throughput. Using unique partition keys will support high scalability, but result in a lot of cross-partition queries and prevent use of cross-document transactions. Occasional fan-out of queries is not too bad, but frequent fan-out will incur high RU consumption and result in rate-limiting.

Estimating throughput

Throughput can be estimated based on the number of expected reads/writes per second. 1 Request Unit (RU) corresponds to read of a 1-KB document containing 10 unique property values by self-link or id. Write, replace or delete will consume more RU/s.

RU calculator

Microsoft provides a Request Unit calculator that serves to arrive at a base throughput to assign when creating a new collection. Be prepared to fine tune the RU setting as you trot along, but this is a good starting point.

This URL ignites nostalgia 🙂

Request Unit Calculator
Pic courtesy: Microsoft

Conclusion

Azure Cosmos DB is a lot more versatile compared to the initial Document DB days. With added support for Mongo DB, Graph, Cassandra and Table APIs and multi-master and global distribution support, Cosmos DB is definitely the most exciting product in database technology at the moment. With the new Azure data products such as Azure Stream Analytics, Azure Data Bricks and HDInsight supporting out-of-the-box integration with Cosmos DB, it is fast becoming a good candidate for Big Data solutions.

Please feel free to reach out if you have questions. I’m always happy to discuss technology 🙂

Introduction to Azure Cosmos DB

What is Azure Cosmos DB?

Cosmos DB started as an internal project – “Project Florence” – at Microsoft in 2010. It is a globally distributed, highly scalable and performant multi-model database service built on Azure. It was built ground up with horizontal scalability and global distribution in view, so you can expect seamless geo-replication to Azure regions by clicking on a map and performance to the order of milliseconds.

CosmosDBIcon

Why should you use Cosmos DB?

Cosmos DB is the future of storage. Below are some reasons you may want to use it.

  • You use Microsoft stack already and find your favourite RDBMS – SQL Server – too slow
  • Your applications need to process large amounts of data at blazing speeds
  • Your data is not relational anymore, you want to use JSON document store or Graph DB or one of the many popular APIs Cosmos DB supports
  • Big Data applications – This is an evolving use case. Many Azure products such as Azure Stream Analytics, HDInsight and Data Bricks integrate with Cosmos DB, making it an ideal choice for persisting big data.
  • You are having to “over-provision” specs to cover peak loads on the database server
  • Your business spans multiple continents and you want easy and reliable replication
  • You trust Microsoft to do a better job with availability using Azure

CosmosDBGeoReplication

Hello Collections!

Collections are to Cosmos DB what tables are to an RDBMS. This analogy is for easy understanding; don’t get hung up on details. There are many differences between collections and tables. A more generic term is Container, especially as Cosmos DB has evolved to support more APIs since the Document DB days.

After you create a Database in your Cosmos DB account, you can create your first collection. Below are a few aspects of a collection you need to be familiar with.

1Request Units / Throughput: Request Unit (RU) is the currency of Cosmos DB.

1 RU = capacity to read* a single 1 KB item consisting of 10 unique property values
*by self-link or id

A collection need not be a set of documents of the same schema. As Cosmos DB is schema-agnostic, you can very well store different document types in the same collection. That said, you may want to store document types that need the same amount of scalability and storage in the same collection. Cosmos DB is cost effective for high volume transaction systems because RU/s for a collection can be ramped up at peak hours and reduced to a minimum during hours of low trade.

When the provisioned number of RU/s is used up, client requests may experience throttling. Prolonged throttling indicates an insufficient throughput setting.

CosmosDBThroughput

2. Partition: Each document in a collection has a partition key and an id

Partition key helps to identify the logical partition, and the id field helps to identify a specific document.

When you create a collection with an initial throughput more than 1000 RU/s, the resulting collection has unlimited scalability. Depending on the initial throughput assigned, Cosmos DB computes the number of physical partitions required to support the provisioned throughput. Logical partitions have a storage limit of 10GB. Due to this, a single logical partition (housing documents having a certain partition key value, such as Product Type = ‘Motorbike helmet’) should not amount to more than 10GB. Theoretically, the number of partitions in a collection can be infinite. Therefore, a good partition key must be a field having a suitable cardinality. With a bad partition key, you are bound to have uneven distributions of data and are likely to hit the 10 GB storage limit per partition.

CosmosDBPartitions3. Time to Live: TTL can be set at a collection or document level. Document level TTL takes precedence, obviously.

Simply put, if TTL for a document is set to 3600 seconds, the document expires when it is older than 3600 seconds. System field _ts (_ts or timestamp is set to Epoch time) indicates the last modified time of a document. Be careful when you set this at collection level. However, this feature helps to implement a sound purge policy.

Global Distribution:

Geo-replication becomes as simple as clicking on a graph with Cosmos DB’s global distribution feature. With more than 50 Azure regions, you could not only set multiple Read regions for global customers, but also specify several failover regions to help with disaster recovery.

CosmosDBAutomaticFailover

There is a lot more to Cosmos DB than can be covered in a single post. With a whole heap of new Azure features that it can easily integrate with, such as SQL Managed Instance, SQL Data warehouse, HDInsight, Data bricks and Data lakes, Cosmos DB is here to stay.

Check out my post about Cosmos DB migration tool if you are migrating data to or from Cosmos DB.

Export data from Cosmos DB to SQL Server

Cosmos DB is Microsoft’s latest NoSQL database offering low latency, high scalability and geo-distribution to any Azure region. Read Microsoft documentation about Cosmos DB here or check out my blog about Introduction to Azure Cosmos DB. In its infancy, Cosmos DB was known as Document DB. Renaming was inevitable as Document DB evolved beyond just a document store.

Following is a simple note on how to export data from Cosmos DB to SQL Server. You may want a dump of data from a Cosmos DB collection to SQL Server for analysis, data integrity checks, troubleshooting a production issue or to derive insights. If you want to set up a real-time data pipeline from Cosmos DB to SQL Server, check out this post: – https://sqlroadie.wordpress.com/2019/07/21/azure-cosmos-db-real-time-data-movement-using-change-feed-and-azure-functions/

There are a few methods to export data from Cosmos DB. The quickest one is to use Document DB / Cosmos DB Migration Tool. This is a tool provided by Microsoft  to migrate data TO/FROM various sources such as MongoDB, JSON, csv and SQL Server to Cosmos DB.

1. Use Azure Cosmos DB Migration tool to export data to json files:

  • Install the tool and launch as Administrator (use an Azure VM for best performance). Please be mindful of spike in RU costs when exporting data from your collection. To avoid throttling, scale your collection up as required just before you do the export or export in off-peak hours, if any.  Leave a comment if you need any pointer on this.LaunchCosmosDBMigrationTool
  • Choose source – DocumentDB (CosmosDB aka DocumentDB)

CosmosDBMigrationTool-Source

  • Specify connection string. You can find endpoint and key from Keys section of your CosmosDB account in Azure portal. Please note that CosmosDB is case-sensitive.

CosmosDBMigrationTool-ConnectionString

  • A peek into Advanced options
    • Include internal fields – for each document, Cosmos DB maintains a set of auto-generated internal/system fields such as _ts and _self. Their names start with an underscore, making it easy to differentiate them from user fields. This option lets you include internal fields in the output. This is handy, especially the _ts field, which indicates when the document was last updated.
    • Number of retries on failure & Retry interval – Set a reasonable number of retries. In this case, I have used the value 1.
    • Connection Mode you want to use Gateway to get best performance and to bypass firewall rules.

CosmosDBMigrationTool-AdvancedOptions

  • Specify target information. In this case, we want to export to a json document. We could either output to a local file or a blob.

CosmosDBMigrationTool-TargetInformation

  • Error Logging. Set these options to enable error logging.

CosmosDBMigrationTool-ErrorLogging

Hit Import and if there are no errors, you will soon have a new json file with all the data from your collection. If you want only a subset of data, all you need to do is modify your source query.

2. Import json files to SQL Server:

SQL Server 2016 introduced a JSON parse function called OPENJSON. If none of your user databases are upgraded to 2016 yet, but you have a 2016 engine, context-switch to a system database to use OPENJSON.

DECLARE @productNutrition varchar(max);

--Read from the json file using openrowset
SELECT @productNutrition = BulkColumn
FROM OPENROWSET(BULK'C:\Users\Smruthi\Downloads\Arjun\productnutrition_20180427.json', SINGLE_BLOB) JSON;

--Pass the variable containing json as parameter to OPENJSON function
SELECT *
FROM OPENJSON (@productNutrition)
WITH
(
ProductID varchar(20) '$.id',
ProductDescription varchar(100) '$.description',
ProductGroup varchar(200) '$.foodGroup',
ServingAmount float '$.servings[1].amount',
ServingUnit varchar(10) '$.servings[1].description',
nutrients nvarchar(max) as json --note that json is case sensitive
)

Give it a go yourself. If you have any question, leave a comment and I will be happy to assist.

If you are exploring Cosmos DB, consider reading my blog about Azure Cosmos DB – Partition and Throughput to get an overview of partitioning and scaling concepts.

jsonfileSample

openjson Example