Many people who read my Cosmos DB articles are looking for an effective way to export data to SQL, either on-demand or in real-time. After performing a search term analysis for my blog earlier this year, I had made up my mind about posting a solid article on exporting data from Cosmos DB to SQL Server.
Note that this serverless and event-based architecture may be used to not only persist Cosmos DB changes to SQL Server, but trigger alternate actions such as stream processing or loading to blob/Data Lake.
Real-time ETL using Cosmos DB Change Feed and Azure Functions
In this article, we will focus on creating a data pipeline to ETL (Extract, Transform and Load) Cosmos DB container changes to a SQL Server database. My main requirements or design considerations are:
- Fault-tolerant and near real-time processing
- Incur minimum additional cost
- Simple to implement and maintain
Cosmos DB Change Feed
Cosmos DB Change Feed listens to Cosmos DB containers for changes and outputs the list of items that were changed in the chronological order of their modification. Cosmos DB Change Feed enables building efficient and scalable solutions for the following use cases:
- Triggering a notification or calling an API
- Real-time stream processing
- Downstream data movement or archiving
Types of operations
- Change feed tracks inserts and updates. Deletes are not tracked yet
- Cannot control change feed to track only one kind of operation, for example only inserts
- For tracking deletes in the Change Feed, workaround is to soft-delete and assign a small TTL (Time To Live) value of “n” to automatically delete the item after “n” seconds
- Change Feed can be read for historic items, as long as the items have not been deleted
- Change Feed items are available in order of their modification time (_ts system attribute), per logical partition key, and tagged with the same _lsn (system attribute) value for all items modified in the same transaction
Read more about Azure Cosmos DB Change Feed from Microsoft docs to gain a thorough understanding. Change Feed can be processed using Azure Functions or Change Feed Processor Library. In this article, we will use Azure Functions.
Azure Functions
Azure Functions is an event-driven, serverless compute platform for easily running small pieces of code in Azure. Key points to note are:
- Write specific code for a problem without worrying about an application or the infrastructure to run it
- Use either C#, F#, Node.js, Java, or PHP for coding
- Pay only for the time your code runs and trust Azure to scale
- As of July 2019, the Azure Functions trigger for Cosmos DB is supported for use with the Core (SQL) API only
Read more from Microsoft docs to understand full capabilities of Azure Functions.
If you use Consumption plan pricing, it includes a monthly free grant of 1 million requests and 400,000 GBs of resource consumption per month per subscription in pay-as-you-go pricing across all function apps in that subscription, as per MS docs.
Compare hosting plans and check out pricing details for Azure Functions at the Functions pricing page to gain a thorough understanding of pricing options.
Real-time data movement using Change Feed and Azure Functions
The following architecture will allow us to listen to a Cosmos DB container for inserts and updates, and copy changes to a SQL Server Table. Note that Change Feed is enabled by default for all Cosmos DB containers.
I will create a Cosmos DB container and add an Azure Function to listen to the Cosmos DB container. I will then modify the Azure Function code to parse modified container items and save them to a SQL Server table.
1. First, I navigated to Azure portal, Cosmos DB blade and created a container called reservation in my Cosmos DB database. As it is purely for the purposes of this demo, I assigned lowest throughput of 400 RU/s
2. Now that the container is ready, proceed to create an Azure Function App. The Azure Function will be hosted in the Azure Function app
3. Add an Azure Function within the newly created Azure Function App. Azure Function trigger for Cosmos DB utilizes the scaling and event-detection functionalities of Change Feed processor, to allow creation of small reactive Azure Functions that will be triggered on each new input to the Cosmos DB container.
4. Configure the trigger. Leases container may be manually created. Alternately, check the box that says “Create lease collection if it does not exist”. Please note that you would incur cost for storage and compute for leases container.
I got this error that read – “The binding type(s) ‘cosmosDBTrigger’ are not registered. You just need to install the relevant extension. I saw many posts about this, so it will most likely be fixed soon.
Sort out the error by installing the extension for Azure Cosmos DB trigger.
5. Once the function is up and running, add an item to the reservations container that we are monitoring. And we have a working solution!
6. Trigger definition may be modified to achieve different things, in our case we will parse the feed output and persist changes to SQL server. You can download the csx file I used.
Summary
We have successfully implemented a serverless, event-based low cost architecture that is built to scale. Bear in mind that you would still end up paying for Azure Function and the underlying leases collection, but there will be minimum additional RU cost incurred from reading your monitored container(s) as you are tapping into the Change Feed.
You can monitor the function and troubleshoot errors.
I hope you found the article useful. Add a comment if you have feedback for me. If you have any question, drop me a line on LinkedIn. I’ll be happy to help 🙂 Happy coding!
Resources:
https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed
https://docs.microsoft.com/en-us/azure/cosmos-db/changefeed-ecommerce-solution
https://azure.microsoft.com/en-au/services/functions/
https://docs.microsoft.com/en-us/azure/azure-functions/functions-overview
https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-cosmosdb-v2
https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-functions
https://azure.microsoft.com/en-au/resources/videos/azure-cosmosdb-change-feed/
https://h-savran.blogspot.com/2019/03/introduction-to-change-feed-in-cosmos.html
[…] Following is a simple note on how to export data from Cosmos DB to SQL Server. You may want a dump of data from a Cosmos DB collection to SQL Server for analysis, data integrity checks, troubleshooting a production issue or to derive insights. If you want to set up a real-time data pipeline from Cosmos DB to SQL Server, check out this post: – https://sqlroadie.wordpress.com/2019/07/21/azure-cosmos-db-real-time-data-movement-using-change-feed… […]
I want to push the data from my Cosmos DB to Azure data lake storage (Version 2) using change feed, is it possible?
You should be able to push to Data Lake. Set up a data factory pipeline for this and invoke the data factory pipeline from within an Azure Function.
Do you have any reference for this type of scenario or any example?
No mate, it is pretty straightforward. Use the sample code in this article as a reference and invoke the URL to trigger Data Factory pipeline.
Hey Arjun, I am new to the Data factory and its concepts if you can explain it a little more that would be a great help. I wrote an Azure function with a time trigger and based on change feed and I have a data factory consisting of copy activity now how can I trigger the data factory with Azure function?
Thank you in advance.
Hi Vishal, good stuff, you are almost there.
Please refer to the following documentation to understand how to trigger a data pipeline run using API. I would recommend testing the API manually before incorporating it into your Azure Function.
I hope this helps. I am time-poor at the moment, but I would consider writing a blog post on this topic in the near future.
https://docs.microsoft.com/en-us/rest/api/datafactory/pipelines/createrun
https://stackoverflow.com/questions/56672959/how-to-trigger-azure-datafactory-v2-via-rest-api
Awesome thank you I will try out this. Looking forward to your blog post as well.
Hi Arjun … I was going through your articles in search of what is best for my solution . I have Cosmos DB as a application database which stores operational data and this data is furhter cloned in COSMOS DB only for analytics in Power BI .In Power BI i have built a data model for my reports .This model takes longer time to refresh and also complex to maintain . I was thinking about having SQL server instead of COSMOS db for reporting as i have all the operational data . Could you please share your thoughts on what would be the best solution?
Hi Salma, that’s a great question and I’m happy you asked.
To confirm that I understand right, let me repeat what I gathered – you have an operational Cosmos DB database and you use Power BI for analytics on this data. To achieve this, you have cloned the Cosmos DB database and defined a Power BI model – which is now becoming complex and slow.
The best solution I can think of – both from a cost and efficiency perspective – is to use Azure Synapse Analytics as source for your Power BI reports. I will be blogging about Synapse soon, but let me give you a brief outline of steps involved and what the outcome of each of those steps will be.
But first, what is Synapse Analytics – it is an evolution of Azure SQL Data Warehouse, but more than just that. It gives you cool capabilities like generating reports within the Synapse workspace and querying your Cosmos DB or blob storage and using that data for stream or batch analytics.
1. Enable Synapse Link in your Cosmos DB account. If this is not already enabled in your existing operational Cosmos DB account, you will need to create a new account and migrate your database.
Outcome: This will make your Cosmos DB data available for querying or transformation in Synapse workspace in near real-time. I recommend using Synapse Serverless option (pay as you go/use) for starts.
2. Create a Synapse workspace, and define the Cosmos DB connection to your operational Cosmos DB account.
Outcome: Once you define the connection string, your live Cosmos DB data will be available in Synapse. At this point, you will be able to see the containers, and you can define tables in Synapse which could store transformed/aggregated information on your Cosmos DB data. If you need to define a data warehouse model, that will go here.
3. Use the Synapse endpoint (similar to a SQL Server connection string) as source for your Power BI reports. Do the heavy data modeling tasks in Synapse so that Power BI remains lightweight.
Sounds like you have a cool project to work on. Let me know how you go. Good luck!
Hi Salma, a few months too late but here is the article on using Synapse Link for Cosmos DB.
https://sqlroadie.com/2021/10/18/azure-cosmos-db-htap-using-azure-synapse-link/
[…] in data movement out of Cosmos DB, possibly for reporting purposes. Check out my 2019 article on real-time data movement from Cosmos DB using Change Feed and Azure Functions, where I explained how to persist container changes to SQL Server. Well, that is one way to enable […]
[…] in data movement out of Cosmos DB, possibly for reporting purposes. Check out my 2019 article on real-time data movement from Cosmos DB using Change Feed and Azure Functions, where I explained how to persist container changes to SQL Server. Well, that is one way to enable […]