analytics – SQL Roadie

One common theme that stood out last year is the oversupply of data to our fingertips. In my line of work as a data professional, I produce, consume and explain datasets and data visualizations on a regular basis. As a common man though, I did not discuss histograms and rolling averages at the dinner table until Covid struck. 2020 was a year that forced the common man to read and understand data visualizations and metrics. Terms such as “growth factor”, “flattening the curve” and “logarithmic scale” became more popular.

Data Management in public and private sectors

When confounded with complex problems, organizations in both public and private sectors store, analyze and utilize existing data at scale to predict future trends. Each of these steps is very important; only organizations that have effective and mature data management strategies are able to traverse this kind of turbulence successfully. Therefore, defining and implementing a robust data management strategy uplifts the value proposition of an organization. In the private sector, organizations have already been investing heavily in big data platforms and predictive capabilities. Accelerated digitization in the public sector has forced governments to invest heavily in these areas in recent times.

In pandemic times, it takes monumental efforts to churn out daily stats and projections about the disease with high accuracy. I have huge respect for the public service officials and researchers who work tirelessly to achieve this while we navigate our way out of this predicament. In the private sector, we rely a lot on past trends to measure current state and calibrate efforts and priorities to meet annual or periodic targets. Metrics such as “YoY – Year on Year”, “STLY – Same Time Last Year” and “Budget Variance” are usual indicators that we track to report on revenue risk position. They usually serve well, until there are anomalies. Covid 19 is one big anomaly! Depending on which industry you are in, the current trends would be far from previous years that leave you fuming or beaming. As my financial advisor said, one thing is for sure – if anyone claims they know how the markets are going to trend in the coming months, they are lying!

Data Governance

Covid-driven disruption calls for a higher degree of consistency and clarity on the data assets handled by any organization, because in the months to come there will be a lot of flux in data utilized by various departments and how they would be interpreted. Without sound governance, this could lead to a low level of trust in analytics measures used as performance indicators. While data governance was important previously, it is paramount in the current business climate. So, what is data governance really? In simple terms – it is effective handling or management of data assets across an organization.

I thought those were wise words from Satya Nadella, during Microsoft’s timely announcement of Azure Purview – their unified data governance tool. Data governance forms one part of Enterprise Data Management, which may be defined as – using data to take informed decisions. Data governance framework is often the invisible thread that connects all organizational roles that deal with data. Let me briefly explain the key principles of data governance, and what it takes to achieve data governance maturity. Hopefully, this will help you understand why it is important to focus on data governance and the enterprise-wide benefits it brings.

Key principles of Data Governance

Transparency: Data governance is one of those efforts that needs buy in from all departments within an organization. It is impossible to get their buy in if they are not well-informed of the processes involved in achieving and maintaining data governance maturity, and its impact. Being transparent is about clearly explaining why things are done in a certain way and how it fits into the data management life cycle.
Accountability: Accountability is about identifying, communicating and acknowledging the roles and responsibilities of all staff who are involved in data governance. They are to be held accountable for taking action at the times dictated by their roles. Good data is everyone’s responsibility, however accountability lies with designated staff members.
Standardization: Standardization enables data consistency and quality. It is about labeling, describing, classifying and certifying data assets in a standardized or consistent manner. This results in removing barriers to data consumption and makes the available data more valuable and explainable.

While there could be several intangible benefits resulting from deployment of data governance initiatives, immediate benefits are improvement of data quality, better compliance and security, and increased visibility of data assets and their lineage driving the value of data. Once policies and processes are well defined, accountability is established, and decision structures and rules are in play, we could say that data governance has improved. To measure the level of maturity though, we should ask ourselves more questions. Before I delve into that, I want to touch upon data governance domains. It is to be noted that these are focused on creating a data-centric culture and developing a more data-mature organization.

Data Governance domains

Data Principles: One of the initial steps in data governance is clarifying the role of data as an asset for the organization. This ties closely with organizational goals and business context. This also highlights the importance of engaging all departments, because context is subjective.
Data Quality: There is nothing more embarrassing to an Analytics team than presenting bad or erroneous data. Data quality involves determining requirements or intended use of data assets. Based on these, rules are to be defined and data quality reports to be produced at regular intervals to maintain a high degree of trust.
Metadata, Glossary and Master Data Management: Semantics or description of data enables exploration. Presence of a data glossary helps both technical and non-technical audiences understand and adopt data assets for their use. Master data or reference data refers to data points on various subjects of interest such as customers, products and vendors that could be scattered across applications and functional areas in an enterprise. Master data management is about ensuring the accuracy, completeness and consistency of enterprise-wide reference data.
Data Access and Compliance: It is great to curate data assets and remove barriers to their adoption, but who should have access to what data, and for how long? These are important aspects, with regulations concerning data management such as GDPR becoming more stringent over the years. Granting of data access permissions need to be well-controlled and both local and international laws be taking into account throughout the data life cycle.
Analytics and Business Intelligence: Depending on the data analytics maturity of your organization, you may have business intelligence (BI) solutions or self-service BI capabilities. If you are further down the track, you’d have predictive capabilities that open up new avenues of business for you. No matter at what stage you are, you should have good quality data and solid data literacy in the organization to reap maximum rewards. Data governance should support analytics and BI efforts in areas such as deciding what data to use for exploration or to generate insights, who gets to decide what data is collected and measured, who gets access to this data and so forth.

*Courtesy: https://info.microsoft.com/rs/157-GQE-382/images/EN-CNTNT-SQL-Data%20Analytics%20Maturity%20Model-en-us.pdf*

Data Governance Council

Data governance council is a body responsible for strategic guidance of the data governance program. It is one piece of organizational governance. I believe every organization should have one in order to treat data as a valuable business asset, and create a data centric culture where good data is everyone’s responsibility.

*Courtesy: I had this image saved away in a presentation I put together a while ago, can’t remember source. Please message me if you know so that I can add a reference.*

Data governance council is formed of business unit heads, representatives and nominated data stewards from each business team. It is a collaborative effort that brings together key players from all departments within an organization. The council handles data projects and initiatives, data policies, standards and aims to improve data awareness. Data quality issues are reviewed, corrective measures agreed on and implemented with a view to always maintaining data validity. The grand vision of a data governance council could be a building a more data-mature organization.

Implementation/Tools

Data governance implementation requires a concerted effort that needs to be announced, endorsed and supported by business leaders. Depending on size of the organization, it may prove to be quite a herculean task. There are several enterprise tools available to make this journey easier, however a tool may not always the right answer. It is perfectly okay to kick off assessment and policy definition phases without adopting a dedicated tool. Data governance is more about people and how they use data and less about tooling. That said, a few popular tools are Collibra, Informatica and Talend. As I mentioned earlier in this blog, Microsoft recently announced their new governance tool – Azure Purview, which looks promising if you already use Azure data products.

Measure of Progress

I hope this has been an interesting read so far and you are convinced that data governance is an area that needs more attention in the months to come. If your organization already has a data governance program, how can you measure its progress? There are a few key questions you could try to answer, and they might help you evaluate your data governance maturity.

Is data that is available for consumption complete, and is it available on time?
Are you happy with the quality of data, are there data quality reports available and are they reviewed?
Is there a data glossary available and is it up to date, is data lineage clearly defined?
Are compliance and security measures satisfactory?
Is Business Intelligence platform helping teams achieve their goals, are analytics solutions adding value?

No business is perfect and has answers to all questions. The idea is to be prepared and ready to solve complex business problems by leveraging your curated data assets so that better decisions can be taken quicker to make a difference.

Hello 2021

Last year has thrown a number of obstacles at us. It is no longer possible to keep doing what we had been doing and expect great results. Rules of the game have changed, and we need to be quick to adapt to change, for change is the only constant. If you are a data professional, now is an exciting time to review your standard practices, tweak and refine them to ensure that your business keeps going from strength to strength. As the old adage goes, when the going gets tough, the tough get going!

Happy new year! I wish you a wonderful year and decade ahead.

References:

https://www.linkedin.com/learning/learning-data-governance/welcome
https://profisee.com/data-governance-what-why-how-who/
https://www.leighpartnership.com/putting-the-data-strategy-to-work/
https://www.leighpartnership.com/2834-2/
https://www.lightsondata.com/data-governance-maturity-models-dataflux/

Microsoft’s Azure Databricks is an advanced Apache Spark platform that brings data and business teams together. In this introductory article, we will look at what the use cases for Azure Databricks are, and how it really manages to bring technology and business teams together.

Databricks

Before we delve deeper into Databricks, it is good to have a general understanding of Apache Spark.

Apache Spark is an open-source, unified analytics engine for big data processing, maintained by the Apache Software Foundation. Spark and its RDDs were developed in 2012 in response to limitations of MapReduce.

Key factors that make Spark ideal for big data processing are:

Speed – up to 100X faster
Ease of use – code in Java, Scala, Python, R and SQL
Generality – use SQL, streaming and complex analytics

Apache Spark Ecosystem.jpg — Pic courtesy: Microsoft

Databricks – the company – was founded by creators of Apache Spark. Databricks provides a web-based platform for working with Spark, with automated cluster management and IPython-style notebooks. It is aimed at unifying data science and engineering across the Machine Learning (ML) life cycle from data preparation, to experimentation and deployment of ML applications. Databricks, by virtue of its big data processing capabilities, also facilitates big data analytics. Databricks, as the name implies, thus lets you build solutions using bricks of data.

Azure Databricks

Azure Databricks combines Databricks and Azure to allow easy set up of streamlined workflows and an interactive work space that lets data teams and business collaborate. If you’ve been following data products on Azure, you’d be nodding your head along, imagining where Microsoft is going with this 🙂

Azure Databricks enables integration across a variety of Azure data stores and services such as Azure SQL Data Warehouse, Azure Cosmos DB, Azure Data Lake Store, Azure Blob storage, and Azure Event Hub. Add rich integration with Power BI, and you have a complete solution.

Azure Databricks Overview — Pic courtesy: Microsoft

Why use Azure Databricks?

By now, we understand that Azure Databricks is an Apache Spark-based analytics platform that has big data processing capabilities and brings data and business teams together. How exactly does it do that, and why would someone use Azure Databricks?

Fully managed Apache Spark clusters: With the serverless option, create clusters easily without having to set up your own data infrastructure. Dynamically auto-scale clusters up and down, and auto-terminate inactive clusters after a predefined period of inactivity. Share clusters with your teams, reduce time spent on infrastructure management and improve iteration time.
Interactive workspace: Streamline data processing using secure workspaces, assign relevant permissions to different teams. Mix languages within a notebook – use your favorite out of R, Python, Scala and SQL. Explore, model and execute data-driven applications by letting Data Engineers prepare and load data, Data Scientists build models, and business teams analyze results. Visualize data in a few clicks using familiar tools like Matplotlib, ggplot or take advantage of the rich integration with Power BI.
Enterprise security: Use SSO through Azure Active Directory integration to run complete Azure-based solutions. Roles-based access control enables fine-grained user permissions for notebooks, clusters, jobs, and data.
Schedule notebook execution: Build, train and deploy AI models at scale using GPU-enabled clusters. Schedule notebooks as jobs, using runtime for ML that comes preinstalled and preconfigured with deep learning frameworks and libraries such as TensorFlow and Keras. Monitor job performance and stay on top of your game.
Scale seamlessly: Target any amount of data or any project size using a comprehensive set of analytics technologies including SQL, Streaming, MLlib and GraphX. Configure number of threads, select number of cores and enable autoscaling to dynamically scale processing capabilities leveraging a Spark engine that is faster and performant through various optimizations at the I/O layer and processing layer (Databricks I/O).

Of course, all of this comes at a price. If this article has piqued your interest, hop over to Azure Databricks homepage and avail the 14 day free trial!

Azure Databricks - Free Trial 14 days.jpg

Suggested learning path:

Read more about Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
Create a Spark cluster and run a Spark job on Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/quickstart-create-databricks-workspace-portal#clean-up-resources
ETL using Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
Stream data into Azure Databricks using Event Hubs – https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs
Sentiment analysis on streaming data using Azure Databricks – https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-sentiment-analysis-cognitive-services

I hope you found the article useful. Share your learning experience with me. My next article will be on Real-time analytics using Azure Databricks.