Here we look at what Azure Synapse Analytics is, why it was created, what problems it solves. We then give an introduction to the full range of Azure Synapse Analytics’ tools and capabilities.
Organizations want to use analytics creatively and intelligently to fulfill their business needs. Azure Synapse Analytics combines data integration, enterprise data warehousing, and big data analytics in one unified service. It enables organizations to query their data on their own terms. It relieves organizations from complex data loading and preparation while delivering tools for big data analytics and accelerated time-to-insight.
Azure Synapse Analytics was born under the tenets of the modern data warehouse: it combines heterogeneous data sources and provides business insights through analytics tools, including reporting, dashboards, and visualizations. These capabilities significantly speed time-to-insight for organizations, which can use these insights to improve their decision-making process.
In this series, we explore Azure Synapse Analytics, a limitless analytics service that merges data integration, data warehousing, and big data analytics into an Azure-based unified environment. Azure Synapse Analytics ingests, stores, analyzes, visualizes, and serves data for business intelligence (BI) and machine learning (ML).
Data Ingestion
With Azure Synapse Analytics, we can work with data sources stored in various environments, including on-premises, Azure, or other clouds. Data may come from business applications, customer relationship management (CRM) software, banking databases, or social media. After defining data sources, Azure Synapse Analytics takes them to the ingestion and preparation step, where the Azure Data Factory service loads and orchestrates their data.
Note that data is still raw at this point and is not ready for users to consume it. So, we can store the data in Azure Data Lake Storage Gen2. Later, we can explore, prepare, train, model, then serve the data in a format for data scientists to consume.
Azure Data Lake Storage Gen2
As a modern data warehouse, Azure Synapse Analytics can ingest raw, unstructured data from a data lake. Azure Data Lake Storage Gen2 is a set of capabilities dedicated to Big Data analytics. It enables building enterprise data lakes on Azure Blob Storage. Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Its low-cost management solution handles massive amounts of unstructured data with tiered storage, high availability, and resilience.
Once Azure Data Factory ingests the data, it can store and centralize it in Azure Data Lake Storage Gen2. This data lake spans the entire Azure Synapse Analytics architecture and is always available to other components within the ecosystem.
Data Exploration, Training, and Service
Once Azure Data Factory ingests data, Azure Data Lake Storage Gen2 stores it in its raw form. Therefore, Azure Synapse Analytics must first transform it before serving it to the data warehouse end users.
Databricks is a leading cloud solution that bridges the gap between a data lake and data warehouse, a combination known as “lakehouse.” Azure has its own implementation, Azure Databricks. This is the underlying cloud tool that enables Azure Synapse Analytics to explore, prepare, train, and transform data. Azure Databricks provides data engineers and scientists a collaborative platform. It also allows Azure Synapse Analytics to process and transform massive amounts of data while exploring the data with machine learning models.
Data Query Services
Azure Synapse Analytics supports three types of query services: dedicated SQL pools, Azure Synapse Analytics’ SQL on-demand pools, and Apache Spark pools.
Dedicated SQL pool, a rebranded SQL Data Warehouse (DW), refers to the enterprise data warehousing features in Azure Synapse Analytics. It represents a collection of analytic resources provisioned when you first start using Azure Synapse SQL. The dedicated SQL pool is like a traditional SQL data warehouse and the usual destination for your Big Data solutions. A dedicated SQL pool enables us to import big data to run high-performance analytics. The dedicated SQL pool then becomes the single source of truth for fast, robust business insights.
Azure Synapse Analytics’ SQL on-demand pool is a serverless query service enabling you to run SQL queries on CSV, Parquet, and JSON files in Azure Storage. With the on-demand SQL pool, you can access your data through a familiar T-SQL syntax. Run queries, get fresh results, or load and copy data into another store for later use. You can use a serverless SQL pool to load data into a specialized store or query files and other unstructured formats using SQL syntax.
An Apache Spark pool provides distributed, in-memory computing to boost big-data analytic processing. Spark jobs save time by preloading data into memory that would otherwise be loaded repeatedly, resulting in faster applications. Azure Synapse Analytics includes its own implementation of Apache Spark in the cloud. We can use it to process our Azure Storage and Azure Data Lake Gen2 data. Azure Synapse Analytics makes it easy to create and configure a serverless Apache Spark pool in Azure.
The Azure Synapse Analytics with Apache Spark runtime provides faster processing than the standard Spark, delivering improvements including query and cluster optimization, autoscaling, intelligent caching, and indexing.
Azure Synapse Studio
Azure Synapse Studio is a core management tool to control Azure SQL Analytics’ many features. It comes with a sleek user interface (UI) that Microsoft designed with data engineers and data scientists in mind. Instead of being just a new UI tool, it unifies the end-to-end experience of other existing Azure data services, using a central UI to ingest, explore, analyze, and visualize data. For example, Azure Synapse Studio enables us to query data using serverless or dedicated SQL pools.
The Knowledge Center is part of Synapse Studio, and its goal is to guide beginner developers effortlessly. With Knowledge Center, we are only a few clicks away from practical, immersive learning material.
When we choose Use samples immediately, we access a set of ready-to-use samples to quickly learn concepts and practice analytics with scripts, notebooks, pools, and data. We can explore data with Apache Spark, query data with SQL, and create an external table with SQL. Another option shows us how to use the serverless SQL pool to execute a query on Parquet file data.
Choosing Browse gallery takes us to a complete list of sample code and Azure Open Datasets and templates. It includes sample notebooks, SQL scripts, and templates for pipelines that automate data integration and transformation.
We can also tour Synapse Studio from the Knowledge Center. It guides us as we get started with Azure Synapse Analytics features.
As we get started, Synapse Studio offers helpful tips on filling in the UI fields and alerts us about any mistakes. This saves us time with later troubleshooting.
Azure Synapse Notebooks
Data engineers and data scientists are likely familiar with the widespread interactive computing provided by Jupyter notebooks. Azure Synapse Studio’s notebooks feature provides a consistent notebook experience for your analytics needs, using the same file format as Jupyter notebooks so you can get started quickly.
A Synapse Studio notebook is a web interface where we can experiment with data to demonstrate, get insights into, and validate our ideas. A notebook is a readable, human-friendly document you create by freely adding blocks of text and code snippets. We can use the formatted text blocks to write a rich narrative.
At the same time, the code snippets between them can run instantly at any time to query our data sources and render results. Notebooks illustrate our text with reports, charts, and other data visualizations, as well as machine learning insights and Big Data scenarios.
Synapse notebooks enable us to add code to query, manipulate, and analyze data from our unstructured and structured data sources using languages like Python, Scala, Spark SQL, and C#.
Machine Learning and Business Intelligence
Azure Synapse Analytics includes many popular libraries for those interested in machine learning, like Spark MLlib and libraries within the Anaconda Python distribution platform.
We can use the Azure Machine Learning pipeline to target Apache Spark pools in the data preparation and data training steps. Apache Spark allows Azure Synapse Analytics to perform machine learning over big data and draw valuable insights from massive quantities of structured and unstructured data.
After it models and serves data, Azure Synapse Analytics can combine with Power BI, producing insights and actions from large volumes of structured or unstructured data. We can create and manage Power BI datasets and reports without leaving Azure Analytics. This deep integration with Power BI allows Azure Synapse Analytics to create high-performance big data queries and intelligent materialized views based on usage patterns.
Conclusion
As we have seen, Microsoft released Azure Synapse Analytics as the next generation of Azure SQL Data Warehouse. It provides performance over large data volumes with built-in capabilities for data ingestion, data preparation, machine learning, and visualization.
In the upcoming articles of this series, we will explore how Azure Synapse Analytics helps with data preparation and management, eliminating the need for custom extract, transform, and load (ETL) code. Then, we will demonstrate how Azure Synapse Analytics drives data science and business intelligence.
To learn more, continue to the second article of this series, which explores data preparation and management with Azure Synapse Analytics.
As well, check out Microsoft’s Hands-on Training Series for Azure Synapse Analytics. Each 60-minute webinar offers a deep dive into Azure Synapse. You can start your first Synapse workspace, build code-free ETL pipelines, natively connect to Power BI, connect and process streaming data, and use serverless and dedicated query options.