(untagged)

Getting Started with Data Analysis in Azure Synapse Analytics Part 1: Overview

DaveNoderer

0.00/5 (No votes)

5 Jul 2021

In this article, we learn how Azure Synapse Analytics helps you analyze, understand, and report your big data to drive business insights.

Here we explore how Azure Synapse Analytics helps you analyze, understand, and report your data. We also take a quick look at its two components: SQL pool and Apache Spark.

Azure Synapse Analytics provides advanced, scalable tools for analyzing data from many sources. It does so in a single, easy-to-manage environment. Its reporting functions help you to understand your data, driving business insights. This series of articles introduces Synapse workspaces, Synapse Studio, the Synapse Studio Knowledge Center gallery (which features sample datasets, notebooks, SQL scripts, and pipelines), serverless SQL, and Spark notebooks.

We’ll take you step-by-step through creating a workspace, working with existing sample data from the gallery, and using some of the available tools to get you started in this exciting space.

We’ll begin by examining the shift in how organizations work with data and explore SQL Pool and Apache Spark.

Shifting from ETL to ELT

In the past, data engineers, data scientists, business analysts, and developers had data warehouses. In general, creating a data warehouse involved defining measures and key performance indicators based on a well-defined schema.

Developing and maintaining a data warehouse took an extensive effort. We needed to perform extract, transform, and load (ETL) operations. ETL involved working with all data sources and coercing them into a standard format before validating them and loading them into the data warehouse.

It was complicated to map between systems, file types, fields, data types, and lookups. You could invest millions of dollars into a data warehouse before it yielded any results. Changes to the business, systems, and ingested data type could also ripple modifications throughout the existing data.

While much of this work is still required, organizations generally migrate to extract, load, and transform (ELT) operations. Systems like the Microsoft Azure Data Lake build on this. Here, we can dump any data we have into the data lake in its raw form and have it available for any process to use. Essentially, we’re delaying the transform step until you need the data. Different data applications may use other transforms.

Another challenge organizations face is dealing with big data. Big data may be large (petabytes of data) and complex or have high-speed streams. Scaling to handle big data in an on-premise data warehouse can be problematic. As the data expands, new servers must be ordered, added to the data center, and then configured just to work on a single problem.

The ability to perform queries across multiple data sources — including Azure Data Lake Gen2, Spark tables and Cosmos DB — using SQL Serverless (formerly on-demand) pools allows flexible exploration of diverse data.

It takes time to respond to a business change requiring expansion, and there are no savings if you need to scale back. Moving to the cloud and using software like Apache Spark pools can save time. But it still requires administration, care, and feeding.

Although these problems are complex and time-consuming, Microsoft Azure with HDInsight has addressed some of these scaling challenges. With Azure Synapse Analytics scaling, organizations can bring development and production into a single tool and significantly reduce resource administration.

SQL Pools and Apache Spark pools are two essential features of Azure Synapse Analytics. We describe them below and explore hands-on examples in subsequent articles. After you have worked through the examples in this article series, you will be all set to start working with your data.

SQL Pool

There is a serverless SQL database pool configured and built into Azure Synapse. This SQL pool, formerly called SQL Data Warehouse (SQL DW), enables you to query data from your data lake using Microsoft Transact-SQL (T-SQL) syntax and related tools and functions.

You can also create one or more dedicated SQL Pools that might be useful depending on your project, configuration, and authentication requirements.

Apache Spark

Apache Spark is a popular open-source system for handling big data. Big data projects benefit from extensive parallel processing of different parts of the data and performing all operations in memory to minimize round trips to permanent storage. Apache Spark facilitates this by automatically scaling processing and memory to analyze large datasets efficiently.

You can use Apache Spark to batch process filtering, aggregating, and transforming data into usable datasets. Another application is feeding machine learning models for training and execution. Data scientists need to process large amounts of data to create models, find trends, and predict future scenarios. Apache Spark is also helpful for real-time data processing where applications need to quickly ingest a stream of data and analyze it on the fly. There are many other uses for Apache Spark, limited only by your imagination.

The Apache Spark system includes a set of execution resources and scales within the configured resources. When you submit your program, Spark divides the work into tasks. It schedules tasks to run on executors managed by a cluster manager. The cluster manager spins resources up and down within your set constraints.

Apache Spark is flexible and supports many languages, including Scala, Python, Java, SQL, R, .NET, C#, and F#. APIs for the various languages give you more control over your programs.

Next Steps

Azure Synapse Analytics provides focused tooling to handle big data jobs. This first article summarizes its capabilities, while the remaining two articles cover hands-on examples of SQL Pools and Apache Spark Pools.

Although the examples we give in this article series are simple, they introduce you to the steps to set up Apache Spark on Azure. Note that if you are running a heavy load on Apache Spark, the resources and charges can run up quickly.

We will use Python notebooks, which enable easy data exploration and experimentation. These are available for a variety of languages and environments.

For more training, continue to the second article in this series to learn how to use SQL Pool.

To learn more about using Azure Synapse to drive business intelligence and machine learning, you can also attend the Hands-on Training Series for Azure Synapse Analytics.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here