Understanding Table Distribution & Index Types in Azure Synapse Analytics

I’d like to consider myself as being somewhat knowledgeable when it comes to Microsoft SQL Server. However, when recently studying for the DP-200: Implementing an Azure Data Solution exam, I did find myself doubting my abilities. The cause of this? The topic of Azure Synapse Analytics (formerly known as Azure SQL Data Warehouse). This product introduces lots of new concepts, features and capabilities that you feel you should know already. It’s very similar to SQL Server and uses the same T-SQL query language; therefore, it can’t be that different, right? 😅 Well, it turns out, the underlying architecture of the platform, alongside several core capabilities, differ vastly from what you would expect traditional SQL Server to support, and even it’s corresponding cloud offering, Azure SQL Database. The main challenge I found was understanding how data distribution works on the platform, and also on the most suitable index type to configure for a table, depending on its usage scenario. So what I thought I’d do in today’s blog is discuss these topics in detail, and if, like me, you found yourself struggling to understand them, hopefully, this post will help you to gain a better understanding of these core concepts.

Understanding Table Distribution

When working with a traditional SQL Server database, we anticipate our database will reside within a file and, in most cases, stored on a single disk drive. Azure Synapse Analytics adopts a far more complex, but arguably scalable and cost-efficient, means of storing your data. As noted in this Microsoft Docs article, Azure Storage is utilised in a distributed manner, to hold your data securely. When you connect to an Azure Synapse Analytics service via SSMS, it’s not a SQL Server database engine in the traditional sense. Instead, you’re hooking yourself up to a control node, which then leverages a compute node and Data Movement Service (DMS) to process your queries, find your data and return this in the format you’ve requested. How the platform distributes your data is dictated by you when you build your table out for the first time. So the best way to understand this is to take a look at an example. The query below resembles a traditional CREATE TABLE script, with an additional modification in the second portion of the code snippet. Can you spot what it is?

CREATE TABLE MySchema.MySynapseAnalyticsTable
(
	MyColumn1 NVARCHAR(250) NOT NULL,
	MyColumn2 MONEY NULL,
	MyColumn3 INT NULL
)
WITH
(
	DISTRIBUTION = HASH(MyColumn1),
	CLUSTERED COLUMNSTORE INDEX
);
GO

The first line within the WITH options section is what controls table distribution, and we have three options to choose from:

Hash: In this option, the platform assigns each row in the table to its own distribution set, with a corresponding column set as the distribution column. As you add new rows to the table, Synapse Analytics evaluates the value within the distribution column and, if a distribution for this exists, then it is assigned to that; otherwise, a brand new one gets created. Distributions of this type will typically be most effective when you need to perform joins or aggregate queries targeting the distribution column, and for situations where you’re working with large fact/dimension tables that are over 2GB in size.
Round Robin: Data within tables distributed in this manner are done so evenly, with no control over how the initial distribution set is defined. Queries involving joins that target tables of this type will typically suffer from poor performance. You should only really use this distribution type for temporary tables or staging tables.
Replicated: Designed for very small, (ideally) dimension tables, that are less than 2GB size, rows in tables of these types are copied out to every compute node. As such, you can expect your storage space and costs to balloon if you’re not too careful with this distribution type. The performance of your queries also suffers over time, given that writes to the table will need to occur across multiple compute nodes simultaneously.

So in the example above, we are specifying that the table should use the Hash distribution method, with MyColumn1 set as the distribution column. We would anticipate the value of this column to be used as part of joins further down the road.

Indexes: What’s available and what’s best for your particular needs

Returning to the example query above yet again, you will also notice that we are supplying a second option as part of our WITH options; in this case, CLUSTERED COLUMNSTORE INDEX. Here we are instructing the table on which index type to leverage and, similar to our distribution types, we have three options available:

Clustered/Nonclustered: For those coming from a SQL Server background, these are perhaps the most self-explanatory index types on offer. They are designed for situations where you may need to return a single or very few rows in the table. Often, these will be the types of queries that your front-end reporting application will be running most regularly. Given their precise nature, they will not be suitable when you anticipate all manner of different queries hitting your tables.
Clustered Columnstore: Synapse Analytics creates this as the default index when you don’t specify an option for your table and are generally the first option you should consider when you are unsure what’s needed. As well as providing a high level of compression and overall best query performance, they are also well suited for larger tables as well. However, they don’t support any tables that contain maximum length string/binary columns (NVARCHAR(MAX) etc.). Also, for tables with fewer than 60 million rows, you may notice that you are not getting the best overall compression for your data. In this situation…
Heap: …consider using this index type. Tables configured with this will typically benefit from faster load times. As such, they are the best type of index to use for temporary or staging tables.

So, at this stage, it should start to become apparent how you can best mix & match distribution and index types, to achieve the best overall performance for your applications. Round robin tables with a heap index will be your best choice for any table that is used to load or hold data temporarily. For a table smaller than 2GB that are targeting precise, laser-focused queries, a replicated table with the required clustered/nonclustered indexes will be best. Flipping this on this head, finally, consider a hash table with a clustered columnstore when you have a huge table, and you want to ensure Synapse Analytics compresses your data most effectively.

Although your particular needs may vary when it comes to the Azure Synapse Analytics, this should hopefully cover most scenarios and give you a flavour of what will work best for your needs. I could spend many hours and weeks with Azure Synapse Analytics, as it looks to be a competent product, that combines the old and new in some pretty exciting ways. If you have any questions about Azure Synapse Analytics at all, let me know in the comments below. 😀