Data Engineering within Microsoft Azure is a vast subject. Long gone are the days where we had to concern ourselves with traditional tools such as SQL databases or stick to a single vendor/stack of our choosing. Now, to build the modern, cloud-first applications organisations require on Microsoft Azure, which typically involve processing lots of data, having a solid grasp of the following tools becomes necessary:
- Synapse Analytics - For situations where we need a data warehousing solution, Synapse Analytics is the natural candidate to consider. In addition, it very much behaves just like a traditional SQL Server database, meaning we know what to expect “out of the box”.
- Cosmos DB - Typically best for semi-structured or data with no structure at all, Cosmos DB has numerous different API’s available, such as one for SQL and MongoDB. This means it becomes straightforward to migrate our existing applications across into it and, in the process, scale up to run things on a global scale.
- Data Factory - A tool that I have a lot of fondness for, and is designed to help implement your Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) processes at scale. As the natural successor for SQL Server Integration Services, it also has native support for running packages authored by this tool as well.
- Databricks - For when Data Factory can’t perform the type of processing or transformation you need for your application, we can scale up fully managed compute environments using Databricks and execute simple or complex scripts using Python, Java or R.
- Storage - Typically, data engineers need to utilise blob, file or Data Lake storage alongside the aforementioned tools. Therefore, having a complete understanding of what’s available across the Azure stack is essential.
- Stream Analytics - Designed for situations where you have consistent application data that you need to process in real-time, Stream Analytics has various options available to help get your data where it needs to be.
Previously, if we were looking to validate our skills in the above technologies, we would have to sit two exams to earn our cred and, in the process, a shiny new Azure Data Engineer Associate certification:
To help simplify this journey, Microsoft will shortly retire these exams at the end of June 2021 and replace them with a single new exam instead - DP-203: Data Engineering on Microsoft Azure. The exam has just recently come out of public beta, meaning now is an excellent opportunity to go for it. As well as grasping the topics above, candidates need also to demonstrate expertise in:
- Core design concepts, including designing data storage, partitioning your data, and the types of physical data storage structures to leverage.
- Data ingestion mechanisms, leveraging a wide array of potential data formats, such as JSON, Parquet, CSV files and more
- Identifying the best tool to use for the type of data you wish to consume. For example, Stream Analytics is the natural tool to consider if you are processing data from a set of Internet of Thing (IoT) devices or similar.
- Securing the various storage and ingestion tools we may leverage on Azure, using role-based access controls, data encryption and Azure Active Directory (where appropriate)
- Choosing and utilising the best tool to monitor the various services we deploy out to Azure and manage common issues, such as pipeline failures within Azure Data Factory.
To give you a perspective, I found a lot of this a real struggle when I first took the DP-200 and DP-201 exams last year. My comfort zone ends just as we start to move away from Data Factory so, while it was interesting to learn about new things like Synapse Analytics, it was a lot to take in. So much so that I busted out and failed the DP-200 exam the first time 😥. I finally managed to pass it in the end and, through some miracle, managed to get a passing grade in the DP-203 exam too. A miracle, I’m sure! 😅 But don’t let any of this dissuade you. We’ve got some excellent learning tools at our disposal via the Microsoft Learn site that provides up-to-date and relevant content to help you understand each topic, put it into practice and be in a great position to pass the exam. If you’re going for the DP-203 exam yourself, I hope you’ve found this post helpful and feel free to leave a comment below if you have any questions regarding it. 😀