PolyBase is one of the new interesting features of the SQL Server 2016/17. With this feature, we can query both relational (RDBMS) and non-relational databases (NoSQL). This Feature provides access to external data via the traditional t-SQL language.
What can we do with PolyBase Feature?
We can use this feature to query tables and files in Hadoop or in Azure Blob Storage. For example, we can directly query a .csv file from the Azure Storage and import or export data to/from Hadoop.
Why use PolyBase?
Earlier it was more difficult to connect SQL Server data with external data. To achieve this we had the following unpleasant options:
- Transfer half the data to align all data in one format
- Query these both sources of data and write custom query logic to join and integrate the data at the client level.
We can avoid this by PolyBase T-SQL to join the data.
Also, PolyBase does not require to install any additional software to the Hadoop environment. By using the traditional T-SQL commands we can query an external data.
We can use PolyBase to achieve the following:
- Access data stored in Azure Blob Storage.
- Access data stored in Hadoop from SQL Server or PDW.
- Import/Export data from Hadoop, Azure Blob Storage, or Azure Data Lake Store.
Performance Impact
- To improve query performance, the SQL query optimizer makes a cost-based decision to push computation to Hadoop. The process of pushing computation creates MapReduce jobs and leverages Hadoop’s distributed computational resources.
- Also, Performance can be improved by using SQL Server PolyBase scale-out groups. This enables simultaneous data transfer between SQL Server instances and Hadoop nodes adding compute resources for operating on the external data.
Similar to scaling out Hadoop to multiple compute nodes, we can use SQL Server PolyBase scale-out groups to improve query performance. This enables parallel data transfer between SQL Server instances and Hadoop nodes, and also adds compute resources for operating on the external data.
The external data sources that a Polybase Engine can access are:
- Azure blob storage account.
- Azure Data Lake(Not yet but in the near future)
- Hadoop clusterHortonworks HDP 1.3 on Linux/Windows Server
- Hortonworks HDP 2.0 – 2.3 on Linux/Windows Server
- Cloudera CDH 4.3 on Linux
- Cloudera CDH 5.1 – 5.5 on Linux
Microsoft has taken an important step in bringing together structured, unstructured and semi-structured data by including PolyBase feature directly in SQL Server 2016/17. Now, the data can be made available to Microsoft BI tools such as Power Query and Power Pivot, also to the third-party BI products such as Cognos and Tableau thereby providing more flexibility to conduct analytics and run comprehensive reports.
References:-
Author
Shrutha Sekharaiah brings over 13 years of experience in delivering innovative, scalable solutions. His broad expertise in technology and focus on collaboration and mentorship drive the creation of robust systems enhancing efficiency and performance.