Hadoop, Spark and related application frameworks for big data rely more on Java programing skills and less on SQL skills, thus freezing out many SQL veterans — be they Microsoft T-SQL adepts or others.
While continuing its push into Azure cloud support for Hadoop, Hive, Spark,R and the like, Microsoft is looking to enable T-SQL users to join the big data experience as well.
Its answer is U-SQL, a dialect of T-SQL meant to handle disparate data, while supporting C# extensions and, in-turn, .NET libraries. It is presently available as part of a public preview of Microsoft’s Azure Data Lake Analytics cloud service, first released last October.
U-SQL is a language intended to support queries on all kinds of data, not just relational data. It is focused solely on enhancements to the SQL SELECT statement, and it automatically deploys code to run in parallel. U-SQL was outlined in detail by Microsoft this week at the Data Science Summit it held in conjunction with its Ignite 2016 conference in Atlanta.
Beyond Hive and Pig
The Hadoop community has looked to address this by adding SQL-oriented query engines and languages, such as Hive and Pig. But there was a need for something more akin to familiar T-SQL, according to Alex Whittles, founder of the Purple Frog Systems Ltd. data consultancy in Birmingham, England, and a Microsoft MVP.
“Many of the big data tools — for example, MapReduce — come from a Hadoop background, and they tend to require [advanced] Java coding skills. Tools like Hive and Pig are attempts to bridge that gap to try to make it easier for SQL developers,” he said.
But, “in functionality and mindset, the tools are from the programming world and are not too appropriate for people whose job it is to work very closely with a database,” Whittles said.
This is an important way to open up Microsoft’s big data systems to more data professionals, he said.
“U-SQL gives data people the access to a big data platform without requiring as much learning,” he said. That may be doubly important, he added, as Hive-SQL developers are still a small group, compared with the larger SQL army.
U-SQL is something of a differentiator for Azure Data Lake Analytics, according to Warner Chaves, SQL Server principal consultant with The Pythian Group Inc. in Ottawa and also a Microsoft MVP.
“The feedback I have gotten from database administrators is that big data has seemed intimidating, requiring you to deploy and manage Hadoop clusters and to learn a lot of tools, such as Pig, Hive and Spark,” he said. Some of those issues are handled by Microsoft’s Azure cloud deployment — others by U-SQL.
“With U-SQL, the learning curve for someone working in any SQL — not just T-SQL — is way smaller,” he said. “It has a low barrier to entry.”
He added that Microsoft’s scheme for pricing cloud analytics is also an incentive for its use. The Azure Data Lake itself is divided into separate analytics and storage modules, he noted, and users only have to pay for the analytics processing resources when they’re invoked.
More in store
While it looks out for its traditional T-SQL developer base, Microsoft is also pursuing enhanced capabilities for Hive in the Azure Data Lake.
This week at the Strata + Hadoop World conference in New York, technology partner Hortonworks Inc. released its version of an Apache Hive update using LLAP, or Live Long and Process, which uses in-memory and other architectural enhancements to speed Hive queries. It’s meant to work with Microsoft’s HDInsight, a Hortonworks-based Hadoop and big data platform that is another member of the Azure Data Lake Analytics family.
Shining a light on SQL Server storage tactics
Meanwhile, there’s more in store for U-SQL. As an example, at Microsoft’s Data Science Summit, U-SQL driving force Michael Rys, a principal program manager at Microsoft, showed attendees how U-SQL can be extended, focusing on how queries in the R language can be exposed for use in U-SQL.
The R language has garnered more and more support within Microsoft since the company purchased Revolution Analytics in 2015. While R programmers dramatically lag SQL programmers in size of population, R is finding use in new analytics applications, including ones centered on machine learning.