Hyperspace: An Indexing Subsystem for Apache Spark

hello everyone I hope you're having a wonderful day so far today we're gonna be talking about hyperspace and indexing subsystem for Apaches part before I begin talking about the details about the project let us introduce ourselves my name is Rahul petraju and I am joined here by my awesome colleague taken both of us are
from Microsoft we are part of the product that Microsoft has launched recently called Essure synapse analytics we are part of the SPARC team at Microsoft and it's probably obvious by now but I just say it we work on everything SPARC we offer spark as a service to Microsoft customers which includes both internal customers like office and Bing and also external customers where possible we contribute
back to us a spark and the open source majority of our work today I will be covering the background of the pilot project the vision some of the concepts that help you understand what's happening inside this project why is it something that you have to care about and my colleague here will showcase some real-world hosts to hopefully convince you by the end of this talk that hyperspace is an awesome deck so before
hyperspace is an awesome deck so before we dive into anything technical let's just start with the most foundational question which is what is an index I can give you the most obvious answer from the textbook in fact I just lifted this out of a textbook in databases an index is a data structure that improves the speed of data retrieval which in other words is query acceleration at the cost of additional rights and storage space but if you are anything like me then a real world analogy would be tremendously
real world analogy would be tremendously useful in just understanding what really is an index so let's take another crack at it right from since we were kids you might remember pretty much going to the back of your textbook trying to figure out like very exactly is a particular key phrase appearing in the textbook and at that point in time you might have come across and index from the back of your textbook for instance if I wanted to find where the textbook the phrase nested loop join up I would quickly go
to the section which has every word starting with the word at the letter n and I would look up nested loop join and I will immediately see that it's appearing between 718 and 722 page in the textbook and that's an index for you in short one can imagine an index is a shortcut to accelerating some of the quays so let's begin with the overview of hyperspace we have some pretty broad
goals in hyperspace to begin with our first and primary goal here is to be agnostic to data format our second goal is to offer a path towards low cost index metadata management what do we mean by this well we want to make sure that all of the index contents as well as the associated metadata is stored on the lake and this does not assume any other service to operate correctly our third goal is to enable multi engine in
drama for we want to be able enable extensible indexing infrastructure in other words what what you see here today is more of a beginning than an end and finally we want to make sure that we satisfy all of security privacy and compliance requirements at the bottom most as you can see here the only ISM is there is a data lake and on the data lake there are data sets that you already have or potentially structured
like Bar K or even unstructured like CSV or PSV the the biggest assumption as I've laid out in my earlier slide is that we assume that the index is also living on the data link both the contents and the metadata our first area of investment is the indexing infrastructure this is pretty important and you can think of this as the layer that has no intelligence but it is providing you with raw api's in order to be able to do everything that you can with high response to begin with one of
the first pieces of work that we're the first pieces of work that we're investing investing is index creation and maintenance easy I see the name a few things like log management API so how do you ensure that you're able to provide you know concurrency across multiple engines on the same index once we build all of the index creation and maintenance II yes next comes the complimentary dual of the indexing infrastructure which is the query infrastructure this whole work deals with how do you ensure that an optimizer is index of where today's optimizer is index of where today's optimizer optimizer
optimizer in SPARC is not really aware of what an indexes for instance right so as part of this work what we're trying to enable is a transparent query rewrite in the background which is to say if there is an index and we figure out that the index will potentially provide any kind of acceleration then we transfer it when you rewrite the users query and one of the most elegant benefits that you end up getting with this is users do not have to change a single line of code other than setting a configuration to
spa Kampf and finally do we want all of this in the manual way obviously not like I think the holy grail of all of it all of the world of indexing is in Ex recommendation which takes an users query workload and provides them with the recommendation of top K and says that they can build in order to expect some kind of acceleration in their modules let me take a moment to just show you how simple using hyperspace is going to be today we offer hyperspace in three languages
three languages one scholar second Python and third is dotnet although this slide shows you the API is in Scala the first set of API is we provide is the usages and as expected using CG yes you'll be able to create delete restore vacuum and rebalances the second set of API is that we offer is the smarts if you if you if you used spark then you you'll probably already know about D F dot explain a dataframe
dot explain what it does is it takes your query it produces the logical plan the optimized plan the analyze plan the physical plan way this hyperspace dot explain is built with a similar philosophy what do they mean by that it can take your query and it will give you a very cool Biff between how it has figured out what indexes it has used and it will show you what exactly at the index is used how did the plan end up changing the to other API is that we plan on introducing in a couple of weeks is the what-if API
which allows you to hypothetically test whether an index will benefit your query or not and the final holy grail of the smarts which is the recommend API which will take a work load from you and provide you with a list of top K index recommendations that you can then go ahead and build in the background one question that might be lingering in your mind at this stage is as I'm using all of these indices there exactly are doing this is getting
created right now the cool thing about what I mentioned and laid out in the goals and vision of the project is the index aslam on the make why is this important it's very interesting because now if you start looking at your file system like HDFS or maybe a BFS or you know even in abuse s3 for instance right imagine you have your data laid out as I have shown in the bottom it's basically just some path to data files now the moment you start creating an index what you'll observe is the indexing API create index takes a name for the index
create index takes a name for the index and once you provide the name of the index it creates a folder and the folder contains all of the information in a self-describing way to point back into the original data set that you have created why is this important you need to capture pretty much every detail of what exactly you have considered on the original data set at the time of indexing because you want to be able to detect if the data set changed from the time you created the index to ensure that you're not providing any incorrect results and I'm sure you must have
results and I'm sure you must have noticed at this stage the presence of something called as a hyperspace log obviously right it says it's in a different color so it's probably something of interest hyperspace log is a progress log or an operation log which is the heart of our entire design when you initially create an index it captures the operation creat if let's say the underlying data ended up changing then perhaps you might want to run a refresh once you do a refresh and be refreshed completes what is
happening is there is the indexes moved back into an active state and there is a pointed that gets laid out into in this case index directed to and indexes are two to three what you'll notice is hey wait a second why are we seeing in this into index directory one two and three that's because we maintain different snapshots and different versions of the index to enable multi reader and multi writer concurrency and one of the very interesting aspects of having the index on the lake is it provides several
on the lake is it provides several benefits just to name a few one now index scan skills the second biggest benefit is our index today is laid out in an open format and that open format is park' and one of the major advantages you get if you've been lingering around in the SPARC community then you already probably know that there's probably close to a decade worth of investment in making sure our K as a format is highly optimized and you don't need to worry about it any more
whatever optimizations are present for parking you automatically end up kind of leveraging all of those optimizations as far as the index itself and finally as you can tell we have enabled a serverless access protocol what do I mean by this if simple like hyperspace does not really depend on any external service so before we move on to my colleague doing a demo I'd like to introduce either synapse analytics which is a product that Microsoft has released recently and
that Microsoft has released recently and hyper-space has built out of box inside address and have submitted so now I'll switch over to my colleague Eric M who's going to show you the end-to-end experience of how hyperspace looks like and how you can use it in production okay in this demo I'll go over hyperspace API is for creating and using indexes I'll be using Azure synapse analytics which comes with hyperspace out of the box and we will share this notebook after the session okay let's get started
first I'm going to create two data frames one for employees and one for departments and here are the contents of the data frames employees data frame has employed AIT employee name and Department IT departments data frame has the pieman IT department name and location and these are the two data frames I'll be using throughout this demo now that we have data frames to work with let me create a hyperspace object which has methods for creating and managing indexes
next I'm going to create few indexes to create on index you need to have an index config object and specify name index columns and include columns index columns are the columns used for join all filters included columns are the ones used for selected project once you have this index config object you can just say hyper space that create index and pass the data frame which you want to build index from all long meeting is
config object in this example I'm going to create one index for employees data frame and another one for department's data frame and the reason I chose department ID as an index column is that I'll be joining employs data frame with departments data frame on this column we also have an API for listing indexes we created to do that you will do hyperspace that indexes the show and
this will display all the indexes that are created these are the two indexes we just created and this also has more information about the index such as in this column inquiry columns index location and so on the object returned by indexes is a spark data frame so if you want you can perform any data frame operations on it now let's study realizing the indexes we just created before doing that I'm going to disable broadcast join because the indexes we
created are applicable to solar join this is the join query will be executing I'm joining employees data frame with departments data frame on department ID and selecting employee name and Department name and I'll be showing fiscal plan and then the results of the fiscal plan and then the results of the jury jury and this is the output so this is the physical plan of joining two data frames so it will cease all mojo in tension first age and then this is the results
of the joint now finally let's enable hyperspace how do we do it we just say spark that enable hyperspace and what it does is it's going to inject the hyperspace optimizer rules and start utilizing the indexes available and will be executing exactly the same join query and this is the output however I think it's a little hard to see what really changes after enabling hyperspace and
changes after enabling hyperspace and that's where the explained API comes into play so here the hyperspace that explained and pass the data frame and it will give you an overview of what changes if hyperspace is enabled and this is the output so this is the output from the join query that we just executed and this is the physical plan with hyperspace on and this is the physical plan with hyperspace off and the highlighted portion is the difference and you'll see that sort and
difference and you'll see that sort and exchange have been removed from the physical plane with hyperspace on and this also lists the indexes that are used and these are the two indexes we created in this demo and finally it also shows you the physical pressure stats for example what is saying is with hyperspace disabled their work to show for exchange fiscal notes but with hyperspace enabled the number of this node goes to zero and this is same for
node goes to zero and this is same for the sort physical operator hyperspace also works with seeker comments and this is equivalent to the join using the data frame api's so basically this is joining these two tables on the department ID and this should give you the same result as expected okay let me switch a key a little bit and talk about some of the ApS for managing indexes first is delete index if you want to delete your index
index if you want to delete your index you can do hyperspace type delete index and passing the name of the index you to delete this will be a soft delete meaning that it's not going to remove the index from the file system if I look at the output here you will see that the state has changed from active to deleted and once the index goes to the latest state is not going to be picked up by hyperspace and this APSU is useful if you want to disable some of the indexes you want to disable some of the indexes temporarily temporarily
next is restore index so this API is used if you want to bring back the deleted index let me show you the output so if you perform restore index it will change the state of this index from deleted to active and once the index goes to active state it will be used by hyperspace once again next is vacuum vacuum is a hard delete operation meaning that it will physically remove the index from the filesystem the prerequisite is that you have to first delete index so this is the sequence so
delete index so this is the sequence so you first need to delete index and you vacuum the index and here is the output the index goes from active state to delayed state and when you to vacuum index it will be removed from the filesystem and last on the list we have a refresh index API suppose we have the employees will add it to our original data now that the index and data are out of sync hyperspace is smart enough not
used index because they will give you a wrong result however you can use the Refresh index API to rebuild the index and once the data and index is in sync hyperspace will use the index and that's the end of the demo and we can switch back to the slide thank you so much daddy wasn't that demo exciting I'm itching to actually try this out in fact I'll provide you with pointers in the later half of my presentation it is how you can try this I was facing on
your boxes today the covering index the coding index in other words creates a copy of the original data in a different sort of so there is no black magic like I mentioned at the beginning of my talk an index provides query acceleration and the cost of additional rights in storage so let's take one example just to make sure that we can be the concept clearly imagine a table having three columns and you're trying to execute a query this query is super simple it says select B where a equals red the only way to execute this query would be to do a full scan on top of this table and this full
scan on top of this table and this full scan would end up taking linear time obviously I'm not considering optimizations like min max but yeah I've tried to dump it down and simplify just to get the common concept across now you can technically build a covering index on column a and include column E now when you re executor square e that I have shown you here select B where a equals red now you can resort to doing your binary search and this takes logarithmic time well the concept is pretty simple it it
well the concept is pretty simple it it aids tremendously in doing things like filter acceleration not convinced that it works well let's go into something more complex now imagine now you're joining two different tables table a and table B on column a and assume that you're running this query select B comma C where you're joining on the column a from both the tables without an index spark will decide to shuffle both sides of the table into the same number of partitions and remember that the data is
not sorted so as part of step 2 spark ends up going and sorting both the sides before it decides to go ahead and do a merge in step 3 with an index with the right index is build the hyperspace extensions that we have built enable the spark optimizer to pick the right index and remember the index is already pre shuffled and pre-sorted so spark can happily move into step 2 which is the merge step and I'm pretty sure you must have noticed by now that the shuffle got have noticed by now that the shuffle got eliminated eliminated
have noticed by now that the shuffle got have noticed by now that the shuffle got eliminated eliminated and since shuffle is the most expensive step potentially in a query now this query can scale to much larger data sets and potentially run even faster now having rd with these two particular concepts let me switch back to my colleague to show you some of the deep dives into real world complex workloads and hopefully convince you that we can really get some nice acceleration okay now I do a deep dive on two queries
okay now I do a deep dive on two queries ntp city which is one of the word on big data benchmarks and I'll show you how much hyperspace can speed up those queries first this is quarry 93 which is a fairly simple quarry you are joining to fact tables and a dimension table followed by aggregation now let's take a look at how the execution plan looks in the spur UI this is the plan executed by a vanilla spark and as you can see this
is the join of two fact tables followed by a join with a dimension table followed by an aggregation for the first join we see that there were about 140 gigabytes of data being shuffled and so did this is a good opportunity to utilize the hyperspace covering indexes to eliminate the shuffle now the window on your right is the plan executed by spark with the right set of indexes created as you can see the shuffle and sort have been removed but the rest of
the plan remains the same let's compare the execution time this benchmark was run with scale factor 1000 and the data is stored in parquet the execution time has significantly gone down from 6 minutes to 32 seconds giving you about 10x speed up next let's take a look at quarry 64 this is a fairly complex query with multiple joints and aggregations across about 15 different tables so let's take a look at the spread away for
let's take a look at the spread away for the executed plan let me zoom out a bit to show you the overall plan and sure enough it's a complex one if I scroll down a bit I see shuffle and sort of about 110
50 gigabytes of shuffle and so on after creating and applying the indexes this is the new plan on your right I'll make it a full screen the big shuffle and
this is an interesting scenario where we create an index if you cannot create a cover index on one side of the join because it involves aggregations joins and so on you can just have the covering index on one side if applicable and it can also eliminate the shuffle for that can also eliminate the shuffle for that side finally let's compare the execution time
finally let's compare the execution time it went down from ten minutes to three point four minutes to give you about 3x speed up so we just took a brief look at the speed ups in two of the TP CBS queries now raw we talked about the improvements we observed across all the TP CTS queries thank you so much daddy as you've seen areas just demonstrated hyperspace acceleration on a couple of non-trivial queries and pretty sure the moment he actually showed you this this
whole land zoomed out at least I assumed out eyes owned out in fact thanks a lot Terry and now you made me even more curious to try this out now as he has shown you deep dive into two queries let me talk about hyperspace performance in general the computer configuration that we've used to test this out is on the left for anybody who is interested in replying some of our benchmarks at the top is a graph that shows a workload
evaluation derived from TPC benchmark H or in short T BCH the scale factor that we've used is thousand and the evaluation was performed using a purchase part 2.4 we will end up getting support for Apogee three-point as part 3.0 subsequently and remember one very interesting thing all of this base tables of T BCH are laid out in RK and as you already know party is a highly efficient columnar format and all of the acceleration that I'm about to show you in this workload
as well as the next one is on top of market data the graph here the x-axis shows the P BCH query number th has 22 queries and that's what you see here and the y axis shows the duration that it took to execute that query and the dotted line that you see here indicates the gain that we achieved by using the misses in this case I want you to observe this and just digest how much game variable to get right obviously you can see a spectrum
of gains you see gains of as low as like about 1.1 X but you can also see gains as high as 9 X now let's look at the second workload derived from the same DBC benchmark D s or in short TV CD s for brevity we just show the top 20 queries DB CDs and our workload consists of a total of 103 queries and what you can notice here is a similar pattern again there are no regressions and in fact up to 11 mix games to be had the overall
benchmark is really key because it conveys one other important aspect which is we don't regress in any of the queries right so if you look at both PPC H and T PCBs one thing that will immediately stand out is we're getting a pretty high gain of 2 X and 1.88 X pretty high gain of 2 X and 1.88 X respectively respectively I am also pretty thrilled to announce that we are open sourcing hyperspace as of today I agree it's version point 1 it's a start
to recap hyperspace is an extensible indexing subsystem for Apache spark it's a simple I don't we have not changed anything in spark coal and rest assured you can feel thrilled that it's the same technology that powers the indexing engine inside either synapse analytics and it works out of the box with the purchase part 2.4 and the support for this apache spark 3.0 is going to come in in the next few weeks and we offer hyperspace in three languages no compromises you can choose whatever language you feel comfortable with give
us feedback what would you like to see in the upcoming versions and we'll build it for you and today you can access you can download you can build you can also use hyperspace from one of these two URLs it's the both take you to the same place hyperspace is currently on github you can feel free to explore the codebase give us suggestions in terms of what you would like to see having said that I would also like to take a brief moment I'm looking everybody who made us reach this point in time in this slide I'm
going to discuss a couple of areas that we are planning on investing and probably also areas of is where we could collaborate the first area is metadata management and area is metadata management and lifecycle lifecycle do you like what you're saying did we capture enough metadata for instance the second area of investment is indexing enhancements well what we demonstrated today is indexing all immutable data sets or static data how would we do the same thing for incremental indexing what are the things that will change what is the performance implications how would you add support for things
updatable instructions like dental lake or ubers hoodie for instance right the third of the higher degree of investment is optimizer enhancements like I mentioned this is a very humble beginning in this space we are by no means experts at optimizers and this is where we could definitely use your help in making sure that we are taking the right decisions while optimizing queries and one other area of active investment that we're exploring is query explained ability why did you use a particular
index why did you not use a particular index the fourth area of investment is more index types more on this in the next slide the first one is documentation and the first one is documentation and tutorials tutorials well indexing is not a silver bullet and in all honesty we would like to tell you that yeah indexes are always going to but I can't really tell you that you should take care by using indices making sure that they really accelerate your workload you they really accelerate your workload you should should there are lots and lots of questions you
need to ask yourself before you go ahead and decide to build an index this is what we want to be able to capture as part of our best practices gotchas and problem we want to be able to run more reproducible experiments for our community to try out and finally the holy grail when my awesome colleague Terry was showing you some of these cool demos and I was showing you some of the performance evaluation of across all these complex queries I'm sure one thought must have struck across your mind which is wait a second how did you decide what in this is to even write it
decide what in this is to even write it in the first place right that was done through a prototype that we have implemented over the last few months in the form of an index recommendation it's quite simple today what it does is it takes your workload and recommends the top key indices that you can go ahead and build it's not perfect but ahead and build it's not perfect but people people open-sourcing dad over the next few weeks let me answer the final question which is what type of hyper spaces can be built together right today in in a talk we covered one type of a hyper space which is the covering index but in
space which is the covering index but in hyperspace it is very important to remember that the index term is used extremely broadly to refer to a more general class called a derived data set in short and derived data set well as the name States was derived from the originators it but it aids in providing some hints to the optimizer to accelerate the that's the simple definition so what is some other types of hyper spaces can be built together today we only have covering indexes but we plan on going and implementing more
types like the chunk elimination index which is something you can use to do pointed lookups or the materialized views which is an utterly complex area but it's also very very useful especially when you have kind queries that you're running and your multiple users who are doing this and finally also provide potentially stylistics and maintain them so that a cost-based optimizer can utilize them to provide you with higher query acceleration with that I'd like to conclude my talk by
mentioning a couple of key things number one today and super thrilled to announce the open sourcing of hyperspace a tech that we have built as part of a synapse analytics it's lb8 version 0.1 it's obviously not ready for you to use in production it's just a humble attempt at giving back something to the community and point number two the tech was built without really modifying a single line of spark code which gives you some tremendous powers in terms of putting it on your own spark clusters
putting it on your own spark clusters without having to worry about deploying spark itself and because the tech is also deployed as part of either synapse analytics rest assured you can be you can feel safe that there is a tremendous amount of product investment that's going on in it and point number three we have demonstrated a very preliminary acceleration on some of the key workloads and of course I'd like to emphasize this is more of a beginning we might be making some mistakes there's definite scope for improvement it's not perfect but that's where we
it's not perfect but that's where we really need your guidance in making sure that we've all hyperspace into something that our users would truly love and use in their day-to-day workers and last but not the least I do have the URL right here on the right side bottom if it's the whole hyperspace project today is on github you can try it out you can let us know if you run into any issues please feel free to open an issue on the github thing so I I on behalf of my entire team look forward to seeing you folks on our