hello everyone I hope you're having a
wonderful day so far today we're gonna
be talking about hyperspace and indexing
subsystem for Apaches part before I
begin talking about the details about
the project
let us introduce ourselves my name is
Rahul petraju and I am joined here by my
awesome colleague taken both of us are
from Microsoft we are part of the
product that Microsoft has launched
recently called Essure synapse analytics
we are part of the SPARC team at
Microsoft and it's probably obvious by
now but I just say it we work on
everything SPARC we offer spark as a
service to Microsoft customers which
includes both internal customers like
office and Bing and also external
customers where possible we contribute
back to us a spark and the open source
majority of our work today I will be
covering the background of the pilot
project the vision some of the concepts
that help you understand what's
happening inside this project why is it
something that you have to care about
and my colleague here will showcase some
real-world hosts to hopefully convince
you by the end of this talk that
hyperspace is an awesome deck so before
hyperspace is an awesome deck so before
we dive into anything technical let's
just start with the most foundational
question which is what is an index I can
give you the most obvious answer from
the textbook in fact I just lifted this
out of a textbook in databases an index
is a data structure that improves the
speed of data retrieval which in other
words is query acceleration at the cost
of additional rights and storage space
but if you are anything like me then a
real world analogy would be tremendously
real world analogy would be tremendously
useful in just understanding what really
is an index so let's take another crack
at it right from since we were kids you
might remember pretty much going to the
back of your textbook trying to figure
out like very exactly is a particular
key phrase appearing in the textbook
and at that point in time you might have
come across and index from the back of
your textbook for instance if I wanted
to find where the textbook the phrase
nested loop join up I would quickly go
to the section which has every word
starting with the word at the letter n
and I would look up nested loop join and
I will immediately see that it's
appearing between 718 and 722 page in
the textbook and that's an index for you
in short one can imagine an index is a
shortcut to accelerating some of the
quays so let's begin with the overview
of hyperspace we have some pretty broad
goals in hyperspace to begin with our
first and primary goal here is to be
agnostic to data format our second goal
is to offer a path towards low cost
index metadata management what do we
mean by this well we want to make sure
that all of the index contents as well
as the associated metadata is stored on
the lake and this does not assume any
other service to operate correctly our
third goal is to enable multi engine in
drama for we want to be able enable
extensible indexing infrastructure in
other words what what you see here today
is more of a beginning than an end and
finally we want to make sure that we
satisfy all of security privacy and
compliance requirements at the bottom
most as you can see here the only ISM is
there is a data lake and on the data
lake there are data sets that you
already have or potentially structured
like Bar K or even unstructured like CSV
or PSV the the biggest assumption as
I've laid out in my earlier slide is
that we assume that the index is also
living on the data link both the
contents and the metadata our first area
of investment is the indexing
infrastructure this is pretty important
and you can think of this as the layer
that has no intelligence but it is
providing you with raw api's in order to
be able to do everything that you can
with high response to begin with one of
the first pieces of work that we're
the first pieces of work that we're
investing
investing
is index creation and maintenance easy I
see the name a few things like log
management API so how do you ensure that
you're able to provide you know
concurrency across multiple engines on
the same index once we build all of the
index creation and maintenance II yes
next comes the complimentary dual of the
indexing infrastructure which is the
query infrastructure this whole work
deals with how do you ensure that an
optimizer is index of where today's
optimizer is index of where today's
optimizer
optimizer
optimizer
in SPARC is not really aware of what an
indexes for instance right so as part of
this work what we're trying to enable is
a transparent query rewrite in the
background which is to say if there is
an index and we figure out that the
index will potentially provide any kind
of acceleration then we transfer it when
you rewrite the users query and one of
the most elegant benefits that you end
up getting with this is users do not
have to change a single line of code
other than setting a configuration to
spa Kampf and finally do we want all of
this in the manual way obviously not
like I think the holy grail of all of it
all of the world of indexing is in Ex
recommendation which takes an users
query workload and provides them with
the recommendation of top K and says
that they can build in order to expect
some kind of acceleration in their
modules let me take a moment to just
show you how simple using hyperspace is
going to be today we offer hyperspace in
three languages
three languages
one scholar second Python and third is
dotnet although this slide shows you the
API is in Scala the first set of API is
we provide is the usages and as expected
using CG yes you'll be able to create
delete restore vacuum and rebalances the
second set of API is that we offer is
the smarts if you if you if you used
spark then you you'll probably already
know about D F dot explain a dataframe
dot explain what it does is it takes
your query it produces the logical plan
the optimized plan the analyze plan the
physical plan way this hyperspace dot
explain is built with a similar
philosophy what do they mean by that it
can take your query and it will give you
a very cool
Biff between how it has figured out what
indexes it has used and it will show you
what exactly at the index is used how
did the plan end up changing the to
other API is that we plan on introducing
in a couple of weeks is the what-if API
which allows you to hypothetically test
whether an index will benefit your query
or not
and the final holy grail of the smarts
which is the recommend API which will
take a work load from you and provide
you with a list of top K index
recommendations that you can then go
ahead and build in the background one
question that might be lingering in your
mind at this stage is as I'm using all
of these indices
there exactly are doing this is getting
created right now the cool thing about
what I mentioned and laid out in the
goals and vision of the project is the
index aslam on the make why is this
important it's very interesting because
now if you start looking at your file
system like HDFS or maybe a BFS or you
know even in abuse s3 for instance right
imagine you have your data laid out as I
have shown in the bottom it's basically
just some path to data files now the
moment you start creating an index what
you'll observe is the indexing API
create index takes a name for the index
create index takes a name for the index
and once you provide the name of the
index it creates a folder and the folder
contains all of the information in a
self-describing way to point back into
the original data set that you have
created why is this important you need
to capture pretty much every detail of
what exactly you have considered on the
original data set at the time of
indexing because you want to be able to
detect if the data set changed from the
time you created the index to ensure
that you're not providing any incorrect
results and I'm sure you must have
results and I'm sure you must have
noticed at this stage the presence of
something called as a hyperspace log
obviously right it says it's in a
different color so it's probably
something of interest hyperspace log is
a progress log or an operation log which
is the heart of our entire design when
you initially create an index it
captures the operation creat if let's
say the underlying data ended up
changing then perhaps you might want to
run a refresh
once you do a refresh
and be refreshed completes what is
happening is there is the indexes moved
back into an active state and there is a
pointed that gets laid out into in this
case index directed to and indexes are
two to three what you'll notice is hey
wait a second why are we seeing in this
into index directory one two and three
that's because we maintain different
snapshots and different versions of the
index to enable multi reader and multi
writer concurrency and one of the very
interesting aspects of having the index
on the lake is it provides several
on the lake is it provides several
benefits just to name a few one now
index scan skills the second biggest
benefit is our index today is laid out
in an open format and that open format
is park' and one of the major advantages
you get if you've been lingering around
in the SPARC community then you already
probably know that there's probably
close to a decade worth of investment in
making sure our K as a format is highly
optimized and you don't need to worry
about it any more
whatever optimizations are present for
parking you automatically end up kind of
leveraging all of those optimizations as
far as the index itself and finally as
you can tell we have enabled a
serverless access protocol what do I
mean by this
if simple like hyperspace does not
really depend on any external service so
before we move on to my colleague doing
a demo I'd like to introduce either
synapse analytics which is a product
that Microsoft has released recently and
that Microsoft has released recently and
hyper-space has built out of box inside
address and have submitted so now I'll
switch over to my colleague Eric M who's
going to show you the end-to-end
experience of how hyperspace looks like
and how you can use it in production
okay in this demo I'll go over
hyperspace API is for creating and using
indexes I'll be using Azure synapse
analytics which comes with hyperspace
out of the box and we will share this
notebook after the session okay let's
get started
first I'm going to create two data
frames one for employees and one for
departments and here are the contents of
the data frames
employees data frame has employed AIT
employee name and Department IT
departments data frame has the pieman IT
department name and location and these
are the two data frames I'll be using
throughout this demo now that we have
data frames to work with let me create a
hyperspace object which has methods for
creating and managing indexes
next I'm going to create few indexes to
create on index you need to have an
index config object and specify name
index columns and include columns index
columns are the columns used for join
all filters included columns are the
ones used for selected project once you
have this index config object you can
just say hyper space that create index
and pass the data frame which you want
to build index from all long meeting is
config object in this example I'm going
to create one index for employees data
frame and another one for department's
data frame and the reason I chose
department ID as an index column is that
I'll be joining employs data frame with
departments data frame on this column we
also have an API for listing indexes we
created to do that you will do
hyperspace that indexes the show and
this will display all the indexes that
are created these are the two indexes we
just created and this also has more
information about the index such as in
this column inquiry columns index
location and so on the object returned
by indexes is a spark data frame so if
you want you can perform any data frame
operations on it now let's study
realizing the indexes we just created
before doing that I'm going to disable
broadcast join because the indexes we
created are applicable to solar join
this is the join query will be executing
I'm joining employees data frame with
departments data frame on department ID
and selecting employee name and
Department name and I'll be showing
fiscal plan and then the results of the
fiscal plan and then the results of the
jury
jury
and this is the output so this is the
physical plan of joining two data frames
so it will cease all mojo in tension
first age and then this is the results
of the joint now finally let's enable
hyperspace how do we do it we just say
spark that enable hyperspace and what it
does is it's going to inject the
hyperspace optimizer rules and start
utilizing the indexes available and will
be executing exactly the same join query
and this is the output however I think
it's a little hard to see what really
changes after enabling hyperspace and
changes after enabling hyperspace and
that's where the explained API comes
into play so here the hyperspace that
explained and pass the data frame and it
will give you an overview of what
changes if hyperspace is enabled and
this is the output so this is the output
from the join query that we just
executed and this is the physical plan
with hyperspace on and this is the
physical plan with hyperspace off and
the highlighted portion is the
difference and you'll see that sort and
difference and you'll see that sort and
exchange have been removed from the
physical plane with hyperspace on and
this also lists the indexes that are
used and these are the two indexes we
created in this demo and finally it also
shows you the physical pressure stats
for example what is saying is with
hyperspace disabled their work to show
for exchange fiscal notes but with
hyperspace enabled the number of this
node goes to zero and this is same for
node goes to zero and this is same for
the sort physical operator hyperspace
also works with seeker comments and this
is equivalent to the join using the data
frame api's so basically this is joining
these two tables on the department ID
and this should give you the same result
as expected okay let me switch a key a
little bit and talk about some of the
ApS for managing indexes first is delete
index if you want to delete your index
index if you want to delete your index
you can do hyperspace type delete index
and passing the name of the index you to
delete this will be a soft delete
meaning that it's not going to remove
the index from the file system if I look
at the output here you will see that the
state has changed from active to deleted
and once the index goes to the latest
state is not going to be picked up by
hyperspace and this APSU is useful if
you want to disable some of the indexes
you want to disable some of the indexes
temporarily
temporarily
next is restore index so this API is
used if you want to bring back the
deleted index let me show you the output
so if you perform restore index it will
change the state of this index from
deleted to active and once the index
goes to active state it will be used by
hyperspace once again next is vacuum
vacuum is a hard delete operation
meaning that it will physically remove
the index from the filesystem the
prerequisite is that you have to first
delete index so this is the sequence so
delete index so this is the sequence so
you first need to delete index and you
vacuum the index and here is the output
the index goes from active state to
delayed state and when you to vacuum
index it will be removed from the
filesystem and last on the list we have
a refresh index API suppose we have the
employees will add it to our original
data now that the index and data are out
of sync hyperspace is smart enough not
used index because they will give you a
wrong result however you can use the
Refresh index API to rebuild the index
and once the data and index is in sync
hyperspace will use the index and that's
the end of the demo and we can switch
back to the slide thank you so much
daddy wasn't that demo exciting I'm
itching to actually try this out
in fact I'll provide you with pointers
in the later half of my presentation it
is how you can try this I was facing on
your boxes today the covering index the
coding index in other words
creates a copy of the original data in a
different sort of so there is no black
magic like I mentioned at the beginning
of my talk an index provides query
acceleration and the cost of additional
rights in storage so let's take one
example just to make sure that we can be
the concept clearly
imagine a table having three columns and
you're trying to execute a query this
query is super simple it says select B
where a equals red the only way to
execute this query would be to do a full
scan on top of this table and this full
scan on top of this table and this full
scan would end up taking linear time
obviously I'm not considering
optimizations like min max but yeah I've
tried to dump it down and simplify just
to get the common concept across now you
can technically build a covering index
on column a and include column E now
when you re executor square e that I
have shown you here select B where a
equals red
now you can resort to doing your binary
search and this takes logarithmic time
well the concept is pretty simple it it
well the concept is pretty simple it it
aids tremendously in doing things like
filter acceleration not convinced that
it works well let's go into something
more complex now imagine now you're
joining two different tables table a and
table B on column a and assume that
you're running this query select B comma
C where you're joining on the column a
from both the tables without an index
spark will decide to shuffle both sides
of the table into the same number of
partitions and remember that the data is
not sorted so as part of step 2
spark ends up going and sorting both the
sides before it decides to go ahead and
do a merge in step 3 with an index with
the right index is build the hyperspace
extensions that we have built enable the
spark optimizer to pick the right index
and remember the index is already pre
shuffled and pre-sorted so spark can
happily move into step 2 which is the
merge step and I'm pretty sure you must
have noticed by now that the shuffle got
have noticed by now that the shuffle got
eliminated
eliminated
have noticed by now that the shuffle got
have noticed by now that the shuffle got
eliminated
eliminated
and since shuffle is the most expensive
step potentially in a query now this
query can scale to much larger data sets
and potentially run even faster now
having rd with these two particular
concepts let me switch back to my
colleague to show you some of the deep
dives into real world complex workloads
and hopefully convince you that we can
really get some nice acceleration
okay now I do a deep dive on two queries
okay now I do a deep dive on two queries
ntp city
which is one of the word on big data
benchmarks and I'll show you how much
hyperspace can speed up those queries
first this is quarry 93 which is a
fairly simple quarry you are joining to
fact tables and a dimension table
followed by aggregation now let's take a
look at how the execution plan looks in
the spur UI this is the plan executed by
a vanilla spark and as you can see this
is the join of two fact tables followed
by a join with a dimension table
followed by an aggregation for the first
join we see that there were about 140
gigabytes of data being shuffled and so
did this is a good opportunity to
utilize the hyperspace covering indexes
to eliminate the shuffle now the window
on your right is the plan executed by
spark with the right set of indexes
created as you can see the shuffle and
sort have been removed but the rest of
the plan remains the same let's compare
the execution time this benchmark was
run with scale factor 1000 and the data
is stored in parquet the execution time
has significantly gone down from 6
minutes to 32 seconds giving you about
10x speed up next let's take a look at
quarry 64 this is a fairly complex query
with multiple joints and aggregations
across about 15 different tables so
let's take a look at the spread away for
let's take a look at the spread away for
the executed plan let me zoom out a bit
to show you the overall plan and sure
enough it's a complex one if I scroll
down a bit
I see shuffle and sort of about 110
50 gigabytes of shuffle and so on after
creating and applying the indexes this
is the new plan on your right I'll make
it a full screen the big shuffle and
this is an interesting scenario where we
create an index if you cannot create a
cover index on one side of the join
because it involves aggregations joins
and so on you can just have the covering
index on one side if applicable and it
can also eliminate the shuffle for that
can also eliminate the shuffle for that
side
finally let's compare the execution time
finally let's compare the execution time
it went down from ten minutes to three
point four minutes to give you about 3x
speed up so we just took a brief look at
the speed ups in two of the TP CBS
queries now raw we talked about the
improvements we observed across all the
TP CTS queries thank you so much daddy
as you've seen areas just demonstrated
hyperspace acceleration on a couple of
non-trivial queries and pretty sure the
moment he actually showed you this this
whole land zoomed out at least I assumed
out eyes owned out in fact thanks a lot
Terry and now you made me even more
curious to try this out now as he has
shown you deep dive into two queries let
me talk about hyperspace performance in
general the computer configuration that
we've used to test this out is on the
left for anybody who is interested in
replying some of our benchmarks at the
top is a graph that shows a workload
evaluation derived from TPC benchmark H
or in short T BCH the scale factor that
we've used is thousand and the
evaluation was performed using a
purchase part 2.4 we will end up getting
support for Apogee three-point as part
3.0 subsequently and remember one very
interesting thing
all of this base tables of T BCH are
laid out in RK and as you already know
party is a highly efficient columnar
format and all of the acceleration that
I'm about to show you in this workload
as well as the next one is on top of
market data the graph here the x-axis
shows the P BCH query number th has 22
queries and that's what you see here and
the y axis shows the duration that it
took to execute that query and the
dotted line that you see here indicates
the gain that we achieved by using the
misses in this case
I want you to observe this and just
digest how much game variable to get
right obviously you can see a spectrum
of gains you see gains of as low as like
about 1.1 X but you can also see gains
as high as 9 X now let's look at the
second workload derived from the same
DBC benchmark D s or in short TV CD s
for brevity we just show the top 20
queries DB CDs and our workload consists
of a total of 103 queries and what you
can notice here is a similar pattern
again there are no regressions and in
fact up to 11
mix games to be had the overall
benchmark is really key because it
conveys one other important aspect which
is we don't regress in any of the
queries right so if you look at both PPC
H and T PCBs one thing that will
immediately stand out is we're getting a
pretty high gain of 2 X and 1.88 X
pretty high gain of 2 X and 1.88 X
respectively
respectively
I am also pretty thrilled to announce
that we are open sourcing hyperspace as
of today I agree it's version point 1
it's a start
to recap hyperspace is an extensible
indexing subsystem for Apache spark it's
a simple I don't we have not changed
anything in spark coal and rest assured
you can feel thrilled that it's the same
technology that powers the indexing
engine inside either synapse analytics
and it works out of the box with the
purchase part 2.4 and the support for
this apache spark 3.0 is going to come
in in the next few weeks and we offer
hyperspace in three languages no
compromises you can choose whatever
language you feel comfortable with give
us feedback what would you like to see
in the upcoming versions and we'll build
it for you
and today you can access you can
download you can build you can also use
hyperspace from one of these two URLs
it's the both take you to the same place
hyperspace is currently on github you
can feel free to explore the codebase
give us suggestions in terms of what you
would like to see having said that I
would also like to take a brief moment
I'm looking everybody who made us reach
this point in time in this slide I'm
going to discuss a couple of areas that
we are planning on investing and
probably also areas of
is where we could collaborate the first
area is metadata management and
area is metadata management and
lifecycle
lifecycle
do you like what you're saying did we
capture enough metadata for instance the
second area of investment is indexing
enhancements well what we demonstrated
today is indexing all immutable data
sets or static data how would we do the
same thing for incremental indexing what
are the things that will change what is
the performance implications
how would you add support for things
updatable instructions like dental lake
or ubers hoodie for instance right the
third of the higher degree of investment
is optimizer enhancements like I
mentioned this is a very humble
beginning in this space we are by no
means experts at optimizers and this is
where we could definitely use your help
in making sure that we are taking the
right decisions while optimizing queries
and one other area of active investment
that we're exploring is query explained
ability why did you use a particular
index why did you not use a particular
index the fourth area of investment is
more index types more on this in the
next slide
the first one is documentation and
the first one is documentation and
tutorials
tutorials
well indexing is not a silver bullet and
in all honesty we would like to tell you
that yeah
indexes are always going to but I can't
really tell you that you should take
care by using indices making sure that
they really accelerate your workload you
they really accelerate your workload you
should
should
there are lots and lots of questions you
need to ask yourself before you go ahead
and decide to build an index this is
what we want to be able to capture as
part of our best practices gotchas and
problem we want to be able to run more
reproducible experiments for our
community to try out and finally the
holy grail when my awesome colleague
Terry was showing you some of these cool
demos and I was showing you some of the
performance evaluation of across all
these complex queries I'm sure one
thought must have struck across your
mind which is wait a second how did you
decide what in this is to even write it
decide what in this is to even write it
in the first place right that was done
through a prototype that we have
implemented over the last few months in
the form of an index recommendation
it's quite simple today what it does is
it takes your workload and recommends
the top key indices that you can go
ahead and build it's not perfect but
ahead and build it's not perfect but
people
people
open-sourcing dad over the next few
weeks let me answer the final question
which is what type of hyper spaces can
be built together right today in in a
talk we covered one type of a hyper
space which is the covering index but in
space which is the covering index but in
hyperspace it is very important to
remember that the index term is used
extremely broadly to refer to a more
general class called a derived data set
in short and derived data set well as
the name States was derived from the
originators it but it aids in providing
some hints to the optimizer to
accelerate the that's the simple
definition so what is some other types
of hyper spaces can be built together
today we only have covering indexes but
we plan on going and implementing more
types like the chunk elimination index
which is something you can use to do
pointed lookups or the materialized
views which is an utterly complex area
but it's also very very useful
especially when you have kind queries
that you're running and your multiple
users who are doing this and finally
also provide potentially stylistics and
maintain them so that a cost-based
optimizer can utilize them to provide
you with higher query acceleration with
that I'd like to conclude my talk by
mentioning a couple of key things number
one today and super thrilled to announce
the open sourcing of hyperspace
a tech that we have built as part of a
synapse analytics it's lb8 version 0.1
it's obviously not ready for you to use
in production it's just a humble attempt
at giving back something to the
community and point number two the tech
was built without really modifying a
single line of spark code which gives
you some tremendous powers in terms of
putting it on your own spark clusters
putting it on your own spark clusters
without having to worry about deploying
spark itself and because the tech is
also deployed as part of either synapse
analytics rest assured you can be you
can feel safe that there is a tremendous
amount of product investment that's
going on in it and point number three we
have demonstrated a very preliminary
acceleration on some of the key
workloads and of course I'd like to
emphasize this is more of a beginning we
might be making some mistakes there's
definite scope for improvement
it's not perfect but that's where we
it's not perfect but that's where we
really need your guidance in making sure
that we've all hyperspace into something
that our users would truly love and use
in their day-to-day workers and last but
not the least I do have the URL right
here on the right side bottom if it's
the whole hyperspace project today is on
github you can try it out you can let us
know if you run into any issues please
feel free to open an issue on the github
thing so I I on behalf of my entire team
look forward to seeing you folks on our