SQL Catalyst Optimizer.
being in the wrong end
realized that I was trying
never fix at the wrong end
AI Group in Bloomberg,
large clusters in the cloud
does in Apache Spark,
problem with that app,
planning is the problem?
about all of the green spaces
very simple schema inference
thing about Spark SQL is,
barrier to entry is, see?
like SQL, SQL, right?
running the application
SQL turn into something
fashion across a cluster.
of time to come back,
is what takes you from
to run on the cluster.
sites in New York city.
are planted on the street,
catalysts go to work.
this query is highlighted,
query plan in the Spark UI,
some really exciting options,
it's probably not exciting
this presentation, trust me,
to be turning into something
color highlighting here?
three arm filters, okay?
gonna be really interesting.
the analyzed logical plan.
be a little bit interesting
what projection this query
thing happened, optimization.
check for all the tasks,
together in one stage,
the execution of the query,
a preconceive decision
bunch of different decisions.
cost-based optimization
to order these joints is.
of a really good start.
can apply various rules
like after logical optimization
bit different, right?
underneath the projection?
interesting has happened here.
asterisk that highlighted red?
is what we were talking about
can be fused together.
is as fast as it is.
in one logical unit.
a cyclic graph of RDDs
interested in the details of this,
end of the presentation
into very great detail.
that it some of its own
now there is a partial count
gonna run back on the driver.
take ordered and protect.
without the highlighting,
big wall of text, right?
stage Codegen stages,
here the filter is now
correlate with each other.
nothing happens.
happen in production.
out on a large Spark cluster
cluster is idle, hung, whatever,
that when you need
to be persistent here
with the smoking gun is going
persistence, timing, and luck,
first time thread dump, okay?
led us back to the problem.
highly repetitive calculations.
this developer's involved,
different column names here,
for different periods
this kind of Cartesian
frame with a full block, okay?
expands out like that
do something with full left,
exactly the wrong way to do it
that the query optimizer
of repetitive process
analysis of the query plan.
plan before we do anything
plan, just expanding out.
set that was running
Day, it was much larger,
to this to start with.
to David Copperfield's,
with the Cartesian product
might wanna catch that
licorice that is common
European countries.
like it a whole lot, okay?
clear why it's called Salmiak
was a data set that had,
that had a very heavy heat.
high frequency of key A,
see that this workload
into a synthetic key
that would use a UDF,
UDF was both deterministic,
unpleasant text processing,
calculations on the data set
about Spark and UDFs,
either more or less fumes
in the original query.
will be great." Right?
hadn't thought through very well
really gonna kind of
really high cache hit rate
just zip right through,
hate play it as it lays,
to make it worth my time
bogged down and died.
two things across purposes.
to amortize the time
results in the cache.
transformations like filter,
pulled again and again,
was not populated for that.
the benefit of caching,
breaks knowledge base,
people to have been here,
of the catalyst optimize
of a UDF that is expensive
tried to read those before."
don't need to be a genius
wrong with Spark queries
a method of more than 64K,
that are participating
that you should investigate
new data set to be created
dig a bit more deeply?
running through Spark Jira
you're the first person ever
really well documented
from core developers.
Spark optimizer will try
the query graph as possible.
with a thousand nodes
at the end of your query,
could lead to a cross Join.
also lead to trouble.
you want running in production
on decimal columns.
a certain number of places,
included as an illustration,
JVM, blah, blah, blah,
message in your Spark logs,
optimize your queries.
running the physical plan,
a large application.
Spark driver more memory
code cash to ensure
into the driver option
on every single Spark driver
available code cache?
the reserve code cache size
to make the code cache
to keep the code cache
of the code cash is free
options that will be available
still living in JDK eight,
code is greater than 64K.
generated query grows so large,
hard limit on the JVM.
that's absolutely massive.
at the generated code
easy to hit that limit.
yuck on Spark just loggers,
analyzed over and over again.
have to really dig in
would be either cash it
truncate query in the news.
impact that, okay,
impact from checkpointing
you have hit some type
really want to do, though.
generated code in logs.
some of the code generated.
one to log the entire error,
query that's the problem
carbon monoxide detector,
that specific class,
the on call engineer
not halt the application.
explain with Codegen mode."
count to reduce it to just two
query had seven of them,
at a really small one, okay?
correlate back to the output
on the individual closets,
capitalist optimizer section.
that the in-memory table scan,
Codegen has keys together,
artificially captured here,
see exchange in Spark UI
data across a network.
this is not a math operation.
of that really the exit,
showing you is kind of almost,
this up in your head,
two, what you're seeing,
is kind of stage one
that's a shuffle.
highlighted, stage two,
single take ordered in project
count of streets sites?
finger at a specific role.
is partially rule-based.
what rules are being applied
executor collects metrics,
to kind of reset the metrics,
the logical optimization
easiest output to show
redacted all the laws that
not applicable here.
runs did this actually apply
on the optimizer spent,
the smallest code.
smallest code size possible
the size of the code,
through trying to Git it.
the smallest code possible,
left to do optimization later.
specific authorization,
real Join based optimizations
something like a rule
interesting in here
the cause of your bum.
impact of disabling Codegen?
take a look at that,
again, to the first stage
not been fused together,
highlighted in red,
fusing the operations together,
class of optimizations,
Codegen was 489 milliseconds.
Codegen was 256 kilobytes.
more subjective example.
totally unrelated to Codegen.
makes this query faster,
data set of this size?
that query planning
down to one at the end.
back in the physical plan
there were 200 partitions.
a data set of this size.
partitions to only two,
aggregate now takes place,
memory as compared to 4.1 Gigs.
why a query could be slow
of partitions involved
that your query is slow.
at some exciting changes
more efficient query code.
to catalyst optimizer
Partition Pruning.
other side of the Join?
to be New Join hints,
something that's been thorn