SQL Catalyst Optimizer. being in the wrong end realized that I was trying never fix at the wrong end

⇰

AI Group in Bloomberg, large clusters in the cloud

⇰

does in Apache Spark, problem with that app, planning is the problem?

⇰

about all of the green spaces very simple schema inference thing about Spark SQL is, barrier to entry is, see?

⇰

like SQL, SQL, right? running the application

⇰

SQL turn into something fashion across a cluster.

⇰

of time to come back, is what takes you from to run on the cluster.

⇰

sites in New York city. are planted on the street, catalysts go to work. this query is highlighted,

⇰

query plan in the Spark UI, some really exciting options, it's probably not exciting this presentation, trust me,

⇰

to be turning into something

⇰

color highlighting here? three arm filters, okay? gonna be really interesting. the analyzed logical plan. be a little bit interesting

⇰

what projection this query thing happened, optimization. check for all the tasks, together in one stage,

⇰

the execution of the query, a preconceive decision bunch of different decisions. cost-based optimization to order these joints is.

⇰

of a really good start. can apply various rules

⇰

like after logical optimization bit different, right? underneath the projection? interesting has happened here. asterisk that highlighted red? is what we were talking about

⇰

can be fused together. is as fast as it is. in one logical unit.

⇰

a cyclic graph of RDDs interested in the details of this, end of the presentation into very great detail.

⇰

that it some of its own now there is a partial count gonna run back on the driver. take ordered and protect. without the highlighting,

⇰

big wall of text, right? stage Codegen stages,

⇰

here the filter is now correlate with each other.

⇰

nothing happens. happen in production. out on a large Spark cluster

⇰

cluster is idle, hung, whatever, that when you need

⇰

to be persistent here with the smoking gun is going persistence, timing, and luck, first time thread dump, okay? led us back to the problem.

⇰

highly repetitive calculations. this developer's involved, different column names here, for different periods

⇰

this kind of Cartesian frame with a full block, okay? expands out like that do something with full left,

⇰

exactly the wrong way to do it that the query optimizer of repetitive process analysis of the query plan.

⇰

plan before we do anything plan, just expanding out.

⇰

set that was running Day, it was much larger, to this to start with.

⇰

to David Copperfield's, with the Cartesian product

⇰

might wanna catch that licorice that is common European countries. like it a whole lot, okay?

⇰

clear why it's called Salmiak was a data set that had, that had a very heavy heat. high frequency of key A,

⇰

see that this workload into a synthetic key

⇰

that would use a UDF, UDF was both deterministic, unpleasant text processing, calculations on the data set

⇰

about Spark and UDFs, either more or less fumes in the original query.

⇰

will be great." Right? hadn't thought through very well really gonna kind of

⇰

really high cache hit rate just zip right through, hate play it as it lays,

⇰

to make it worth my time bogged down and died. two things across purposes. to amortize the time

⇰

results in the cache. transformations like filter,

⇰

pulled again and again, was not populated for that. the benefit of caching, breaks knowledge base, people to have been here,

⇰

of the catalyst optimize of a UDF that is expensive

⇰

tried to read those before." don't need to be a genius wrong with Spark queries

⇰

a method of more than 64K, that are participating

⇰

them are really related

⇰

that you should investigate

⇰

new data set to be created dig a bit more deeply? running through Spark Jira you're the first person ever

⇰

really well documented from core developers. Spark optimizer will try the query graph as possible. with a thousand nodes at the end of your query,

⇰

a whole presentation in

⇰

could lead to a cross Join. also lead to trouble. you want running in production

⇰

on decimal columns. a certain number of places, included as an illustration, JVM, blah, blah, blah,

⇰

message in your Spark logs, optimize your queries. running the physical plan, a large application. Spark driver more memory code cash to ensure

⇰

into the driver option on every single Spark driver available code cache?

⇰

the reserve code cache size to make the code cache to keep the code cache of the code cash is free options that will be available

⇰

still living in JDK eight, code is greater than 64K. generated query grows so large,

⇰

hard limit on the JVM. that's absolutely massive. at the generated code easy to hit that limit. yuck on Spark just loggers,

⇰

do the right thing,

⇰

analyzed over and over again. have to really dig in would be either cash it truncate query in the news. impact that, okay, impact from checkpointing

⇰

you have hit some type really want to do, though. generated code in logs. some of the code generated.

⇰

one to log the entire error, query that's the problem

⇰

carbon monoxide detector, that specific class, the on call engineer not halt the application.

⇰

explain with Codegen mode." count to reduce it to just two query had seven of them,

⇰

at a really small one, okay?

⇰

correlate back to the output on the individual closets, capitalist optimizer section. that the in-memory table scan,

⇰

Codegen has keys together, artificially captured here,

⇰

see exchange in Spark UI data across a network. this is not a math operation.

⇰

of that really the exit, showing you is kind of almost, this up in your head, two, what you're seeing,

⇰

is kind of stage one that's a shuffle. highlighted, stage two, single take ordered in project

⇰

count of streets sites? finger at a specific role.

⇰

is partially rule-based. what rules are being applied executor collects metrics, to kind of reset the metrics, the logical optimization

⇰

easiest output to show redacted all the laws that not applicable here.

⇰

runs did this actually apply on the optimizer spent, the smallest code.

⇰

smallest code size possible the size of the code, through trying to Git it.

⇰

the smallest code possible, left to do optimization later. specific authorization,

⇰

real Join based optimizations something like a rule interesting in here the cause of your bum.

⇰

impact of disabling Codegen? take a look at that, again, to the first stage

⇰

not been fused together, highlighted in red, fusing the operations together, class of optimizations,

⇰

Codegen was 489 milliseconds. Codegen was 256 kilobytes. more subjective example.

⇰

totally unrelated to Codegen. makes this query faster, data set of this size? that query planning down to one at the end.

⇰

back in the physical plan there were 200 partitions. a data set of this size. partitions to only two, aggregate now takes place, memory as compared to 4.1 Gigs.

⇰

why a query could be slow of partitions involved that your query is slow. at some exciting changes

⇰

more efficient query code.

⇰

to catalyst optimizer Partition Pruning. other side of the Join? to be New Join hints, something that's been thorn

⇰

Care and Feeding of Catalyst Optimizer

Created with glancer