A Year in Flink - Flink SQL Live Coding

Created with glancer

hi hi welcome everyone so uh welcome to this a year in flink special edition um because i'm doing live coding it's like i said so first of all i want to have a look brief look at what happened over the past look at what happened over the past years years so if we look at the website visitors of so if we look at the website visitors of link.apache.org link.apache.org we see this steady stream like from 2016 to now there has been a steady stream of increased uses
increased uses like basically if you look at it very closely you will see this quite steady stream up until let's say second half of 2018 and then it ramped up a bit so from from that there seem to be quite a lot of users that look into flink that work with link and that are using our website as a useful resource then when we look at things like what happened from the development side of
things just in september last year flink 190 was released so that's a bit out of this last year scope but since then a lot has happened so there have been two major releases like frink 110 came two major releases like frink 110 came out out in february flink 111 came out in july so roughly speaking flink release or major release every five months and if link developers have not
have not only been focusing on flink or the the core flink distribution but also released the stateful functions library into the open so into in april this year the state for functions api became a member or became part of the apache software foundation and the first release was created then and the state for functions is actually a bit on a faster pace so delivering you more up-to-date more recent features
but wait i collected those statistics but wait i collected those statistics manually manually like by hand but why why should i um it would be better to to automate this it would be better to to automate this right right and this is actually what i want to do for the rest of the talk so let's look at something that gives us the statistics um in a better way so i don't have to work on those by collecting data work on those by collecting data manually manually and for that i chose to have a look at
flink sql because its focus is on logic not on the implementation both available or both useful for batch processing as well as stream processing it aims at maximizing the developer speed and the autonomy of the user and thus it should be useful in getting quick results and exploring your data and at some point also getting into full-fledged streaming applications so what is flink sql in case you haven't dug into it
dug into it yet flink sql is well sql on its own is a declarative api so on the left hand side you see this statement and what flink does if you give it that statement it will basically translate that into a job so in job in terms of flink is a composition of some operators that exchange data and eventually well you start reading from something then you do transformations and eventually you have something that you put on the output so that's this streaming graph that the sql query will
streaming graph that the sql query will be translated into so it defines a program it's not a so it defines a program it's not a database database it's just a way of defining a program it can also be used to get materialized it can also be used to get materialized views views and to get the maintenance on that for free so take this example it has an input table of transactions and we want to have a flink job that is doing some have a flink job that is doing some aggregation aggregation like this job is going over the transactions table and we'll just count the transactions and group them by type
and group them by type and then get the counter that you can then use to server dashboards so that's a common use case for sling sql as well you will get your real-time dashboards updated by running flink job and um what is there at the moment to to work with things with flink sql well you can write your your java code or scala and just embed sql in there and then or use the table api and then
and then or use the table api and then send this query out execute and the rest of the translation will be done you can also use the sql query sorry the sql cli and have it on your command line just do the queries that you like and then put them out both of them are not really viable for production systems and this is why we at verica chose to integrate sql into our platform so at the core of the platform there's this strong apache fling streaming runtime that you used to know
runtime that you used to know and then we have very useful and and and then we have very useful and and easy easy tune tooling around that helps you get tune tooling around that helps you get your your applications deployed and helps you create your clusters and manage them over the life cycle of those applications and well we started with support for java and scala jobs we extended this now basically by releasing the 2.3 version basically by releasing the 2.3 version yesterday yesterday we extended it with support for sql so i thought why not use it
and you can use sql with our platform in basically two ways one is you use the integrated editor so there's a sql editor in your in your browser that shows you the platform and then it can interact with a platform it can look into the catalog that the platform comes with it can enter you can you can upload udf so user defined functions or define tables and then those will be translated into queries that will then run in your cluster all of that is using kubernetes
all of that is using kubernetes underneath but that's nothing you should worry about the other way to interact with how to use sql with our platform is by using jupiter notebooks i'm going to use the ide in the browser so let's have a look with what we can do to gather a few more statistics let's switch over to the the platform sorry so what i did for in preparation for this
i did for in preparation for this for this event is that i created or imported a bit of data so that i can lose that to analyze it so what i did is i created the github api and fetched the commits and the flink and fetched the commits and the flink repository repository and i also fetched all the per requests and i also imported the mailing lists like a user and devmailing lists so we can have a look at those and see what was happening on there so those i did beforehand so we have a bunch of kafka topics that
so we have a bunch of kafka topics that are now pre-filled with data that we can use to do our analytics and if i look into so now now we're going into sql um this is the sql editor of our um this is the sql editor of our platform platform to the left you will see schema explorer that you can use to get information about everything that to get information about everything that is is in the catalog that comes with the platform or with the hive catalog if you connect to that so i chose the internal platform the internal catalog and as you can see i have a bunch of
and as you can see i have a bunch of tables available and we will make use of them throughout the rest of this webinar so and they also have a bunch of webinar so and they also have a bunch of functions functions that i will use at a later point so i want to talk about what happened in flink community and well just as a starting point let's take the user mailing list as an example so i do have this one table that is flink mailing list user and if i look into that
look into that i will see a bunch of things that i imported from the from the mbox archives that you can download it's the date of each email that got into the mailing list it's the from into the mailing list it's the from field field also passed as an email address and it's the subject i don't need the body for the rest of the analysis so i just skipped that but it could have been imported as well so let's do our first query let's try and gather some information
we could look for the number of users on the mailing list so we say select and say star and i could just do that let's see let me close that preview let's just run it and see what what's being read so now i get all those email addresses but i don't want all of them i mean i just
them i mean i just it would be nice to know how they change how those numbers changed over the time so what i want is to have maybe an overview of the user of the user an overview of the user of the user count count over maybe months so this is what you can use a tumbling window for and in sql it just goes by group by and then tumble and you specify a time attribute that's that you can work on and that is some
timestamp that you can use and if i look into the table again there is this date column and in that date column we have the date of email so we'll use that so i have a tumbling window over the date and now i need to define the length of that tumbling window and the sql that goes with the interval oops enter vowel and let's make it a statistic over 30 days so roughly a month uh 30 day so now i do have a tumbling window but i'm grouping by so i cannot just select
i'm grouping by so i cannot just select i'm grouping by so i cannot just select star star and let's make it uh the account i want to count everything that happened in that window and maybe i also want to know which month i counted this for so i do maybe tumble and so this will get me the maybe tumble and so this will get me the end end time stamp of that tumbling window of those 30 days and for that to happen i need to use the same specification that i did down here let's say i can i have this tumble end and i will rename it to
and i will rename it to window end and the card star is uh as maybe num users all right now we should be good to go um except for well this is not actually counting the users so how is a user defined so if i just cheat and click into the table again i can see what i have and i could distinguish the users from the email addresses and this is what i'm going to do so by having the distinct number of email
having the distinct number of email having the distinct number of email addresses addresses this is how i get the number of users so email addresses so this is in the from and distinct from email as oops i'm missing a ledger that and now i should be ready to ledger that and now i should be ready to go go let's close that again just run that to see a preview
there we go i see all the months aggregated there are actually two windows here i imported everything from the mailing list since the start of those mailing lists in 2014 when apache flink became an apache when apache flink became an apache project project um but the talk is just a year in fling um but the talk is just a year in fling so so we want to do this only for the past year so let's actually reduce the scope a bit so let's say where my date timestamp is greater than
uh let's use the current timestamp so current timestamp gives you the current timestamp and i want to look back a year so we'll and i want to look back a year so we'll remove remove an interval of there we go and then we should have we should only get everything that happened in that past year so let's run it again and i do get everything starting roughly a year ago so that's the window
ending in the 10th of november and then 30 days after that enough of that enough for that so i would have 12 results there cool so that's the first statistic that i get this is the number of distinct email addresses per month let's let's do something more let's maybe count or have a look at the emails so how many emails did we get on that email letter on this mailing list so instead of counting the distinct
so instead of counting the distinct email addresses i just count everything and it's not users count everything and it's not users anymore anymore it will be number of emails and that's that's actually where i can already start right i count everything i yeah that's the number of emails let's see there we go uh there's a steady stream if we look back it's like around 500 some periods have been lower some periods have higher been lower some periods have higher numbers numbers
there has been some more activity in the there has been some more activity in the month month up until the start of march maybe that's related to the flink release that got out then um we don't know we could have a look further there's a second one in may um but this is actually just the number of emails i would actually like to have a better look at things like how engaging with us i mean just looking at the number of emails you it could be something like users are just sending them but never got a reply
so i actually want to have a look at the so i actually want to have a look at the email email threads that happened in that month so how do i distinguish those threads so how do i distinguish those threads let's let's have a look at the table again to see what the data model is and i do have the subject in here and usually when someone replies to a usually when someone replies to a subject subject to an email this subject is prefixed to an email this subject is prefixed with with re or maybe in other languages there's some other prefixes but let's just stay simple at the moment we will assume
that it is a prefix with this re colon and then a space so we want to count the and then a space so we want to count the distinct distinct email threads so count again we'll just distinct and we will try and remove that re prefix from the try and remove that re prefix from the subject subject so we will trim the leading then re oops our e
colon space from subject and as you can see there's already some some synthetic highlighting when i don't type when the the whole sql query is not correct i do get the errors live so i can see what is wrong and i can then act on that so i have the and i can then act on that so i have the distinct distinct i'm counting the distinct subjects removing every re so that should give me the email threads let's see let's just rename this um
it's not the name emails but some it's not the name emails but some threads threads um that should be fine let's run it um that should be fine let's run it again again close that all right there are some close that all right there are some threads threads it's a bit fewer than before but we don't have both numbers at the same time so let's actually make a result this is just a preview let's make a result that gets us both the emails as well as the threads and i removed the previous statement so i will just undo that all those changes
i will have two things here um so at the moment i was just looking at emails and i was separately looking at the email threads but let's bring them the email threads but let's bring them together together i have two queries one is the one for emails and the other is the one for the email threads so by bringing them together i get the so by bringing them together i get the overview overview and i'll do that by making each of them a temporary view
so create temporary view and then i'll give it a name maybe that's the emails as and i'll group that together like here and then have a second view create temporary oops temporary so now i have two temporary views and i
so now i have two temporary views and i just need to bring them back together so for that i can just join them all together i can for example select let's make a select start at the moment and i want from emails as well as or maybe maybe rename it to some short thing and as well as threads t and i will join them on um on the month because both of them on um on the month because both of them aggregate aggregate
things aggregate everything on the on the windows of 30 days so i can group them both together on exactly that attribute that window end that is there in both so i will do where oops e dot window and equals t dot window and uh and then let's just rename a few things so that we have it easier i will want this window and i know it's the same so i'll just want it once window end so i'll just want it once window end then then
i also want the number of emails so this is this is the one that i have at the topmost view and then as the second thing i want the number of threads also in here so now we can actually run also in here so now we can actually run that that um oops sorry that should be maybe email.windowend and i can run the whole thing and it would give me the emails as well as the threads
as the threads but what i actually want to look at is how engaging with those threads so what i want is to know how many emails were there per thread and i'll just add that field i'll make a number of emails divided by num threads there we go and well we didn't give it a good name and well we didn't give it a good name but but there's there's some some low numbers um
there's there's some some low numbers um sometimes actually also saw zero and there's one thing to be aware of this is there's one thing to be aware of this is sql sql so both noun threads and numerals are so both noun threads and numerals are integers integers so the calculation this division is actually integer division but um what i want is floated division so i want to have the exact average number so i can do things like cost this type to float and then we'll be floating point division and then i would get
will get better results we can make it as maybe average or as emails per thread so we have a nice name and we already saw in the preview that there are some values which were not at 3 but rather 3.9 so we want to know that distinction so if we browse through those engagements we see like say roughly but at the the roughly four so almost all of them can be rounded up to
almost all of them can be rounded up to four um and so they're on average four emails per thread that's fine let's maybe also look at another mailing list i did import different waiting lists so i also used this imported the chinese mailing list let's have a look at how things are there so i'll just change those those from tables because i also have this flinkml tables because i also have this flinkml user user china it has the same format i can see that in here if you actually look a bit further for
example at the ddl statement this is how i created that table so i have the fields that we saw in the columns and i also defined a watermark for that date so that we actually are working in event time all right let's see let's see what the result is for the chinese mailing list and behind the scenes flink will execute and behind the scenes flink will execute that that it will walk through the kafka cluster and we'll have a look at this later add a bit more complex query so for the chinese mailing list it's
almost the same the maybe one one email lower so roughly three emails per thread all right let's let's have a uh one more one more let's let's have a uh one more one more details details let's get more details sorry i want to well now we looked at the user mailing list we also looked at the user mailing list in chinese let's actually have an overview over all three mailing lists that i imported here so that is those two as well as the devmailing list
those two as well as the devmailing list and let's compare them all together so we do have a temporary view on the so we do have a temporary view on the emails emails for the chinese user mailing list let's make it one more for the normals so we can just rename them this is chinese males normal flink user or english link user and then we have
def so that is also flink def and we can have a look at the table all right now i have temporary views of all those email addresses i just need to join them together so i'll do that below here so we already joined two tables but now it's actually a bit let's move that a bit further away i want to join those three email tables i want to join those three email tables together together so let's make a select select start let's start with the select star and i
let's start with the select star and i want to join from email um emails sh as well as make it uh emails e as well as what is the other email.def d and i want to join those three tables d and i want to join those three tables together together and i just need to define what they are joined together on so this is where easy dot window
and equals e dot oops window and because they're all the same from what you can see above and then the second thing is you not window and all right and then just maybe rename a more speaking to us so we can have e dot uh let's just start with the window end so we always see what we know this is for
this is for and then we'll go with um and as user or maybe nice user then we have easy dot num emails as user zh as well as d dot num emails as
def oops there we go we combined all of them there we go we combined all of them together together uh except for this one thing here encounter the cologne when there shouldn't be let's see what i forgot um there's there's one there's one yep that should not be a comma that should be an end that's a weird should be an end that's a weird condition condition how lucky we are we actually got real
how lucky we are we actually got real time query validation so we see when something is wrong before we actually get this query into the we actually get this query into the cluster cluster all right let's execute this i'll just execute that selection because i only want to execute that part and the other part is just stored away all right now we see all the months they're like something like
500 emails on the user mailing list on the chinese main list there have been 800s so it also looks like this chinese main list is quite popular and compared to numbers and the devmailing list is equally as as well used so let's maybe have a look or a deeper look at the def mailing list to continue that or maybe actually this is a more complex query let's see how it is translated inside
oops sorry session cluster this is the cluster that the verica platform will use in order to execute those previews so we'll have a look in here remember we joined three tables together and we joined all of them together on the time attributes so that way flink can internally also throw away things when they're not needed anymore oh i can't actually oops sorry i did cancel that preview so i cannot see
cancel that preview so i cannot see anything anymore so i actually need to run that query again because it's just a preview and a because it's just a preview and a preview preview if it's not there anymore it doesn't need to be running right session cluster um actually we have that that flink session cluster still here so we can just refresh we should see the new jobs
and now maybe the live demo is and now maybe the live demo is appropriately appropriately getting to an end because oh no there it getting to an end because oh no there it is is okay let's see so this is the cluster or this is the job that flink translated the sql query to so we can browse we can zoom in and have a look at the things it does so there is a table source that will read from the um what is that
the i don't see it immediately there should be one for each of the tables like one is here the other is here and the third one is down here then there are some filters there are some aggregations there are also some some aggregations there are also some reductions reductions because we don't need everything we for example didn't make use of the email from fields and other things so those are being dropped so there's an optimizer running behind that would reduce this grade to perform would reduce this grade to perform better all right um
all right um all right um [Music] [Music] now now that we have an overview of the mating lists i wanted to have a deeper mating lists i wanted to have a deeper look look at the devmailing list um so let's choose that so email threads per month this is what we used earlier and we'll use the same thing for the death mating list i do have this emails i can just adapt this for the meth def mailing list i once had
for the meth def mailing list i once had for the meth def mailing list i once had the the the threads that are there for the death list and i can just combine those two together just like we did before so i want to join the emails and threads um and oftentimes we're going to an end so i will just
switch over that part and we'll show you the uh one more advanced statistic that i wanted to pull off to just get you into into the mood to just get you into into the mood basically basically so um i prepared a little bit some of the the scripts that i wanted to show i stored them in the platform so we can have a look immediately i can i store them in here and what i wanted to have a look at is who had the most commits so who was the most active in the in terms of
terms of commits into the flink master branch over the past over the past month so i prepared that query i will actually get it running first so this one we cannot preview at the moment because it's not an append only stream for those who know flink sql the preview is currently only capable of running or showing append only streams so that one is i will just go back
to uh the editor to see uh it is a retract stream so that cannot immediately show but it will be running behind the scenes i just started up a deployment and this is what will be up a deployment and this is what will be done done by a live stream application and then it will update something in that case for me it will update an in that case for me it will update an index index in elasticsearch so i use that table that will put something into elasticsearch and i can browse that once it is running so i'll just show once it is running so i'll just show that that
really quick and once it is running or maybe i can take a few questions up until i have the result and then i'll show you that then let's have a look at the if we still have a have a look at the if we still have a minute minute let's have a look at the result of that query and i put this into elasticsearch and let's see who had the most commits last month so in the 30 days up to october the 5th this famous i sorted this by user author email
addresses this famous null email had 160 commits so congratulations to now um that might be something from the github api maybe this is from those privacy settings that you can actually say you don't want to reveal your email don't want to reveal your email addresses addresses so let's look at the second one uh stefan even with 32 and then uh david witcharkovsky um zakovich sorry sorry with 16 and then a few more you can uh look at those i just printed or put the