hi
hi
welcome everyone so uh welcome to this a
year in flink special edition
um because i'm doing live coding it's
like i said so
first of all i want to have a look brief
look at what happened over the past
look at what happened over the past
years
years
so if we look at the website visitors of
so if we look at the website visitors of
link.apache.org
link.apache.org
we see this steady stream like from 2016
to now there has been a steady stream of
increased uses
increased uses
like basically if you look at it very
closely you will see this
quite steady stream up until let's say
second half of 2018 and then it ramped
up a bit
so from from that there seem to be quite
a lot of users
that look into flink that work with link
and that are using our website as a
useful resource
then when we look at things like what
happened from the development side of
things
just in september last year flink 190
was released so that's
a bit out of this last year scope but
since then a lot has happened so there
have been
two major releases like frink 110 came
two major releases like frink 110 came
out
out
in february flink 111 came out in july
so roughly speaking flink release or
major release
every five months and if link developers
have not
have not only been focusing on flink or
the the core flink distribution
but also released the stateful functions
library into the open so into in april
this year the state for functions api
became a member
or became part of the apache software
foundation and the first release was
created then and the state for functions
is actually a bit on a faster pace so
delivering you
more up-to-date more recent features
but wait i collected those statistics
but wait i collected those statistics
manually
manually
like by hand but why why should i um
it would be better to to automate this
it would be better to to automate this
right
right
and this is actually what i want to do
for the rest of the talk so
let's look at something that gives us
the statistics
um in a better way so i don't have to
work on those by collecting data
work on those by collecting data
manually
manually
and for that i chose to have a look at
flink sql because its focus is on logic
not on the implementation
both available or both useful for batch
processing as well as stream processing
it aims at maximizing the developer
speed and the autonomy of the user
and thus it should be useful in getting
quick results
and exploring your data and
at some point also getting into
full-fledged streaming applications
so what is flink sql in case you haven't
dug into it
dug into it
yet flink sql is well sql on its own is
a declarative api
so on the left hand side you see this
statement and what flink does if you
give it that statement
it will basically translate that into a
job so in job in terms of flink
is a composition of some operators that
exchange data
and eventually well you start reading
from something then you do
transformations and eventually
you have something that you put on the
output so that's this
streaming graph that the sql query will
streaming graph that the sql query will
be translated into
so it defines a program it's not a
so it defines a program it's not a
database
database
it's just a way of defining a program
it can also be used to get materialized
it can also be used to get materialized
views
views
and to get the maintenance on that for
free so take this example
it has an input table of transactions
and we want to
have a flink job that is doing some
have a flink job that is doing some
aggregation
aggregation
like this job is going over the
transactions table
and we'll just count the transactions
and group them by type
and group them by type
and then get the counter that you can
then use
to server dashboards so that's a common
use case for sling sql as well
you will get your real-time dashboards
updated by
running flink job
and um what is there at the moment to to
work with
things with flink sql well you can
write your your java code or scala and
just embed sql in there
and then or use the table api and then
and then or use the table api and then
send this query out
execute and the rest of the translation
will be done
you can also use the sql query sorry the
sql cli
and have it on your command line just do
the queries that you like
and then put them out both of them are
not really viable
for production systems and this is why
we at verica chose to integrate sql into
our platform
so at the core of the platform there's
this strong apache fling streaming
runtime that you used to know
runtime that you used to know
and then we have very useful and and
and then we have very useful and and
easy
easy
tune tooling around that helps you get
tune tooling around that helps you get
your
your
applications deployed and helps you
create your clusters and manage them
over the life cycle
of those applications and well we
started with
support for java and scala jobs we
extended this now
basically by releasing the 2.3 version
basically by releasing the 2.3 version
yesterday
yesterday
we extended it with support for sql
so i thought why not use it
and you can use sql with our platform in
basically two ways
one is you use the integrated editor so
there's a sql editor
in your in your browser that shows you
the platform
and then it can interact with a platform
it can
look into the catalog that the platform
comes with
it can enter you can you can upload udf
so user defined functions
or define tables and then those will be
translated into queries that will then
run in your cluster
all of that is using kubernetes
all of that is using kubernetes
underneath but that's nothing you should
worry about
the other way to interact with how to
use sql with our platform is by using
jupiter notebooks
i'm going to use the ide in the browser
so let's have a look with what we can do
to gather a few more statistics
let's switch over to the the platform
sorry so what
i did for in preparation for this
i did for in preparation for this
for this event is that i created
or imported a bit of data so that i can
lose that
to analyze it so what i did is i created
the github api
and fetched the commits and the flink
and fetched the commits and the flink
repository
repository
and i also fetched all the per requests
and i also imported the mailing lists
like a user
and devmailing lists so we can have a
look at those and see
what was happening on there so those i
did beforehand
so we have a bunch of kafka topics that
so we have a bunch of kafka topics that
are now
pre-filled with data that we can use to
do our analytics
and if i look into so now now we're
going into sql
um this is the sql editor of our
um this is the sql editor of our
platform
platform
to the left you will see schema explorer
that you can use
to get information about everything that
to get information about everything that
is
is
in the catalog that comes with the
platform or with the hive catalog if you
connect to that
so i chose the internal platform the
internal catalog
and as you can see i have a bunch of
and as you can see i have a bunch of
tables available
and we will make use of them throughout
the rest of this
webinar so and they also have a bunch of
webinar so and they also have a bunch of
functions
functions
that i will use at a later point
so i want to talk about what happened in
flink community
and well just as a starting point let's
take the user mailing list as an example
so i do have this one table
that is flink mailing list user and if i
look into that
look into that
i will see a bunch of things that i
imported from the
from the mbox archives that you can
download it's the date of each email
that got
into the mailing list it's the from
into the mailing list it's the from
field
field
also passed as an email address and it's
the subject
i don't need the body for the rest of
the analysis so i just skipped that
but it could have been imported as well
so let's do
our first query let's try and gather
some information
we could look for the number of users on
the mailing list
so we say select and say star
and i
could just do that let's see let me
close that preview
let's just run it and see what
what's being read so now i get all those
email addresses but i don't want all of
them i mean i just
them i mean i just
it would be nice to know how they change
how those numbers changed over the time
so what i want is to have maybe
an overview of the user of the user
an overview of the user of the user
count
count
over maybe months so this is what you
can use a tumbling window for
and in sql it just goes by group by
and then tumble and you specify a
time attribute that's that you can work
on and that is some
timestamp that you can use and if i
look into the table again there is this
date column and in that date column
we have the date of email so we'll use
that so i have a tumbling window
over the date and now i need to define
the length of that tumbling window and
the sql that goes with the interval
oops enter vowel and let's make it a
statistic over 30 days
so roughly a month uh 30 day
so now i do have a tumbling window but
i'm grouping by so i cannot just select
i'm grouping by so i cannot just select
i'm grouping by so i cannot just select
star
star
and let's make it uh
the account i want to count everything
that happened in that window
and maybe i also want to know which
month i counted this for so i do
maybe tumble and so this will get me the
maybe tumble and so this will get me the
end
end
time stamp of that tumbling window of
those 30 days
and for that to happen i need to use the
same specification that i did down here
let's say i can i have this tumble end
and i will rename it to
and i will rename it to
window end and the card star is
uh as maybe num users
all right now we should be good to go
um except for well this is not actually
counting the users
so how is a user defined so if i just
cheat and click into the table again i
can see what i have
and i could distinguish the users from
the email addresses
and this is what i'm going to do so by
having the distinct number of email
having the distinct number of email
having the distinct number of email
addresses
addresses
this is how i get the number of users so
email addresses so this is in the from
and distinct from email
as oops i'm missing a
ledger that and now i should be ready to
ledger that and now i should be ready to
go
go
let's close that again just run that to
see a preview
there we go i see all the months
aggregated there are actually two
windows here i imported everything
from the mailing list since the start of
those mailing lists in 2014
when apache flink became an apache
when apache flink became an apache
project
project
um but the talk is just a year in fling
um but the talk is just a year in fling
so
so
we want to do this only for the past
year so let's actually
reduce the scope a bit so let's say
where my
date timestamp is greater than
uh let's use the current timestamp
so current timestamp gives you the
current timestamp
and i want to look back a year so we'll
and i want to look back a year so we'll
remove
remove
an interval of
there we go and then we should have we
should only get everything that
happened in that past year so let's run
it again
and i do get everything starting
roughly a year ago so that's the window
ending in the
10th of november and then 30 days after
that enough of that enough for that
so i would have 12 results there
cool so that's the first statistic that
i get this is the number of distinct
email addresses per month
let's let's do something more let's
maybe count or have a look at the emails
so how many emails
did we get on that email letter on this
mailing list
so instead of counting the distinct
so instead of counting the distinct
email addresses i just
count everything and it's not users
count everything and it's not users
anymore
anymore
it will be number of emails and
that's that's actually where i can
already start
right i count everything i yeah that's
the number of emails
let's see
there we go uh there's a steady stream
if we look back
it's like around 500 some periods have
been lower some periods have higher
been lower some periods have higher
numbers
numbers
there has been some more activity in the
there has been some more activity in the
month
month
up until the start of march maybe that's
related to the flink release
that got out then um we don't know we
could have a look further
there's a second one in may um but
this is actually just the number of
emails i would actually
like to have a better look at things
like how
engaging with us i mean just looking at
the number of emails
you it could be something like users are
just sending them but never got a reply
so i actually want to have a look at the
so i actually want to have a look at the
email
email
threads that happened in that month
so how do i distinguish those threads
so how do i distinguish those threads
let's
let's
have a look at the table again to see
what the data model is
and i do have the subject in here and
usually when someone replies to a
usually when someone replies to a
subject
subject
to an email this subject is prefixed
to an email this subject is prefixed
with
with
re or maybe in other languages there's
some other prefixes but
let's just stay simple at the moment we
will assume
that it is a prefix with this re colon
and then a space so we want to count the
and then a space so we want to count the
distinct
distinct
email threads so count again we'll just
distinct and we will
try and remove that re prefix from the
try and remove that re prefix from the
subject
subject
so we will trim the
leading then re oops our e
colon space from subject
and as you can see there's already some
some synthetic highlighting
when i don't type when the the whole sql
query is not correct
i do get the errors live so i can see
what is wrong
and i can then act on that so i have the
and i can then act on that so i have the
distinct
distinct
i'm counting the distinct subjects
removing every re
so that should give me the email threads
let's see let's just rename this um
it's not the name emails but some
it's not the name emails but some
threads
threads
um that should be fine let's run it
um that should be fine let's run it
again
again
close that all right there are some
close that all right there are some
threads
threads
it's a bit fewer than before but we
don't have both numbers
at the same time so let's actually make
a result this is just a preview
let's make a result that gets us both
the emails as well as the threads
and i removed the previous statement so
i will just
undo that all those changes
i will have two things here um
so at the moment i was just looking at
emails and i was separately looking at
the email threads but let's bring them
the email threads but let's bring them
together
together
i have two queries one is the one for
emails and the other is the one for the
email threads
so by bringing them together i get the
so by bringing them together i get the
overview
overview
and i'll do that by making each of them
a temporary view
so create temporary view
and then i'll give it a name maybe
that's the emails
as and i'll group that
together like here and then have a
second view create
temporary oops temporary
so now i have two temporary views and i
so now i have two temporary views and i
just need to bring them back together
so for that i can just join them all
together i can
for example select let's make a select
start at the moment
and i want from emails
as well as or maybe
maybe rename it to some short thing and
as well as threads
t and i will join them
on um on the month because both of them
on um on the month because both of them
aggregate
aggregate
things aggregate everything on the on
the windows of 30 days
so i can group them both together on
exactly that attribute that window end
that is there in both
so i will do where oops
e dot window and equals
t dot window and
uh and then let's just rename a few
things so that we have it easier
i will want this window and i know it's
the same
so i'll just want it once window end
so i'll just want it once window end
then
then
i also want the number of emails
so this is this is the one that i have
at the topmost
view and then as the second thing i want
the number of threads
also in here so now we can actually run
also in here so now we can actually run
that
that
um oops sorry that should be
maybe email.windowend and i can run the
whole thing
and it would give me the emails as well
as the threads
as the threads
but what i actually want to look at is
how engaging with those
threads so what i want is to know how
many emails
were there per thread and i'll just add
that field i'll make
a number of emails
divided by num threads there we go
and well we didn't give it a good name
and well we didn't give it a good name
but
but
there's there's some some low numbers um
there's there's some some low numbers um
sometimes actually also saw zero and
there's one thing to be aware of this is
there's one thing to be aware of this is
sql
sql
so both noun threads and numerals are
so both noun threads and numerals are
integers
integers
so the calculation this division is
actually integer division
but um what i want is floated division
so i want to have
the exact average number so i can
do things like cost this type to
float and then we'll be floating point
division and then i would get
will get better results
we can make it as maybe average
or as emails per thread
so we have a nice name and we already
saw in the preview
that there are some values which were
not at 3 but rather 3.9
so we want to know that distinction so
if we
browse through those engagements we see
like say roughly but
at the the roughly four so
almost all of them can be rounded up to
almost all of them can be rounded up to
four um and so they're
on average four emails per thread
that's fine let's maybe also look at
another mailing list
i did import different waiting lists so
i also used this
imported the chinese mailing list let's
have a look at how things are there
so i'll just change those those from
tables because i also have this flinkml
tables because i also have this flinkml
user
user
china it has the same format i can see
that in here
if you actually look a bit further for
example at the ddl statement this is how
i created that table
so i have the fields that we saw in the
columns and i also defined a watermark
for that date so that we actually are
working in event time
all right let's see let's see what the
result is for the chinese mailing list
and behind the scenes flink will execute
and behind the scenes flink will execute
that
that
it will walk through the kafka cluster
and we'll have a look at this later add
a bit more complex query
so for the chinese mailing list it's
almost the same
the maybe one one email lower so roughly
three emails per thread all right
let's let's have a uh one more one more
let's let's have a uh one more one more
details
details
let's get more details sorry
i want to well now we looked at the user
mailing list
we also looked at the user mailing list
in chinese
let's actually have an overview over all
three mailing lists that i imported here
so that is
those two as well as the devmailing list
those two as well as the devmailing list
and let's compare them all together
so we do have a temporary view on the
so we do have a temporary view on the
emails
emails
for the chinese user mailing list let's
make it
one more for the
normals so we can just rename them this
is chinese males
normal flink user or english link user
and then we have
def so that is also flink def
and we can have a look at the table
all right now i have temporary views of
all those email addresses
i just need to join them together so
i'll do that
below here so we already joined two
tables but now
it's actually a bit let's move that a
bit further away
i want to join those three email tables
i want to join those three email tables
together
together
so let's make a select select start
let's start with the select star and i
let's start with the select star and i
want to join
from email um emails
sh as well as make it
uh emails e
as well as what is the other email.def
d and i want to join those three tables
d and i want to join those three tables
together
together
and i just need to define what they are
joined together on
so this is where easy dot window
and equals e dot
oops window and because they're all the
same from what you can see above
and then the second thing is you not
window and
all right and then just maybe rename a
more speaking to us so we can have
e dot uh let's just start with the
window end so we always see what we know
this is for
this is for
and then we'll go with um
and as user
or maybe nice user
then we have easy dot num
emails as user zh
as well as d dot num emails as
def oops
there we go we combined all of them
there we go we combined all of them
together
together
uh except for this one thing here
encounter the cologne when there
shouldn't be
let's see what i forgot
um there's there's one there's one
yep that should not be a comma that
should be an end that's a weird
should be an end that's a weird
condition
condition
how lucky we are we actually got real
how lucky we are we actually got real
time query validation
so we see when something is wrong before
we actually get this query into the
we actually get this query into the
cluster
cluster
all right let's execute this i'll just
execute that selection because i only
want to execute that part
and the other part is just stored away
all right now we see all the months
they're like something like
500 emails on the user mailing list on
the chinese main list there have been
800s so it also looks like this
chinese main list is quite popular and
compared to numbers
and the devmailing list is equally as as
well used
so let's maybe have a look or a deeper
look at the def mailing list
to continue that or maybe actually
this is a more complex query let's see
how it is translated inside
oops sorry session cluster
this is the cluster that the verica
platform will use in order to execute
those previews
so we'll have a look in here remember we
joined three tables together
and we joined all of them together on
the time attributes
so that way flink can internally also
throw away things when they're not
needed anymore
oh i can't actually oops sorry i did
cancel that preview so i cannot see
cancel that preview so i cannot see
anything anymore so i actually need to
run that query again
because it's just a preview and a
because it's just a preview and a
preview
preview
if it's not there anymore it doesn't
need to be running
right session cluster um
actually we have that that flink session
cluster still here
so we can just refresh we should see the
new jobs
and now maybe the live demo is
and now maybe the live demo is
appropriately
appropriately
getting to an end because oh no there it
getting to an end because oh no there it
is
is
okay let's see so
this is the cluster or this is the job
that flink translated the sql query
to so we can browse we can zoom in
and have a look at the things it does so
there is a table source
that will read from the um what is that
the i don't see it immediately
there should be one for each of the
tables like one is here
the other is here and the third one is
down here
then there are some filters there are
some aggregations there are also some
some aggregations there are also some
reductions
reductions
because we don't need everything we for
example didn't make use
of the email from fields and other
things so those are being dropped so
there's an optimizer running behind that
would reduce this grade to perform
would reduce this grade to perform
better
all right um
all right um
all right um
[Music]
[Music]
now now that we have an overview of the
mating lists i wanted to have a deeper
mating lists i wanted to have a deeper
look
look
at the devmailing list um
so let's choose that
so email threads per month this is what
we used earlier and we'll use the same
thing for the death mating list
i do have this
emails i can just adapt this
for the meth def mailing list i once had
for the meth def mailing list i once had
for the meth def mailing list i once had
the
the
the
threads
that are there for the death list
and i can just combine those two
together just like we did before so i
want to
join the emails and threads
um and oftentimes we're going to an end
so i will just
switch over that part and we'll show you
the uh one more advanced statistic that
i wanted to pull off
to just get you into into the mood
to just get you into into the mood
basically
basically
so um i prepared a little bit
some of the the scripts that i wanted to
show i stored them in the platform so we
can have a look immediately
i can i store them in here and what i
wanted to have a look at
is who had the most commits
so who was the most active in the in
terms of
terms of
commits into the flink master branch
over the past
over the past month so i prepared that
query i will actually get it running
first so this one we cannot preview at
the moment because
it's not an append only stream for those
who know
flink sql the preview is currently only
capable of
running or showing append only streams
so that one is i will just go back
to uh the editor
to see uh it is a retract stream so that
cannot immediately show but it will be
running behind the scenes i just started
up a deployment and this is what will be
up a deployment and this is what will be
done
done
by a live stream application and then it
will update something
in that case for me it will update an
in that case for me it will update an
index
index
in elasticsearch so i use that
table that will put something into
elasticsearch and i can browse that
once it is running so i'll just show
once it is running so i'll just show
that
that
really quick and once it is running
or maybe i can take a few questions up
until i have the result
and then i'll show you that then let's
have a look at the if we still have a
have a look at the if we still have a
minute
minute
let's have a look at the result of that
query and i put this into elasticsearch
and let's see who had the most commits
last month so in the 30 days up to
october the 5th this famous
i sorted this by user author email
addresses this famous null
email had 160 commits
so congratulations to now um
that might be something from the github
api maybe this is from those privacy
settings that you can actually say you
don't want to reveal your email
don't want to reveal your email
addresses
addresses
so let's look at the second one uh
stefan even with 32
and then uh david witcharkovsky um
zakovich sorry
sorry with 16 and then a few more you
can uh
look at those i just printed or put the