...

I will start with what is Pandas, what is Arrow and how they relate. Then explain a bit what is Spark and how it works (I’ll try to be fast here) and then how PySpark works. Finally, I’ll cover why Arrow speeds up processes.

whoami

Ruben Berenguel (@berenguel)
PhD in Mathematics
(big) data consultant
Lead data engineer using Python, Go and Scala
Right now at Affectv

What is Pandas?

Python Data Analysis library
Used everywhere data and Python appear in job offers
Efficient (is columnar and has a C and Cython backend)

How does Pandas manage columnar data?

Internally, Pandas handles columns together by type

A BlockManager structure handles these Blocks

What is Arrow?

Cross-language in-memory columnar format library
Optimised for efficiency across languages
Integrates seamlessly with Pandas

How does Arrow manage columnar data?

Everything starts with a Table structure

It is formed of RecordBatches, which contain a certain amount of rows. This makes it a streamable format

And together form a RecordBatch

Internally, RecordBatches have a columnar layout. Each RecordBatch contains some additional metadata.

In the end, you can think of an Arrow Table is formed of a set of RecordBatches for this presentation. It’s actually a collection of chunked arrays, where each array is formed of different chunked pieces column-wise. RecordBatches have the same length (so, when you create a table from a set of RecordBatches all the chunks are “the same”). Thanks Uwe L. Korn (xhochy) for his pointer to this.

🏹 ❤️ 🐼

Arrow uses RecordBatches
Pandas uses blocks handled by a BlockManager
You can convert an Arrow Table into a Pandas DataFrame easily

The internal layouts are similar enough that transforming one into the other is close to being zero-copy. Since most of the code for this step is written in C and Cython, it is very fast. Note that Pandas is already storing data in a columnar way: Arrow just offers an unified way to be able to share the same data representation among languages. Thanks to Marc Garcia for pointing out this should be made more clear here

What is Spark?

Distributed Computation framework
Open source
Easy to use
Scales horizontally and vertically

How does Spark work?

Spark usually runs on top of a cluster manager

And a distributed storage

A Spark program runs in the driver

The driver requests resources from the cluster manager to run tasks

The main building block is the RDD:

##__R__esilient __D__istributed __D__ataset

Locations (preferredLocations) are optional, and allow for fine grained execution to avoid shuffling data across the cluster

Py Spark

PySpark offers a Python API to the Scala core of Spark

It uses the Py4J bridge

gateway = JavaGateway(
    gateway_parameters=GatewayParameters(
       port=gateway_port, 
       auth_token=gateway_secret,
       auto_convert=True))
java_import(gateway.jvm, "org.apache.spark.SparkConf")
java_import(gateway.jvm, "org.apache.spark.api.java.*")
java_import(gateway.jvm, "org.apache.spark.api.python.*")
.
.
.
return gateway

Some _jrdd appears. Each “Spark” related object has some internal relationship with an equivalent JVM object. An RDD, for instance, has a _jrdd, which refers to how the RDD was created in the JVM. By extension, if in Python you create an RDD from a file, for instance, this will call the JVM method to do so, and the resulting Python object will have a _jrdd pointing to that.

The main entrypoints are `RDD` and `PipelinedRDD(RDD)`

In Python land. The first exposes the API of Scala RDDs (by interacting with the JVM connected to the underlying RDD), the second defines how to apply a Python function to an RDD. For instance, all map methods on RDDs defer to the mapPartitionsWithIndex method, which builds a PipelinedRDD wrapping the Python function to map, and then does some other things I’ll explain later

`PipelinedRDD`

builds in the JVM a

`PythonRDD`

With its jrdd (so, it’s in Python land)

RDD’s map will create a PipelinedRDD

Will put the function in the func field

And will point prev to the RDD. Missing anything?

And now, what is the _jrdd of the PipelinedRDD?

It’s a PythonRDD (built via Py4J of course, this is in the Scala code), which is a sub-class of RDD

The dependencies of this RDD will point to the original _jrdd

The compute method will wrap the function defined in func

The magic is in

`compute`

of PythonRDD It’s where something gets eventually done to the RDD in Python

`compute`

is run on each executor and starts a Python worker via `PythonRunner`

Workers act as standalone processors of streams of data

Connects back to the JVM that started it
Load included Python libraries
Deserializes the pickled function coming from the stream
Applies the function to the data coming from the stream
Sends the output back

…

But… wasn’t Spark magically optimising everything?

Yes, for Spark `DataFrame`

You can think of DataFrames as RDDs which actually refer to tables. They have column names, and may have types for each column

Spark will generate a plan

(a D irected A cyclic Graph)

to compute the result

And the plan will be optimised using Catalyst

Depending on the function, the optimiser will choose

`PythonUDFRunner`

or

`PythonArrowRunner`

(both extend `PythonRunner`)

If we can define our functions using Pandas `Series` transformations we can speed up PySpark code from 3x to 100x!

Quick examples

The basics: `toPandas`

from pyspark.sql.functions import rand

df = spark.range(1 << 20).toDF("id").withColumn("x", rand())

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pandas_df = df.toPandas()

from pyspark.sql.functions import rand

df = spark.range(1 << 20).toDF("id").withColumn("x", rand())

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pandas_df = df.toPandas()

from pyspark.sql.functions import rand

df = spark.range(1 << 20).toDF("id").withColumn("x", rand())

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pandas_df = df.toPandas()

The fun: `.groupBy`

from pyspark.sql.functions import rand, randn, floor
from pyspark.sql.functions import pandas_udf, PandasUDFType

df = spark.range(20000000).toDF("row").drop("row") \
     .withColumn("id", floor(rand()*10000)).withColumn("spent", (randn()+3)*100)

@pandas_udf("id long, spent double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    spent = pdf.spent
    return pdf.assign(spent=spent - spent.mean())

df_to_pandas_arrow = df.groupby("id").apply(subtract_mean).toPandas()

from pyspark.sql.functions import rand, randn, floor
from pyspark.sql.functions import pandas_udf, PandasUDFType

df = spark.range(20000000).toDF("row").drop("row") \
     .withColumn("id", floor(rand()*10000)).withColumn("spent", (randn()+3)*100)

@pandas_udf("id long, spent double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    spent = pdf.spent
    return pdf.assign(spent=spent - spent.mean())

df_to_pandas_arrow = df.groupby("id").apply(subtract_mean).toPandas()

Finally only the required columns and data are passed to the driver, and we collect there .toPandas

Before you may have done something like..

import numpy as np
from pyspark.sql.functions import collect_list

grouped = df2.groupby("id").agg(collect_list('spent').alias("spent_list"))
as_pandas = grouped.toPandas()
as_pandas["mean"] = as_pandas["spent_list"].apply(np.mean)
as_pandas["substracted"] = as_pandas["spent_list"].apply(np.array) - as_pandas["mean"]
df_to_pandas = as_pandas.drop(columns=["spent_list", "mean"]).explode("substracted")

import numpy as np
from pyspark.sql.functions import collect_list

grouped = df2.groupby("id").agg(collect_list('spent').alias("spent_list"))
as_pandas = grouped.toPandas()
as_pandas["mean"] = as_pandas["spent_list"].apply(np.mean)
as_pandas["substracted"] = as_pandas["spent_list"].apply(np.array) - as_pandas["mean"]
df_to_pandas = as_pandas.drop(columns=["spent_list", "mean"]).explode("substracted")

They group and aggregate with the not-fancy collect_list (when this is too large, fun times)

The collect_list-collected data is now sent to the driver, to finalise the operation in Python

TLDR:

Use¹ Arrow and Pandas UDF s

1 →

in pyspark

Resources

Questions?

Thanks!

Get the slides from my github:

github.com/rberenguel/

The repository is

pyspark-arrow-pandas

Further references

Table for `toPandas`

2 to x	Direct (s)	With Arrow (s)	Factor
17	1,08	0,18	5,97
18	1,69	0,26	6,45
19	4,16	0,30	13,87
20	5,76	0,61	9,44
21	9,73	0,96	10,14
22	17,90	1,64	10,91
23	(OOM)	3,42
24	(OOM)	11,40

whoami

How does Pandas manage columnar data?

What is Arrow?

How does Arrow manage columnar data?

🏹 ❤️ 🐼

What is Spark?

How does Spark work?

Spark usually runs on top of a cluster manager

And a distributed storage

A Spark program runs in the driver

The driver requests resources from the cluster manager to run tasks

The driver requests resources from the cluster manager to run tasks

The driver requests resources from the cluster manager to run tasks

The driver requests resources from the cluster manager to run tasks

The main building block is the RDD:

Py Spark

PySpark offers a Python API to the Scala core of Spark

It uses the Py4J bridge

The main entrypoints are RDD and PipelinedRDD(RDD)

PipelinedRDD

builds in the JVM a

PythonRDD

The magic is in

compute

compute

is run on each executor and starts a Python worker via PythonRunner

But… wasn’t Spark magically optimising everything?

Yes, for Spark DataFrame

Spark will generate a plan

(a D irected A cyclic __G__raph)

to compute the result

And the plan will be optimised using Catalyst

Depending on the function, the optimiser will choose

PythonUDFRunner

or

PythonArrowRunner

(both extend PythonRunner)

If we can define our functions using Pandas Series transformations we can speed up PySpark code from 3x to 100x!

Quick examples

The basics: toPandas

The fun: .groupBy

Before you may have done something like..

TLDR:

Use1 Arrow and Pandas UDF s

Resources

Questions?

Thanks!

Further references

Arrow

Pandas

Spark/PySpark

Py4J

Table for toPandas

EOF

The main entrypoints are `RDD` and `PipelinedRDD(RDD)`

`PipelinedRDD`

`PythonRDD`

`compute`

`compute`

is run on each executor and starts a Python worker via `PythonRunner`

Yes, for Spark `DataFrame`

(a D irected A cyclic Graph)

`PythonUDFRunner`

`PythonArrowRunner`

(both extend `PythonRunner`)

If we can define our functions using Pandas `Series` transformations we can speed up PySpark code from 3x to 100x!

The basics: `toPandas`

The fun: `.groupBy`

Use¹ Arrow and Pandas UDF s

Table for `toPandas`

`EOF`