This is a short introduction to pandas API on Spark, geared mainly for new users. This notebook shows you some key differences between pandas and pandas API on Spark. You can run this examples by yourself in ‘Live Notebook: pandas API on Spark’ at the quickstart page.
Customarily, we import pandas API on Spark as follows:
[1]:
import pandas as pd import numpy as np import pyspark.pandas as ps from pyspark.sql import SparkSession
Creating a pandas-on-Spark Series by passing a list of values, letting pandas API on Spark create a default integer index:
[2]:
s = ps.Series([1, 3, 5, np.nan, 6, 8])
[3]:
s
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
Creating a pandas-on-Spark DataFrame by passing a dict of objects that can be converted to series-like.
[4]:
psdf = ps.DataFrame( {'a': [1, 2, 3, 4, 5, 6], 'b': [100, 200, 300, 400, 500, 600], 'c': ["one", "two", "three", "four", "five", "six"]}, index=[10, 20, 30, 40, 50, 60])
[5]:
psdf
Creating a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:
[6]:
dates = pd.date_range('20130101', periods=6)
[7]:
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
[8]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
[9]:
pdf
Now, this pandas DataFrame can be converted to a pandas-on-Spark DataFrame
[10]:
psdf = ps.from_pandas(pdf)
[11]:
type(psdf)
pyspark.pandas.frame.DataFrame
It looks and behaves the same as a pandas DataFrame.
[12]:
Also, it is possible to create a pandas-on-Spark DataFrame from Spark DataFrame easily.
Creating a Spark DataFrame from pandas DataFrame
[13]:
spark = SparkSession.builder.getOrCreate()
[14]:
sdf = spark.createDataFrame(pdf)
[15]:
sdf.show()
+--------------------+-------------------+--------------------+--------------------+ | A| B| C| D| +--------------------+-------------------+--------------------+--------------------+ | 0.91255803205208|-0.7956452608556638|-0.28911463069772175| 0.18760566615081622| |-0.05970271470242...| -1.233896949308984| 0.3166246451758431| -1.2268284000402265| | 0.33287106947536615|-1.2620100816441786| -0.4348444277082644| -0.5799199651437185| | 0.9240158461589916|-1.0220190956326003| -0.4052488880650239| -1.0360212104348547| | -0.7722090016558953|-1.2280986385313222| 0.0689011451939635| 0.8966790729426755| | 1.4855822995785612|-0.7093056426018517| -0.2026366848847041|-0.24876619876451092| +--------------------+-------------------+--------------------+--------------------+
Creating pandas-on-Spark DataFrame from Spark DataFrame.
[16]:
psdf = sdf.to_pandas_on_spark()
[17]:
Having specific dtypes . Types that are common to both Spark and pandas are currently supported.
[18]:
psdf.dtypes
A float64 B float64 C float64 D float64 dtype: object
Here is how to show top rows from the frame below.
Note that the data in a Spark dataframe does not preserve the natural order by default. The natural order can be preserved by setting compute.ordered_head option but it causes a performance overhead with sorting internally.
compute.ordered_head
[19]:
psdf.head()
Displaying the index, columns, and the underlying numpy data.
[20]:
psdf.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
[21]:
psdf.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
[22]:
psdf.to_numpy()
array([[ 0.91255803, -0.79564526, -0.28911463, 0.18760567], [-0.05970271, -1.23389695, 0.31662465, -1.2268284 ], [ 0.33287107, -1.26201008, -0.43484443, -0.57991997], [ 0.92401585, -1.0220191 , -0.40524889, -1.03602121], [-0.772209 , -1.22809864, 0.06890115, 0.89667907], [ 1.4855823 , -0.70930564, -0.20263668, -0.2487662 ]])
Showing a quick statistic summary of your data
[23]:
psdf.describe()
Transposing your data
[24]:
psdf.T
Sorting by its index
[25]:
psdf.sort_index(ascending=False)
Sorting by value
[26]:
psdf.sort_values(by='B')
Pandas API on Spark primarily uses the value np.nan to represent missing data. It is by default not included in computations.
np.nan
[27]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])
[28]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1
[29]:
psdf1 = ps.from_pandas(pdf1)
[30]:
psdf1
To drop any rows that have missing data.
[31]:
psdf1.dropna(how='any')
Filling missing data.
[32]:
psdf1.fillna(value=5)
Performing a descriptive statistic:
[33]:
psdf.mean()
A 0.470519 B -1.041829 C -0.157720 D -0.334542 dtype: float64
Various configurations in PySpark could be applied internally in pandas API on Spark. For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See also PySpark Usage Guide for Pandas with Apache Arrow in PySpark documentation.
[34]:
prev = spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") # Keep its default value. ps.set_option("compute.default_index_type", "distributed") # Use default index prevent overhead. import warnings warnings.filterwarnings("ignore") # Ignore warnings coming from Arrow optimizations.
[35]:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True) %timeit ps.range(300000).to_pandas()
900 ms ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[36]:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False) %timeit ps.range(300000).to_pandas()
3.08 s ± 227 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[37]:
ps.reset_option("compute.default_index_type") spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", prev) # Set its default value back.
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
[38]:
psdf = ps.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C': np.random.randn(8), 'D': np.random.randn(8)})
[39]:
Grouping and then applying the sum() function to the resulting groups.
[40]:
psdf.groupby('A').sum()
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
[41]:
psdf.groupby(['A', 'B']).sum()
[42]:
pser = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
[43]:
psser = ps.Series(pser)
[44]:
psser = psser.cummax()
[45]:
psser.plot()
On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:
[46]:
pdf = pd.DataFrame(np.random.randn(1000, 4), index=pser.index, columns=['A', 'B', 'C', 'D'])
[47]:
[48]:
psdf = psdf.cummax()
[49]:
psdf.plot()
For more details, Plotting documentation.
CSV is straightforward and easy to use. See here to write a CSV file and here to read a CSV file.
[50]:
psdf.to_csv('foo.csv') ps.read_csv('foo.csv').head(10)
Parquet is an efficient and compact file format to read and write faster. See here to write a Parquet file and here to read a Parquet file.
[51]:
psdf.to_parquet('bar.parquet') ps.read_parquet('bar.parquet').head(10)
In addition, pandas API on Spark fully supports Spark’s various datasources such as ORC and an external datasource. See here to write it to the specified datasource and here to read it from the datasource.
[52]:
psdf.to_spark_io('zoo.orc', format="orc") ps.read_spark_io('zoo.orc', format="orc").head(10)
See the Input/Output documentation for more details.