Pandas is a widely used data processing tool for Python. Along with NumPy and Matplotlib, it provides in-memory high-performance data munging, analyzing, and visualization capabilities. Although Python is an easy-to-learn programming language, it still takes time to learn Pandas APIs and the idiomatic usages. For data engineer and analysts, SQL is the de-facto standard language of data queries. This article will provide examples of how some common SQL queries can be rewritten with Pandas.
The installation and basic concepts of Pandas is not covered in this post. One can check out the offical documentation, or read the book Python for Data Analysis. And I recommend using the Anaconda Python distribution, with Spyder IDE included. Before diving into the codes, please import Pandas and NumPy as follows:
First of all, let’s read some data into the workspace (memory). Pandas supports a variety of formats, one of them is CSV. Take the following flight delay dataset for example (link):
We can use
pd.read_csv to load this file:
This statement will load
flights.csv file into memory, use first line as column names, and try to figure out each column’s type. Since the
date column is in
%m%d%H%M format, we don’t want to lose the initial
0 in month, so we pass an explict
dtype for it, indicating that this column should stay unparsed.
df.head is a function to peek the dataset. It accepts a single parameter to limit the rows, much like
LIMIT caluse. To perform a
LIMIT 10, 100, use
df.iloc[10:100]. Besides, IPython defaults to show only 60 rows, but we can increase this limit by:
Another common loading technique is reading from database. Pandas also has built-in support:
To save DataFrame into file or database, use
SELECT clause in SQL is used to perform column projection and data transformation.
SQL provides various functions to transform data, most of them can be replaced by Pandas, or you can simply write one with Python. Here I’ll choose some commonly used functions to illustrate.
Pandas string functions can be invoked by DataFrame and Series’
str attribute, e.g.
Pandas also has a feature called broadcast behaviour, i.e. perform operations between lower dimensional data (or scalar value) with higher dimensional data. For instances:
There’re many other string functions that Pandas support out-of-the-box, and they are quite different, thus more powerful than SQL. For a complete list please check the Working with Text Data doc.
pd.to_datetime is used to convert various datetime representations to the standard
dt is a property of datetime/period like Series, from which you can extract information about date and time. Full documentation can be found in Time Series / Date functionality.
For logic operators, Pandas will result in a boolean typed Series, which can be used to filter out rows:
We can combine multiple conditions with bitwise operators:
IS NULL and
IS NOT NULL, we can use the built-in functions:
There’s also a
df.query method to write filters as string expression:
Actually, Pandas provides more powerful functionalities for Indexing and Selecting Data, and some of them cannot be expressed by SQL. You can find more usages in the docs.
There’re two parts in an aggregation statement, the columns to group by and the aggregation function. It’s possible to pass multiple columns to
df.groupby, as well as multiple aggregators.
We can also group by a function result. More usages can be found in Group By: split-apply-combine.
There’re two types of sort, by index and by values. If you are not familiar with the concept index, please refer to Pandas tutorials.
If join key is the same, we can use
on=['k1', 'k2']. The default join method (
how) is inner join. Other options are
left for left join,
right outer join, and
outer for full outer join.
pd.concat can be used to perform
UNION. More usages can be found in Merge, join, and concatenate.
Last but not least, it’s common to select top n items within each groups. In MySQL, we have to use variables. In Pandas, we can use the
rank function on grouped DataFrame: