Comparison with SQL¶
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and numpy as follows:
In [1]: import pandas as pd
In [2]: import numpy as np
Most of the examples will utilize the tips
dataset found within pandas tests. We’ll read
the data into a DataFrame called tips and assume we have a database table of the same name and
structure.
In [3]: url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'
In [4]: tips = pd.read_csv(url)
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-4-eb7031f85b0e> in <module>()
----> 1 tips = pd.read_csv(url)
/build/pandas-0.19.2/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
644 skip_blank_lines=skip_blank_lines)
645
--> 646 return _read(filepath_or_buffer, kwds)
647
648 parser_f.__name__ = name
/build/pandas-0.19.2/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
373 filepath_or_buffer, _, compression = get_filepath_or_buffer(
374 filepath_or_buffer, encoding,
--> 375 compression=kwds.get('compression', None))
376 kwds['compression'] = (inferred_compression if compression == 'infer'
377 else compression)
/build/pandas-0.19.2/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
236
237 if _is_url(filepath_or_buffer):
--> 238 req = _urlopen(str(filepath_or_buffer))
239 if compression == 'infer':
240 content_encoding = req.headers.get('Content-Encoding', None)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
427 req = meth(req)
428
--> 429 response = self._open(req, data)
430
431 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
445 protocol = req.get_type()
446 result = self._call_chain(self.handle_open, protocol, protocol +
--> 447 '_open', req)
448 if result:
449 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1239 def https_open(self, req):
1240 return self.do_open(httplib.HTTPSConnection, req,
-> 1241 context=self._context)
1242
1243 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1196 except socket.error, err: # XXX what error?
1197 h.close()
-> 1198 raise URLError(err)
1199 else:
1200 try:
URLError: <urlopen error [Errno 111] Connection refused>
In [5]: tips.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-9b5f9fb19d63> in <module>()
----> 1 tips.head()
NameError: name 'tips' is not defined
SELECT¶
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a *
to select all columns):
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
With pandas, column selection is done by passing a list of column names to your DataFrame:
In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-ca70a422970c> in <module>()
----> 1 tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
NameError: name 'tips' is not defined
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s
*
).
WHERE¶
Filtering in SQL is done via a WHERE clause.
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
In [7]: tips[tips['time'] == 'Dinner'].head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-51c36d28b166> in <module>()
----> 1 tips[tips['time'] == 'Dinner'].head(5)
NameError: name 'tips' is not defined
The above statement is simply passing a Series
of True/False objects to the DataFrame,
returning all rows with True.
In [8]: is_dinner = tips['time'] == 'Dinner'
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-ec2d078cf580> in <module>()
----> 1 is_dinner = tips['time'] == 'Dinner'
NameError: name 'tips' is not defined
In [9]: is_dinner.value_counts()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-6109931b85b3> in <module>()
----> 1 is_dinner.value_counts()
NameError: name 'is_dinner' is not defined
In [10]: tips[is_dinner].head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-57b725cb2a9c> in <module>()
----> 1 tips[is_dinner].head(5)
NameError: name 'tips' is not defined
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
-- tips of more than $5.00 at Dinner meals
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
# tips of more than $5.00 at Dinner meals
In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-f9872f18820f> in <module>()
----> 1 tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
NameError: name 'tips' is not defined
-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
# tips by parties of at least 5 diners OR bill total was more than $45
In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-89bface1daa4> in <module>()
----> 1 tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
NameError: name 'tips' is not defined
NULL checking is done using the notnull()
and isnull()
methods.
In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
....: 'col2': ['F', np.NaN, 'G', 'H', 'I']})
....:
In [14]: frame
Out[14]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records
where col2
IS NULL with the following query:
SELECT *
FROM frame
WHERE col2 IS NULL;
In [15]: frame[frame['col2'].isnull()]
Out[15]:
col1 col2
1 B NaN
Getting items where col1
IS NOT NULL can be done with notnull()
.
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [16]: frame[frame['col1'].notnull()]
Out[16]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
GROUP BY¶
In pandas, SQL’s GROUP BY operations are performed using the similarly named
groupby()
method. groupby()
typically refers to a
process where we’d like to split a dataset into groups, apply some function (typically aggregation)
, and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex:
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
The pandas equivalent would be:
In [17]: tips.groupby('sex').size()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-b6efcfd549fa> in <module>()
----> 1 tips.groupby('sex').size()
NameError: name 'tips' is not defined
Notice that in the pandas code we used size()
and not
count()
. This is because
count()
applies the function to each column, returning
the number of not null
records within each.
In [18]: tips.groupby('sex').count()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-76e695bd2afb> in <module>()
----> 1 tips.groupby('sex').count()
NameError: name 'tips' is not defined
Alternatively, we could have applied the count()
method
to an individual column:
In [19]: tips.groupby('sex')['total_bill'].count()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-b60ce06ea317> in <module>()
----> 1 tips.groupby('sex')['total_bill'].count()
NameError: name 'tips' is not defined
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount
differs by day of the week - agg()
allows you to pass a dictionary
to your grouped DataFrame, indicating which functions to apply to specific columns.
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/
In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-b5e78ebd47a5> in <module>()
----> 1 tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
NameError: name 'tips' is not defined
Grouping by more than one column is done by passing a list of columns to the
groupby()
method.
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
*/
In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-21-6f224060382b> in <module>()
----> 1 tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
NameError: name 'tips' is not defined
JOIN¶
JOINs can be performed with join()
or merge()
. By default,
join()
will join the DataFrames on their indices. Each method has
parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the
columns to join on (column names or indices).
In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
....: 'value': np.random.randn(4)})
....:
In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
....: 'value': np.random.randn(4)})
....:
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
INNER JOIN¶
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [24]: pd.merge(df1, df2, on='key')
Out[24]:
key value_x value_y
0 B -0.318214 0.543581
1 D 2.169960 -0.426067
2 D 2.169960 1.138079
merge()
also offers parameters for cases when you’d like to join one DataFrame’s
column with another DataFrame’s index.
In [25]: indexed_df2 = df2.set_index('key')
In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)
Out[26]:
key value_x value_y
1 B -0.318214 0.543581
3 D 2.169960 -0.426067
3 D 2.169960 1.138079
LEFT OUTER JOIN¶
-- show all records from df1
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
# show all records from df1
In [27]: pd.merge(df1, df2, on='key', how='left')
Out[27]:
key value_x value_y
0 A 0.116174 NaN
1 B -0.318214 0.543581
2 C 0.285261 NaN
3 D 2.169960 -0.426067
4 D 2.169960 1.138079
RIGHT JOIN¶
-- show all records from df2
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
# show all records from df2
In [28]: pd.merge(df1, df2, on='key', how='right')
Out[28]:
key value_x value_y
0 B -0.318214 0.543581
1 D 2.169960 -0.426067
2 D 2.169960 1.138079
3 E NaN 0.086073
FULL JOIN¶
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
# show all records from both frames
In [29]: pd.merge(df1, df2, on='key', how='outer')
Out[29]:
key value_x value_y
0 A 0.116174 NaN
1 B -0.318214 0.543581
2 C 0.285261 NaN
3 D 2.169960 -0.426067
4 D 2.169960 1.138079
5 E NaN 0.086073
UNION¶
UNION ALL can be performed using concat()
.
In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
....: 'rank': range(1, 4)})
....:
In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
....: 'rank': [1, 4, 5]})
....:
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
In [32]: pd.concat([df1, df2])
Out[32]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5
SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
In pandas, you can use concat()
in conjunction with
drop_duplicates()
.
In [33]: pd.concat([df1, df2]).drop_duplicates()
Out[33]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5
Pandas equivalents for some SQL analytic and aggregate functions¶
Top N rows with offset¶
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [34]: tips.nlargest(10+5, columns='tip').tail(10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-34-4d492e6230af> in <module>()
----> 1 tips.nlargest(10+5, columns='tip').tail(10)
NameError: name 'tips' is not defined
Top N rows per group¶
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
....: .groupby(['day'])
....: .cumcount() + 1)
....: .query('rn < 3')
....: .sort_values(['day','rn'])
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-6b63d0ae5566> in <module>()
----> 1 (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
2 .groupby(['day'])
3 .cumcount() + 1)
4 .query('rn < 3')
5 .sort_values(['day','rn'])
NameError: name 'tips' is not defined
the same using rank(method=’first’) function
In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
....: .rank(method='first', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['day','rnk'])
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-36-d33ea02e360f> in <module>()
----> 1 (tips.assign(rnk=tips.groupby(['day'])['total_bill']
2 .rank(method='first', ascending=False))
3 .query('rnk < 3')
4 .sort_values(['day','rnk'])
5 )
NameError: name 'tips' is not defined
-- Oracle's RANK() analytic function
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;
Let’s find tips with (rank < 3) per gender group for (tips < 2).
Notice that when using rank(method='min')
function
rnk_min remains the same for the same tip
(as Oracle’s RANK() function)
In [37]: (tips[tips['tip'] < 2]
....: .assign(rnk_min=tips.groupby(['sex'])['tip']
....: .rank(method='min'))
....: .query('rnk_min < 3')
....: .sort_values(['sex','rnk_min'])
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-37-ae999baf4147> in <module>()
----> 1 (tips[tips['tip'] < 2]
2 .assign(rnk_min=tips.groupby(['sex'])['tip']
3 .rank(method='min'))
4 .query('rnk_min < 3')
5 .sort_values(['sex','rnk_min'])
NameError: name 'tips' is not defined
UPDATE¶
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-38-32b569e8c5fe> in <module>()
----> 1 tips.loc[tips['tip'] < 2, 'tip'] *= 2
NameError: name 'tips' is not defined
DELETE¶
DELETE FROM tips
WHERE tip > 9;
In pandas we select the rows that should remain, instead of deleting them
In [39]: tips = tips.loc[tips['tip'] <= 9]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-39-4ad2971a8844> in <module>()
----> 1 tips = tips.loc[tips['tip'] <= 9]
NameError: name 'tips' is not defined