与SQL的比较#
由于许多潜在的 pandas 用户对 SQL 有一些了解,本页面旨在提供一些示例,说明如何使用 pandas 执行各种 SQL 操作。
如果你是 pandas 的新手,你可能想先阅读 10 分钟入门 pandas 以熟悉这个库。
按照惯例,我们导入 pandas 和 NumPy 如下:
In [1]: import pandas as pd
In [2]: import numpy as np
大多数示例将使用在 pandas 测试中找到的 tips
数据集。我们将数据读入一个名为 tips
的 DataFrame,并假设我们有一个同名和结构的数据库表。
In [3]: url = (
...: "https://raw.githubusercontent.com/pandas-dev"
...: "/pandas/main/pandas/tests/io/data/csv/tips.csv"
...: )
...:
In [4]: tips = pd.read_csv(url)
In [5]: tips
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
复制 vs. 就地操作#
大多数 pandas 操作返回 Series
/DataFrame
的副本。要使更改生效,你需要将其分配给一个新变量:
sorted_df = df.sort_values("col1")
或者覆盖原始的那个:
df = df.sort_values("col1")
备注
你将会看到一些方法提供了 inplace=True
或 copy=False
关键字参数:
df.replace(5, inplace=True)
关于弃用和移除大多数方法(例如 dropna
)中的 inplace
和 copy
的讨论正在进行中,除了极少数方法(包括 replace
)。在写时复制(Copy-on-Write)的上下文中,这两个关键字将不再需要。提案可以在这里找到 here。
选择#
在 SQL 中,选择是通过使用逗号分隔的列列表来完成的(或者使用 *
来选择所有列):
SELECT total_bill, tip, smoker, time
FROM tips;
使用 pandas,列选择是通过将列名列表传递给您的 DataFrame 来完成的:
In [6]: tips[["total_bill", "tip", "smoker", "time"]]
Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
.. ... ... ... ...
239 29.03 5.92 No Dinner
240 27.18 2.00 Yes Dinner
241 22.67 2.00 Yes Dinner
242 17.82 1.75 No Dinner
243 18.78 3.00 No Dinner
[244 rows x 4 columns]
不带列名列表调用 DataFrame 将显示所有列(类似于 SQL 的 *
)。
在 SQL 中,您可以添加一个计算列:
SELECT *, tip/total_bill as tip_rate
FROM tips;
使用 pandas,你可以使用 DataFrame 的 DataFrame.assign()
方法来添加一个新列:
In [7]: tips.assign(tip_rate=tips["tip"] / tips["total_bill"])
Out[7]:
total_bill tip sex smoker day time size tip_rate
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
.. ... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3 0.203927
240 27.18 2.00 Female Yes Sat Dinner 2 0.073584
241 22.67 2.00 Male Yes Sat Dinner 2 0.088222
242 17.82 1.75 Male No Sat Dinner 2 0.098204
243 18.78 3.00 Female No Thur Dinner 2 0.159744
[244 rows x 8 columns]
WHERE#
在SQL中,过滤是通过WHERE子句完成的。
SELECT *
FROM tips
WHERE time = 'Dinner';
DataFrame 可以通过多种方式进行过滤;其中最直观的方式是使用 布尔索引。
In [8]: tips[tips["total_bill"] > 10]
Out[8]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[227 rows x 7 columns]
上述语句仅仅是将一个 True
/False
对象的 Series
传递给 DataFrame,返回所有 True
的行。
In [9]: is_dinner = tips["time"] == "Dinner"
In [10]: is_dinner
Out[10]:
0 True
1 True
2 True
3 True
4 True
...
239 True
240 True
241 True
242 True
243 True
Name: time, Length: 244, dtype: bool
In [11]: is_dinner.value_counts()
Out[11]:
time
True 176
False 68
Name: count, dtype: int64
In [12]: tips[is_dinner]
Out[12]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[176 rows x 7 columns]
就像 SQL 的 OR
和 AND
一样,可以使用 |
(OR
) 和 &
(AND
) 将多个条件传递给 DataFrame。
晚餐餐食超过 $5 的小费:
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
In [13]: tips[(tips["time"] == "Dinner") & (tips["tip"] > 5.00)]
Out[13]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
至少5位用餐者的团体 OR 账单总额超过$45的小费:
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
In [14]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]
Out[14]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
NULL 检查使用 notna()
和 isna()
方法完成。
In [15]: frame = pd.DataFrame(
....: {"col1": ["A", "B", np.nan, "C", "D"], "col2": ["F", np.nan, "G", "H", "I"]}
....: )
....:
In [16]: frame
Out[16]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
假设我们有一个与上述 DataFrame 结构相同的表。我们可以通过以下查询仅查看 col2
为 NULL 的记录:
SELECT *
FROM frame
WHERE col2 IS NULL;
In [17]: frame[frame["col2"].isna()]
Out[17]:
col1 col2
1 B NaN
获取 col1
不为空的项可以通过 notna()
完成。
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [18]: frame[frame["col1"].notna()]
Out[18]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
GROUP BY#
在 pandas 中,SQL 的 GROUP BY
操作使用同名的 groupby()
方法执行。groupby()
通常指的是一种我们希望将数据集分成组,应用某些函数(通常是聚合),然后将这些组组合在一起的过程。
一个常见的SQL操作是获取数据集中每个组的记录数。例如,一个查询获取按性别分组的提示数量:
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
pandas 的等价物是:
In [19]: tips.groupby("sex").size()
Out[19]:
sex
Female 87
Male 157
dtype: int64
请注意,在 pandas 代码中我们使用了 DataFrameGroupBy.size()
而不是 DataFrameGroupBy.count()
。这是因为 DataFrameGroupBy.count()
将函数应用于每一列,返回每个列中 NOT NULL
记录的数量。
In [20]: tips.groupby("sex").count()
Out[20]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
或者,我们可以对单个列应用 DataFrameGroupBy.count()
方法:
In [21]: tips.groupby("sex")["total_bill"].count()
Out[21]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
多个函数也可以同时应用。例如,假设我们想看看小费金额如何随一周中的某天变化 - DataFrameGroupBy.agg()
允许你传递一个字典给分组后的DataFrame,指示要对特定列应用哪些函数。
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thu 2.771452 62
*/
In [22]: tips.groupby("day").agg({"tip": "mean", "day": "size"})
Out[22]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
通过将列列表传递给 groupby()
方法,可以按多列进行分组。
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thu 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thu 17 3.030000
*/
In [23]: tips.groupby(["smoker", "day"]).agg({"tip": ["size", "mean"]})
Out[23]:
tip
size mean
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
JOIN#
JOIN
s 可以使用 join()
或 merge()
进行。默认情况下,join()
将根据索引连接 DataFrame。每个方法都有参数允许你指定要执行的连接类型(LEFT
、RIGHT
、INNER
、FULL
)或要连接的列(列名或索引)。
警告
如果两个键列都包含键为空值的行,这些行将相互匹配。这与通常的SQL连接行为不同,可能会导致意外的结果。
In [24]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
In [25]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
假设我们有两个数据库表,它们的名称和结构与我们的 DataFrame 相同。
现在让我们来看看各种类型的 JOIN
s。
INNER JOIN#
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [26]: pd.merge(df1, df2, on="key")
Out[26]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
merge()
还提供了参数,用于当你想将一个DataFrame的列与另一个DataFrame的索引连接时的情况。
In [27]: indexed_df2 = df2.set_index("key")
In [28]: pd.merge(df1, indexed_df2, left_on="key", right_index=True)
Out[28]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
3 D -1.135632 0.119209
LEFT OUTER JOIN#
显示 df1
中的所有记录。
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
In [29]: pd.merge(df1, df2, on="key", how="left")
Out[29]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
RIGHT JOIN#
显示 df2
中的所有记录。
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
In [30]: pd.merge(df1, df2, on="key", how="right")
Out[30]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236
FULL JOIN#
pandas 还允许使用 FULL JOIN
s,它会显示数据集的两边,无论连接列是否找到匹配项。截至撰写本文时,并非所有 RDBMS(MySQL)都支持 FULL JOIN
s。
显示两个表中的所有记录。
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
In [31]: pd.merge(df1, df2, on="key", how="outer")
Out[31]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236
UNION#
UNION ALL
可以使用 concat()
来执行。
In [32]: df1 = pd.DataFrame(
....: {"city": ["Chicago", "San Francisco", "New York City"], "rank": range(1, 4)}
....: )
....:
In [33]: df2 = pd.DataFrame(
....: {"city": ["Chicago", "Boston", "Los Angeles"], "rank": [1, 4, 5]}
....: )
....:
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
In [34]: pd.concat([df1, df2])
Out[34]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5
SQL 的 UNION
类似于 UNION ALL
,然而 UNION
会移除重复的行。
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
在 pandas 中,你可以结合使用 concat()
和 drop_duplicates()
。
In [35]: pd.concat([df1, df2]).drop_duplicates()
Out[35]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5
限制#
SELECT * FROM tips
LIMIT 10;
In [36]: tips.head(10)
Out[36]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
7 26.88 3.12 Male No Sun Dinner 4
8 15.04 1.96 Male No Sun Dinner 2
9 14.78 3.23 Male No Sun Dinner 2
一些SQL分析和聚合函数的pandas等价物#
顶部 n 行带偏移#
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [37]: tips.nlargest(10 + 5, columns="tip").tail(10)
Out[37]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4
每个组的顶部 n 行#
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [38]: (
....: tips.assign(
....: rn=tips.sort_values(["total_bill"], ascending=False)
....: .groupby(["day"])
....: .cumcount()
....: + 1
....: )
....: .query("rn < 3")
....: .sort_values(["day", "rn"])
....: )
....:
Out[38]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2
使用 rank(method='first')
函数
In [39]: (
....: tips.assign(
....: rnk=tips.groupby(["day"])["total_bill"].rank(
....: method="first", ascending=False
....: )
....: )
....: .query("rnk < 3")
....: .sort_values(["day", "rnk"])
....: )
....:
Out[39]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK() analytic function
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;
让我们在每个性别组中找到 (rank < 3) 的提示,条件是 (tips < 2)。注意,当使用 rank(method='min')
函数时,rnk_min
对于相同的 tip
保持不变(类似于 Oracle 的 RANK()
函数)
In [40]: (
....: tips[tips["tip"] < 2]
....: .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min"))
....: .query("rnk_min < 3")
....: .sort_values(["sex", "rnk_min"])
....: )
....:
Out[40]:
total_bill tip sex smoker day time size rnk_min
67 3.07 1.00 Female Yes Sat Dinner 1 1.0
92 5.75 1.00 Female Yes Fri Dinner 2 1.0
111 7.25 1.00 Female No Sat Dinner 1 1.0
236 12.60 1.00 Male Yes Sat Dinner 2 1.0
237 32.83 1.17 Male Yes Sat Dinner 2 2.0
更新#
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [41]: tips.loc[tips["tip"] < 2, "tip"] *= 2
删除#
DELETE FROM tips
WHERE tip > 9;
在 pandas 中,我们选择应该保留的行,而不是删除应该删除的行:
In [42]: tips = tips.loc[tips["tip"] <= 9]