上节课我们讲述了如何查看数据。
本节课我们来讲述如何选择数据。
1.Series
1.1 索引
Series 对象索引的工作原理和 ndarray 对象索引非常类似,不同的一点是,在对 Series 对象进行索引时,我们不但可以使用整数还可以使用 Series 对象本身的索引,举几个例子。 单个元素:
import pandas as pd
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series[3])
print(my_series['d'])
在上面的代码中,my_series[3]
和 my_series['d']
访问的是同一个元素,一个使用的是整数 3,一个使用的是 Series 对象本身的索引 d。
多个元素:
import pandas as pd
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series[[1, 3, 4]])
print(my_series[['b', 'd', 'e']])
切片:
import pandas as pd
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series[0:5])
print(my_series['a':'e'])
使用整数进行切片和使用 Series 对象的索引进行切片不同的地方是,使用整数切片不包含最末的那个元素,而使用对象的索引进行切片时是包含最末的那个元素的。 布尔索引:
import pandas as pd
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series[my_series > my_series.median()])
上面代码的输出结果是大于平均值的元素。
1.2 像字典一样选择数据
import pandas as pd
import numpy as np
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series['b'])
print(my_series.get('b', np.NaN))
判断 key 是否在 Series 里:
import pandas as pd
import numpy as np
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print('e' in my_series)
print('g' in my_series)
1.3 loc 或 iloc
loc 和 iloc 的区别是,loc 使用的是轴标签,iloc 使用的是整数。
单个元素:
import pandas as pd
import numpy as np
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series.iloc[1])
print(my_series.loc['b'])
多个元素:
import pandas as pd
import numpy as np
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series.iloc[[1, 3]])
print(my_series.loc[['b', 'd']])
切片:
import pandas as pd
import numpy as np
my_series = pd.Series([4, -7, 6, -5, 3, 2], index=["a", "b", "c", "d", "e", "f"])
print(my_series.iloc[1:5])
print(my_series.loc['b': 'e'])
使用整数进行切片和使用 Series 对象的索引进行切片不同的地方是,使用整数切片不包含最末的那个元素,而使用对象的索引进行切片时是包含最末的那个元素的。
2.DataFrame
2.1 索引
使用索引的方法来访问 DataFrame 时,得到的是一列或多列。 单列的访问:
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df['Open'])
上面的单列访问我们还可以使用另外一种方法,例如:
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df.Open)
上面代码采用的方法是类似于属性的访问方法。 多列的访问:
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df[['Open', 'Close']])
布尔索引:
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df[df['Open'] > 140])
上面代码中,Open 列的值大于 140 的为后面三行,所以最后得到整个 DataFrame 的后面三行。使用布尔索引,我们还可以得到整个 DataFrame 中大于 140 的元素,例如:
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df[df > 140])
上面的代码得到了整个 DataFrame 中值大于 140 的元素,不大于 140 的元素用 NaN 填充。
2.2 切片
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df[:3])
print(df[::2])
print(df[::-1])
print(df[1:2])
上面的切片都是针对行来的。
2.3 loc 或 iloc
2.3.1 单行或单列的访问
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df.iloc[0])
print(df.loc['2021-07-01'])
print(df.iloc[:, 0])
print(df.loc[:, 'Open'])
2.3.2 多行或多列的访问
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df.iloc[[0, 1, 2]])
print(df.iloc[:, [0, 1, 2]])
print(df.loc[:, ['Open', 'High', 'Low']])
2.3.3 位于行列交叉的元素
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df.loc[['2021-07-01', '2021-07-02'], ['Open', 'High']])
print(df.iloc[[0, 1], [0, 1]])
切片
import pandas as pd
d = {
"Open": pd.Series([136, 137, 140, 143, 141, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"High": pd.Series([137, 140, 143, 144, 144, 145], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Low": pd.Series([135, 137, 140, 142, 140, 142], index=['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09']),
"Close": pd.Series([137, 139, 142, 144, 143, 145], index = ['2021-07-01', '2021-07-02', '2021-07-06', '2021-07-07', '2021-07-08', '2021-07-09'])
}
df = pd.DataFrame(d)
print(df.loc['2021-07-01':'2021-07-07', 'Open': 'Low'])
print(df.iloc[0:4, 0:3])
3.总结
本节课我们讲述了如何提取 Series 对象或 DataFrame 对象的数据。