Python 数据分析三剑客之 Pandas（五）：统计计算与统计描述

Pandas 系列文章：

这里是一段防爬虫文本，请读者忽略。
本文原创首发于 CSDN，作者 TRHX。
博客首页：https://itrhx.blog.csdn.net/
本文链接：https://itrhx.blog.csdn.net/article/details/106788501
未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！

【01x00】统计计算

Pandas 对象拥有一组常用的数学和统计方法。它们大部分都属于约简和汇总统计，用于从 Series 中提取单个值（如 sum 或 mean）或从 DataFrame 的行或列中提取一个 Series。跟对应的 NumPy 数组方法相比，它们都是基于没有缺失数据的假设而构建的。

【01x01】sum() 求和

sum() 方法用于返回指定轴的和，相当于 numpy.sum()。

在 Series 和 DataFrame 中的基本语法如下：

Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
DataFrame.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

官方文档：

常用参数描述如下：

参数描述 axis 指定轴求和，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’ skipna bool 类型，求和时是否排除缺失值（NA/null），默认 True level 如果轴是 MultiIndex（层次结构），则沿指定层次求和

在 Series 中的应用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.sum()
14
>>> 
>>> obj.sum(level='blooded')
blooded
warm    6
cold    8
Name: legs, dtype: int64
>>> 
>>> obj.sum(level=0)
blooded
warm    6
cold    8
Name: legs, dtype: int64

在 DataFrame 中的应用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.sum()
one    9.25
two   -5.80
dtype: float64
>>> 
>>> obj.sum(axis=1)
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

【01x02】min() 最小值

min() 方法用于返回指定轴的最小值。

在 Series 和 DataFrame 中的基本语法如下：

Series.min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文档：

常用参数描述如下：

参数描述 axis 指定轴求最小值，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’ skipna bool 类型，求最小值时是否排除缺失值（NA/null），默认 True level 如果轴是 MultiIndex（层次结构），则沿指定层次求最小值

在 Series 中的应用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.min()
0
>>> 
>>> obj.min(level='blooded')
blooded
warm    2
cold    0
Name: legs, dtype: int64
>>> 
>>> obj.min(level=0)
blooded
warm    2
cold    0
Name: legs, dtype: int64

在 DataFrame 中的应用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.min()
one    0.75
two   -4.50
dtype: float64
>>> 
>>> obj.min(axis=1)
a    1.4
b   -4.5
c    NaN
d   -1.3
dtype: float64
>>> 
>>> obj.min(axis='columns', skipna=False)
a    NaN
b   -4.5
c    NaN
d   -1.3
dtype: float64

【01x03】max() 最大值

max() 方法用于返回指定轴的最大值。

在 Series 和 DataFrame 中的基本语法如下：

Series.max(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文档：

常用参数描述如下：

参数描述 axis 指定轴求最大值，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’ skipna bool 类型，求最大值时是否排除缺失值（NA/null），默认 True level 如果轴是 MultiIndex（层次结构），则沿指定层次求最大值

在 Series 中的应用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.max()
8
>>> 
>>> obj.max(level='blooded')
blooded
warm    4
cold    8
Name: legs, dtype: int64
>>> 
>>> obj.max(level=0)
blooded
warm    4
cold    8
Name: legs, dtype: int64

在 DataFrame 中的应用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.max()
one    7.1
two   -1.3
dtype: float64
>>> 
>>> obj.max(axis=1)
a    1.40
b    7.10
c     NaN
d    0.75
dtype: float64
>>> 
>>> obj.max(axis='columns', skipna=False)
a     NaN
b    7.10
c     NaN
d    0.75
dtype: float64

【01x04】mean() 平均值

mean() 方法用于返回指定轴的平均值。

在 Series 和 DataFrame 中的基本语法如下：

Series.mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文档：

常用参数描述如下：

参数描述 axis 指定轴求平均值，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’ skipna bool 类型，求平均值时是否排除缺失值（NA/null），默认 True level 如果轴是 MultiIndex（层次结构），则沿指定层次求平均值

在 Series 中的应用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.mean()
3.5
>>> 
>>> obj.mean(level='blooded')
blooded
warm    3
cold    4
Name: legs, dtype: int64
>>> 
>>> obj.mean(level=0)
blooded
warm    3
cold    4
Name: legs, dtype: int64

在 DataFrame 中的应用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.mean()
one    3.083333
two   -2.900000
dtype: float64
>>> 
>>> obj.mean(axis=1)
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
>>> 
>>> obj.mean(axis='columns', skipna=False)
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

【01x05】idxmin() 最小值索引

idxmin() 方法用于返回最小值的索引。

在 Series 和 DataFrame 中的基本语法如下：

Series.idxmin(self, axis=0, skipna=True, *args, **kwargs)
DataFrame.idxmin(self, axis=0, skipna=True)

官方文档：

常用参数描述如下：

参数描述 axis 指定轴，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’ skipna bool 类型，是否排除缺失值（NA/null），默认 True

在 Series 中的应用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.idxmin()
('cold', 'fish')

在 DataFrame 中的应用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.idxmin()
one    d
two    b
dtype: object

【01x06】idxmax() 最大值索引

idxmax() 方法用于返回最大值的索引。

在 Series 和 DataFrame 中的基本语法如下：

Series.idxmax(self, axis=0, skipna=True, *args, **kwargs)
DataFrame.idxmax(self, axis=0, skipna=True)

官方文档：

常用参数描述如下：

参数描述 axis 指定轴，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’ skipna bool 类型，是否排除缺失值（NA/null），默认 True

在 Series 中的应用：

>>> import pandas as pd
>>> idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
>>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> obj
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> 
>>> obj.idxmax()
('cold', 'spider')

在 DataFrame 中的应用：

>>> import pandas as pd
>>> import numpy as np
>>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
    [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
>>> obj
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
>>> 
>>> obj.idxmax()
one    b
two    d
dtype: object

【02x00】统计描述

describe() 方法用于快速综合统计结果：计数、均值、标准差、最大最小值、四分位数等。还可以通过参数来设置需要忽略或者包含的统计选项。

在 Series 和 DataFrame 中的基本语法如下：

Series.describe(self: ~ FrameOrSeries, percentiles=None, include=None, exclude=None)
DataFrame.describe(self: ~ FrameOrSeries, percentiles=None, include=None, exclude=None)

官方文档：

参数描述 percentiles 数字列表，可选项，要包含在输出中的百分比。所有值都应介于 0 和 1 之间。默认值为 [.25、.5、.75]，即返回第 25、50 和 75 个百分点 include 要包含在结果中的数据类型，数据类型列表，默认 None，具体取值类型参见官方文档 exclude 要从结果中忽略的数据类型，数据类型列表，默认 None，具体取值类型参见官方文档

描述数字形式的 Series 对象：

>>> import pandas as pd
>>> obj = pd.Series([1, 2, 3])
>>> obj
0    1
1    2
2    3
dtype: int64
>>> 
>>> obj.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

分类描述：

>>> import pandas as pd
>>> obj = pd.Series(['a', 'a', 'b', 'c'])
>>> obj
0    a
1    a
2    b
3    c
dtype: object
>>> 
>>> obj.describe()
count     4
unique    3
top       a
freq      2
dtype: object

描述时间戳：

>>> import pandas as pd
>>> obj  = pd.Series([
    np.datetime64("2000-01-01"),
    np.datetime64("2010-01-01"),
    np.datetime64("2010-01-01")
    ])
>>> obj
0   2000-01-01
1   2010-01-01
2   2010-01-01
dtype: datetime64[ns]
>>> 
>>> obj.describe()
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

描述 DataFrame 对象：

>>> import pandas as pd
>>> obj = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
>>> obj
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> 
>>> obj.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

不考虑数据类型，显示所有描述：

>>> import pandas as pd
>>> obj = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
>>> obj
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> 
>>> obj.describe(include='all')
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

仅包含 category 列：

>>> import pandas as pd
>>> obj = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
>>> obj
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> 
>>> obj.describe(include=['category'])
       categorical
count            3
unique           3
top              f
freq             1

【03x00】常用统计方法

其他常用统计方法参见下表：

方法描述官方文档 count 非NA值的数量 Series丨DataFrame describe 针对Series或各DataFrame列计算汇总统计 Series丨DataFrame min 计算最小值 Series丨DataFrame max 计算最大值 Series丨DataFrame argmin 计算能够获取到最小值的索引位置（整数） Series argmax 计算能够获取到最大值的索引位置（整数） Series idxmin 计算能够获取到最小值的索引值 Series丨DataFrame idxmax 计算能够获取到最大值的索引值 Series丨DataFrame quantile 计算样本的分位数（0到1） Series丨DataFrame sum 值的总和 Series丨DataFrame mean 值的平均数 Series丨DataFrame median 值的算术中位数（50%分位数） Series丨DataFrame mad 根据平均值计算平均绝对离差 Series丨DataFrame var 样本值的方差 Series丨DataFrame std 样本值的标准差 Series丨DataFrame

这里是一段防爬虫文本，请读者忽略。
本文原创首发于 CSDN，作者 TRHX。
博客首页：https://itrhx.blog.csdn.net/
本文链接：https://itrhx.blog.csdn.net/article/details/106788501
未经授权，禁止转载！恶意转载，后果自负！尊重原创，远离剽窃！

【01x00】统计计算

【01x01】sum() 求和

【01x02】min() 最小值

【01x03】max() 最大值

【01x04】mean() 平均值

【01x05】idxmin() 最小值索引

【01x06】idxmax() 最大值索引

【02x00】统计描述

【03x00】常用统计方法

Recommend

ELK-学习笔记–记一次日志突然写不进去了的处理 |坐而言不如起而行！二丫讲梵

docker学习笔记–企业级仓库harbor搭建 |坐而言不如起而行！二丫讲梵

【转载】深度好文-饿了么进化史（你一定会有收获）

Pikachu 漏洞平台通关记录

4A创意高管当腻后，他为广告新人办了所“黄埔军校”

无意苦争春，一任群芳妒！M1 Mac book(Apple Silicon)能否支撑全栈工程师的日常？(Pyt...

九度获中国专利金奖中兴通讯以创新为源“筑路数字经济”

“天问一号”即将登陆火星！最惊险的8分钟却无法监测

感知重塑与忠诚建立：车企营销的两大新机遇

7 Steps to Secure JavaScript in 2021 | by Viduni Wickramarachchi | May, 2021 | B...

About Joyk