文章目录
- 四、实例
- 4.1 plotly.express箱线图
- 4.1.1 基本箱线图
- 4.1.2 为 x的每个值绘制一个箱线图
- 4.1.3 显示基础数据
- 4.1.4 选择计算四分位数的算法
- 4.1.5 四分位数算法之间的区别
- 4.1.6 风格箱线图
- 4.1.7 Dash中的箱线图
四、实例
箱线图是变量通过其四分位数分布的统计表示。盒子的末端代表下四分位数和上四分位数,而中位数(第二个四分位数)由盒子内的一条线标记。有关数值数据的其他统计表示,请参阅其他统计图表。
用于可视化分布的箱线图的替代方法包括直方图、小提琴图、ECDF 图和条形图。
4.1 plotly.express箱线图
Plotly Express是 Plotly 的易于使用的高级界面,它对各种类型的数据进行操作并生成易于样式化的图形。
4.1.1 基本箱线图
import plotly.express as px
df = px.data.tips()
print(df)
'''
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
'''
fig = px.box(df, y="total_bill")
fig.show()
4.1.2 为 x的每个值绘制一个箱线图
如果将列名作为x参数给出,则会为 x的每个值绘制一个箱线图。
import plotly.express as px
df = px.data.tips()
print(df)
'''
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
'''
fig = px.box(df, x="time", y="total_bill")
fig.show()
4.1.3 显示基础数据
使用该points参数,显示包含所有点 ( all)、仅异常值 ( outliers,默认值) 或都不包含 ( False) 的基础数据点。
import plotly.express as px
df = px.data.tips()
print(df)
'''
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
'''
fig = px.box(df, x="time", y="total_bill", points="all")
fig.show()
4.1.4 选择计算四分位数的算法
默认情况下,箱线图的四分位数是使用该linear方法计算的(有关线性插值的更多信息,请参阅http://www.amstat.org/publications/jse/v14n3/langford.html和https://en 上列出的 #10。 wikipedia.org/wiki/Quartile了解更多详情)。
但是,您也可以选择使用一种exclusive或一种inclusive算法来计算四分位数。
排他算法使用中位数将有序数据集分成两半。如果样本是奇数,则不包括任何一半的中位数。Q1 是下半部分的中位数,Q3 是上半部分的中位数。
包容性算法也使用中位数将有序数据集分成两半,但如果样本是奇数,则在两半中都包含中位数。Q1 是下半部分的中位数,Q3 是上半部分的中位数。
import plotly.express as px
df = px.data.tips()
print(df)
'''
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
'''
fig = px.box(df, x="day", y="total_bill", color="smoker")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()
4.1.5 四分位数算法之间的区别
有时很难看出用于计算四分位数的线性、包容性和排他性算法之间的区别。在以下示例中,使用三种不同的四分位数计算算法中的每一种来可视化相同的数据集。
import plotly.express as px
import pandas as pd
data = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(dict(
linear=data,
inclusive=data,
exclusive=data
)).melt(var_name="quartilemethod")
print(df)
'''
quartilemethod value
0 linear 1
1 linear 2
2 linear 3
3 linear 4
4 linear 5
5 linear 6
6 linear 7
7 linear 8
8 linear 9
9 inclusive 1
10 inclusive 2
11 inclusive 3
12 inclusive 4
13 inclusive 5
14 inclusive 6
15 inclusive 7
16 inclusive 8
17 inclusive 9
18 exclusive 1
19 exclusive 2
20 exclusive 3
21 exclusive 4
22 exclusive 5
23 exclusive 6
24 exclusive 7
25 exclusive 8
26 exclusive 9
'''
fig = px.box(df, y="value", facet_col="quartilemethod", color="quartilemethod",
boxmode="overlay", points='all')
fig.update_traces(quartilemethod="linear", jitter=0, col=1)
fig.update_traces(quartilemethod="inclusive", jitter=0, col=2)
fig.update_traces(quartilemethod="exclusive", jitter=0, col=3)
fig.show()
4.1.6 风格箱线图
有关缺口的解释,请参阅https://en.wikipedia.org/wiki/Box_plot#Variations。
import plotly.express as px
df = px.data.tips()
print(df)
'''
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
'''
fig = px.box(df, x="time", y="total_bill", color="smoker",
notched=True, # 用槽口形状
title="Box plot of total bill",
hover_data=["day"] # 将日期列添加到悬停数据
)
fig.show()
4.1.7 Dash中的箱线图
import dash
from dash import html, dcc
from dash.dependencies import Input, Output
import plotly.express as px
df = px.data.tips()
print(df)
'''
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
'''
app = dash.Dash(__name__)
app.layout = html.Div([
html.P("x-axis:"),
dcc.Checklist(
id='x-axis',
options=[{'value': x, 'label': x}
for x in ['smoker', 'day', 'time', 'sex']],
value=['time'],
labelStyle={'display': 'inline-block'}
),
html.P("y-axis:"),
dcc.RadioItems(
id='y-axis',
options=[{'value': x, 'label': x}
for x in ['total_bill', 'tip', 'size']],
value='total_bill',
labelStyle={'display': 'inline-block'}
),
dcc.Graph(id="box-plot"),
])
@app.callback(
Output("box-plot", "figure"),
[Input("x-axis", "value"),
Input("y-axis", "value")])
def generate_chart(x, y):
fig = px.box(df, x=x, y=y)
return fig
app.run_server(debug=True)