PS:为了页面整洁，已隐藏代码，如需要查看代码请点击右上角的“ Fork ”然后点击 “运行” 选择右上角的 “查看” 选择 “隐藏/显示所有代码输入” 按钮，保存即可

In [2]:

# 加载必要的库
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
from warnings import filterwarnings
filterwarnings('ignore') 
import os
import datetime

import plotly.offline as py
from plotly.offline import init_notebook_mode,iplot
import plotly.graph_objs as go
from plotly import tools
init_notebook_mode(connected=True)
import plotly.figure_factory as ff

from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.datasets import make_blobs

In [3]:

# 导入数据
path='/home/kesci/input/7947606275/data.csv'
df=pd.read_csv(path,dtype={'CustomerID':str,'InvoiceID':str})

In [5]:

#invoiceNo：发票号码/货单号
#stockcode:存货码
#description:描述
#quantity:数量
#invoice:发票日期
#unitprice:单价
#customerid:客户id
#country:国家

In [6]:

df.head()

Out[6]:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850 United Kingdom 1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850 United Kingdom 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850 United Kingdom 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850 United Kingdom 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850 United Kingdom

In [4]:

df.describe()

Out[4]:

Quantity UnitPrice count 541909.000000 541909.000000 mean 9.552250 4.611114 std 218.081158 96.759853 min -80995.000000 -11062.060000 25% 1.000000 1.250000 50% 3.000000 2.080000 75% 10.000000 4.130000 max 80995.000000 38970.000000

In [5]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null object
Country        541909 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 33.1+ MB

1）数据准备¶

1)去除重复数据¶

In [7]:

df=df.drop_duplicates()

2)处理异常数据¶

In [8]:

# 查看描述统计
df.describe()

Out[8]:

Quantity UnitPrice count 536641.000000 536641.000000 mean 9.620029 4.632656 std 219.130156 97.233118 min -80995.000000 -11062.060000 25% 1.000000 1.250000 50% 3.000000 2.080000 75% 10.000000 4.130000 max 80995.000000 38970.000000

In [9]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 536641 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      536641 non-null object
StockCode      536641 non-null object
Description    535187 non-null object
Quantity       536641 non-null int64
InvoiceDate    536641 non-null object
UnitPrice      536641 non-null float64
CustomerID     401604 non-null object
Country        536641 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 36.8+ MB

3)统计UnitPrice有多少异常的¶

In [10]:

df.loc[df['UnitPrice']<0].UnitPrice.count()

Out[10]:

In [11]:

#查看这2行的description是什么
df.loc[df['UnitPrice']<0,['UnitPrice','Description']]

Out[11]:

UnitPrice Description 299983 -11062.06 Adjust bad debt 299984 -11062.06 Adjust bad debt

In [12]:

# 删除UnitPrice小于0的和Quantity小于0的数据
df=df[(df['UnitPrice']>=0) & (df['Quantity']>0)]

In [13]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 526052 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      526052 non-null object
StockCode      526052 non-null object
Description    525460 non-null object
Quantity       526052 non-null int64
InvoiceDate    526052 non-null object
UnitPrice      526052 non-null float64
CustomerID     392732 non-null object
Country        526052 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 36.1+ MB

In [21]:

df.loc[df['CustomerID'].isna()].CustomerID.count()

Out[21]:

4)缺失值处理¶

In [22]:

# 统计缺失值
df.isnull().sum()

Out[22]:

InvoiceNo           0
StockCode           0
Description       592
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     133320
Country             0
dtype: int64

In [23]:

# 统计缺失值得占比
df.isnull().sum()/df.shape[0]*100

Out[23]:

InvoiceNo       0.000000
StockCode       0.000000
Description     0.112536
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     25.343502
Country         0.000000
dtype: float64

In [26]:

df.shape[0]

Out[26]:

In [31]:

# 删除CustomerID为空的数据
df=df[~(df.CustomerID.isnull())]
df.head()

Out[31]:

In [34]:

# 把InvoiceDate转换为datetime类型
df['InvoiceDate']=pd.to_datetime(df['InvoiceDate'])
df["CustomerID"] = df["CustomerID"].astype("str")

In [35]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392732 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      392732 non-null object
StockCode      392732 non-null object
Description    392732 non-null object
Quantity       392732 non-null int64
InvoiceDate    392732 non-null datetime64[ns]
UnitPrice      392732 non-null float64
CustomerID     392732 non-null object
Country        392732 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 27.0+ MB

In [36]:

# 查看数据日期区间（需要用到最后的交易时间统计客户最后一次交易的时间距离现在的天数)
print('最大日期是：',df['InvoiceDate'].max())
print('最小日期是：',df['InvoiceDate'].min())

最大日期是： 2011-12-09 12:50:00
最小日期是： 2010-12-01 08:26:00

2）数据分析¶

1）数据准备¶

In [37]:

# 添加一列Sales,销售总额
df['Sales']=df['Quantity']*df['UnitPrice']

In [39]:

# 减少重复数据
df_f = df
df_f.drop_duplicates(subset=['InvoiceNo', 'CustomerID'], keep="first", inplace=True)
df_f.head()

In [40]:

#计算购买频率
frequency_df = df_f.groupby(by=['CustomerID'],as_index=False)['InvoiceNo'].count()
frequency_df.columns = ['CustomerID','Frequency']
frequency_df.set_index('CustomerID',drop=True,inplace=True)
frequency_df.head()

Out[40]:

Frequency CustomerID 12346 1 12347 7 12348 4 12349 1 12350 1

In [41]:

#  按用户ID分组计算总销售金额、最后一次交易的日期
df_group = df.groupby('CustomerID')
df_rm=df_group.agg({'Sales':'sum','InvoiceDate':'max'})

'''
通过最后一次的交易日期计算出客户最近一次下单距离2012-01-01的天数（2012-01-01，一般会用当天的
日期但由于数据是12年以前的所以这里直接用2012-01-01减去最大日期得到想要距离天数）
'''
df_rm['DateDiff']=(pd.to_datetime('2012-01-01') - df_rm['InvoiceDate']).dt.days
# 删除InvoiceDate字段列
df_rm=df_rm.drop('InvoiceDate',axis=1)
df_rm.head()

Out[41]:

Sales DateDiff CustomerID 12346 77183.60 347 12347 163.16 24 12348 331.36 97 12349 15.00 40 12350 25.20 332

In [42]:

# 合并数据
df_rfm = df_rm.merge(frequency_df,on='CustomerID')

# 对df_rfm以Sales排序

df_rfm.head()

Out[42]:

Sales DateDiff Frequency CustomerID 12346 77183.60 347 1 12347 163.16 24 7 12348 331.36 97 4 12349 15.00 40 1 12350 25.20 332 1

In [46]:

df_rfm.sort_values('Sales',ascending=False).head()

Out[46]:

Sales DateDiff Frequency CustomerID 16446 168471.25 22 2 12346 77183.60 347 1 18102 42053.60 22 60 15098 39916.50 204 3 17949 29999.69 23 45

3)数据可视化¶

1）查看数据大概分布¶

In [49]:

df_rfm.columns.tolist()

Out[49]:

['Sales', 'DateDiff', 'Frequency']

In [50]:

sns.pairplot(df_rfm)

Out[50]:

<seaborn.axisgrid.PairGrid at 0x7fe49f691940>

2）单独拿出来看分布直方图¶

In [52]:

plt.figure(1,figsize=(12,6))
n=0
for x in ['Frequency','DateDiff','Sales']:
    n+=1
    plt.subplot(1,3,n)
    plt.subplots_adjust(hspace=0.5,wspace=0.5)
    sns.distplot(df_rfm[x],bins=30)
    plt.title('{} 直方图'.format(x))
plt.show()

3)R、F、M模型¶

1、R、F、M分别对应¶

Image Name

2）根据客户的消费间隔、频率和贡献金额把客户分为8个类型¶

In [53]:

k=8
clf=KMeans(n_clusters=k)
clf.fit(df_rfm)

Out[53]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

3)计算用于划分客户的阙值，R、F、M的均值（*通过分布直方图可以发现该份数据不适合用中位数来分层，因此这里用均值做分层）¶

In [54]:

rmd = df_rfm['DateDiff'].mean()
fmd = df_rfm['Frequency'].mean()
mmd = df_rfm['Sales'].mean()
rmd,fmd,mmd

Out[54]:

(114.0414842129523, 4.271952062687255, 296.7611131597142)

In [56]:

df_rfm.head()

Out[56]:

Sales DateDiff Frequency CustomerID 12346 77183.60 347 1 12347 163.16 24 7 12348 331.36 97 4 12349 15.00 40 1 12350 25.20 332 1

In [57]:

df_rfm.iloc[0,1]

Out[57]:

In [58]:

def customer_type(frame): 
    customer_type = []
    for i in range(len(frame)):
        if frame.iloc[i,1]<=rmd and frame.iloc[i,2]>=fmd and frame.iloc[i,0]>=mmd:
            customer_type.append('重要价值用户')
        elif  frame.iloc[i,1]>rmd and frame.iloc[i,2]>=fmd and frame.iloc[i,0]>=mmd:
            customer_type.append('重要唤回用户')
        elif  frame.iloc[i,1]<=rmd and frame.iloc[i,2]<fmd and frame.iloc[i,0]>=mmd:
            customer_type.append('重要深耕用户')
        elif  frame.iloc[i,1]>rmd and frame.iloc[i,2]<fmd and frame.iloc[i,0]>=mmd:
            customer_type.append('重要挽留用户')
        elif  frame.iloc[i,1]<=rmd and frame.iloc[i,2]>=fmd and frame.iloc[i,0]<mmd:
            customer_type.append('潜力用户')
        elif  frame.iloc[i,1]>rmd and frame.iloc[i,2]>=fmd and frame.iloc[i,0]<mmd:
            customer_type.append('一般维持用户')
        elif  frame.iloc[i,1]<=rmd and frame.iloc[i,2]<fmd and frame.iloc[i,0]<mmd:
            customer_type.append('新用户')
        elif frame.iloc[i,1]>rmd and frame.iloc[i,2]<fmd and frame.iloc[i,0]<mmd:
            customer_type.append('流失用户')
    frame['classification'] = customer_type

In [59]:

customer_type(df_rfm)

print('不同类型的客户总数：')
print('--------------------')
df_rfm.groupby(by='classification').size().reset_index(name='客户数')

不同类型的客户总数：
--------------------

Out[59]:

classification 客户数 0 一般维持用户 51 1 新用户 1803 2 流失用户 1298 3 潜力用户 686 4 重要价值用户 361 5 重要唤回用户 17 6 重要挽留用户 58 7 重要深耕用户 65

In [61]:

print('不同类型的客户消费总额：')
print('------------------------')
df_rfm.groupby('classification').Sales.sum().reset_index(name='金额')

不同类型的客户消费总额：
------------------------

Out[61]:

classification 金额 0 一般维持用户 6332.64 1 新用户 89209.31 2 流失用户 53116.41 3 潜力用户 93804.10 4 重要价值用户 617878.42 5 重要唤回用户 22002.41 6 重要挽留用户 168550.12 7 重要深耕用户 236753.06

4)可视化不同类型客户数量¶

In [62]:

plt.figure(1,figsize=(12,7))
sns.countplot(y="classification",order=df_rfm['classification'].value_counts().index ,
data=df_rfm,palette='viridis')
plt.title('不同类型的客户总数',fontsize=20)
plt.xlabel('')
plt.ylabel('')

con=list(df_rfm.groupby('classification').classification.count().values)
con=sorted(con,reverse=True)

for x,y in enumerate(con):
    plt.text(y+0.1,x,'%s' %y,va='center',size=12)
plt.show()

5）不同类型的客户消费份额¶

In [63]:

plt.figure(1,figsize=(10,10))
labels=df_rfm.groupby('classification').Sales.sum().index
size=df_rfm.groupby('classification').Sales.sum()

# explode=[0,0,0,0,0.1,0,0,0]

plt.pie(size,labels=labels,autopct='%.2f%%',wedgeprops={'width':0.35,'edgecolor':'w'},
pctdistance=0.85)

plt.title('不同类型的客户销售份额',fontsize=20)
plt.axis('off')
plt.show()

客户价值分析--用python实现RFM模型

1）数据准备¶

1)去除重复数据¶

2)处理异常数据¶

3)统计UnitPrice有多少异常的¶

4)缺失值处理¶

2）数据分析¶

1）数据准备¶

3)数据可视化¶

1）查看数据大概分布¶

2）单独拿出来看分布直方图¶

3)R、F、M模型¶

1、R、F、M分别对应¶

2）根据客户的消费间隔、频率和贡献金额把客户分为8个类型¶

3)计算用于划分客户的阙值，R、F、M的均值（*通过分布直方图可以发现该份数据不适合用中位数来分层，因此这里用均值做分层）¶

4)可视化不同类型客户数量¶

5）不同类型的客户消费份额¶

Recommend

Best Web Development Company

D1net阅闻：美国国安局顶级后门首次曝光

为什么超融合基础设施可以简化IT管理

京东企业业务开工季推出多场景物资解决方案保障企业高效开工

好未来2022财年前三季巨亏10亿美元 Q3亏损同比扩大128%

托管服务提供商成功的7个最佳实践

成功商业智能战略的8个关键因素

中国移联元宇宙产委会发布产业自律公约：坚决抵制利用元宇宙进行资本炒作

冬奥会带来“运动热” 闲鱼健身卡搜索量上涨730%

共建清洁美丽世界：2022年六五环境日主题发布

About Joyk