50

[Pandas]一个Pandas VS MySQL SQL 的性能对比

 5 years ago
source link: http://www.flyml.net/2019/03/25/pandas一个pandas-vs-mysql-sql-的性能对比/?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

引言

最近使用MySQL做一些数据统计比较多。 有时候为了省事直接写出了一个略复杂的SQL查询, 直接获得最后的结果。 但是总是觉得比较慢。 下面做了一个改动,简述如下:

* 原来:

纯粹使用SQL查询, 得到最后的结果

* 改进:

使用SQL查询原始数据+Pandas进行数据处理

* 结论:

改进方案的性能提升了 300%

原方案:

sql = """
select t2_id, name, col2
from (
    select sum(col1) as col1_sum, col2, t2.name, t2_id
    FROM t1
    left JOIN t2 on t2_id = t2.id
    WHERE date > (NOW() - INTERVAL 12 HOUR)  and col1 > 0 and price > 0 and bidfloor > 0 
    GROUP BY col2, t2_id
    ORDER BY col1_sum desc
) t
order by t.col1_sum desc
"""
 

思路还是比较清晰的, 就是比较慢。 一共花费了51s

新方案

  • Step 1: 使用SQL读取原始数据
select col1, col2, t2.name, t2_id
from t1
left JOIN t2 on t2_id = t2.id
WHERE date > (NOW() - INTERVAL 12 HOUR)  and col1 > 0 and price > 0 and bidfloor > 0 
 
  • Step2 : 使用Pandas进行Group / Count / Sort的操作
grpDF = df.groupby(by=["col2", "t2_id"])
datas = []
for name, subDF in grpDF :
    datas.append({
        "col1_sum" : subDF['col1'].sum(),
        "col2" : name[0],
        "name" : subDF['name'].values.tolist()[0],
        't2_id': name[1]
    })
 
df2 = pd.DataFrame(datas)
df2 = df2.sort_values(by=['col1_sum'], ascending=False)
df2 = df2[["t2_id", "name", "col2"]]
 

计时结果:

* step1: 16s

* step2: 0.7s

* total: 16.7s, (原来: 51s)

本文原创, 转载需要著名出处:www.flyml.net


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK