Pandas got 3x faster

[ Source ]

Couple of weeks back, I came across this amazing library that scales up the existing pandas code by changing just one line of code and making it at least 2x faster compared to the existing. Seeing such big claims gave me a reason to test it out and see the results myself. This is the project i came across check it out!

I will be importing a 2 different datasets of different sizes to compare the performances for both the methods.

Dataset 1

Size: 445MB

#!/usr/bin/env python

import time
import pandas as pd

duration = []
for i in range(3):
    start = time.time()
    data_df = pd.read_csv(
               'data.txt',
               sep = '\s\|\|\s'
             )
    stop = time.time()-start
    duration.append(stop)
    del data_df

final_time_pd = sum(duration) / float(len(duration))
print ('Average time for 3 runs is {} sec'.format(final_time_pd))
>>> <b>Average time for 3 runs is 12.120 sec</b>

import time
import modin.pandas as pd

duration = []
for i in range(3):
    start = time.time()
    data_df = pd.read_csv(
               'data.txt',
               sep = '\s\|\|\s'
             )
    stop = time.time()-start
    duration.append(stop)
    del data_df

final_time_pd = sum(duration) / float(len(duration))
print ('Average time for 3 runs is {} sec'.format(final_time_pd))
>>> <b>Average time for 3 runs is 6.515 sec</b>

Clearly, Modin wins this case. Let's try with another dataset.

Dataset 2

Size: 990MB

I used the same code and re-ran the experiment.

<b>>>></b> <code><i>Average time taken for </i><b><i>pandas</i></b><i> is approx </i></code><b><i>111.723 </i></b><code><b><i>seconds</i></b></code>

<b>>>></b> <code><i>Average time taken for </i><b><i>modin pandas</i></b><i> is approx </i></code><b><i>71.770 </i></b><code><b><i>seconds</i></b></code>

P.S. Unfortuately, modin does not support read_table method as of now, that’s why I had to use read_csv .

Results are really impressive! This is surely going to help me in handling good amount of data now in pandas while making use of pandas magics with speed. Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries also at the same time gives seamless integration and compatibility with existing pandas code. It uses all 4 physical cores, whereas in pandas, you are only able to use 1 core at a time when you are doing computation of any kind.

No doubt! that this is some realy good contribution for Data Science / ML Enthusiasts. Kudos! Do give it a try at-least once for your use case.

Feel free to share comment your thoughts on the same. — Thanks

Recommend

GitHub - The-Art-of-Hacking/h4cker: This repository is primarily maintained by O...

Facebook Messenger将上线消息撤回功能

微信给抖音"亮红牌" 是垄断还是不正当竞争引发争议

你在听歌吗 - 随时随地关注网易云音乐好友的听歌动态，第一时间获取好友听歌订阅通知...

讨利是 - 讨利是小程序能为您制作春节拜年祝福语，快来测一测今年红包有多少？ - NEXT

SAMSUNG 三星 X5 移动固态硬盘 2TB 8849元包邮（需用券）_天猫精选优惠

驯龙高手即视感

英国电信“官宣”入华！移动、联通、电信该紧张吗？

星界军的真实战力究竟怎样？ - 知乎

Core Graphics Tutorial: Arcs and Paths [FREE]

About Joyk