1

How to split a large CSV file based on the number of rows 🔪

 2 years ago
source link: https://dev.to/ranb2002/how-to-split-a-large-csv-file-based-on-the-number-of-rows-312o
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Cover image for How to split a large CSV file based on the number of rows 🔪

How to split a large CSV file based on the number of rows 🔪

Jul 22 Originally published at benjaminrancourt.ca on Jun 24

・2 min read

Recently I had to import CSV files into an external application at work, but the application was not able to process all the rows because it was triggering timeouts after a few minutes. The files weren't exactly huge, they were only 8MB and maxed out at 200,000 lines, but that was too much for this application... 😢

I had no control over the external software, but I needed smaller files. I could have adapted the application that creates them, but it would have required fetching the data from several APIs again and it was taking a long time. ⌛

Instead, I decided to try splitting the files directly. After a quick search on a search engine, I came accros a Gist in Python that does exactly what I wanted. My CSV file had semicolons (;) instead of commas (,), so I had to adapt its script a bit to accommodate this (delimiter=';').

Here is my adapted version, also saved in a forked Gist, in case I need it later:

import csv
import sys
import os

# example usage: python split.py example.csv 200
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included)
# if example.csv has 401 rows for instance, this creates 3 files in same directory:
# - `example_1.csv` (row 1 - 200)
# - `example_2.csv` (row 201 - 400)
# - `example_3.csv` (row 401)

CURRENT_DIR = os.path.dirname(os.path.realpath( __file__ ))
filename = sys.argv[1]

full_file_path = os.path.join(CURRENT_DIR, filename)
file_name = os.path.splitext(full_file_path)[0]

rows_per_csv = int(sys.argv[2]) if len(sys.argv) > 2 else 5000

with open(filename) as infile:
    reader = csv.DictReader(infile, delimiter=';')
    header = reader.fieldnames
    rows = [row for row in reader]
    pages = []

    row_count = len(rows)
    start_index = 0
    # here, we slice the total rows into pages, each page having [row_per_csv] rows
    while start_index < row_count:
        pages.append(rows[start_index: start_index+rows_per_csv])
        start_index += rows_per_csv

    for i, page in enumerate(pages):
        with open('{}_{}.csv'.format(file_name, i+1), 'w+') as outfile:
            writer = csv.DictWriter(outfile, fieldnames=header, delimiter=';')
            writer.writeheader()
            for row in page:
                writer.writerow(row)

        print('DONE splitting {} into {} files'.format(filename, len(pages)))
Enter fullscreen modeExit fullscreen mode

Thanks to Kelvin Tay for his script, it saved me a lot of time! 🥳


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK