How to split a large CSV file based on the number of rows 🔪

2 years ago

source link: https://dev.to/ranb2002/how-to-split-a-large-csv-file-based-on-the-number-of-rows-312o
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to split a large CSV file based on the number of rows 🔪

Jul 22 Originally published at benjaminrancourt.ca on Jun 24

・2 min read

Recently I had to import CSV files into an external application at work, but the application was not able to process all the rows because it was triggering timeouts after a few minutes. The files weren't exactly huge, they were only 8MB and maxed out at 200,000 lines, but that was too much for this application... 😢

I had no control over the external software, but I needed smaller files. I could have adapted the application that creates them, but it would have required fetching the data from several APIs again and it was taking a long time. ⌛

Instead, I decided to try splitting the files directly. After a quick search on a search engine, I came accros a Gist in Python that does exactly what I wanted. My CSV file had semicolons (;) instead of commas (,), so I had to adapt its script a bit to accommodate this (delimiter=';').

Here is my adapted version, also saved in a forked Gist, in case I need it later:

import csv
import sys
import os

# example usage: python split.py example.csv 200
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included)
# if example.csv has 401 rows for instance, this creates 3 files in same directory:
# - `example_1.csv` (row 1 - 200)
# - `example_2.csv` (row 201 - 400)
# - `example_3.csv` (row 401)

CURRENT_DIR = os.path.dirname(os.path.realpath( __file__ ))
filename = sys.argv[1]

full_file_path = os.path.join(CURRENT_DIR, filename)
file_name = os.path.splitext(full_file_path)[0]

rows_per_csv = int(sys.argv[2]) if len(sys.argv) > 2 else 5000

with open(filename) as infile:
    reader = csv.DictReader(infile, delimiter=';')
    header = reader.fieldnames
    rows = [row for row in reader]
    pages = []

    row_count = len(rows)
    start_index = 0
    # here, we slice the total rows into pages, each page having [row_per_csv] rows
    while start_index < row_count:
        pages.append(rows[start_index: start_index+rows_per_csv])
        start_index += rows_per_csv

    for i, page in enumerate(pages):
        with open('{}_{}.csv'.format(file_name, i+1), 'w+') as outfile:
            writer = csv.DictWriter(outfile, fieldnames=header, delimiter=';')
            writer.writeheader()
            for row in page:
                writer.writerow(row)

        print('DONE splitting {} into {} files'.format(filename, len(pages)))

Enter fullscreen modeExit fullscreen mode

Thanks to Kelvin Tay for his script, it saved me a lot of time! 🥳

Recommend

blog.sqlauthority.com 3 years ago
Cache

SQL SERVER - Number of Rows Read - Execution Plan - SQL Authority with Pinal Dav...

SQL Authority with Pinal DaveSQL SERVER – Number of Rows Read – Execution PlanRecently one of the clients sent me the following two images from the execution plan and his question was about the Number of Rows Read...

www.michev.info 2 years ago
Cache

OneDrive for Business “deleted a large number of files” email notification

OneDrive for Business “deleted a large number of files” email notification Hey folks, look at what landed in my Inbox yesterday: This is not a spam or a clever phishing message, it’s a legitimate email...

www.codesd.com 2 years ago
Cache

I can not enter a large decimal number in Excel

I can not enter a large decimal number in Excel advertisements I am using Python and I use this method to enter numbers in an Excel sheet:...

www.codesd.com 2 years ago
Cache

Bash - Delete rows that do not end with a number

Bash - Delete rows that do not end with a number advertisements I need to remove all lines (in text file) not ending with a number. Bef...

www.donghao.org 2 years ago
Cache

Get the number of rows for a parquet file

Get the number of rows for a parquet file We were using Pandas to get the number of rows for a parquet file: import pandas as pd df = pd.read_parquet("my.parqu...

www.codesd.com 2 years ago
Cache

jQuery datatables - display the number of rows on the array

jQuery datatables - display the number of rows on the array advertisements I have a jQuery datatable on my view...

dev.to 2 years ago
Cache

How to Insert Thousands of Rows from an Excel / CSV File into a database with a...

How to Insert Thousands of Rows from an Excel / CSV File into a database with a single database call ...

www.laravelcode.com 1 year ago
Cache

How to Count Number of Rows in a Table Using jQuery

How to Count Number of Rows in a Table Using jQuery 1142 views 2 years ago jQuery Use the length

answers.sap.com 1 year ago
Cache

Adobe Livecycle Designer (number of rows in the table)

Adobe Livecycle Designer (number of rows in the table) Skip to Content...

devblogs.microsoft.com 9 months ago
Cache

How to split a large CSV file based on the number of rows 🔪

How to split a large CSV file based on the number of rows 🔪

Recommend

SQL SERVER - Number of Rows Read - Execution Plan - SQL Authority with Pinal Dav...

OneDrive for Business “deleted a large number of files” email notification

I can not enter a large decimal number in Excel

Bash - Delete rows that do not end with a number

Get the number of rows for a parquet file

jQuery datatables - display the number of rows on the array

How to Insert Thousands of Rows from an Excel / CSV File into a database with a...

How to Count Number of Rows in a Table Using jQuery

Adobe Livecycle Designer (number of rows in the table)

How to split off an older copy of a file while preserving git line history

About Joyk