3

Get the number of rows for a parquet file

 2 years ago
source link: http://www.donghao.org/2021/12/17/get-the-number-of-rows-for-a-parquet-file/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Get the number of rows for a parquet file

We were using Pandas to get the number of rows for a parquet file:

import pandas as pd

df = pd.read_parquet("my.parquet")
print(df.shape[0])
Python
xxxxxxxxxx
import pandas as pd
df = pd.read_parquet("my.parquet")
print(df.shape[0])

This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.

If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:

import pyarrow.parquet as pq

table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)
Python
xxxxxxxxxx
import pyarrow.parquet as pq
table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)

This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.

Like this:

Loading...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK