3
Get the number of rows for a parquet file
source link: http://www.donghao.org/2021/12/17/get-the-number-of-rows-for-a-parquet-file/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Get the number of rows for a parquet file
We were using Pandas to get the number of rows for a parquet file:
import pandas as pd df = pd.read_parquet("my.parquet") print(df.shape[0])
Python
xxxxxxxxxx
import pandas as pd
df = pd.read_parquet("my.parquet")
print(df.shape[0])
This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.
If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:
import pyarrow.parquet as pq table = pq.read_table("my.parquet", columns=[]) print(table.num_rows)
Python
xxxxxxxxxx
import pyarrow.parquet as pq
table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)
This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.
Like this:
Loading...
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK