
8

Get the schema of a parquet file
source link: http://www.donghao.org/2020/11/25/get-the-schema-of-a-parquet-file/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Get the schema of a parquet file
Previously I just use this snippet to get all the column names of a parquet file:
import pandas as pd df = pd.read_parquet("hello.parquet") print(list(df.columns))
Python
import pandas as pd
df = pd.read_parquet("hello.parquet")
print(list(df.columns))
But if the parquet file is very large (maybe not very large, for example, 1GB), it will cause OOM in my small VM (about 4GB RAM).
Actually, what I want is just column names, not the whole data. Since parquet file has strongly designed format, there must be someway we can only get the schema instead of all data.
And, here it is:
import pyarrow.parquet as pq schema = pq.read_schema("hello.parquet", memory_map=True) print(list(schema.names))
Python
xxxxxxxxxx
import pyarrow.parquet as pq
schema = pq.read_schema("hello.parquet", memory_map=True)
print(list(schema.names))
Like this:
Loading...
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK