
17

Github GitHub - awslabs/aws-data-wrangler: Pandas on AWS - Easy integration with...
source link: https://github.com/awslabs/aws-data-wrangler
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Quick Start
Installation command: pip install awswrangler
For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job, MWAA):
pip install pyarrow==2 awswrangler
import awswrangler as wr import pandas as pd from datetime import datetime df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]}) # Storing data on Data Lake wr.s3.to_parquet( df=df, path="s3://bucket/dataset/", dataset=True, database="my_db", table="my_table" ) # Retrieving the data directly from Amazon S3 df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True) # Retrieving the data from Amazon Athena df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db") # Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum con = wr.redshift.connect("my-glue-connection") df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con) con.close() # Amazon Timestream Write df = pd.DataFrame({ "time": [datetime.now(), datetime.now()], "my_dimension": ["foo", "boo"], "measure": [1.0, 1.1], }) rejected_records = wr.timestream.write(df, database="sampleDB", table="sampleTable", time_col="time", measure_col="measure", dimensions_cols=["my_dimension"], ) # Amazon Timestream Query wr.timestream.query(""" SELECT time, measure_value::double, my_dimension FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3 """)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK