

How to use AWS Glue crawlers with Amazon Athena
source link: https://www.pluralsight.com/resources/blog/data/how-to-use-aws-glue-crawlers
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to use AWS Glue crawlers with Amazon Athena
Amazon Athena provides a simplified, flexible way to analyze petabytes of data right where they live. For example, Athena can analyze data or build applications from an Amazon Simple Storage Service (S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.
There are four main Amazon Athena use cases:
Run queries on S3, on-premises data centers, or on other clouds
Prepare data for machine learning models
Use machine learning models in SQL queries or Python to simplify complex tasks, such as anomaly detection, customer cohort analysis, and sales predictions
Perform multicloud analytics (like querying data in Azure Synapse Analytics and then visualizing the results with Amazon QuickSight)
Now that we’ve covered Amazon Athena, let's talk about AWS Glue. You can do a few different things with AWS Glue.
First, you can use AWS Glue data integration engines, which allow you to get data from a few different sources. This includes Amazon S3, Amazon DynamoDB, and Amazon RDS, as well as databases running on Amazon EC2 (which integrates with AWS Glue studio) and AWS Glue for Ray, Python Shell, and Apache Spark.
Once the data is interfaced and filtered so it can interact with places to load or create data, this list expands to include data from places like Amazon Redshift, data lakes, and data warehouses.
You can also use AWS Glue to run your ETL jobs. These jobs allow you to segregate customer data, protect customer data in transit and at rest, and access customer data only as needed in response to customer requests. When provisioning an ETL job, all you need to do is provide input data sources and output data targets in your virtual private cloud.
The final way you can use AWS Glue is through a data catalog to quickly discover and search multiple AWS datasets without moving the data. Once the data is cataloged, it’s immediately available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrums.
So, how can you get data from AWS Glue into Amazon Athena? Follow these steps:
Start by uploading data to a data source. The most popular option is an S3 bucket, but DynamoDB tables and Amazon RedShift are also options.
Select your data source and create a classifier if necessary. A classifier reads the data and generates a schema if it recognizes the format. You can create custom classifiers to see different data types.
Create a crawler.
Set up a name for the crawler, then choose your data sources and add any custom classifiers to make sure AWS Glue recognizes the data correctly.
Set up an Identity and Access Management (IAM) role to make sure the crawler can run the processes correctly.
Create a database that will hold the data set. Set when and how often the crawler works to keep your data fresh and up to date.
Run the crawler. This process can take a while depending on how big the dataset is. Once the crawler has successfully run, you’ll see changes to tables in the database.
Now that you’ve completed this process, you can jump over to Amazon Athena and run the queries you need to filter the data and get the results you’re looking for.
Recommend
-
23
Security Crawl Maze: An Open Source Tool to Test Web Security Crawlers 2019-06-22adminGoogleDevFee...
-
39
7 Ways to Secure Amazon Athena Sadequl Hussain
-
18
"Insert Overwrite Into Table" with Amazon Athena For a long time, Amazon Athena does not support INSERT or CTAS (Create Table As Select) statements. To be sure, the results of a query are au...
-
13
Advice for Monopoly Pub Crawlers 2006-08-14 by qntm The idea of a Monopoly Pub Crawl is...
-
10
Enough with the broken "Java/x.y.z_nn" crawlers I watch my web logs a lot. It's a good way to get inspired by seemingly random events. Seeing some bit of insanity arrive from the outside world can lead to a concept for a post....
-
10
What "glue" lanugages do you use/like? I develop and use a lot of different scientific code in my work (I’m a theoretical chemist). I have pipelines with components written in C/C++, Python, AWK, Haskell, tcl, various shell dialect...
-
12
Serverless Interactive Query Service
-
12
Amazon Athena, Explained: What is it, and When Should I Use it? Noreen Hasan Oct 30, 2022 14 Minute Read
-
5
Interactive Query Service Amazon Athena Introduces New Engine Oct 30, 2022...
-
3
Introduction Amazon Athena is an interactive query tool supplied by Amazon Web Services (AWS) that allows you to use conventional SQL queries to evaluate data stored in Amazon
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK