24

ETL Tools for R Language – Comparison

 5 years ago
source link: https://www.tuicool.com/articles/MBRbEnr
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The R Project is an open source programming environment that supports statistical computing and graphic design. For R work or any data operations, you need an ETL (extract, transform and load) to process your data from its source to your output or data warehouse. R on its own can act as an ETL. But it can also be used to build apps that perform specific ETL tasks. Some examples of what ETLs built on the R platform can do are accessing internal data and uploading NYC bike trip data . In this blog, we list some of the top open source ETLs for R and what they do best.

R and most open source R ETLs might be a challenge for non-programmers and beginners. The user-friendly features of paid ETLs can make R work easier. Even pro-level developers can save time and effort with the powerful features that subscription BI tools deliver. That’s even more true if you have critical big data jobs that demand enterprise-level support. At the end of this blog, we list some paid ETLs that offer power, speed, ease of use and specialized tools for your business intelligence tasks.

Open source R ETLs

1. Apache Spark

If Apache is your web server, you can use Spark as an ETL tool for R. Spark’s own SparkR package installs a frontend that runs Spark inside of R and RStudio . SparkR’s data frame supports machine learning and big data operations. To access SparkR’s massive suite of functions, download the SparkR API .

Spark is an open source ETL with all sorts of data processing and transformation tools built in. It’s designed to run computations in parallel, so even large data jobs run fast—100 times faster than Hadoop, according to the Spark website. And it scales up for big data operations and can run algorithms in stream. Spark has tools for fast data streaming , machine learning and graph processing that output to storage or live dashboards.

Spark is supported by the community. If you need help, try its mailing lists , in-person groups and issue tracker .

2. Pentaho Kettle

Pentaho’s Data Integration (PDI), or Kettle (Kettle E.T.T.L. Environment), is an open source ETL that uses Pentaho’s own metadata-based integration method. Kettle documentation includes an R script executor . The R executor adds the complex statistical analysis and graphic modeling functions of R to Kettle’s user-friendly, multidata ETL tool. However, the extra work to install and run the R script executor might be a challenge for beginning programmers.

With Kettle, you can move and transform data, create and run jobs, load-balance data, pull data from multiple sources, and more. But you can’t sequence your transformations. You’ll need Spoon , the GUI for designing jobs and transformations that work with Kettle’s tools: Pan does data transformation, and Kitchen runs your jobs. Also be aware that Spoon has some reported issues .

3. beanumber / ETL

etl from GitHub contributor Ben Baumer is an R package that makes your ETL data ops easier. This open source ETL is designed specifically for work with medium data and SQL database output. All programming work for etl is done in R. Then you can output to either local or online remote storage and do analysis on your tabular data. A useful feature of etl is that you can use it to create your own ETLs .

To get the stable version of etl, install the package library that’s on Comprehensive Archive R Network ( CRAN ). etl is well documented and frequently downloaded. The next three ETLs tools for specific tasks that use R are also from Ben Baumer and built on the etl framework.

4. beanumber / nyc311

NYC311 is an etl-based ETL tool that extracts and stores phone call data from the NYC Open Data portal. NYC’s publicly available database has information on service request calls to New York City’s non-emergency phone number, 311. Examples are complaints about noise and street conditions and requests for large item removal. etl works with medium data, and the Open Data portal’s data is in that range, although it includes everything from 2010 to the presen

NYC311 brings data into the R environment, so you can transform and analyze it with R’s statistical and graphic modelling functions. Details on how to use this open source ETL are on its GitHub vignette page .

5. beanumber / citibike

citibike is another R package created by GitHub contributor Ben Baumer. It’s designed specifically for ETL on NYC CitiBike trip data from 2013 to 2016. Bike data you can analyze with citibike includes trip history, duration, start and stop time, rider gender and age, and more.

This open source ETL is also built from the etl framework and outputs to local SQL databases. citibike’s vignette page explains how to install and use it.

6. beanumber / imdb

imdb is a lightweight open source ETL from Ben Baumer that loads Internet Movie Database ( IMDb ) packages into SQL databases in the R environment. This open source ETL is built on the etl framework, and it uses the IMDbPY Python package from GitHub contributor Davide Alberani . With imbd, you can search IMDb for all the Star Wars movies, for example.

There’s no transformation done when you use imdb as an ETL tool for IMDb, only extract and load. But be prepared for a long wait for the data to load. etl works best with medium data, and IMDb data is quite large. So it’s best to store your imdb data in a SQL database meant for big data storage like MySQL.

7. ropensci / elastic

Frequently downloaded elastic from ropensci is a multipurpose R interface for Elasticsearch , a paid search and analytics engine from the Elastic company. elastic is part of the ROpenSci project. This open source ETL tool lets you use R’s complex data analytics capabilities on specific data sets gathered by Elasticsearch. Examples in the elastic GitHub documentation tell you how to get data from Shakespeare plays, the Public Library of Science and the Global Biodiversity Information Facility. Get even more predefined datasets on ropensci’s elastic dataset page . With elastic, you can also search, get multiple documents and parse your Elasticsearch data.

You can download the stable version of elastic on CRAN . If you’re curious, you can try the latest dev package on GitHub, which is currently passing build. elastic works best with older versions of Elasticsearch.

8. UptakeOpenSource / uptasticsearch

uptasticsearch is a basic-use R ETL from Update that moves data from Elasticsearch to R tables. This open source ETL queries Elasticsearch and outputs the parsed results in data tables. uptasticsearch supports many of Elasticsearch’s built-in aggregations that create summarized data views. A few examples are "date_histogram" , "percentiles" and "significant terms" . Support for more aggregations is in the works.

The uptasticsearch ETL tool for Elasticsearch is less documented than elastic. You can get its stable version on CRAN . The dev version, which currently has a build error, is on GitHub.

9. jwijffels / ETLUtils

ETLUtils from GitHub contributor jwijffels is a package of ETL utilities that uploads and transmits big data from databases to CRAN’s R-based ff packages . ff stores large data on disks with similar speed and capacity as if the data was stored on RAM. ETLUtils pulls data from SQL databases: MySQL, Oracle, PostgreSQL and Apache Hive.

You can get the stable version of ETLUtils on CRAN . And there’s also a link to this open source ETL in R Project’s package library. ETLUtils is only lightly documented on GitHub, but there’s more guidance in its reference manual on CRAN.

Less documented R ETL

If you’re curious to see what some other open source ETLs can do, and you’re comfortable with figuring things out on your own, you might try this ETL for R—with only light documentation. Even more ETLs are in progress on GitHub , so check back later to see what’s new.

10. vh-d / RETL

RETL from GitHub contributor Václav Hausenblas is an open source R package with not much documentation. It’s licensed, and as of this writing, the creator is actively working on it, so it might be an up-and-coming ETL worth checking out.

Paid R ETLs

1.Panoply

Panoply’sElasticsearch integration makes it a time and effort saving choice for anyone who uses Elasticsearch to search and query online data. The user-friendly Panoply BI tool also has one-click connectors to SQL data source apps that support work in the R environment:MySQL andPostgreSQL.

Panoply is easy for non-programmers, but it also delivers the unbeatable speed and support that professional designers need for big and small data ops. This automated ETL data platform pulls data from any source, simplifies it, and stores it all in one place. Panoply continuously streams data in real time to your output. It’s the only service that combines a fully integrated ETL with acloud data warehouse that it builds and manages for you. You can try Panoply out forfree or get apersonalized demo.

2. Blendo

With Blendo ’s cloud-based ETL tool, users can get their data into warehouses as quickly as possible by using its suite of proprietary data connectors. This paid ETL-as-a-service platform makes it easy to pull data from many data sources including CSV files and third-party sources like Amazon S3 buckets, Google Analytics, Mailchimp, Salesforce and others.

The Blendo ETL is a fast and safe way to load data from e-commerce platforms into your data warehouse. After you set up the incoming end of the data pipeline, you can load it into several storage destinations, including PostgreSQL (coming soon), a data source app that supports R work. Blendo’s blog tells you how to connect it to Google BigQuery with R and then run queries in the R environment. And this blog explains how to get your data from Amazon Redshift and PostgreSQL into R.

3. Stitch

Stitch is a self-service ETL data pipeline solution built for developers. The Stitch API replicates data from any source, and it handles bulk and incremental data updates. Stitch also has a replication engine that uses multiple strategies to deliver data to users. Its REST API supports JSON or transit to automatically detect and normalize nested document structures into relational schemas.

Stitch connects toAmazon Redshift,Google BigQuery and PostgreSQL architectures, and it integrates with a massive suite of data analysis tools . Stitch collects, transforms and loads Google analytics data into its own system, where it automatically produces business insights on your raw data. There’s a built-in integration for R —one of many analysis tools that work with Stitch.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK