5

How to evaluate ETL tools

 3 years ago
source link: https://www.stitchdata.com/blog/how-to-evaluate-etl-tools/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to evaluate ETL tools

tools.jpg

Every data-driven organization bases its data analytics stack on a data warehouse, which in turn hosts data replicated from multiple data sources. ETL (extract, transform, and load) software pipes data from those sources to the data warehouse. You can write your own ETL code — but you shouldn't — or you can adopt an ETL product or service to do the work for you. What should you look for in an ETL tool and how do you evaluate the candidates?

Each ETL tool or platform has pros and cons. We've zeroed in on 11 key factors that distinguish ETL services from one another, in rough order of importance:

Data sources and destinations supported

A historical perspective


A few years ago, data warehouses ran on expensive on-premises servers. Data engineers used ETL software to extract data from sources and load it to the data warehouse. Because any hardware server only has so many processors and so much memory, and because it was costly and time-consuming to upgrade the hardware in those boxes, ETL tools would transform the data on commodity hardware before loading it, leaving more CPU cycles for the analytical tasks these boxes were designed to perform.

The last few years have seen the emergence of cloud-native data warehouses like Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure Synapse. Because they run on cloud infrastructure, they can scale elastically to handle any necessary transformation processing using the same hardware on which the data warehouse runs. That low-latency scalability eliminates the need for a separate transformation step in the data pipeline. In other words, ETL has turned into ELT for organizations that implement cloud data warehouses.

With ELT, you extract data from sources and load it into a cloud-based data warehouse or data lake. Because the transformation process takes place after loading, users can model the data in different ways once it's in the target destination, then apply business intelligence and analytics tools to the data.

Whether the data integration process is ETL or ELT, many people refer to it by the older abbreviation.

If you want to use an ETL service, it must support the destinations in which you keep your analytics data. That may mean a data warehouse such as Amazon Redshift, Google BigQuery, Snowflake, Microsoft Azure Synapse; a data lake such as Amazon S3, Google Cloud Storage, or Microsoft Azure Data Lake; or, for smaller organizations, a database such as PostgreSQL. Some tools only allow you to sync to one data warehouse; some support multiple destinations but only one at a time, where others allow you to push data to multiple places simultaneously.

Support for data sources is trickier. While it's unlikely you'll find one ETL platform that can support all of the SaaS tools, databases, and other sources your organization uses, look for one that supports as many of your most important ones as possible.

What can you do if no platform supports all of the data sources your want to analyze with your business intelligence tools? In some cases, businesses choose a main ETL tool and either use a second tool to support a smaller number of sources or build the remaining integrations themselves. It's not ideal from many standpoints — administration, maintenance, and cost among them — but it's a tradeoff some organizations have to make.

Similarly, you may find, as your organization grows, that the ETL tool you've chosen doesn't support a new data source you've adopted. That leads us to the next factor...

Extensibility and future-proofing

You should be able to ask your ETL provider to add support for additional data sources. Even better would be if you had the ability to add data sources yourself. Stitch, for instance, is built on Singer, an open source specification for writing scripts that move data. If you need a new data source, you can write an integration to the Singer spec and run it on your own hardware, or submit a Singer tap you write to Stitch's product team for inclusion in the platform — or take a hybrid approach. Once your tap is accepted, you can use Stitch to run your integration.

Scalability

Inevitably, your data volumes will grow, so look for a tool that can meet your growing needs without experiencing a degradation of service. Find out how the data pipeline is architected to support a large volume of data.

Usability

How easy is it to understand the ETL tool's interface, to set up integrations, to schedule replication tasks, and to monitor them? Does the tool allow you to replicate your data sources on different schedules, and what level of granularity can you get with those schedules? How flexible and customizable is the software? How quickly can you start being productive using it?

If issues come up, how clear are the error messages, and how many options do you have for being notified? Can you fix problems yourself, or must you rely on the vendor's support team?

Support and documentation

Speaking of support, contact each vendor's support team and ask questions to assess their expertise. How are they at resolving issues? How quickly do they provide answers? Also find out what support channels are available — options include phone, email, online chat, or web form. Some vendors offer different levels of support depending on your contract terms. Finally, since your SaaS vendor might be based in a different time zone or even on a different continent, find out whether support resources are available when you're likely to need them.

Look over the vendor's documentation to see whether it's clear and complete, and written with a level of technical expertise appropriate to the people who'll use the tool.

Security and compliance

Security is a critical component of any IT system. For a cloud-based data pipeline, consider:

  • Whether the vendor encrypts data at motion and at rest. If a vendor doesn't do that automatically, be sure you can enable encryption if you need it.
  • How much of the security controls are user-configurable
  • Whether the vendor offers API key management
  • Whether the vendor uses HTTPS for web-based data sources
  • What options the vendor offers for connecting to data sources and destinations. For instance, can it operate within your virtual private cloud (VPC)?
  • Whether it uses Secure Shell (SSH) for strong authentication
  • On what schedule the vendor deletes your data after it's copied to your destination

Hand in hand with the issue of security is that of compliance with national and international data security standards. Find out what certifications each platform has. Even certifications you don't care about today are relevant because they show how serious the vendor is about security. HIPAA, SOC 2, and GDPR compliance are three of the most common measures.

Batch and stream processing

Just about all ETL tools do batch extraction from data sources — replicating new data at scheduled intervals. Some also support stream processing for cases where you need to access and report on events as soon as they occur. Not all analytics need streaming — a Stitch blog post talks more about the right latency for replication depending on your use case.

Stability and reliability

When we talk about stability, we're really talking about uptime. Determine how much downtime you can tolerate and look at the vendor's service level agreement (SLA), which outlines what percentage of time they guarantee to be up, and if they aren't, what you're owed. Find out how well the vendor has stuck to its SLAs over time. Stitch, for instance, has a great record. Evaluating a platform for stability and reliability also include ensuring that the data being captured is accurate and that it arrives in a reasonable timeframe — two issues that require testing.

Pricing

Unlike other factors, pricing isn't a technical point, but other things being equal, pricing can tip the balance. ETL tools may charge based on a number of factors — by the amount of data you replicate, or the number of different data sources, or the number of users authorized to use the software — and comparable services may differ in price by an order of magnitude. Some providers have clear pricing post on their websites; others won't tell you how much you have to pay until they know more about your use case. Look for options such as a platform that offers a free trial for new users, free historical data loads, and free loads of data from new data sources. And, keeping in mind the issue of scalability, understand how your costs will change as your data volume grows.

Compatibility with third-party tools

The third-party tools your users depend on should be compatible with your ETL tool, or has a way to create the compatibility you need through APIs, webhooks, or other software. For example, you might want your ETL tool to provide notifications via email or Slack or through a notification service like PagerDuty or Datadog.

Data transformations

When most data warehouses were expensive in-house appliances, the ability to perform preload transformations within the data pipeline was critical. Today, however, with most organizations rolling out new data warehouses on cloud platforms, data teams perform transformations after the data has been loaded, using a modeling tool like dbt or Talend Data Fabric or just SQL.

There you have it — 11 factors to consider when evaluating ETL tools. You're unlikely to find one tool that tops the list in all of these categories. Think about which factors are must-have and which are just nice to have, and weight your assessments accordingly.

Try Stitch for free

Email Address

Hands-on evaluation

Once you've considered all your options, draw up a short list of ETL solutions and test them in your own environment with your own data. Choose a few data sources and go through the process of setting up a connection and replicating data to your data warehouse or data lake. Try both your most critical and your most niche and difficult-to-support tools, and include at least one data set that you've historically had problems syncing. Make sure to test for:

  • Usability: Go through the flow of adding a destination, adding a few sources, and performing a few syncs. Add a few tables, then unsync them. Examine the resulting logs. In other words, go through that process of testing all kinds of functions, even ones you might not need immediately but that might be part of your workflow going forward.
  • Synchronization and integration: Test out a few integrations to learn how easily you can set up a data source and how reliably the ETL tool sends the data at the frequency you need. If you're adding new fields, columns, or tables, for example, does it automatically pick them up? If your Salesforce account adds a new table, would it automatically sync that table, or would you need to manually add it? How difficult is it to add a table? If there's an issue with a specific integration, how easy is it to figure out the problem? Does the tool help you solve it?
  • Timeliness: Can you get all the data you want to your destination on a schedule that meets your data analysts' needs?
  • Accuracy: To make sure the data the tool is sending is accurate, test and set up a few data sets from various data sources, then analyze the results and compare source and destination.

And you don't have to go it alone. Ask peers for tool recommendations. Another way to get good information is by visiting forums like Reddit's /r/dataengineering, or Locally Optimistic, or review sites like G2.

Once you've thought through all the factors, brought in some candidates for hands-on testing, and figured out what your costs are likely to be, you have all the information you need to sign up for an ETL service.

Stitch: ETL made simple

When it comes to ETL tool evaluation criteria, Stitch stacks up well in all categories. Stitch is cost-effective, easy to implement, and provides high-quality logs for maintenance and debugging. It's as close to "set it and forget it" as you'll find. Try Stitch for free today.

Special thanks to independent analytics consultant Dylan Baker for his help on this project.

Image credit: Huebi

Try Stitch for free

Email Address


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK