What differentiates them and what AWS resources are available?

Know Your Data Storage Options on the Cloud

11. Aug 2022

Big Data is just that: big. According to Statista, a leading provider of market and consumer data, data creation will exceed 180 zettabytes by 2025 ─ approximately 118.8 zettabytes more than in 2020. That’s a lot of data that has the potential to be translated into business intelligence (BI). It’s also a lot of data that must be stored somewhere so the information can be processed, analyzed, viewed, and distributed. For many organizations, that somewhere will be in the AWS cloud. And for solution architects, that means choosing the most suitable AWS data storage repository.

There are various options, and the appropriate choice will depend on numerous factors. Among them: whether the data is structured or unstructured and if it’s currently in use or its usage is to be determined.

Before making any decisions, however, it’s essential to understand how the different repositories work, what differentiates them from one another, and which AWS resources are available.

The Data Warehouse

A data warehouse is a central repository of information that can be analyzed to make more informed decisions. It ingests structured data with predefined schema from transactional systems, relational databases, and other sources. It then connects that data to downstream analytical tools used by business analysts, data engineers, data scientists, and decision-makers.

Data warehouse architecture consists of tiers. The top tier is the front-end client. It presents results through reporting, analysis, and data mining tools. The middle tier consists of an analytics engine that accesses and analyzes data. The bottom tier is a database server, where data is loaded and stored.

Data that’s frequently accessed is stored in very fast storage, such as solid-state drives (SSD). If it’s accessed infrequently, it’s stored in a lower-cost object store, like Amazon S3. The data warehouse will automatically move frequently accessed data into faster storage to optimize query speed.

Data warehouses follow a schema-on-write data model. The source data must fit into a predefined structure (schema) before entering the warehouse. This is usually accomplished through an extract-transform-load (ETL) process. You must know how the data will be used so you can optimize the structure before it enters a warehouse

Storage and compute resources are tightly coupled so ingesting more data into the warehouse requires more ETL. That entails more computation, which increases time, cost, and complexity. Defining schema also requires planning in advance.

In terms of AWS data warehouse resources, among the most predominant is Amazon Redshift. It offers petabyte-scale data warehousing and exabyte-scale data lake analytics together in a single pay-only-for-what-you-use. AWS also offers a broad set of managed services that can be used to quickly deploy end-to-end analytics and data warehousing solutions.

Storage and compute resources are tightly coupled so ingesting more data into the warehouse requires more ETL. That entails more computation, which increases time, cost, and complexity.

The Data Mart

A data mart is a data warehouse or a portion of a data warehouse but is intentionally limited in scope. It’s focused on a specific functional area or subject matter, and usually serves the needs of a single team or business unit, like finance, marketing, or sales.

Data marts can be created quickly because of their limited coverage. They’re simple to design, build, and administer, and can be built from a large data warehouse, operational stores, or a combination of the two.

Since it’s condensed and summarized, data mart information derived from the broader data warehouse allows each department to access more focused data to its operations. There’s less data in the data mart, so the processing overhead is decreased. As such, queries run faster. Because data marts concentrate on specific functional areas, however, querying across areas can become complex.

Data marts that are fed directly from source data can also generate inconsistent information. Those fed from an existing data warehouse avoid inconsistency issues.

Data marts can be created quickly because of their limited coverage. They’re simple to design, build, and administer, and can be built from a large data warehouse, operational stores, or a combination of the two.

The Data Lake

A data lake is a centralized data repository that allows for storing, governing, discovering, and sharing structured, semi-structured and unstructured data at any scale. It eliminates data silos by acting as a single landing zone for data from multiple sources.

Unlike data warehouses, data lakes ingest all data types in their source format. This encourages a schema-on-read process model.

One of the advantages of schema-on-read is that it results in loose coupling of the compute and storage resources for maintaining a data lake. Bypassing the ETL process means you can ingest large volumes of data into a data lake without the time, cost, and complexity that usually accompanies the ETL process. Instead, compute resources are consumed at query time where they’re more targeted and cost-effective.

Data lakes also make it easy and cost-effective to store large volumes of organizational data, including data without a clearly defined use case. The downside is that, without organization, governance, or integration with known ETL or analytics tools, data lakes can easily become data swamps.

AWS data lake resources include Amazon S3 (object storage); AWS Lake Formation, a service that makes it easy to set up a secure data lake in days; Amazon S3 Glacier and Glacier Deep Archive, low-cost Amazon S3 cloud storage classes for data archiving and long-term backup; AWS Backup, cost-effective, fully managed, policy-based service that simplifies data protection at scale; AWS Glue, a serverless data integration service; and AWS Data Exchange, which makes it easy to find, subscribe to, and use third-party data in the cloud.

Data lakes also make it easy and cost-effective to store large volumes of organizational data, including data without a clearly defined use case.

The Data Lakehouse

A data lakehouse combines the flexibility, scale, and cost-efficiency of data lakes with the atomicity, consistency, isolation, and durability (ACID) transactions of data warehouses. It enables querying data across a data warehouse, data lake, and operational databases to gain faster, deeper insights that aren’t possible otherwise. Data can be stored in open file formats in a data lake and queried in place while joining with data warehouse data.

A data lakehouse has dual layered architecture. The warehouse layer resides over a data lake enforcing schema on write, providing quality and control to facilitate the BI and reporting. On AWS, Amazon Redshift powers a lake house architecture.

A data lakehouse has dual layered architecture. The warehouse layer resides over a data lake enforcing schema on write, providing quality and control to facilitate the BI and reporting.

Conclusion

A technology partner can help a company select and implement the data repository that is best suited for their application development projects. The right partner can help ensure that the applications incorporate the most effective, cost-efficient data management solutions.

Know Your Data Storage Options on the Cloud

Know Your Data Storage Options on the Cloud

The Data Warehouse

The Data Mart

The Data Lake

The Data Lakehouse

Conclusion

Recommend

【SpringBoot】SpringBoot 概念,创建、运行

便利蜂“失灵”

Nothing pulled a fast one on us regarding the phone (1) brightness

EC2-Classic 的狀態

锂电池综合回收制造企业金晟新能源完成数亿元B轮融资

重磅！2022年中国及31省市ERP软件行业政策汇总及解读（全）

「昇科能源」获数千万元A轮投资

长石资本合伙人丁忠民：多种周期叠加，投资机构如何应对半导体行业的大变局？

XaaS陷阱：“一切即服务”并不一定是IT真正需要的东西

搜狐三重门：时代“宠儿”到风口“弃子”

About Joyk