0

Data Lakes or Digital Integration Hubs: Which one is better suited to solve the...

 1 year ago
source link: https://www.gigaspaces.com/blog/data-lakes-or-digital-integration-hubs-which-one-is-better-suited-to-solve-the-it-gap
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Data Lakes or Digital Integration Hubs: Which one is better suited to solve the “IT GAP”?

14min. read
Feature-image-1.png

Enterprises, more than ever, require modernization of their backend and middleware architecture to improve performance for the digital age, facilitate lower TCO of their infrastructure, and optimize the moving parts of the IT and digital services departments. Becoming a digital leader with a strong emphasis on transaction affiliation and the ability to process a high volume of data with a compelling cloud journey, is a shared goal for multiple stakeholders within organizations.

In my recent dialogues with IT and business executives, some of the key challenges they raise derive from a gap between the growing appetite for digital applications, and the pace at which data is being modeled and served to the business applications. Architects often refer to this phenomenon as the “IT Gap”. This gap can create delays in the delivery of new services, slowing down the ability to scale and ultimately leading to inefficiencies across the board.

CHART

The disconnect between business and IT negatively impacts the overall customer experience. Bridging this gap requires organizations to shift their focus from IT operations to delivering positive customer experiences. As part of this shift, organizations face numerous new questions and challenges such as:

  • How to create consistency across all channels, brands and devices?
  • How to contextualize digital services based on real-time circumstances, location and indirect referential data?
  • How to serve data to services in a proper fashion and a timely manner to meet an individual customer’s needs and expectations?
  • How to deliver optimum personalized digital experiences?

To understand the technical gap that organizations must overcome in tackling these challenges, we’ll break down the components that are part of this ecosystem – and then rebuild it, better.

How to choose a transactional platform that best fits your organization’s data needs?

Enterprises should opt for a holistic data architecture across the organization rather than having separate technology stacks for each of their digital products. Different digital services require different data objects and models, often increasing the complexity and impacting data integrity across channels. Many public sources, such as technical blogs, compare the pros and cons of relational databases, NoSQLs, data warehouses (DWH) and data lakes. This wide range of data stores and technologies inadvertently causes confusion in the industry about which should be used to do what.

As a general rule, before jumping into the actual technologies, I’d like to suggest separating the analytics from the transactional platforms, as the use cases they solve are of a completely different nature. Organizations also need to figure out what portion of their data is operational, to avoid turning the DWHs into something it is not. By focusing on the purpose that the technological solution is employed to serve, we can address each component in the proper context of the enterprise architecture and optimize utilization and costs. Want to learn more? Tune in to our upcoming webinar:

When considering the leading solutions as part of modernizing your enterprise architecture, the following factors should be taken into account:

  • Continuous data integration
  • Data consumption and exposure
  • SQL interfaces
  • Data compression
  • Multiple native stacks vs. a fully integrated solution
  • Supported data formats
  • How each data store solution updates data

The following table is based on comparison tables that were previously publicly published, comparing industry leaders in data storage across six key factors:

Factors logo WhatsApp-Image-2022-07-03-at-11.22.12-300x126.jpeg logo logo
CDC Built-in (streams) Achieved using various tools such as AWS Glue, Athena and Spark Achieved using ETL tools Embedded, yet can work with any other CDC vendor
Consuming / Exposing Data Drivers: JDBC, ODBC, .NET, and Go. Connectors: Node.js, Python, Spark, and Kafka.  APIs: Java & Python to simplify working in REST APIs. REST API, JDBC & ODBC Drivers. Connectors for JS, Python, PHP, .NET, Ruby, Java, C++ and NodeJS. Delta ACID API for consumption and a delta JDBC connector for exposure. Drivers: JDBC, ODBC

Connectors: Kafka

APIs: Java, .Net, REST

Access patterns: Document API, Key/Value API, Object API, SQL

Roadmap: GraphQL

SQL Interface Built-in via Worksheets Need Athena/Presto (additional cost) Apache Spark SQL, Azure SQL, Data Warehouse/DB Native PG-Wire (Postgres compatible) for JDBC / ODBC, supporting ANSI-99
Compression (Data Storage) Automatically compresses the file as it stores data in a columnar format (4:1 ratio) Can be achieved manually using EC2 machines Efficient compression using Apache Parquet file format Optimized low-latency in-memory compression and SSD compression while enabling multi-tier storage
Supported Formats Structured & semi-Structured Data Structured, semi-structured & unstructured Data Structured, semi-structured & unstructured data Structured, semi-structured & unstructured data
Data updates Updates the specific rows in the table with new values where the condition matches Can’t update data in S3, only read and rewrite the entire object back to S3 Can update specific values in the data where the condition matches Can update specific values in the data where the condition matches, as well as using the CHANGE method to update specific properties within an object/table

Rebuilding with a futuristic vision

Using a Digital Hub for real-time sync between digital apps, business services and backend systems

Here’s something that won’t come as a shock to you: building software architecture is complex. Architects need to sync multiple data sources, multiple data types and pipelines, and the transformations that run between these sources.

One well-established notion is that data lakes and data warehouses fall short with Event-Driven Architecture, as they are unable to serve APIs quickly, with high concurrency.

First, the ingress – moving data to data lakes and warehouses is an offline or batch process, which almost always results in a built-in delay and high latency if the data is served from them.

Second, the egress – most solutions utilize SQL and REST APIs above the data lake, which is simply not fast enough to meet the latency demand of business applications.

To cope with these shortcomings, application developers started building small databases adjacent to business applications, often referred to as “data marts” or “local cache”, which also lead to high overall latency and data duplication across the different marts. This architecture pattern causes excessive data duplication and inefficiencies. Even worse, it often compromises data integrity between channels or applications. A common challenge with this pattern can be demonstrated by executing a basic “get my account information” query and receiving different results on the mobile app than on the internet website – a true story that happened to me with a local credit card company.

A Digital Integration Hub (DIH) eliminates this workaround and related issues by decoupling business applications and backend SoRs with event-based or batch replication patterns. The organizations’ operational data is reflected in the consolidated fabric that powers real-time access by using advanced microservices, exposing relevant APIs and by doing that –  accelerating the API serving.

Data Integration

The advantage of the DIH is evident in GigaSpaces’ powerful tool, the Data Integration (DI) layer.

All databases can ingest data from ETL or CDC, which can be integrated with common databases and message brokers, so you might ask: what’s the big deal here?

Here’s the thing: the initial integration is not all that complicated. The truly hard work begins after integration, when architects, DBAs and developers have to do all kinds of wrangling to solve common integration challenges in existing systems, with countless production workflows that often have indirect dependability due to modern event-driven and API-based patterns. Before diving into the different challenges, let’s examine the simple data extraction and ingestion pipeline and what we need to handle:

  • Data conflicts and reconciliation
  • Multiple CDC streams
  • Concurrent Initial Load and CDC without any downtime to data access or business services
  • Schema evolution or adding new/existing tables dynamically to an ongoing CDC without restarting the service
  • Scaling CDC streams to align with higher ingress/egress
  • Handle logical data misalignments
  • Metadata management and “tagging” data to map relationships between data and services
  • Data freshness validation
  • Data integrity between the DB and the “System of Engagement” (SOE)
  •  Reflect transactional data from multiple tables in the SOE when pushing to a restreaming service

There might be other post-integration challenges, but all the products in the market fall into one of the following categories: CDC, ETL, Databases/NoSQLs or Microservices, thus lacking the holistic capabilities to handle the entire data lifecycle between the SoRs and the business services. Smart DIH, due to its unified, holistic architecture and monitoring capabilities, seamlessly unifies and manages the entire data lifecycle.

Data Synchronization

Data Lakes and Data Warehouses are not and should not be used as Operational Data Stores (ODS). Rather, the writebacks and further operational workflows under the OLTP umbrella should be implemented against the SoRs, either directly or indirectly via the ODS.

Smart DIH projects are based on multiphase implementations, starting with a “read-only” layer. Instead of services writing directly back to the Smart DIH, the OLTP systems write to the backend SoRs (either directly or indirectly). This architecture pattern is called “Command and Query Responsibility Segregation” (CQRS). While the DIH does not serve as an SoR, it represents the “single source of truth” for multiple applications or channels as they access to retrieve the operational data.

When organizations implement the DIH to a more evolved state, the optional write-back pattern can also occur using asynchronous design. This method enables writing “commands” directly to the GigaSpaces fabric; a scheduled task then opens a transaction that pushes these commands against the backend SoR, or to a message broker (MQ, JMS, Rabbit, Kafka, Zero, etc.) and syncs the upserted data from the SoR back to the Smart DIH, via the embedded CDC. This advanced form of write-back indirectly changes the data by applying an asynchronous reliable FIFO command execution, using common design patterns such as Outbox and Sagas.

One consolidated operational data store for countless applications

Organizations face a growing need to scale up their digital services rapidly. This strong digital appetite comes with growing pains in performance, cost, and manageability as the thriving number of applications outgrow a certain comfort threshold.

Leveraging a converged, distributed real-time data fabric with an embedded lightweight Java application server provides unprecedented performance and scale that can’t be achieved when using different solutions that are manually stitched together. The benefits include maintaining data integrity via a combination of collections and normalized relational data, together with the ability to perform certain operations, such as “joins” across data in different formats.

Effective data management is the premise for delivering strategic business value from digital services. This requires having domain-oriented decentralized data ownership, combined with microservices-driven architecture to access enterprise shared data. This consolidated architecture provides a more flexible and easier scale for parallel reuse of functionality and data.

Chart-1-no-title.png

Classic microservices architecture using collections per service: Data is duplicated between collections

Multichannel integrity is achieved by reusing the same “data access services” from a single source of truth, as depicted here:

chart

The GigaSpaces Data Fabric: Unified multi-model data store pattern

Embedded Event-Driven Architecture

Many organizations have adopted the Event-Driven Architecture (EDA) methodologies and design principles as part of their data management strategy (more on this in Kai Waehner’s excellent blog). Companies such as Uber and Netflix are textbook examples of using EDA effectively. But here’s one major caveat: these are technology workshops that happen to be streaming movies or orchestrating commutes, and having their entire budget built around these specific operations – a luxury most organizations don’t have.

To achieve a simpler architecture that also provides a lower latency real-time response, embedded-EDA (eEDA) is built by embracing the architecture for embedding events, message queues and notifications as part of the extreme low-latency performance of in-memory workflow. This design, as opposed to traditional SOA which involves heavy multi-process communication and data transfer, is a real-time fabric based on the “Spaces” principles.

To enhance the utilization of events, GigaSpaces created an architecture with the following unique characteristics:

  • Embedded Event Triggers
  • Embedded Event Management Engine
  • Embedded Event Priority Based Queues
  • Embedded Event Priority Based Clusters (grouping)
  • Embedded Outbound Messaging System (pub/sub notification pattern)

Event processing is improved immensely with co-location by injecting business logic to run in the same memory space as the data on the data fabric. The technological benefits include:

  • Durable notifications via fully durable pub/sub messaging for data consistency and reliability
  • FIFO Groups ensure in-order and exclusive processing of events
  • No need to transfer events from the data tier to the service tier
  • Related data can be co-located to the same group while parallelizing across additional groups

With reduced latency for business applications, the IT team can easily add contextual information to the queries while enhancing the overall volume of customer interactions.

chart

Event-Driven Architecture for inbound and outbound

The reduced total cost of ownership

All architects know a simple truth: a design isn’t viable if its costs are unacceptable. Let’s keep this notion in mind when examining the trend of shifting to cloud computing in order to reduce costs.

Cloud has endless advantages, however, when used irresponsibly it can backfire without compassion. The following quote from the Firebolts blog captures this irony: “If you look at the Fivetran benchmark, which managed 1TB of data, most of the clusters cost $16 per hour. That was the enterprise pricing for Snowflake ($2 per credit). Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price.

Thinking about operational data, we often require tens or even hundreds of TB of data, resulting in an overpriced architecture just for the data tier – before accounting for other middleware components such as CDC, ETL, Cache and others.

With the GigaSpaces consolidated data store, a unified and performance-optimized technology creates efficiency at scale. The platform reduces the need to replicate and mobilize data while simplifying data management. It substitutes costly standalone elements, driving direct and indirect cost savings by optimizing data management, reducing overall footprint, reducing usage and dependency on existing costly elements, and reducing operational load and maintenance costs.

GigaSpaces customers testify to a reduction in operational costs of 40-75%. This reduction of software and maintenance costs may vary based on elements being replaced or optimized with the introduction of GigaSpaces into the solution architecture stack. Here’s one example: A fully digital bank operating in Sweden has made an entire stack of commercial RDBMS licenses redundant after two years of using the GigaSpaces solution, eventually substituting with a GigaSpaces stand-alone DIH as the bank’s Operational Data Store.

With the GigaSpaces solution in place, enterprises can also substitute some standalone data replication solutions that extract data to a single ODS, eliminating the need for additional costly expenditures. In addition, they will also no longer be required to add additional caching solutions, such as Redis, on top of the ODS.

Additional benefits include allowing software engineers to focus on developing new business logic instead of spending time on data-related and integration challenges, resulting in shorter time-to-service, from months to days, and reduced costs associated with human error.

Lastly, ongoing maintenance and support costs are reduced, as well as the expertise required per workflow. This is achieved by the standardization of data pipelines and data microservices, through no code and low code options provided with the GigaSpaces solution.

chart

The blue line indicates a lower operational cost over time when using GigaSpaces Smart DIH versus a “DIY Solution” leveraging multiple products

Putting it all together – the full DIH package

After careful examination of the different technologies required to build a robust and price-effective solution, GigaSpaces built the solution architecture for the modern operational data store in the form of a Digital Integration Hub.

The DIH enables organizations to focus on converging business and technology, reducing the stack complexity, and providing fast response time for new and upgraded digital services while reducing overall costs.

By simply upgrading a database, or adding a newer middleware component, organizations tend to improve performance in the short term, but the additional costs and overall complexity don’t provide the required ROI.

We can keep diving deep into the IT Gap and closely examine the specs of different data stores, but ironically enough the biggest challenges organizations face in digital transformation are not technological in nature. Rather, they revolve around changing the thought paradigm of managers signing off on these changes. More on this – in my next blog posts.

Stay tuned.

Cartoon

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK