E-Series as Tier One for multi-tiered Kafka clusters

28 Jun 2022 -

7 minute read

I’ve blogged about the unnecessary decade long abuse of JBOD and DAS storage in the context of Hadoop, Splunk, Elastic, Vertica and other platforms and applications.

Rather than belabor the points that most people already know by know, I’ll keep this one short.

Multi-tiered storage in Kafka clusters

Confluent introduced Tiered Storage in Confluent Platform 6.

Tiering is done similarly to how it’s done elsewhere:

Small Tier 1 for hot storage
Big Tier 2 for warm/cold storage

Kafka with Tier 1 on E-Series SAN and Tier 2 on Object Storage

With this approach we can:

Use smaller and cheaper servers servers
Save rack space and energy
Deploy, maintain and upgrade with ease
Lower software licensing and maintenance fees
Gain agility and simplicity
etc (you get the idea)

How to leverage E-Series

Technically we don’t “need” to use E-Series (or other SAN) here. It’s fine to use internal NVMe in RAID1, for example.

But we’ve already been through this with Splunk and other apps: once a deployment gets to 10 broker/index/whatever servers, application owner realizes they have 20 NVMe internal disks in RAID1, and one E-Series EF300 with 12 disks in a RAID10 group would have been the same or better value. And they still purchased some external storage for other applications, databases, VMs and more, effectively spending more and to get less.

How to leverage E-Series as Tier 1 storage for Kafka? The same way we’d do it for Splunk, Vertica and other applications that use this tiering pattern:

Configure EF300 (up to 10 GB/s write) or EF600 (up to 20 GB/s write) with a RAID1 (2 disks) or RAID10 (4-24 disks) volume group for Kafka brokers
- The larger model, EF600, can deliver full performance with just one (controller) shelf full of disks (24 NMVe), so you can roughly imagine it delivers 900 MB/s per each populated slot
- The smaller model, would be approximately half of that value
- Size for performance first, pick the number of slots, and then size for capacity by choosing the right NVMe disk size (currently ~2-15 TB disks are available)
Create N or N*2 volumes for N broker servers
Connect N broker nodes to E-Series using direct attach (for 2-4 nodes) or SAN (more than 4 nodes)
- Use iSCSI for 25Gbps, FC or NVMe/FC for 32 Gbps, or Infiniband for more than that

That would take care of the hot Kafka tier, which can be small (according to Confluent, between 0.1 and 1 TB). Some related best practices can be found here.

If you need extra capacity, other disk slots can be populated with different-sized disks and used for other applications (VMware, Tanzu/Kubernetes, software-defined S3, databases, etc.). This disk group can be protected with DDP (RAID6-like protection schema).

I suspect all-DDP (it requires the minimum of 11 slots/disks) configurations should be good enough in most cases, but haven’t had a chance to verify this in practice. If we started with 11 small disks, we could grow DDP by adding as few as 1 disks and as many as 13 (11 + 13 = 24), to get more and more performance and capacity while:

lowering disk reconstruction time (see this)
lowering the impact of data reconstruction on performance due to disk failure (with DDP it’s normally around 25%), and
decreasing DDP overhead (with DDP, full EF300/EF600 controller shelf would have protection overhead of 24/22, i.e. only 9%, compared to 122% with 11/9 in the minimal DDP deployment - both are significantly better than RAID10 or a bunch of RAID1 islands in servers)

We need to remember that all disks in a DDP should be the same size. Here are some examples of three configurations using RAID10 vs. DDP with 2 disks worth of spare capacity:

Protection	Disks	Overhead	Overhead %
RAID10	8	8/4	200
DDP(2)	11	11/9	122
DDP(2)	20	20/18	111

Usable (TB) for high DWPD environments (see pages 71 and 72 of this TR) will be lower than expected, i.e. when sizing for 5 DWPD we should leave approximately 28% of usable disk capacity unprovisioned if we wanted to minimize disk failures due to extra wear leveling
- For example, 8 * 1.92TB * 5 DPWD equals 76TB/day or close to 1 GB/s; although the array can deliver more than 1 GB/s, writing > 1 GB/s 24 x 7 would result in more DPWD than disks should get, potentially increasing failure rates
- DDP uses spare capacity (not spare disks), RAID10 would need at least one (not included in overhead figure above)

Depending on the capacity of, and use case for Tier 1, Object Storage may turn out to be “hot” (or not):

if Kafka consumers need fast response for data that doesn’t go beyond 7 days and Tier 1 can hold 14 days of data, Object Storage could be slow and/or remote
if Kafka’s Tier 1 storage is tiny (0.1 TB per broker, for example), obviously it’s unlikely it will be able to hold weeks of data, which means Object Storage will be busy, and should be fast and/or located nearby (e.g. on-prem, and maybe even use only SSD/NVMe media)
some Object Storage, such as NetApp StorageGRID, can consist of heterogeneous nodes (e.g. all-flash and NL-SAS nodes) and use ILM policies to adjust not only where data is placed, but also how
- 2 Copies on All Flash nodes for all objects within 30 days of creation (gives us faster read performance for recent data)
- Erasure Coding 2+1 with NL-SAS HDD placement for all *.segment objects older than 30 days (lowers object storage software overhead from 200% (RF2) to 150% (EC 2+1), both of these are layered on top of R6-equivalent (StorageGRID requirement) volumes)

Kafka Tier 1 on E-Series with R10 would result in less usable capacity with faster performance and normally use one hot spare (not displayed).

DDP delivers lower performance for the same number of disks, but overheads of DDP are very limited and can tolerate two concurrent disk failures. DDP reserve is shown in yellow; normally that amounts to two disks worth of capacity for reconstruction in the case one or two disks fail. Other than that, DDP requires no dedicated hot spares so its overhead advantage over RAID6 or RAID10 is even better than the table above suggests.

For classic Kafka we’d use two E-Series arrays and RF2, but with tiered Kafka there’s probably no need to use multiple arrays, so I’d consider using a single array (assuming sufficient performance and capacity) either R10 (or multiple R1) or DDP, not both.

Kafka with Tier 1 on E-Series R10 or DDP and Tier 2 on Object Storage with RF2 and EC2+1

Object Storage (at the bottom) could use multi-replica or Erasure Coding. Some S3 software such as StorageGRID requires that storage nodes have protected storage (here, R6), other does not but that results in a longer and more impactful recovery from disk failures.

Kafka supports compression (GZip, Snappy, LZ4, zstd) which - after eating up some CPU and memory resources on producers or brokers - has two positive effects on storage sizing:

Lowers network, and storage I/O and capacity requirements (or, lets us get 100% more performance and capacity on the same E-Series array, assuming 50% savings from compression)
Lowers wear on SSDs (e.g. 3 DPWD > 1.5 DPWD), allowing for less hold-back on usable storage capacity

Conclusion

Various if-else statements in this post make it impossible to use simple rules of thumb without knowing more details about the workload, but that’s unavoidable - if the inputs are unknown, it’s impossible to provide correct outputs. Fortunately, it’s just a lot of sequential(ized) IO writes, and can be sized relatively easily once the requirements are known.

The NetApp solution guide for Kafka sizing has a detailed list of inputs that need to be gathered, and can be viewed here.

Kafka prototyping can be done on any storage (even in RAM disks) - we just need to find the capacity and performance requirements which can be fed in to an E-Series sizing tool.

When sizing capacity, performance, and choosing RAID/DDP configuration, I suggest to do that with our next likely step in mind; while we can’t know what we’ll need 5 years from now, we probably have a fairly decent idea of what we will need a year or two from now. If Kafka Tier 1 storage is busy and is likely to grow a lot, then EF600 and RAID10 may be more appropriate than EF300 and DDP, for example.

E-Series as Tier One for multi-tiered Kafka clusters