3

A Conceptual Architecture for a Filesystem to SQS Loader

 3 years ago
source link: https://fuzzyblog.io/blog/lambda/2020/06/12/a-conceptual-architecture-for-a-filesytem-to-sqs-loader.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

A Conceptual Architecture for a Filesystem to SQS Loader

Jun 12, 2020

IMG_3817.jpeg

I have an interesting technical situation facing me:

  • A docker container which runs persistently and writes JSON files to a directory
  • Each JSON file contains short form messages with JSON encoding along a variety of metadata fields
  • Each file is idempotent i.e. once it exists in the directory will never be written to again
  • Each file is named for a timestamp

Each file needs to:

  • Have ridiculous, non-JSON comments removed from the beginning
  • Be parsed into json
  • There is roughly 6000 of these JSON things coming in per second
  • Have only a selection of the JSON internally be selected
  • Be batched into a collection up to the maximum size the SQS supports and then committed to SQS
  • Be processed only once

Note: SQS is the AWS "Simple Queue Service". A queue is a specialized data structure which hands data off for processing by other tasks.

My initial thinking to handle this was to use Rust to write a high performance file processor. This appealed to me:

  • New technical challenge; I don't know Rust
  • Manly - writing fast code is a conveniently macho challenge
  • Simple architecture - one box, one process watching a directory and ripping through files

There are, naturally, problems here:

  • The Rust program represents a SPOF or "Single Point of Failure"; things break
  • This architecture is simple but making it not be a SPOF means the architecture gets complex
  • We don't actually even know if a single Rust executable is fast enough to handle all the data
  • Rust is complex and even simple things take dramatically more code than a scripting language like Ruby or Python. There isn't anything wrong with Rust (I really like the language but learning it is non-trivial).

Thinking through all these issues as well as learning, by chance, that the runtime execution period for AWS lambda serverless functions had increased from 5 minutes to 15 minutes made me think in terms of a different architecture focused on using lambdas.

Note: A lambda is a self contained bit of code that you give over to AWS to manage on your behalf. Another term for Lambda is "functions as a service". You don't have to focus at all on servers, DevOps administration or the like.

Here is what I'm thinking:

  1. Add a network API to the filesystem of JSON files. This could literally be as simple as an NGINX server that listed the files.
  2. A lambda that requests a JSON file for processing per the description above and relies on a Redis dictionary to track files that have already been processed. Two dictionaries would be needed – json_files_processed and v (and, yes, there would need to be a way to expire things from json_files_processing in case a lambda crashes or is terminated; this would be a separate lambda).
  3. A CloudWatch Scheduler rule that triggers the lambda every 5 minutes

Fleshing this out further gives three lambda functions:

  • sqs_loader - the main thing which processes files and sends their contents to SQS
  • json_files_processed_archiver - pulls files that have been processed into a different directory so that they don't have to be ever checked again; alternatively this might delete them after some period of time
  • json_files_processing_checker - checks to see if files have been in the json_files_processing queue for too long

There would likely need to be 3 CloudWatch Scheduler rules one for each lambda.

The sqs_loader would need the ability to self terminate / exit if all files are currently being processed.


Posted In: #lambda #aws #sqs


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK