Two new cloud-based data processing papers published

Two of my latest research papers about cloud-based data processing have just been published. The first paper is entitled “Capability-based Scheduling of Scientific Workflows in the Cloud” and deals with the scheduling algorithm I implemented in Steep. I presented this paper at the 9th International Conference on Data Science, Technology and Applications (DATA), which was held as a virtual conference due to COVID-19.

The other paper entitled “Scalable processing of massive geodata in the cloud: generating a level-of-detail structure optimized for web visualization” was a joint collaboration with Ralf Gutbell, Hendrik M. Würz, and Jannis Weil where we implemented an approach to distributed triangulation of digital terrain models with Apache Spark and GeoTrellis. This journal article has been published in the AGILE GIScience Series.

Please find details about the papers, the conference presentation of the first one, as well as the full references below.

Capability-based Scheduling of Scientific Workflows in the Cloud

In this paper, I presented a distributed task scheduling algorithm and a software architecture for a system executing scientific workflows in the Cloud. The main challenges I addressed are (i) capability-based scheduling, which means that individual workflow tasks may require specific capabilities from highly heterogeneous compute machines in the Cloud, (ii) a dynamic environment where resources can be added and removed on demand, (iii) scalability in terms of scientific workflows consisting of hundreds of thousands of tasks, and (iv) fault tolerance because in the Cloud, faults can happen at any time.

My software architecture consists of loosely coupled components communicating with each other through an event bus and a shared database. Workflow graphs are converted to process chains that can be scheduled independently.

My scheduling algorithm collects distinct required capability sets for the process chains, asks the agents which of these sets they can manage, and then assigns process chains accordingly.

I presented the results of four experiments I conducted to evaluate if my approach meets the aforementioned challenges. The paper finishes with a discussion, conclusions, and future research opportunities.

An implementation of my algorithm and software architecture is publicly available with the open-source workflow management system Steep.

Presentation

Here are the slides of the presentation I gave at the DATA conference:

Reference

Krämer, M. (2020). Capability-based Scheduling of Scientific Workflows in the Cloud. Proceedings of the 9th International Conference on Data Science, Technology, and Applications DATA, 43–54. https://doi.org/10.5220/0009805400430054

Download

The paper has been published under the CC BY-NC-ND 4.0 license. You may download the final manuscript here.

Scalable processing of massive geodata in the cloud

In this paper, we described a cloud-based approach to transform arbitrarily large terrain data to a hierarchical level-of-detail structure that is optimized for web visualization. Our approach is based on a divide-and-conquer strategy. The input data is split into tiles that are distributed to individual workers in the cloud. These workers apply a Delaunay triangulation with a maximum number of points and a maximum geometric error. They merge the results and triangulate them again to generate less detailed tiles. The process repeats until a hierarchical tree of different levels of detail has been created. This tree can be used to stream the data to the web browser.

We have implemented this approach in the frameworks Apache Spark and GeoTrellis. Our paper includes an evaluation of our approach and the implementation. We focus on scalability and runtime but also investigate bottlenecks, possible reasons for them, as well as options for mitigation. The results of our evaluation show that our approach and implementation are scalable and that we are able to process massive terrain data.

Reference

Krämer, M., Gutbell, R., Würz, H. M., & Weil, J. (2020). Scalable processing of massive geodata in the cloud: generating a level-of-detail structure optimized for web visualization. AGILE: GIScience Series, 1. https://doi.org/10.5194/agile-giss-1-10-2020

Download

The paper has been published under the CC-BY 4.0 license. You may download the final manuscript here.

Two new cloud-based data processing papers published

Two new cloud-based data processing papers published

Capability-based Scheduling of Scientific Workflows in the Cloud

Presentation

Reference

Download

Scalable processing of massive geodata in the cloud

Reference

Download

Recommend

日本三菱日联集团宣布放弃区块链支付网络项目

Bitcoin Angola Hosting a Crypto Event on Metaverse, DeFi, and NFTs

My God, It’s Full of Dots!

Bank Of Uganda Exploring a Central Bank Digital Currency (CBDC)

Script Shell: How to make a loop for among the float number strings?

浙江西溪湿地施展“驻颜术” 3000余根松木桩护岸固堤

When not to use a North Star - Maahir Shah - Mind the Product

Figma系统组件库搭建全攻略！

The Main Thread – Issue #1

How The Global SWIFT Payment System Works

About Joyk