3

Two new cloud-based data processing papers published

 2 years ago
source link: https://michelkraemer.com/Two-new-cloud-based-data-processing-papers-published/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Two new cloud-based data processing papers published

Two of my latest re­search pa­pers about cloud-based data pro­cessing have just been pub­lished. The first pa­per is en­titled “Cap­ab­il­ity-based Schedul­ing of Sci­entific Work­flows in the Cloud” and deals with the schedul­ing al­gorithm I im­ple­men­ted in Steep. I presen­ted this pa­per at the 9th In­ter­na­tional Con­fer­ence on Data Sci­ence, Tech­no­logy and Ap­plic­a­tions (DATA), which was held as a vir­tual con­fer­ence due to COVID-19.

The other pa­per en­titled “Scal­able pro­cessing of massive geodata in the cloud: gen­er­at­ing a level-of-de­tail struc­ture op­tim­ized for web visu­al­iz­a­tion” was a joint col­lab­or­a­tion with Ralf Gut­bell, Hendrik M. Würz, and Jan­nis Weil where we im­ple­men­ted an ap­proach to dis­trib­uted tri­an­gu­la­tion of di­gital ter­rain mod­els with Apache Spark and Geo­Trel­lis. This journal art­icle has been pub­lished in the AGILE GIS­cience Series.

Please find de­tails about the pa­pers, the con­fer­ence present­a­tion of the first one, as well as the full ref­er­ences be­low.

Capability-based Scheduling of Scientific Workflows in the Cloud

In this pa­per, I presen­ted a dis­trib­uted task schedul­ing al­gorithm and a soft­ware ar­chi­tec­ture for a sys­tem ex­ecut­ing sci­entific work­flows in the Cloud. The main chal­lenges I ad­dressed are (i) cap­ab­il­ity-based schedul­ing, which means that in­di­vidual work­flow tasks may re­quire spe­cific cap­ab­il­it­ies from highly het­ero­gen­eous com­pute ma­chines in the Cloud, (ii) a dy­namic en­vir­on­ment where re­sources can be ad­ded and re­moved on de­mand, (iii) scalab­il­ity in terms of sci­entific work­flows con­sist­ing of hun­dreds of thou­sands of tasks, and (iv) fault tol­er­ance be­cause in the Cloud, faults can hap­pen at any time.

architecture.svg

My soft­ware ar­chi­tec­ture con­sists of loosely coupled com­pon­ents com­mu­nic­at­ing with each other through an event bus and a shared data­base. Work­flow graphs are con­ver­ted to pro­cess chains that can be sched­uled in­de­pend­ently.

generate-process-chains.svg

My schedul­ing al­gorithm col­lects dis­tinct re­quired cap­ab­il­ity sets for the pro­cess chains, asks the agents which of these sets they can man­age, and then as­signs pro­cess chains ac­cord­ingly.

I presen­ted the res­ults of four ex­per­i­ments I con­duc­ted to eval­u­ate if my ap­proach meets the afore­men­tioned chal­lenges. The pa­per fin­ishes with a dis­cus­sion, con­clu­sions, and fu­ture re­search op­por­tun­it­ies.

evaluation.png

An im­ple­ment­a­tion of my al­gorithm and soft­ware ar­chi­tec­ture is pub­licly avail­able with the open-source work­flow man­age­ment sys­tem Steep.

Presentation

Here are the slides of the present­a­tion I gave at the DATA con­fer­ence:

Reference

Krämer, M. (2020). Cap­ab­il­ity-based Schedul­ing of Sci­entific Work­flows in the Cloud. Pro­ceed­ings of the 9th In­ter­na­tional Con­fer­ence on Data Sci­ence, Tech­no­logy, and Ap­plic­a­tions DATA, 43–54. ht­tps://​doi.org/​10.5220/​0009805400430054

Download

The pa­per has been pub­lished un­der the CC BY-NC-ND 4.0 li­cense. You may down­load the fi­nal manuscript here.

Scalable processing of massive geodata in the cloud

In this pa­per, we de­scribed a cloud-based ap­proach to trans­form ar­bit­rar­ily large ter­rain data to a hier­arch­ical level-of-de­tail struc­ture that is op­tim­ized for web visu­al­iz­a­tion. Our ap­proach is based on a di­vide-and-con­quer strategy. The in­put data is split into tiles that are dis­trib­uted to in­di­vidual work­ers in the cloud. These work­ers ap­ply a Delaunay tri­an­gu­la­tion with a max­imum num­ber of points and a max­imum geo­met­ric er­ror. They merge the res­ults and tri­an­gu­late them again to gen­er­ate less de­tailed tiles. The pro­cess re­peats un­til a hier­arch­ical tree of dif­fer­ent levels of de­tail has been cre­ated. This tree can be used to stream the data to the web browser.

conversion-process.png

We have im­ple­men­ted this ap­proach in the frame­works Apache Spark and Geo­Trel­lis. Our pa­per in­cludes an eval­u­ation of our ap­proach and the im­ple­ment­a­tion. We fo­cus on scalab­il­ity and runtime but also in­vest­ig­ate bot­tle­necks, pos­sible reas­ons for them, as well as op­tions for mit­ig­a­tion. The res­ults of our eval­u­ation show that our ap­proach and im­ple­ment­a­tion are scal­able and that we are able to pro­cess massive ter­rain data.

Reference

Krämer, M., Gut­bell, R., Würz, H. M., & Weil, J. (2020). Scal­able pro­cessing of massive geodata in the cloud: gen­er­at­ing a level-of-de­tail struc­ture op­tim­ized for web visu­al­iz­a­tion. AGILE: GIS­cience Series, 1. ht­tps://​doi.org/​10.5194/​agile-giss-1-10-2020

Download

The pa­per has been pub­lished un­der the CC-BY 4.0 li­cense. You may down­load the fi­nal manuscript here.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK