HPC and Cloud Computing

Santangelo, Luigi

Up to a few years ago, scientific applications were designed, developed and built to be executed just on high performance infrastructures or on-premise systems, which were, and are, able to guarantee high performance, reducing execution time and increasing application scalability. Since 2006, however, the traditional IT landscape has started to change. The birth of Cloud Computing brought along several opportunities for business users who started replacing their own on premise systems with the emerging cloud services. As time goes by, cloud infrastructures became soon more powerful, reliable, affordable and secure, conquering the interest of the scientific community. Despite the considerable amount of advantages which can be met relying on cloud computing, there are a lot of barriers which slow down the transition towards the emerging environment. One of the most relevant factors limiting the transition is the complexity of moving applications. Indeed, moving an application into the cloud is not effortless and might take a huge amount of time. Therefore, before starting to move an application into the cloud, it might be worth having in advance an idea about how the application will behave when run in a different infrastructure. It is then crucial, for researchers, to understand trade-offs, costs and benefits related to moving an application into the cloud. Indeed, applications which are able to run quickly on HPC systems, might perform worse when run on a different infrastructure, such as in the cloud. There are several factors limiting the performance of an application running in the cloud, such as the virtualized environment, the architecture of the physical CPU, the amount of RAM, and so on, but, as highlighted in many works, perhaps the most important factor limiting the performance of a large amount of scientific applications is the interconnection network. For communication-intensive applications, the network infrastructure may soon become a bottleneck, reducing performance and scalability. A rich set of different interconnection technologies (such as Infiniband, Intel Omni Path or Ethernet) have been developed to reduce the overhead introduced by the network layer and increase application scalability and performance. All these technologies are currently used in HPC systems. Indeed, according to the Top500 list (Top500 Release June 2018), 49.4% of the HPC systems use Gigabit Ethernet technology to interconnect nodes, 27.8% uses Infiniband, 7.8% uses Omni Path and the remaining 15% makes use of proprietary or custom interconnection networks. Cloud service providers try to keep in step with the HPC systems by studying and introducing new components in their interconnection model, making the cloud environment more attractive and promising even for running scientific applications, traditionally executed on HPC systems [A64A69]. This is also the case for many bioinformatics applications, and bio-scientists are thinking to move their parallel code to the cloud infrastructures. As this activity is not effortless and can take a huge amount of time, before moving an application to the new infrastructure, a deep analysis of the application and the new infrastructure layer should be done, in order to get insights about how the application will behave being run in the cloud and how it might be adapted in order to reduce the impact of the communication. It might be useful to know in advance the impact of the communication on the application, because this might help researchers to get hints and insights about how the application will perform on a different architecture and the economical cost for running such application on that architecture.

L'utilizzo di infrastrutture cloud, quale alternativa ai tradizionali sistemi on-premise, è diventato nel corso degli anni, una interessante opportunità per eseguire applicazioni commerciali. Le organizzazioni aziendali infatti possono beneficiare dei numerosi vantaggi offerti dalle piattaforme cloud, e questo si traduce in una significativa riduzione dei costi e una maggiore efficienza nella gestione dei sistemi informativi. Tuttavia, tale affermazione potrebbe non essere vera per le applicazioni scientifiche, storicamente progettate per essere eseguite su sistemi HPC ad altissime prestazioni. Un confronto tra sistemi cloud e infrastrutture ad alte prestazioni non sempre vede le prime vincenti su queste ultime, soprattutto quando solo le prestazioni e laspetto economico vengono presi in considerazione. Questa affermazione è confermata dai risultati dei test sperimentali, descritti in questa tesi, ottenuti dopo aver migrato sul cloud due applicazioni scientifiche (chiamate rispettivamente Cross Motif Search e BloodFlow), basate su due differenti modelli di comunicazione. Test di prestazioni e scalabilità ottenuti eseguendo le due applicazioni su due infrastrutture simili (cloud e HPC) hanno mostrato che le due applicazioni si comportano meglio quando sono eseguite sul sistema HPC, soprattutto per BloodFlow, la cui esecuzione è fortemente condizionata dallinfrastruttura di rete, essendo basata su un modello di comunicazione molto intensivo. Anche sotto laspetto economico, il cloud risulta essere non conveniente se confrontato con il costo di un sistema HPC. Questo risultato è stato ottenuto confrontando il costo per eseguire una applicazione generica per unora su tre differenti architetture cloud (Google, Amazon, Microsoft) con il costo necessario per eseguire la medesima applicazione per il medesimo tempo su un sistema HPC (nello specifico su Marconi). Sebbene la piattaforma cloud di Google risulta essere più economica, il costo rimane diverse volte più alto del costo di Marconi, anche se si utilizza un cluster cloud con prerilasciabilità, ovvero un cluster di istanze la cui esecuzione può essere interrotta automaticamente dallinfrastruttura cloud, senza alcun preavviso, quando altri task richiedono laccesso a tali risorse. Anche sotto laspetto economico, il cloud computing sembra non essere conveniente per lesecuzione di applicazioni scientifiche. Tuttavia un confronto tra sistemi cloud e HPC che prenda in esame solo laspetto economico e di performance è iniquo, perché ci sono moltissimi altri fattori che rendono il cloud vincente sui sistemi HPC. Infatti, quando ad esempio viene preso in esame il tempo di consegna e la preferenza dellutente, il panorama sulla convenienza cambia. Un equo confronto tra sistemi cloud e HPC dovrebbe quindi prendere in esame non solo le prestazioni e il costo, ma anche il tempo di attesa dei job, il numero dei job interrotti dal sistema, il tempo di setup, il tempo di disponibilità del sistema e la preferenza dellutente. Ad esempio, i job eseguiti su un sistema HPC non vengono eseguiti immediatamente ma vengono inseriti in una coda in attesa che le risorse richieste dal job siano disponibili. Il tempo di attesa di un job nella coda varia in base a diversi fattori, quali il numero delle risorse fisiche disponibili, il numero di job già sottomessi ma in attesa di essere eseguiti, e, qualche volta, anche dal numero di job già eseguiti dallutente nellultimo mese o da fattori simili.