Scientific applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow environments have been already proposed. They have been developed primarily to simplify the design and the execution of a set of tasks in parallel-distributed infrastructures. These environments propose a “process-oriented” design approach, where the information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. They have no or little support for efficiently managing large data sets. Finally, data analyzed and produced by a scientific workflow are often stored in loosely structured files. Simple and classical mechanisms for their management are used: they are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-centric applications (because of inherent bottlenecks, costly data transfers, etc.).
Clouds have recently emerged as an interesting infrastructure option for deploying scientific workflows. Building on their elasticity, in recent years, scientific workflows have become an archetype to model experiments on such infrastructures. In addition, the Cloud allows users to simply outsource data storage and application executions. Still, there are substantial challenges ahead that must be addressed before their potential can be fully exploited in the context of scientific workflows.
One missing link is data management, as clouds mainly target web and business applications, and lack specific support for data-intensive scientific workflows. Currently, the management of workflow data in the clouds is achieved using either some application specific overlays that map the output of one task to the input of another in a pipeline fashion, or, more recently, by leveraging the MapReduce programming model (e.g. Hadoop on Azure - HDInsight). However, most scientific applications do not fit this model and require a more general data and task orchestration model, independent of any programming model.
The goal of this project is to specifically address these challenges by proposing a framework for the efficient processing of scientific workflows in clouds. Our approach will leverage the cloud infrastructure capabilities for handling and processing large data volumes. In order to support data-intensive workflows, our cloud-based solution will:
The validation of this approach will be performed using synthetic benchmarks and real-life applications from bioinformatics on the Microsoft Azure cloud environment.