How Can We Help?
< All Topics
Print

Nephos Platform Overview

Table of Contents

Overview

Nephos is a full-stack ML-Ops platform that allows data engineers and data scientists to manage datasets, explore them, create re-usable executables, build and run visual workflows that can train and publish models.

Nephos is built to be a low-code/no-code environment for most tasks that can be accomplished through an easy to use UI powered by a powerful and scalable cloud-native platform.

​Problem Statement

Small, medium and large businesses currently need to balance the agility and elasticity of Cloud based computing with the cost effectiveness and security that existing on-premise IT infrastructure brings them. This has ushered the need for Hybrid clouds which are a cloud computing environment that uses a mix of on-premises, private cloud and third-party, public cloud services with orchestration between the two platforms.

Application and Scientific workflows allow users to easily express multi-step computational tasks, for example train deep learning models, retrieve data from an instrument or a database, reformat the data, and run an analysis. A workflow describes the dependencies between the tasks and in most cases the workflow is described as a directed acyclic graph (DAG), where the nodes are tasks and the edges denote the task dependencies. A defining property for a workflow is that it manages data flow. The tasks in a workflow can be everything from short serial tasks to very large parallel tasks (MPI for example) surrounded by a large number of small, serial tasks used for pre- and post-processing. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Nephos tries to combine the ease of use of workflows for defining and managing large scale applications with the cost effectiveness, security and elasticity of using hybrid clouds.

Technology Stack

Nephos encompasses a set of technologies that help workflow-based applications like deep learning training workflows to execute in a number of different environments including desktops, campus clusters, grids, and clouds. Nephos bridges the scientific domain and the execution environment by automatically mapping high-level workflow descriptions onto distributed resources. It automatically locates the necessary input data and computational resources necessary for workflow execution.

Nephos also tries to optimize the usage of computational resources to minimize overall costs while trying to balance the priority of submitted workflow-based applications. By utilizing underlying costs involved in using cloud-based environments and local clusters and also distinguishing between on-demand and spot instances in the case of Cloud environments like Amazon EC2.

Nephos enables scientists and application developers to construct workflows in abstract terms without worrying about the details of the underlying execution environment or the particulars of the low-level specifications required by the middleware (Condor, Globus, or Amazon EC2). Nephos also bridges the current cyber infrastructure by effectively coordinating multiple distributed resources.

Nephos can be used in a number of scientific and commercial domains including astronomy, bioinformatics, earthquake science , gravitational wave physics, ocean science, limnology, and others. When errors occur, Nephos tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow-level checkpointing, by re-mapping portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done. It cleans up storage as the workflow is executed so that data-intensive workflows have enough space to execute on storage-constrained resources]. Nephos keeps track of what has been done (provenance) including the locations of data used and produced, and which software was used with which parameters.

Platform Features

  • Rich user interface makes it easy to create, manage, troubleshoot tools, workflows, jobs and compute environments.

  • Portability / Reuse – User created workflows can easily be run in different environments without alteration.

  • Hybrid Cloud Support - Nephos can run workflows on top of Condor, Grid infrastructures such as Open Science Grid and XSEDE, Amazon EC2, Google Cloud, and many campus clusters. The same workflow can run on a single system or across a heterogeneous set of resources.

  • Cost Optimization - Nephos can intelligently determine how to optimize costs by submitting to the appropriate computational environment while trying to honor the priority of submitted jobs, helping reduce the overall cost of running and managing a hybrid cloud environment.

  • Performance – Nephos can reorder, group, and prioritize tasks in order to increase overall workflow performance.

  • Scalability – Nephos can easily scale both the size of the workflow, and the resources that the workflow is distributed over. Nephos runs workflows ranging from just a few computational tasks up to 1 million. The number of resources involved in executing a workflow can scale as needed without any impediments to performance.

  • Security - Nephos will have support for Role based access controls for the UI and also support integration with Directory based services.

  • Provenance – By default, all jobs in Nephos are launched using the Kickstart wrapper that captures runtime provenance of the job and helps in debugging. Provenance data is collected in a database, and the data can be queried with tools or directly using SQL.

  • Data Management – Nephos handles replica selection, data transfers and output registrations in data catalogs. These tasks are added to a workflow as auxiliary jobs by the Nephos planner.

  • Reliability – Jobs and data transfers are automatically retried in case of failures. Debugging tools help the user to debug the workflow in case of non-recoverable failures.

  • Error Recovery – When errors occur, Nephos tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow-level checkpointing, by re-mapping portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done. It cleans up storage as the workflow is executed so that data-intensive workflows have enough space to execute on storage-constrained resources. Nephos keeps track of what has been done (provenance) including the locations of data used and produced, and which software was used with which parameters.

Beta Access

Nephos is in a private invite-only beta currently. Contact info@wisecube.ai for a beta invite to the Nephos AI platform.

Table of Contents