docs/content/concepts/compute-to-data.md

---
title: Compute-to-Data
description: Providing access to data in a privacy-preserving fashion
slug: /concepts/compute-to-data/
section: concepts
---

## Motivation

The most basic scenario for a Publisher is to provide access to the datasets they own or manage. However, a Publisher may offer a service to execute some computation on top of their data. This has some benefits:

- The data **never** leaves the Publisher enclave.
- It's not necessary to move the data; the algorithm is sent to the data.
- Having only one copy of the data and not moving it makes it easier to be compliant with data protection regulations.

[This page](https://oceanprotocol.com/technology/compute-to-data) elaborates on the benefits.

## Datasets & Algorithms

With Compute-to-Data, datasets are not allowed to leave the premises of the data holder, only algorithms can be permitted to run on them under certain conditions within an isolated and secure environment. Algorithms are an asset type just like datasets. They they too can have a pool or a fixed price to determine their price whenever they are used.


Algorithms can be public or private by setting `"attributes.main.type"` value as follows:
-`"access"` - public. The algorithm can be downloaded, given appropriate datatoken.
-`"compute"` - private. The algorithm is only available to use as part of a compute job without any way to download it. The dataset must be published on the same Ocean Provider as the dataset it's targeted to run on.

For each dataset, publishers can choose to allow various permission levels for algorithms to run:

- allow selected algorithms, referenced by their DID
- allow all algorithms published within a network or marketplace
- allow raw algorithms, for advanced use cases circumventing algorithm as an asset type, but most prone to data escape

All implementations should set permissions to private by default: upon publishing a compute dataset, no algorithms should be allowed to run on it. This is to prevent data escape by a rogue algorithm being written in a way to extract all data from a dataset.

## Architecture Overview

Here's the sequence diagram for starting a new compute job. 

![Sequence Diagram for computing services](images/Starting New Compute Job.png)

The Consumer calls the Provider with `start(did, algorithm, additionalDIDs)`. It returns job id `XXXX`. The Provider oversees the rest of the work. At any point, the Consumer can query the Provider for the job status via `getJobDetails(XXXX)`.

Here's how Provider works. First, it ensures that the Consumer has sent the appropriate datatokens to get access. Then, it calls asks the Operator-Service (a microservice) to start the job, which passes on the request to Operator-Engine (the actual compute system). Operator-Engine runs Kubernetes compute jobs etc as needed. Operator-Engine reports when to Operator-Service when the job has finished.

Here's the actors/components:

- Consumers - The end users who need to use some computing services offered by the same Publisher as the data Publisher.
- Operator-Service - Micro-service that is handling the compute requests.
- Operator-Engine - The computing systems where the compute will be executed.
- Kubernetes - a K8 cluster

Before the flow can begin, these pre-conditions must be met:

- The Asset DDO has a `compute` service.
- The Asset DDO compute service must permit algorithms to run on it.
- The Asset DDO must specify an Ocean Provider endpoint exposed by the Publisher.

## Access Control using Ocean Provider

As [with the `access` service](/concepts/architecture/#datatokens--access-control-tools), the `compute` service requires the **Ocean Provider** as a component handled by Publishers. Ocean Provider is in charge of interacting with users and managing the basics of a Publisher's infrastructure to integrate this infrastructure into Ocean Protocol. The direct interaction with the infrastructure where the data resides happens through this component only.

Ocean Provider includes the credentials to interact with the infrastructure (initially in cloud providers, but it could be on-premise).

<repo name="provider"></repo>

## Compute-to-Data Environment

### Operator Service

The **Operator Service** is a micro-service in charge of managing the workflow executing requests.

The main responsibilities are:

- Expose an HTTP API allowing for the execution of data access and compute endpoints.
- Interact with the infrastructure (cloud/on-premise) using the Publisher's credentials.
- Start/stop/execute computing instances with the algorithms provided by users.
- Retrieve the logs generated during executions.

Typically the Operator Service is integrated from Ocean Provider, but can be called independently of it.

The Operator Service is in charge of establishing the communication with the K8s cluster, allowing it to:

- Register new compute jobs
- List the current compute jobs
- Get a detailed result for a given job
- Stop a running job

The Operator Service doesn't provide any storage capability, all the state is stored directly in the K8s cluster.

<repo name="operator-service"></repo>

### Operator Engine

The **Operator Engine** is in charge of orchestrating the compute infrastructure using Kubernetes as backend where each compute job runs in an isolated [Kubernetes Pod](https://kubernetes.io/docs/concepts/workloads/pods/). Typically the Operator Engine retrieves the workflows created by the Operator Service in Kubernetes, and manage the infrastructure necessary to complete the execution of the compute workflows.

The Operator Engine is in charge of retrieving all the workflows registered in a K8s cluster, allowing to:

- Orchestrate the flow of the execution
- Start the configuration pod in charge of download the workflow dependencies (datasets and algorithms)
- Start the pod including the algorithm to execute
- Start the publishing pod that publish the new assets created in the Ocean Protocol network.
- The Operator Engine doesn't provide any storage capability, all the state is stored directly in the K8s cluster.

<repo name="operator-engine"></repo>

### Pod: Configuration

<repo name="pod-configuration"></repo>

### Pod: Publishing

<repo name="pod-publishing"></repo>

## Further Reading

- [Tutorial: Writing Algorithms](/tutorials/compute-to-data-algorithms/)
- [Tutorial: Set Up a Compute-to-Data Environment](/tutorials/compute-to-data/)
- [Compute-to-Data in Ocean Market](https://blog.oceanprotocol.com)
- [(Old) Compute-to-Data specs](https://github.com/oceanprotocol-archive/OEPs/tree/master/12) (OEP12)
first cut 2020-04-29 12:27:29 +02:00			`---`
			`title: Compute-to-Data`
update 'Contributions' section 2020-10-17 09:09:20 +02:00			`description: Providing access to data in a privacy-preserving fashion`
first cut 2020-04-29 12:27:29 +02:00			`slug: /concepts/compute-to-data/`
			`section: concepts`
			`---`

			`## Motivation`

update 'Contributions' section 2020-10-17 09:09:20 +02:00			`The most basic scenario for a Publisher is to provide access to the datasets they own or manage. However, a Publisher may offer a service to execute some computation on top of their data. This has some benefits:`
first cut 2020-04-29 12:27:29 +02:00
formatting 2020-06-03 02:30:38 +02:00			`- The data never leaves the Publisher enclave.`
			`- It's not necessary to move the data; the algorithm is sent to the data.`
			`- Having only one copy of the data and not moving it makes it easier to be compliant with data protection regulations.`
first cut 2020-04-29 12:27:29 +02:00
update 'Contributions' section 2020-10-17 09:09:20 +02:00			`[This page](https://oceanprotocol.com/technology/compute-to-data) elaborates on the benefits.`

For issue 740 Fix #740 2021-08-24 10:58:44 +02:00			`## Datasets & Algorithms`
first cut 2020-04-29 12:27:29 +02:00
For issue 740 Fix #740 2021-08-24 10:58:44 +02:00			`With Compute-to-Data, datasets are not allowed to leave the premises of the data holder, only algorithms can be permitted to run on them under certain conditions within an isolated and secure environment. Algorithms are an asset type just like datasets. They they too can have a pool or a fixed price to determine their price whenever they are used.`
first cut 2020-04-29 12:27:29 +02:00

For issue 740 Fix #740 2021-08-24 10:58:44 +02:00			Algorithms can be public or private by setting `"attributes.main.type"` value as follows:
			-`"access"` - public. The algorithm can be downloaded, given appropriate datatoken.
			-`"compute"` - private. The algorithm is only available to use as part of a compute job without any way to download it. The dataset must be published on the same Ocean Provider as the dataset it's targeted to run on.

			`For each dataset, publishers can choose to allow various permission levels for algorithms to run:`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`- allow selected algorithms, referenced by their DID`
			`- allow all algorithms published within a network or marketplace`
			`- allow raw algorithms, for advanced use cases circumventing algorithm as an asset type, but most prone to data escape`
first cut 2020-04-29 12:27:29 +02:00
For issue 740 Fix #740 2021-08-24 10:58:44 +02:00			`All implementations should set permissions to private by default: upon publishing a compute dataset, no algorithms should be allowed to run on it. This is to prevent data escape by a rogue algorithm being written in a way to extract all data from a dataset.`
first cut 2020-04-29 12:27:29 +02:00
heading tweaks 2021-04-28 12:18:05 +02:00			`## Architecture Overview`
first cut 2020-04-29 12:27:29 +02:00
For issue 740 Fix #740 2021-08-24 10:58:44 +02:00			`Here's the sequence diagram for starting a new compute job.`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`![Sequence Diagram for computing services](images/Starting New Compute Job.png)`
first cut 2020-04-29 12:27:29 +02:00
For issue 740 Fix #740 2021-08-24 10:58:44 +02:00			The Consumer calls the Provider with `start(did, algorithm, additionalDIDs)`. It returns job id `XXXX`. The Provider oversees the rest of the work. At any point, the Consumer can query the Provider for the job status via `getJobDetails(XXXX)`.

			`Here's how Provider works. First, it ensures that the Consumer has sent the appropriate datatokens to get access. Then, it calls asks the Operator-Service (a microservice) to start the job, which passes on the request to Operator-Engine (the actual compute system). Operator-Engine runs Kubernetes compute jobs etc as needed. Operator-Engine reports when to Operator-Service when the job has finished.`

			`Here's the actors/components:`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`- Consumers - The end users who need to use some computing services offered by the same Publisher as the data Publisher.`
			`- Operator-Service - Micro-service that is handling the compute requests.`
			`- Operator-Engine - The computing systems where the compute will be executed.`
			`- Kubernetes - a K8 cluster`
first cut 2020-04-29 12:27:29 +02:00
For issue 740 Fix #740 2021-08-24 10:58:44 +02:00			`Before the flow can begin, these pre-conditions must be met:`
restructure all the things 2021-04-28 11:54:20 +02:00
rewrite provider paragraph 2021-04-28 12:15:56 +02:00			- The Asset DDO has a `compute` service.
			`- The Asset DDO compute service must permit algorithms to run on it.`
			`- The Asset DDO must specify an Ocean Provider endpoint exposed by the Publisher.`
update core C2D concepts 2021-04-28 09:20:22 +02:00
heading tweaks 2021-04-28 12:18:05 +02:00			`## Access Control using Ocean Provider`
first cut 2020-04-29 12:27:29 +02:00
link up more repos in general architecture 2021-04-28 12:29:34 +02:00			As [with the `access` service](/concepts/architecture/#datatokens--access-control-tools), the `compute` service requires the Ocean Provider as a component handled by Publishers. Ocean Provider is in charge of interacting with users and managing the basics of a Publisher's infrastructure to integrate this infrastructure into Ocean Protocol. The direct interaction with the infrastructure where the data resides happens through this component only.
first cut 2020-04-29 12:27:29 +02:00
rewrite provider paragraph 2021-04-28 12:15:56 +02:00			`Ocean Provider includes the credentials to interact with the infrastructure (initially in cloud providers, but it could be on-premise).`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`<repo name="provider"></repo>`

heading tweaks 2021-04-28 12:18:05 +02:00			`## Compute-to-Data Environment`
restructure all the things 2021-04-28 11:54:20 +02:00
heading tweaks 2021-04-28 12:18:05 +02:00			`### Operator Service`
restructure all the things 2021-04-28 11:54:20 +02:00
			`The Operator Service is a micro-service in charge of managing the workflow executing requests.`
first cut 2020-04-29 12:27:29 +02:00
			`The main responsibilities are:`

fixes 2020-05-04 10:20:05 +02:00			`- Expose an HTTP API allowing for the execution of data access and compute endpoints.`
			`- Interact with the infrastructure (cloud/on-premise) using the Publisher's credentials.`
			`- Start/stop/execute computing instances with the algorithms provided by users.`
			`- Retrieve the logs generated during executions.`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`Typically the Operator Service is integrated from Ocean Provider, but can be called independently of it.`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`The Operator Service is in charge of establishing the communication with the K8s cluster, allowing it to:`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`- Register new compute jobs`
			`- List the current compute jobs`
			`- Get a detailed result for a given job`
			`- Stop a running job`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`The Operator Service doesn't provide any storage capability, all the state is stored directly in the K8s cluster.`
first cut 2020-04-29 12:27:29 +02:00
restructure all the things 2021-04-28 11:54:20 +02:00			`<repo name="operator-service"></repo>`
first cut 2020-04-29 12:27:29 +02:00
heading tweaks 2021-04-28 12:18:05 +02:00			`### Operator Engine`
restructure all the things 2021-04-28 11:54:20 +02:00
mention Kubernetes Pod 2021-04-28 17:31:22 +02:00			`The Operator Engine is in charge of orchestrating the compute infrastructure using Kubernetes as backend where each compute job runs in an isolated [Kubernetes Pod](https://kubernetes.io/docs/concepts/workloads/pods/). Typically the Operator Engine retrieves the workflows created by the Operator Service in Kubernetes, and manage the infrastructure necessary to complete the execution of the compute workflows.`
restructure all the things 2021-04-28 11:54:20 +02:00
			`The Operator Engine is in charge of retrieving all the workflows registered in a K8s cluster, allowing to:`

			`- Orchestrate the flow of the execution`
			`- Start the configuration pod in charge of download the workflow dependencies (datasets and algorithms)`
			`- Start the pod including the algorithm to execute`
			`- Start the publishing pod that publish the new assets created in the Ocean Protocol network.`
			`- The Operator Engine doesn't provide any storage capability, all the state is stored directly in the K8s cluster.`

			`<repo name="operator-engine"></repo>`
rewrite provider paragraph 2021-04-28 12:15:56 +02:00
list pod-configuration & pod-publishing 2021-04-28 17:28:58 +02:00			`### Pod: Configuration`

			`<repo name="pod-configuration"></repo>`

			`### Pod: Publishing`

			`<repo name="pod-publishing"></repo>`

rewrite provider paragraph 2021-04-28 12:15:56 +02:00			`## Further Reading`

			`- [Tutorial: Writing Algorithms](/tutorials/compute-to-data-algorithms/)`
			`- [Tutorial: Set Up a Compute-to-Data Environment](/tutorials/compute-to-data/)`
			`- [Compute-to-Data in Ocean Market](https://blog.oceanprotocol.com)`
Fix #738 2021-08-24 10:34:11 +02:00			`- [(Old) Compute-to-Data specs](https://github.com/oceanprotocol-archive/OEPs/tree/master/12) (OEP12)`