1
0
mirror of https://github.com/oceanprotocol/docs.git synced 2024-11-26 19:49:26 +01:00

Merge pull request #820 from oceanprotocol/issue-808-c2d-docs

WIP: Issue 808 c2d docs
This commit is contained in:
Akshay 2022-01-07 16:40:36 +01:00 committed by GitHub
commit 91d3d235b8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 278 additions and 258 deletions

View File

@ -5,6 +5,11 @@ slug: /concepts/compute-to-data/
section: concepts section: concepts
--- ---
## Quick Start
- [Compute-to-Data example](https://github.com/oceanprotocol/ocean.py/blob/main/READMEs/c2d-flow.md)
## Motivation ## Motivation
The most basic scenario for a Publisher is to provide access to the datasets they own or manage. However, a Publisher may offer a service to execute some computation on top of their data. This has some benefits: The most basic scenario for a Publisher is to provide access to the datasets they own or manage. However, a Publisher may offer a service to execute some computation on top of their data. This has some benefits:
@ -15,105 +20,9 @@ The most basic scenario for a Publisher is to provide access to the datasets the
[This page](https://oceanprotocol.com/technology/compute-to-data) elaborates on the benefits. [This page](https://oceanprotocol.com/technology/compute-to-data) elaborates on the benefits.
## Datasets & Algorithms
With Compute-to-Data, datasets are not allowed to leave the premises of the data holder, only algorithms can be permitted to run on them under certain conditions within an isolated and secure environment. Algorithms are an asset type just like datasets. They they too can have a pool or a fixed price to determine their price whenever they are used.
Algorithms can be public or private by setting `"attributes.main.type"` value as follows:
- `"access"` - public. The algorithm can be downloaded, given appropriate datatoken.
- `"compute"` - private. The algorithm is only available to use as part of a compute job without any way to download it. The dataset must be published on the same Ocean Provider as the dataset it's targeted to run on.
For each dataset, publishers can choose to allow various permission levels for algorithms to run:
- allow selected algorithms, referenced by their DID
- allow all algorithms published within a network or marketplace
- allow raw algorithms, for advanced use cases circumventing algorithm as an asset type, but most prone to data escape
All implementations should set permissions to private by default: upon publishing a compute dataset, no algorithms should be allowed to run on it. This is to prevent data escape by a rogue algorithm being written in a way to extract all data from a dataset.
## Architecture Overview
Here's the sequence diagram for starting a new compute job.
![Sequence Diagram for computing services](images/Starting New Compute Job.png)
The Consumer calls the Provider with `start(did, algorithm, additionalDIDs)`. It returns job id `XXXX`. The Provider oversees the rest of the work. At any point, the Consumer can query the Provider for the job status via `getJobDetails(XXXX)`.
Here's how Provider works. First, it ensures that the Consumer has sent the appropriate datatokens to get access. Then, it calls asks the Operator-Service (a microservice) to start the job, which passes on the request to Operator-Engine (the actual compute system). Operator-Engine runs Kubernetes compute jobs etc as needed. Operator-Engine reports when to Operator-Service when the job has finished.
Here's the actors/components:
- Consumers - The end users who need to use some computing services offered by the same Publisher as the data Publisher.
- Operator-Service - Micro-service that is handling the compute requests.
- Operator-Engine - The computing systems where the compute will be executed.
- Kubernetes - a K8 cluster
Before the flow can begin, these pre-conditions must be met:
- The Asset DDO has a `compute` service.
- The Asset DDO compute service must permit algorithms to run on it.
- The Asset DDO must specify an Ocean Provider endpoint exposed by the Publisher.
## Access Control using Ocean Provider
As [with the `access` service](/concepts/architecture/#datatokens--access-control-tools), the `compute` service requires the **Ocean Provider** as a component handled by Publishers. Ocean Provider is in charge of interacting with users and managing the basics of a Publisher's infrastructure to integrate this infrastructure into Ocean Protocol. The direct interaction with the infrastructure where the data resides happens through this component only.
Ocean Provider includes the credentials to interact with the infrastructure (initially in cloud providers, but it could be on-premise).
<repo name="provider"></repo>
## Compute-to-Data Environment
### Operator Service
The **Operator Service** is a micro-service in charge of managing the workflow executing requests.
The main responsibilities are:
- Expose an HTTP API allowing for the execution of data access and compute endpoints.
- Interact with the infrastructure (cloud/on-premise) using the Publisher's credentials.
- Start/stop/execute computing instances with the algorithms provided by users.
- Retrieve the logs generated during executions.
Typically the Operator Service is integrated from Ocean Provider, but can be called independently of it.
The Operator Service is in charge of establishing the communication with the K8s cluster, allowing it to:
- Register new compute jobs
- List the current compute jobs
- Get a detailed result for a given job
- Stop a running job
The Operator Service doesn't provide any storage capability, all the state is stored directly in the K8s cluster.
<repo name="operator-service"></repo>
### Operator Engine
The **Operator Engine** is in charge of orchestrating the compute infrastructure using Kubernetes as backend where each compute job runs in an isolated [Kubernetes Pod](https://kubernetes.io/docs/concepts/workloads/pods/). Typically the Operator Engine retrieves the workflows created by the Operator Service in Kubernetes, and manage the infrastructure necessary to complete the execution of the compute workflows.
The Operator Engine is in charge of retrieving all the workflows registered in a K8s cluster, allowing to:
- Orchestrate the flow of the execution
- Start the configuration pod in charge of download the workflow dependencies (datasets and algorithms)
- Start the pod including the algorithm to execute
- Start the publishing pod that publish the new assets created in the Ocean Protocol network.
- The Operator Engine doesn't provide any storage capability, all the state is stored directly in the K8s cluster.
<repo name="operator-engine"></repo>
### Pod: Configuration
<repo name="pod-configuration"></repo>
### Pod: Publishing
<repo name="pod-publishing"></repo>
## Further Reading ## Further Reading
- [Compute-to-Data architecture](/tutorials/compute-to-data-architecture/)
- [Tutorial: Writing Algorithms](/tutorials/compute-to-data-algorithms/) - [Tutorial: Writing Algorithms](/tutorials/compute-to-data-algorithms/)
- [Tutorial: Set Up a Compute-to-Data Environment](/tutorials/compute-to-data/) - [Tutorial: Set Up a Compute-to-Data Environment](/tutorials/compute-to-data/)
- [Use Compute-to-Data in Ocean Market](https://blog.oceanprotocol.com/compute-to-data-is-now-available-in-ocean-market-58868be52ef7) - [Use Compute-to-Data in Ocean Market](https://blog.oceanprotocol.com/compute-to-data-is-now-available-in-ocean-market-58868be52ef7)

View File

@ -0,0 +1,83 @@
---
title: Compute-to-Data
description: Architecture overview
---
## Architecture Overview
Here's the sequence diagram for starting a new compute job.
![Sequence Diagram for computing services](images/Starting New Compute Job.png)
The Consumer calls the Provider with `start(did, algorithm, additionalDIDs)`. It returns job id `XXXX`. The Provider oversees the rest of the work. At any point, the Consumer can query the Provider for the job status via `getJobDetails(XXXX)`.
Here's how Provider works. First, it ensures that the Consumer has sent the appropriate datatokens to get access. Then, it calls asks the Operator-Service (a microservice) to start the job, which passes on the request to Operator-Engine (the actual compute system). Operator-Engine runs Kubernetes compute jobs etc as needed. Operator-Engine reports when to Operator-Service when the job has finished.
Here's the actors/components:
- Consumers - The end users who need to use some computing services offered by the same Publisher as the data Publisher.
- Operator-Service - Micro-service that is handling the compute requests.
- Operator-Engine - The computing systems where the compute will be executed.
- Kubernetes - a K8 cluster
Before the flow can begin, these pre-conditions must be met:
- The Asset DDO has a `compute` service.
- The Asset DDO compute service must permit algorithms to run on it.
- The Asset DDO must specify an Ocean Provider endpoint exposed by the Publisher.
## Access Control using Ocean Provider
As [with the `access` service](/concepts/architecture/#datatokens--access-control-tools), the `compute` service requires the **Ocean Provider** as a component handled by Publishers. Ocean Provider is in charge of interacting with users and managing the basics of a Publisher's infrastructure to integrate this infrastructure into Ocean Protocol. The direct interaction with the infrastructure where the data resides happens through this component only.
Ocean Provider includes the credentials to interact with the infrastructure (initially in cloud providers, but it could be on-premise).
<repo name="provider"></repo>
## Compute-to-Data Environment
### Operator Service
The **Operator Service** is a micro-service in charge of managing the workflow executing requests.
The main responsibilities are:
- Expose an HTTP API allowing for the execution of data access and compute endpoints.
- Interact with the infrastructure (cloud/on-premise) using the Publisher's credentials.
- Start/stop/execute computing instances with the algorithms provided by users.
- Retrieve the logs generated during executions.
Typically the Operator Service is integrated from Ocean Provider, but can be called independently of it.
The Operator Service is in charge of establishing the communication with the K8s cluster, allowing it to:
- Register new compute jobs
- List the current compute jobs
- Get a detailed result for a given job
- Stop a running job
The Operator Service doesn't provide any storage capability, all the state is stored directly in the K8s cluster.
<repo name="operator-service"></repo>
### Operator Engine
The **Operator Engine** is in charge of orchestrating the compute infrastructure using Kubernetes as backend where each compute job runs in an isolated [Kubernetes Pod](https://kubernetes.io/docs/concepts/workloads/pods/). Typically the Operator Engine retrieves the workflows created by the Operator Service in Kubernetes, and manage the infrastructure necessary to complete the execution of the compute workflows.
The Operator Engine is in charge of retrieving all the workflows registered in a K8s cluster, allowing to:
- Orchestrate the flow of the execution
- Start the configuration pod in charge of download the workflow dependencies (datasets and algorithms)
- Start the pod including the algorithm to execute
- Start the publishing pod that publish the new assets created in the Ocean Protocol network.
- The Operator Engine doesn't provide any storage capability, all the state is stored directly in the K8s cluster.
<repo name="operator-engine"></repo>
### Pod: Configuration
<repo name="pod-configuration"></repo>
### Pod: Publishing
<repo name="pod-publishing"></repo>

View File

@ -0,0 +1,27 @@
---
title: Compute-to-Data
description: Datasets and Algorithms
---
## Datasets & Algorithms
With Compute-to-Data, datasets are not allowed to leave the premises of the data holder, only algorithms can be permitted to run on them under certain conditions within an isolated and secure environment. Algorithms are an asset type just like datasets. They too can have a pool or a fixed price to determine their price whenever they are used.
Algorithms can be public or private by setting `"attributes.main.type"` value in DDO as follows:
- `"access"` - public. The algorithm can be downloaded, given appropriate datatoken.
- `"compute"` - private. The algorithm is only available to use as part of a compute job without any way to download it. The Algorithm must be published on the same Ocean Provider as the dataset it's targeted to run on.
For each dataset, publishers can choose to allow various permission levels for algorithms to run:
- allow selected algorithms, referenced by their DID
- allow all algorithms published within a network or marketplace
- allow raw algorithms, for advanced use cases circumventing algorithm as an asset type, but most prone to data escape
All implementations should set permissions to private by default: upon publishing a compute dataset, no algorithms should be allowed to run on it. This is to prevent data escape by a rogue algorithm being written in a way to extract all data from a dataset.
## DDO Links
- [Algorithm DDO](/concepts/ddo-metadata/#fields-when-attributesmaintype--algorithm)
- [Compute DDO](/concepts/ddo-metadata/#fields-when-attributesmaintype--compute)

View File

@ -0,0 +1,83 @@
---
title: Setting up private docker registry for Compute-to-Data environment
description: Learn how to setup own docker registry and push images for running algorithms in C2D environment.
---
## Prerequisites
1. Running docker environment on the server.
2. Domain name is mapped to the server IP address.
3. SSL certificate
## Generate password file
Replace content in `<>` with appropriate content.
```bash
docker run \
--entrypoint htpasswd \
httpd:2 -Bbn <username> <password> > <path>/auth/htpasswd
```
## Docker compose template file for registry
Copy the below yml content to `docker-compose.yml` file and replace content in `<>`.
```yml
version: '3'
services:
registry:
restart: always
container_name: my-docker-registry
image: registry:2
ports:
- 5050:5000
environment:
REGISTRY_HTTP_TLS_CERTIFICATE: /certs/domain.crt
REGISTRY_HTTP_TLS_KEY: /certs/domain.key
REGISTRY_AUTH: htpasswd
REGISTRY_AUTH_HTPASSWD_PATH: /auth/htpasswd
REGISTRY_AUTH_HTPASSWD_REALM: Registry Realm
REGISTRY_HTTP_SECRET: <secret>
volumes:
- <path>/data:/var/lib/registry
- <path>/auth:/auth
- <path>/certs:/certs
```
## Start the registry
```bash
docker-compose -f docker-compose.yml up
```
## List images in the registry
```bash
curl -X GET -u <username>:<password> https://example.com/v2/_catalog
```
## Other useful commands
## Login to registry
```bash
docker login example.com -u <username> -p <password>
```
## Build and push image to registry
Use the commands below to build an image from a `Dockerfile` and push to your own private registry.
```bash
docker build . -t example.com/my-algo:latest
docker image tag example.com/my-algo:latest
```
## Next step
You can publish an algorithm asset with the metadata containing registry url, image, and tag information to enable users to run C2D jobs.

View File

@ -24,8 +24,78 @@ wget -q --show-progress https://github.com/kubernetes/minikube/releases/download
sudo dpkg -i minikube_1.22.0-0_amd64.deb sudo dpkg -i minikube_1.22.0-0_amd64.deb
``` ```
## Start Minikube
First command is imporant, and solves a [PersistentVolumeClaims problem](https://github.com/kubernetes/minikube/issues/7828).
```bash
minikube config set kubernetes-version v1.16.0
minikube start --cni=calico --driver=docker --container-runtime=docker
```
## Install kubectl
```bash
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(<kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
```
Wait untill all the defaults are running (1/1).
```bash
watch kubectl get pods --all-namespaces
```
### Run IPFS host
```bash
export ipfs_staging=~/ipfs_staging
export ipfs_data=~/ipfs_data
docker run -d --name ipfs_host -v $ipfs_staging:/export -v $ipfs_data:/data/ipfs -p 4001:4001 -p 4001:4001/udp -p 127.0.0.1:8080:8080 -p 127.0.0.1:5001:5001 ipfs/go-ipfs:latest
sudo /bin/sh -c 'echo "127.0.0.1 youripfsserver" >> /etc/hosts'
```
## Storage class (Optional)
For minikube, you can use the default 'standard' class.
For AWS, please make sure that your class allocates volumes in the same region and zone in which you are running your pods.
We created our own 'standard' class in AWS:
```bash
kubectl get storageclass standard -o yaml
```
```yaml
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east-1a
apiVersion: storage.k8s.io/v1
kind: StorageClass
parameters:
fsType: ext4
type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: Immediate
```
For more information, please visit https://kubernetes.io/docs/concepts/storage/storage-classes/
## Download and Configure Operator Service ## Download and Configure Operator Service
Open new terminal and run the command below.
```bash ```bash
git clone https://github.com/oceanprotocol/operator-service.git git clone https://github.com/oceanprotocol/operator-service.git
``` ```
@ -68,30 +138,6 @@ Check the [README](https://github.com/oceanprotocol/operator-engine#customize-yo
At a minimum you should add your IPFS URLs or AWS settings, and add (or remove) notification URLs. At a minimum you should add your IPFS URLs or AWS settings, and add (or remove) notification URLs.
## Install kubectl
```bash
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(<kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
```
## Start Minikube
First command is imporant, and solves a [PersistentVolumeClaims problem](https://github.com/kubernetes/minikube/issues/7828).
```bash
minikube config set kubernetes-version v1.16.0
minikube start --cni=calico --driver=docker --container-runtime=docker
```
Wait untill all the defaults are running (1/1).
```bash
watch kubectl get pods --all-namespaces
```
## Create namespaces ## Create namespaces

View File

@ -1,132 +0,0 @@
---
title: Set Up a Compute-to-Data Environment
description:
---
## Requirements
First, create a folder with the following structure:
```text
ocean/
barge/
operator-service/
operator-engine/
```
Then you need the following parts:
- working [Barge](https://github.com/oceanprotocol/barge). For this setup, we will asume the Barge is installed in /ocean/barge/
- a working Kubernetes (K8s) cluster ([Minikube](../compute-to-data-minikube/) is a good start)
- a working `kubectl` connected to the K8s cluster
- one folder (/ocean/operator-service/), in which we will download the following:
- [postgres-configmap.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-service/main/kubernetes/postgres-configmap.yaml)
- [postgres-storage.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-service/main/kubernetes/postgres-storage.yaml)
- [postgres-deployment.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-service/main/kubernetes/postgres-deployment.yaml)
- [postgres-service.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-service/main/kubernetes/postgresql-service.yaml)
- [deployment.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-service/main/kubernetes/deployment.yaml)
- one folder (/ocean/operator-engine/), in which we will download the following:
- [sa.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-engine/main/kubernetes/sa.yml)
- [binding.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-engine/main/kubernetes/binding.yml)
- [operator.yaml](https://raw.githubusercontent.com/oceanprotocol/operator-engine/main/kubernetes/operator.yml)
## Customize your Operator Service deployment
The following resources need attention:
| Resource | Variable | Description |
| ------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------ |
| `postgres-configmap.yaml` | | Contains secrets for the PostgreSQL deployment. |
| `deployment.yaml` | `ALGO_POD_TIMEOUT` | Allowed time for a algorithm to run. If it exceeded this value (in minutes), it's going to get killed. |
## Customize your Operator Engine deployment
Check the [README](https://github.com/oceanprotocol/operator-engine#customize-your-operator-engine-deployment) section of operator engine to customize your deployment
## Storage class
For minikube, you can use the default 'standard' class.
For AWS, please make sure that your class allocates volumes in the same region and zone in which you are running your pods.
We created our own 'standard' class in AWS:
```bash
kubectl get storageclass standard -o yaml
```
```yaml
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east-1a
apiVersion: storage.k8s.io/v1
kind: StorageClass
parameters:
fsType: ext4
type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: Immediate
```
For more information, please visit https://kubernetes.io/docs/concepts/storage/storage-classes/
## Create namespaces
```bash
kubectl create ns ocean-operator
kubectl create ns ocean-compute
```
## Deploy Operator Service
```bash
kubectl config set-context --current --namespace ocean-operator
kubectl create -f /ocean/operator-service/postgres-configmap.yaml
kubectl create -f /ocean/operator-service/postgres-storage.yaml
kubectl create -f /ocean/operator-service/postgres-deployment.yaml
kubectl create -f /ocean/operator-service/postgresql-service.yaml
kubectl apply -f /ocean/operator-service/deployment.yaml
```
## Deploy Operator Engine
```bash
kubectl config set-context --current --namespace ocean-compute
kubectl apply -f /ocean/operator-engine/sa.yml
kubectl apply -f /ocean/operator-engine/binding.yml
kubectl apply -f /ocean/operator-engine/operator.yml
kubectl create -f /ocean/operator-service/postgres-configmap.yaml
```
## Expose Operator Service
```bash
kubectl expose deployment operator-api --namespace=ocean-operator --port=8050
```
Run a port forward or create your ingress service (not covered here):
```bash
kubectl -n ocean-operator port-forward svc/operator-api 8050
```
## Initialize database
If your cluster is running on example.com:
```bash
curl -X POST "http://example.com:8050/api/v1/operator/pgsqlinit" -H "accept: application/json"
```
## Update Barge for local testing
Update Barge's Provider by adding or updating the `OPERATOR_SERVICE_URL` env in `/ocean/barge/compose-files/provider.yaml`
```yaml
OPERATOR_SERVICE_URL: http://example.com:8050/
```
Restart Barge with updated provider configuration

View File

Before

Width:  |  Height:  |  Size: 117 KiB

After

Width:  |  Height:  |  Size: 117 KiB

View File

@ -15,7 +15,7 @@
- group: Compute-to-Data - group: Compute-to-Data
items: items:
- title: Compute-to-Data Overview - title: Introduction
link: /concepts/compute-to-data/ link: /concepts/compute-to-data/
- group: Specifying Assets - group: Specifying Assets

View File

@ -37,12 +37,16 @@
- group: Compute-to-Data - group: Compute-to-Data
items: items:
- title: Architecture Overview
link: /tutorials/compute-to-data-architecture/
- title: Run a Compute-to-Data Environment
link: /tutorials/compute-to-data-minikube/
- title: Datasets and algorithms
link: /tutorials/compute-to-data-datasets-algorithms/
- title: Writing Algorithms - title: Writing Algorithms
link: /tutorials/compute-to-data-algorithms/ link: /tutorials/compute-to-data-algorithms/
- title: Run a Compute-to-Data Environment - title: Setting up docker registry
link: /tutorials/compute-to-data/ link: /tutorials/compute-to-data-docker-registry/
- title: Minikube Compute-to-Data Environment
link: /tutorials/compute-to-data-minikube/
- group: Storage Setup - group: Storage Setup
items: items: