1
0
mirror of https://github.com/oceanprotocol/docs.git synced 2024-11-26 19:49:26 +01:00

GITBOOK-313: Data Science Description

This commit is contained in:
Christian Casazza 2023-05-26 13:31:42 +00:00 committed by gitbook-bot
parent d24b9dc838
commit c249ab0a6f
No known key found for this signature in database
GPG Key ID: 07D2180C7B12D0FF
16 changed files with 51 additions and 104 deletions

View File

Before

Width:  |  Height:  |  Size: 77 KiB

After

Width:  |  Height:  |  Size: 77 KiB

View File

Before

Width:  |  Height:  |  Size: 77 KiB

After

Width:  |  Height:  |  Size: 77 KiB

View File

Before

Width:  |  Height:  |  Size: 77 KiB

After

Width:  |  Height:  |  Size: 77 KiB

View File

Before

Width:  |  Height:  |  Size: 77 KiB

After

Width:  |  Height:  |  Size: 77 KiB

View File

Before

Width:  |  Height:  |  Size: 1.4 MiB

After

Width:  |  Height:  |  Size: 1.4 MiB

View File

@ -1,6 +1,6 @@
---
description: Help for wherever you are on your Ocean Protocol journey.
cover: .gitbook/assets/cover/contribute (1).png
cover: .gitbook/assets/cover/contribute (1) (1).png
coverY: 0
layout: landing
---

View File

@ -83,8 +83,7 @@
* [Hosting a data challenge](data-science/data-challenges/hosting-a-data-challenge.md)
* [How to contribute](data-science/how-to-contribute/README.md)
* [Data Engineers](data-science/how-to-contribute/data-engineers.md)
* [Data Scientists](data-science/how-to-contribute/data-scientists/README.md)
* [Creating a new docker image for C2D](data-science/how-to-contribute/data-scientists/creating-a-new-docker-image-for-c2d.md)
* [Data Scientists](data-science/how-to-contribute/data-scientists.md)
* [🔨 Infrastructure](infrastructure/README.md)
* [Setup a Server](infrastructure/setup-server.md)
* [Deploying Marketplace](infrastructure/deploying-marketplace.md)

View File

@ -6,3 +6,36 @@ coverY: 0
# 📊 Data Science
Ocean Protocol was built for a world running on data and AI in mind. At the base, Ocean Protocol is a neutral registry to act as a standard for ownership and access control across storage and compute providers for all types of digital assets and services. Providing a standard allows developers to focus on building the best product possible instead of proprietary methods for access control. The tools and libraries built on top of Ocean act as an operating system for interacting with this registry of assets. By choosing to build this protocol on top of crypto rails while leveraging standards such as wallets, ERC20s, and NFTs, Ocean Protocol inherits numerous benefits the natural benefits of blockchains and the innovation of the DeFi world. Some of those benefits are explained in more detail below. Together, they ensure Ocean Protocol is the best option for data scientists and data engineers to build and distribute their work around the world, while also easily allowing one to turn a simple project into a full-fledged business. \
* Privacy-preserving data sharing
* There is an enormous amount of data in the world that can be used to build powerful analytics workflows and models for many tasks. While many companies and individuals can benefit from these insights, while also gaining an additional revenue stream with direct data monetization, current data privacy concerns with access to that data means that most of it is stuck inside data siloes. Ocean Protocols Compute-to-Data engine resolves the tradeoff between data openness and data privacy. Data publishers can publish their assets on Ocean, but make it so that only computation can be run on the dataset instead of downloading it. Data publishers can allow third parties to train models and generate insights from their data, without the underlying data itself being exposed. This unlocks a new revenue stream for businesses. While also making it possible to utilize third party data science talent for their internal needs without having to onboard them to the company. In addition, the C2D engine reduces the MLOps workload for data scientists. One of the largest pain points for data scientists is deploying their models into production. Many data scientists dont possess the skill set to interact with cloud computing providers to run their models or simply dont enjoy this part of the process. Ocean Protocols Compute-to-Data engine provides a way for data scientists to deploy their models without worrying about this MLOps work. Users simply need to create a docker image around their published models and algorithms to ensure they can be run with the C2D engine. For more information, check out our [Compute-to-Data section](../developers/compute-to-data/).
* Fine-grained access control
* Access control is one of the most important parts of data sharing. Ocean Protocol is a standard for managing access control across various storage providers. Publishers can add the credentials for accessing their data assets directly into the assets they list on Ocean. Publishers can also utilize a fine-grained allow list for whitelisting only specific wallets are able to purchase their assets. The fine-grained access control of Ocean Protocol makes it easier for data publishers to interact with each others assets across different storage providers while ensuring only those they want to share the data with can access it. They can also ensure only whitelisted algorithms from trusted parties are allowed to run any computation on their data. To learn more, check out our [fine-grained access control section.](../developers/Fine-Grained-Permissions.md)\
* Crypto-native Payments
* Utilizing Ocean Protocol contracts for payment processing brings numerous benefits compared to the traditional financial system. One major advantage is the significantly lower transaction fees, ranging from 0.1% to 0.2% per transaction, which can be a major reduction compared to the 2-6% typically associated with traditional financial systems. Another key benefit is the instant settlement feature nature of crypto. Payments are processed immediately, ensuring that funds are readily available for immediate use. This eliminates the usual wait of several days associated with traditional systems and avoids any additional charges that may arise from delayed settlements. This instant settlement also provides a zero counterparty risk environment. With the absence of chargebacks, businesses can enjoy greater stability in their revenue streams. This feature provides assurance and peace of mind, eliminating the concerns associated with potential payment reversals. Moreover, users can transact with each other with far greater ease across borders. Users can use any ERC20 token to transact, such as OCEAN or USDC. This provides a standard for selling products around the world. To learn more, check out our our [asset pricing](../developers/asset-pricing.md) and [contracts ](../developers/contracts/)sections.\
* Provenance of data
* One of the most important parts of building a good data science product is a strong understanding the context of how a dataset was created. With greater provenance, data scientists can be more confident in the data they are using. In web2 systems, this typically requires either having to learn to pull data from proprietary systems or paying for specialty software that passively monitors data pipelines. Ocean Protocol leverages the natural abilities of blockchains to provide an enhanced audit trail. For any given compute or data asset, anyone can query the chain to understand when and who published it, any metadata changes, and what compute jobs were used to create the asset. As more and more of the data pipeline metadata gets pushed on-chain, data scientists and engineers will be able to leverage the best system for data provenance possible to build greater trust in their data. To learn more, check out our [Subgraph section](../developers/subgraph/).
* Verified usage statistics
* Receiving any product information from web2 platforms requires using their API and trusting all of the data that the company wants to provide. Their API may not contain all of the information a user might want, and schema changes and rate limits may make receiving information difficult. With Ocean Protocol, anyone can leverage composable subgraphs to receive rich information about their products. They can know exactly who accessed their asset, when, total revenue, and more. The verified access statistics help both data publishers and data consumers. Data publishers can create powerful customer analytics by building data models that incorporate their customer's on-chain activity to understand them at a high level while understanding their habits. Data consumers can have greater trust in the quality of the assets they are using for their projects. For more information, check out our subgraphs and Aquarius pages. To learn more, check out our [Subgraph section](../developers/subgraph/).
* Global discovery of assets
* Being able to easily find useful data and models is vital part of sharing your work with the world and establishing a market. In the web2 world, only the company hosting the platform can impact how assets are discovered. For example, Amazon decides what assets are shown in their marketplace, and Huggingface decides how assets are found on their platform. While working with these platforms, there is rarely transparency on how exactly assets are displayed, and so users are left to guess what the best way to promote their assets. With Ocean Protocol, assets can be permissionlessly discovered since they utilize standards of NFTs and ERC20s on-chain. Anyone can fork the Ocean Market and develop their own method of promoting assets, while ensuring transparency on how it is done. To learn more, check out our [Aquarius ](../developers/aquarius/)and [Build a Marketplace](../developers/build-a-marketplace/) sections.
\
\
\
\
\
\

View File

@ -1,6 +1,6 @@
# Data Challenges
Data challenges present an exciting opportunity for data scientists to hone their skills, and actual business problems, and earn some income along the way. These operate as data science competitions where participants are tasked with solving a business problem. Data challenges can have several different types of formats, topics, and sponsors. One of the main advantages of these data challenges is that users retain ownership and the ability to further monetize their work outside of the competition. 
Data challenges present an exciting opportunity for data scientists to hone their skills, and actual business problems, and earn some income along the way. These operate as data science competitions where participants are tasked with solving a business problem. Data challenges can have several different types of formats, topics, and sponsors. One of the main advantages of these data challenges is that users retain ownership and the ability to further monetize their work outside of the competition. Active and past data challenges can be found on the Ocean Protocol site [here](https://oceanprotocol.com/challengeshttps://oceanprotocol.com/challenges).

View File

@ -1,10 +1,8 @@
# Hosting a data challenge
Creating a data challenge 
Hosting a data challenge can be an exciting way for data publishers to seed use cases and bring attention to their data assets. Hosting a challenge can also be a good way to tap a community of data scientists to build products on top of your data to gain insights and useful models for your business without needing to bring an in-house data science team. To host a data challenge, the steps can be found below.
1.  Establish the business problem you want to solve. The first step in building a data solution is understanding what you want to solve. For example, you may want to be able to predict the drought risk in an area to help price parametric insurance, or predict the price of ETH to optimize Uniswap LPing. 
1. Establish the business problem you want to solve. The first step in building a data solution is understanding what you want to solve. For example, you may want to be able to predict the drought risk in an area to help price parametric insurance, or predict the price of ETH to optimize Uniswap LPing. 
2. Curate the dataset for the challenge. The key to hosting a good data challenge is to provide an exciting and through dataset that participants can use to build their solutions. Do your research to understand what data is available, whether it be free from an API, available for download, require any transformations, etc. For the first challenge, it is alright if the created dataset is a static file. However, it is best to ensure there is a path to making the data available from a dynamic endpoint so that entires can eventually be applied to real-world use cases
3. Decide how the judging process will occur. This includes how long to make review period, how to score submissions, and how to decide any prizes will be divided among participants
4. Work with OPF to gather participants for your data challenge. Creating blog posts and hosting twitter spaces is a good way to spread the word about your data challenge

View File

@ -1,7 +1,6 @@
# Participating in a data challenge
Here is the typical flow for a data challenge\
\
Data challenges can take a few different formats. Some challenges are built for data exploration and reporting. In these challenges, participants are tasked with analyzing a provided dataset or several and conducting an exploratory analysis to understand the hidden insights in the data. Then, they build a written report that explains their insights so that a business user can make informed decisions from the analysis. Another format is the participants being tasked with building a model to perform a given task. They will typically be provided with a dataset(although not always) that they will use to train their model. In some challenges, the participants must publish their model as a compute asset on Ocean so that the model can be run using Compute-to-Data. Here is the typical flow for a data challenge.\
1. On Ocean Market, a dataset( or several) is published along with its description and schema. The dataset will either be provided by the data challenge sponsor partner or by OPF ourselves.

View File

@ -1,11 +1,11 @@
# Data Engineers
Data engineers are a key part of data value creation. Building any useful dashboards or machine-learning models requires access to curated data. Data engineers help provide this by creating robust data pipelines that ingest data from source systems, conduct transformations to clean and aggregate the data, and then make the data available for downstream use cases  
Data engineers play a pivotal role in driving data value creation. If you're a developer looking to build useful dashboards or cutting-edge machine-learning models, you understand the importance of having access to well-curated data. That's where our team of friendly and skilled data engineers comes in!
\
Some examples of useful sources of data can be found below
Data engineers specialize in creating robust data pipelines that enable seamless data ingestion from diverse source systems. The expertise lies in conducting essential transformations to ensure data cleanliness and aggregation, ultimately making the data readily available for downstream use cases. With data engineer support, data scientists can focus on unleashing their creativity and innovation, knowing that the data they need is reliably curated and accessible.
* **Government Open Data:** Governments serve as one of the most reliable sources of data, which, although abundant in information, often suffer from inadequate documentation or pose challenges for data scientists to work with effectively. Establishing a robust Extract, Transform, Load (ETL) pipeline to enhance accessibility to such data is crucial.
* **Public APIs:** Similarly to government open data, a wide array of freely available public APIs covering various data verticals are at the disposal of data engineers. Leveraging these APIs, data engineers can construct pipelines that enable others to efficiently access and utilize the available data.
* **On-Chain Data:** Blockchain data presents an excellent opportunity for data engineers to curate high-quality data. Whether connecting directly to the blockchain or utilizing alternative data providers, simplifying data usability holds significant value. While there is a consistent demand for well-curated decentralized finance (DeFi) data, there is also an emerging need for curated data in other domains, including social data.
There are numerous types of data that data engineers can contribute to the Ocean Protocol ecosystem.
1. Government Open Data: Governments serve as a rich and reliable source of data. However, this data often lacks proper documentation or poses challenges for data scientists to work with effectively. Our team excels at establishing robust Extract, Transform, Load (ETL) pipelines that enhance accessibility to government open data. This way, you can tap into this wealth of information without unnecessary hurdles.
2. Public APIs: A wide range of freely available public APIs covers various data verticals. Leveraging these APIs, our data engineers construct pipelines that enable efficient access and utilization of the data. We'll ensure that you have the necessary tools to harness the potential of these APIs, saving you valuable time and effort.
3. On-Chain Data: Blockchain data presents a unique opportunity for data engineers to curate high-quality data. Whether it's connecting directly to the blockchain or utilizing alternative data providers, our team specializes in simplifying data usability. We understand the consistent demand for well-curated decentralized finance (DeFi) data and the emerging need for curated data in other domains, including social data. Count on us to provide you with curated on-chain data that empowers your projects.

View File

@ -2,7 +2,10 @@
Data scientists are integral to the process of extracting insights and generating value from data. Their expertise lies in applying statistical analysis, machine learning algorithms, and domain knowledge to uncover patterns, make predictions, and derive meaningful insights from complex datasets.
1. **Heatmaps and Visualizations of Correlations between Features**: Data scientists excel at exploring and visualizing data to uncover meaningful patterns and relationships. By creating heatmaps and visualizations of correlations between features, they provide insights into the interdependencies and associations within the dataset. These visualizations help stakeholders understand the relationships between variables, identify influential factors, and make informed decisions based on data-driven insights. By publishing the results on Ocean, you can allow others to build on your work.
There are a few different ways that data scientists can add value to the Ocean Protocol ecosystem by building on datasets published on Ocean. 
1. **Visualizations of Correlations between Features**: Data scientists excel at exploring and visualizing data to uncover meaningful patterns and relationships. By creating heatmaps and visualizations of correlations between features, they provide insights into the interdependencies and associations within the dataset. These visualizations help stakeholders understand the relationships between variables, identify influential factors, and make informed decisions based on data-driven insights. By publishing the results on Ocean, you can allow others to build on your work.
2. **Conducting Feature Engineering**: Feature engineering is a critical step in the data science workflow. Data scientists leverage their domain knowledge and analytical skills to engineer new features or transform existing ones, creating more informative representations of the data. By identifying and creating relevant features, data scientists enhance the predictive power of models and improve their accuracy. This process often involves techniques such as scaling, normalization, one-hot encoding, and creating interaction or polynomial features.
3. **Conducting Experiments to Find the Best Model**: Data scientists perform rigorous experiments to identify the most suitable machine learning models for a given problem. They evaluate multiple algorithms, considering factors like accuracy, precision, recall, and F1-score, among others. By systematically comparing different models, data scientists select the one that performs best in terms of predictive performance and generalization. This process often involves techniques such as cross-validation, hyperparameter tuning, and model selection based on evaluation metrics. 
4. **Testing Out Different Models**: Data scientists explore various machine learning models to identify the optimal approach for a specific problem. They experiment with algorithms such as linear regression, decision trees, random forests, support vector machines, neural networks, and more. By testing out different models, data scientists gain insights into the strengths and weaknesses of each approach, allowing them to select the most appropriate model for the given dataset and problem domain.
5. **Deploy a model with Compute-to-Data:** After building a robust model, data scientists can utilize C2D to deploy their model for personal or third-party use. At this final stage of value creation for data scientists, they can provide direct value to the ecosystem by driving data consume volume and overall usage of Ocean Protocol.  

View File

@ -1,85 +0,0 @@
# Creating a new docker image for C2D
Docker is widely used to run containerized applications with Ocean Compute-to-Data. Ocean Compute-to-Data allows computations to be brought to the data, preserving data privacy, and enabling the use of private data without exposing it. Docker is a crucial part of this infrastructure, allowing applications to run in a secure, isolated manner.
The best way to sell access to models using C2D is by creating a docker image. Docker is an open-source platform designed to automate the deployment, scaling, and management of applications. It uses containerization technology to bundle up an application along with all of its related configuration files, libraries, and dependencies into a single package. This means your applications can run uniformly and consistently on any infrastructure. Docker helps solve the problem of "it works on my machine" by providing a consistent environment from development to production.
Main Value:
* Consistency across multiple development and release cycles, ensuring that your application (and its full environment) can be replicated and reliably moved from one environment to another.
* Rapid deployment of applications. Docker containers are lightweight, featuring fast startup times as they don't include the unnecessary binaries and libraries of full-fledged virtual machines.
* Isolation of applications and resources, allowing for safe testing and effective use of system resources.
**Step by Step Guide to Creating a Docker Image**
1. **Install Docker**
First, you need to install Docker on your machine. Visit Docker's official website for installation instructions based on your operating system.
* [Docker for Windows](https://docs.docker.com/desktop/windows/install/)
* [Docker for Mac](https://docs.docker.com/desktop/mac/install/)
* [Docker for Linux](https://docs.docker.com/engine/install/)
2. **Write a Dockerfile**
Docker images are created using Dockerfiles. A Dockerfile is a text document that contains all the commands needed to assemble an image. Create a new file in your project directory named `Dockerfile` (no file extension).
3. **Configure Your Dockerfile**
Here is a basic example of what a Dockerfile might look like for a simple Python Flask application:
```bash
bashCopy code# Use an official Python runtime as a parent image
FROM python:3.7-slim
# Set the working directory in the container to /app
WORKDIR /app
# Add the current directory contents into the container at /app
ADD . /app
# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
# Define environment variable
ENV NAME World
# Run app.py when the container launches
CMD ["python", "app.py"]
```
For a more detailed explanation of Dockerfile instructions, check the [Docker documentation](https://docs.docker.com/engine/reference/builder/).
4. **Build the Docker Image**
Navigate to the directory that houses your Dockerfile in the terminal. Build your Docker image using the `docker build` command. The `-t` flag lets you tag your image so it's easier to find later.
```shell
shellCopy codedocker build -t friendlyhello .
```
The `.` tells the Docker daemon to look for the Dockerfile in your current directory.
5. **Verify Your Docker Image**
Use the `docker images` command to verify that your image was created correctly.
```shell
shellCopy codedocker images
```
6. **Run a Container from Your Image**
Now you can run a Docker container based on your new image:
```shell
shellCopy codedocker run -p 4000:80 friendlyhello
```
The `-p` flag maps the port on your machine to the port on the Docker container.
You've just created and run your first Docker image! For more in-depth information about Docker and its various uses, refer to the [official Docker documentation](https://docs.docker.com/).

View File

@ -6,7 +6,7 @@ description: Fundamental knowledge of using ERC-20 crypto wallets.
Ocean Protocol users require an ERC-20 compatible wallet to manage their OCEAN and ETH tokens. In this guide, we will provide some recommendations for different wallet options.
<figure><img src="../.gitbook/assets/show-wallet.gif" alt=""><figcaption></figcaption></figure>
<figure><img src="../.gitbook/assets/whats-a-wallet (1).gif" alt=""><figcaption></figcaption></figure>
### What is a wallet?

View File

@ -14,7 +14,7 @@ Liquidity pools and dynamic pricing used to be supported in previous versions of
4\. Go to field `20. balanceOf` and insert your ETH address. This will retrieve your pool share token balance in wei.
<figure><img src="../.gitbook/assets/liquidity/remove-liquidity-2 (1).png" alt=""><figcaption><p>Balance Of</p></figcaption></figure>
<figure><img src="../.gitbook/assets/liquidity/remove-liquidity-2 (1) (1) (2).png" alt=""><figcaption><p>Balance Of</p></figcaption></figure>
5\. Copy this number as later you will use it as the `poolAmountIn` parameter.