2.3 KiB
Composable Data Flows
Data is the fuel that drives ML and AI. The popular expression "garbage in, garbage out" holds true as the best way to improve data science effectiveness is to have better data. Data can exit in several different forms throughout the entire AI/ML value creation loop.
- Raw Data: This is the unprocessed, untouched data, fresh from the source. Example: a sales spreadsheet from a coffee shop or a corpus of internet text.
- Cleaned Data and Feature Vectors: The raw data, now polished and transformed into numerical representations - the feature vectors. Example: the coffee shop sales data, now cleaned and organized, or preprocessed text data transformed into word embeddings.
- Trained Models: Machine learning models that have been trained on feature vectors, learning to decode data's patterns and relationships. Example: a random forest model predicting coffee sales or GPT-3 trained on a vast text corpus.
- Data to Tune Models: Additional data introduced to further refine and enhance model performance. Example: a new batch of sales data for the coffee shop model, or specific domain text data for GPT-3.
- Tuned Models: Models that have been optimized for high performance, robustness, and accuracy. Example: a tuned random forest model forecasting the coffee shop's busiest hours, or a fine-tuned GPT-3 capable of generating expert-level text.
- Model Prediction Inputs: Inputs provided to the models to generate insights. Example: inputting today's date and weather into the sales model, or a text prompt for GPT-3 to generate a blog post.
- Model Prediction Outputs: The models' predictions or insights based on the inputs. Example: the sales model's forecast of a spike in iced coffee sales due to an incoming heatwave, or GPT-3's generated blog post on sustainability in business
With Ocean Protocol, data can be tokenized at every stage of the value creation loop. By leveraging the standards of Ocean, individuals can work together to mix and match valuable assets to build powerful flows. For example, when building a model instead of starting from scratch a data scientist may find a dataset that has already been cleaned and prepped for model building. To fine-tune a foundation model instead of building the dataset from scratch, they can find an already prepared one shared on Ocean.