Log Dataset

You can log datasets for a run as an artifact using the log_dataset function. This also captures statistics about your dataset which you can visualize on the dashboard.

import mlfoundry

client = mlfoundry.get_client()
run = client.create_run(project_name="my-first-project")

features = [
    {"feature_1": 1, "feature_2": 1.2, "feature_3": "high"},
    {"feature_1": 2, "feature_2": 3.5, "feature_3": "medium"},
    {"feature_1": None, "feature_2": -1, "feature_3": "low"},
]
run.log_dataset(
    dataset_name="train",
    features=features,
    actuals=[1.2, 1.3, 2],
    predictions=[3.1, 4.5, 2],
)

run.end()

📘

Note

Not all data formats are supported to be logged with the log dataset function. The formats supported can be found here

This is what it looks like on the dashboard.

2463

Dataset Stats

Example of storing the Iris dataset

import mlfoundry

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

client = mlfoundry.get_client()
run = client.create_run(project_name="my-first-project")

iris_dataset = load_iris(as_frame=True)

features = iris_dataset.data
actuals = iris_dataset.target.apply(lambda class_index: iris_dataset.target_names[class_index])

X_train, X_test, y_train, y_test = train_test_split(features, actuals, test_size=0.2, stratify=actuals, random_state=42)

run.log_dataset(features=X_train, actuals=y_train, dataset_name="train")
run.log_dataset(features=X_test, actuals=y_test, dataset_name="test")

run.end()

Can I overwrite an already logged dataset?

No. Datasets once logged are immutable. You have to use a different dataset name to log the updated dataset.

How can I load a logged dataset?

You load a dataset logged under a run by using the get_dataset function.

import mlfoundry

client = mlfoundry.get_client()
run = client.get_run("run-fqn-of-the-run")
dataset = run.get_dataset(dataset_name="my-dataset")

print(dataset.features) # This will be in Pandas DataFrame type.
print(dataset.predictions) # This will be in Pandas Series type.
print(dataset.actuals) # This will be in Pandas Series type.