Log and Get Data
A DataDirectory
is a collection of files and folders that are on remote storage (cloud buckets like S3, GCS, Azure Blob). DataDirectories are useful for storing data that is associated with a particular ML Repository.
These differ from Artifacts as these aren't versioned and aren't associated with any Runs. This makes them ideal for storing data that is not going to change or is constant among runs. For example, you can store your custom data with which you want to finetune an LLM here, without having to go through the process of creating a Run and storing the data in an Artifact.
Creating a Data Directory
Before we can start logging our data in a Data Directory, we need to create it:
To create a Data Directory you need:
- ml_repo: Name of the ML Repo in which you want to create data_directory
- name: Name of the DataDirectory to be created.
from truefoundry.ml import get_client
client = get_client()
data_directory = client.create_data_directory(name="<data_directory-name>", ml_repo="<repo-name>")
print(data_directory.fqn)
Logging data in a Data Directory
To log data in Data Directory you need to get the data_directory instance, and then log the data in it using the .add_files
method.
For this, you will require
- fqn: Fully qualified name of the data directory.
- file_paths: A list of pairs of (
\<source path>
,\<destination path
) to add files and folders to the DataDirectory contents. The first member of the pair should be a file or directory path and the second member should be the path inside the artifact contents to upload to.
import os
from truefoundry.ml import get_client, DataDirectoryPath
with open("artifact.txt", "w") as f:
f.write("hello-world")
client = get_client()
data_directory = client.get_data_directory_by_fqn(fqn="<data_directory-fqn>")
data_directory.add_files(
file_paths=[DataDirectoryPath('artifact.txt', 'a/b/')]
)
this would result in
.
└── a/
└── b/
└── artifact.txt
Downloading data from the Data Directory
To download data in Data Directory you need to get the data_directory instance, and then download the data in it using the .download
method.
For this you will need the:
- download_path: Relative source path to the desired files.
from truefoundry.ml import get_client
client = get_client()
data_directory = client.get_data_directory_by_fqn(fqn="<your-artifact-fqn>")
data_directory.download(download_path="<your-desired-download-path>")
Updated 8 months ago