Tutorial: Features & labels¶
In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:
Findability: Which collections measured expression of cell marker
CD14
? Which characterized cell lineK562
? Which collections have a test & train split? Etc.Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.
Hint
This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.
If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Validate, standardize & annotate.
import lamindb as ln
import pandas as pd
ln.settings.verbosity = "hint"
Show code cell output
💡 connected lamindb: anonymous/lamin-tutorial
Re-cap¶
Let’s briefly re-cap what we learned in Introduction. We started with simple labeling:
# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Show code cell output
Artifact(updated_at=2024-05-19 23:23:24 UTC, uid='UI4dKwLK1iDoIAWMvJ4j', key='iris_studies/study0_raw_images', suffix='', size=656692, hash='wVYKPpEsmmrqSpAZIRXCFg', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False)
Provenance:
📎 created_by: User(uid='00000000', handle='anonymous')
📎 storage: uid='EzvEnPnH', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1', instance_uid='pZ1VQkyD3haH')
📎 transform: Transform(version='0', uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', type='notebook')
📎 run: Run(uid='VBT7VLzVRSYnVrx1ESJm', started_at=2024-05-19 23:23:21 UTC, is_consecutive=True)
Labels:
📎 ulabels (1, ULabel): 'Study 0: initial plant gathering'
In general, it’s good practice to associate labels with features so that we can later feed them into learning algorithms with a defined dimension:
feature = ln.Feature(name="study_name", dtype="cat").save()
artifact.labels.add(study0, feature)
artifact.describe()
✅ linked feature 'study_name' to registry 'ULabel'
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='AVSJ4LiRpONRelwyWJl5', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
Artifact(updated_at=2024-05-19 23:23:24 UTC, uid='UI4dKwLK1iDoIAWMvJ4j', key='iris_studies/study0_raw_images', suffix='', size=656692, hash='wVYKPpEsmmrqSpAZIRXCFg', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False)
Provenance:
📎 created_by: User(uid='00000000', handle='anonymous')
📎 storage: uid='EzvEnPnH', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1', instance_uid='pZ1VQkyD3haH')
📎 transform: Transform(version='0', uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', type='notebook')
📎 run: Run(uid='VBT7VLzVRSYnVrx1ESJm', started_at=2024-05-19 23:23:21 UTC, is_consecutive=True)
Features:
external: FeatureSet(uid='AVSJ4LiRpONRelwyWJl5', n=1, registry='Feature')
🔗 study_name (1, cat[ULabel]): 'Study 0: initial plant gathering'
Labels:
📎 ulabels (1, ULabel): 'Study 0: initial plant gathering'
Register metadata¶
Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.
Features represent measurement dimensions (e.g. "species"
) and labels represent measured values (e.g. "iris setosa"
, "iris versicolor"
, "iris virginica"
).
In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.
Register labels¶
We study 3 species of the Iris plant: setosa
, versicolor
& virginica
. Let’s create 3 labels with ULabel
.
speciess = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(speciess)
ULabel
enables you to manage an in-house ontology to manage all kinds of generic labels.
What are alternatives to ULabel?
In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene
, Tissue
, etc. See: Manage biological registries.
ULabel
, however, will get you quite far and scale to ~1M labels.
Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:
is_species = ln.ULabel(name="is_species").save()
is_species.children.set(speciess)
is_species.view_parents(with_children=True)
Register features¶
For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.
When we integrate datasets, feature names will label columns that store data.
Let’s create and save two Feature
records to identify measurements of the iris species label and the study:
ln.Feature(name="iris_species_name", dtype="cat").save()
# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()
Validate & link labels¶
We already looked at the metadata for study0
, before:
meta_artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images/meta.csv").one()
meta = meta_artifact.load(index_col=0) # load a dataframe
meta.head()
Show code cell output
💡 you can auto-track these data as a run input by calling `ln.track()`
0 | 1 | |
---|---|---|
0 | iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... | setosa |
1 | iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... | versicolor |
2 | iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... | versicolor |
3 | iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... | setosa |
4 | iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... | virginica |
Validate metadata¶
Depending on the data generation process, such metadata might or might not match the labels we defined in our registries.
Let’s validate the labels by mapping the values stored in the artifact on the ULabel
registry:
ln.ULabel.validate(meta["1"], field="name")
Show code cell output
✅ 3 terms (100.00%) are validated for name
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True])
Everything passed and no fixes are needed!
If validation doesn’t pass, standardize()
and inspect()
will help standardize data.
Label artifacts¶
You can label an artifact by calling artifact.labels.add()
and pass a single or multiple labels, and optionally, the corresponding feature.
Let’s do this based on the labels in meta.csv
:
ln.Artifact.df()
Show code cell output
version | created_at | created_by_id | updated_at | uid | storage_id | key | suffix | accessor | description | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
4 | None | 2024-05-19 23:23:24.836147+00:00 | 1 | 2024-05-19 23:23:24.836198+00:00 | 62uS4kSWwh553bVPhdkJ | 2 | iris_studies/study2_raw_images | None | None | 665518 | PX8Vt9T28y-uCEJO1tKm7A | md5-d | 51.0 | None | 1 | 1 | 1 | False | |
3 | None | 2024-05-19 23:23:24.646038+00:00 | 1 | 2024-05-19 23:23:24.646092+00:00 | nE7BMpA4cluio47JdU66 | 2 | iris_studies/study1_raw_images | None | None | 640617 | j61W__GgImA18CKrIf7FVg | md5-d | 49.0 | None | 1 | 1 | 1 | False | |
2 | None | 2024-05-19 23:23:24.280523+00:00 | 1 | 2024-05-19 23:23:24.280576+00:00 | UI4dKwLK1iDoIAWMvJ4j | 2 | iris_studies/study0_raw_images | None | None | 656692 | wVYKPpEsmmrqSpAZIRXCFg | md5-d | 51.0 | None | 1 | 1 | 1 | False | |
1 | None | 2024-05-19 23:23:23.817202+00:00 | 1 | 2024-05-19 23:23:23.817253+00:00 | 5ruHcLPvlViem0z49X9f | 2 | iris_studies/study0_raw_images/meta.csv | .csv | None | None | 4355 | ZpAEpN0iFYH6vjZNigic7g | md5 | NaN | None | 1 | 1 | 1 | False |
study_artifacts = ln.Artifact.filter(key__startswith="iris_studies/", suffix="").all()
study_labels = ln.ULabel.filter(name="is_study").one().children.all()
for artifact, study in zip(study_artifacts, study_labels):
artifact.labels.add(study, feature=features.study_name)
df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
species_labels = ln.ULabel.from_values(df["1"].unique())
artifact.labels.add(species_labels, feature=features.iris_species_name)
Show code cell output
✅ linked feature 'iris_species_name' to registry 'ULabel'
💡 nothing links to it anymore, deleting feature set FeatureSet(uid='AVSJ4LiRpONRelwyWJl5', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
✅ linked new feature 'iris_species_name' together with new feature set FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='i93fYfn99Gfl2nxIvNQi', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
✅ loaded: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ linked new feature 'iris_species_name' together with new feature set FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ loaded: FeatureSet(uid='i93fYfn99Gfl2nxIvNQi', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='i93fYfn99Gfl2nxIvNQi', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
✅ loaded: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ linked new feature 'iris_species_name' together with new feature set FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
Query artifacts by labels¶
Using the new annotations, you can now query image artifacts by species & study labels:
ulabels = ln.ULabel.lookup()
artifact = ln.Artifact.filter(ulabels=ulabels.study0).first()
We also see them when calling describe()
:
artifact.describe()
Show code cell output
Artifact(updated_at=2024-05-19 23:23:24 UTC, uid='nE7BMpA4cluio47JdU66', key='iris_studies/study1_raw_images', suffix='', size=640617, hash='j61W__GgImA18CKrIf7FVg', hash_type='md5-d', n_objects=49, visibility=1, key_is_virtual=False)
Provenance:
📎 created_by: User(uid='00000000', handle='anonymous')
📎 storage: uid='EzvEnPnH', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1', instance_uid='pZ1VQkyD3haH')
📎 transform: Transform(version='0', uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', type='notebook')
📎 run: Run(uid='VBT7VLzVRSYnVrx1ESJm', started_at=2024-05-19 23:23:21 UTC, is_consecutive=True)
Features:
external: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature')
🔗 study_name (2, cat[ULabel]): 'study0'
🔗 iris_species_name (2, cat[ULabel]): 'setosa', 'versicolor', 'virginica'
Labels:
📎 ulabels (4, ULabel): 'setosa', 'versicolor', 'virginica', 'study0'
Label collections¶
Labeling collections works in the same way as labeling artifacts:
collection = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
collection.labels.add(ulabels.study0, feature=features.study_name)
all_species_labels = ln.ULabel.filter(parents__name="is_species").all()
collection.labels.add(all_species_labels, feature=features.iris_species_name)
Show code cell output
✅ loaded: FeatureSet(uid='i93fYfn99Gfl2nxIvNQi', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='i93fYfn99Gfl2nxIvNQi', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
✅ loaded: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
💡 nothing links to it anymore, deleting feature set FeatureSet(uid='i93fYfn99Gfl2nxIvNQi', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
✅ linked new feature 'iris_species_name' together with new feature set FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
collection.describe()
Show code cell output
Collection(version='1', updated_at=2024-05-19 23:23:24 UTC, uid='w7p8GWvoDRJViSLvunTx', name='Iris collection', description='Iris study 0', hash='WwFLpNFmK8GMC2dSGj1W', visibility=1)
Provenance:
📎 created_by: User(uid='00000000', handle='anonymous')
📎 transform: Transform(version='0', uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', type='notebook')
📎 run: Run(uid='VBT7VLzVRSYnVrx1ESJm', started_at=2024-05-19 23:23:21 UTC, is_consecutive=True)
Features:
external: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature')
🔗 study_name (2, cat[ULabel]): 'study0'
🔗 iris_species_name (2, cat[ULabel]): 'setosa', 'versicolor', 'virginica'
Labels:
📎 ulabels (4, ULabel): 'setosa', 'versicolor', 'virginica', 'study0'
Run an ML model¶
Let’s now run a mock ML model that transforms the images into 4 high-level features.
def run_ml_model() -> pd.DataFrame:
transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
ln.track(transform=transform)
input_data = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
input_paths = [
path.download_to(path.name) for path in input_data.artifacts[0].path.glob("*")
]
# apply ML model
output_data = ln.core.datasets.df_iris_in_meter_study1()
return output_data
df = run_ml_model()
Show code cell output
💡 saved: Transform(uid='gzL8KJt6OelI', name='Petal & sepal regressor', type='pipeline', updated_at=2024-05-19 23:23:29 UTC, created_by_id=1)
💡 saved: Run(uid='0LiZ43vXlo5liMUV0RSD', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_0LiZ43vXlo5liMUV0RSD.txt
The output is a dataframe:
df.head()
Show code cell output
sepal_length | sepal_width | petal_length | petal_width | iris_organism_name | |
---|---|---|---|---|---|
0 | 0.051 | 0.035 | 0.014 | 0.002 | setosa |
1 | 0.049 | 0.030 | 0.014 | 0.002 | setosa |
2 | 0.047 | 0.032 | 0.013 | 0.002 | setosa |
3 | 0.046 | 0.031 | 0.015 | 0.002 | setosa |
4 | 0.050 | 0.036 | 0.014 | 0.002 | setosa |
And this is the pipeline that produced the dataframe:
Register the output data¶
Let’s first register the features of the transformed data:
new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?
Use the unit
field of Feature
. In the above example, you’d do:
for feature in features:
if feature.type == "number":
feature.unit = "m" # SI unit for meters
feature.save()
We can now validate & register the dataframe in one line:
artifact = ln.Artifact.from_df(
df,
description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Show code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/SlGWQhSRAsBBpNFCi64d.parquet')
✅ storing artifact 'SlGWQhSRAsBBpNFCi64d' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/SlGWQhSRAsBBpNFCi64d.parquet'
Artifact(updated_at=2024-05-19 23:23:32 UTC, uid='SlGWQhSRAsBBpNFCi64d', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=2, run_id=2)
artifact.features.add_feature_set(ln.FeatureSet(new_features), slot="columns")
Feature sets¶
Get an overview of linked features:
artifact.features
Show code cell output
Features:
columns: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature')
sepal_length (float)
sepal_width (float)
petal_length (float)
petal_width (float)
iris_organism_name (cat)
You’ll see that they’re always grouped in sets that correspond to records of FeatureSet
.
Why does LaminDB model feature sets, not just features?
Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you’ll only need to store 1M instead of 1M x 20k = 20B links.
Interpretation: Model protein panels, gene panels, etc.
Data integration: Feature sets provide the currency that determines whether two collections can be easily concatenated.
These reasons do not hold for label sets. Hence, LaminDB does not model label sets.
A slot
provides a string key to access feature sets. It’s typically the accessor within the registered data object, here pd.DataFrame.columns
.
Let’s use it to access all linked features:
artifact.features["columns"].df()
Show code cell output
created_at | created_by_id | run_id | updated_at | uid | name | dtype | unit | description | synonyms | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
3 | 2024-05-19 23:23:32.402119+00:00 | 1 | None | 2024-05-19 23:23:32.402192+00:00 | a8s7SXkNIsRI | sepal_length | float | None | None | None |
4 | 2024-05-19 23:23:32.402285+00:00 | 1 | None | 2024-05-19 23:23:32.402327+00:00 | Xe5Tq6Dbvrb1 | sepal_width | float | None | None | None |
5 | 2024-05-19 23:23:32.402411+00:00 | 1 | None | 2024-05-19 23:23:32.402450+00:00 | CByVI6TFPVMy | petal_length | float | None | None | None |
6 | 2024-05-19 23:23:32.402534+00:00 | 1 | None | 2024-05-19 23:23:32.402574+00:00 | drQTElEw1rf5 | petal_width | float | None | None | None |
7 | 2024-05-19 23:23:32.402656+00:00 | 1 | None | 2024-05-19 23:23:32.402695+00:00 | lBmzw6jmRUgN | iris_organism_name | cat | None | None | None |
There is one categorical feature, let’s add the species labels:
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.iris_species_name)
Show code cell output
✅ linked new feature 'iris_species_name' together with new feature set FeatureSet(uid='oarAFpPXtq3yKlqmjTkS', n=1, registry='Feature', hash='pr7SYbKy1OLWX2q1FMAe', created_by_id=1)
Let’s now add study labels:
artifact.labels.add(ulabels.study0, feature=features.study_name)
Show code cell output
✅ loaded: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
In addition to the columns
feature set, we now have an external
feature set:
artifact.features
Show code cell output
Features:
external: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature')
🔗 study_name (2, cat[ULabel]): 'study0'
🔗 iris_species_name (2, cat[ULabel]): 'setosa', 'versicolor', 'virginica'
columns: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature')
sepal_length (float)
sepal_width (float)
petal_length (float)
petal_width (float)
iris_organism_name (cat)
This is the context for our artifact:
artifact.describe()
artifact.view_lineage()
Show code cell output
Artifact(updated_at=2024-05-19 23:23:32 UTC, uid='SlGWQhSRAsBBpNFCi64d', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True)
Provenance:
📎 created_by: User(uid='00000000', handle='anonymous')
📎 storage: uid='fgwMZj1JIHOy', root='/home/runner/work/lamindb/lamindb/docs/lamin-tutorial', type='local', instance_uid='5WuFt3cW4zRx')
📎 transform: Transform(uid='gzL8KJt6OelI', name='Petal & sepal regressor', type='pipeline')
📎 run: Run(uid='0LiZ43vXlo5liMUV0RSD', started_at=2024-05-19 23:23:29 UTC, is_consecutive=True)
Features:
external: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature')
🔗 study_name (2, cat[ULabel]): 'study0'
🔗 iris_species_name (2, cat[ULabel]): 'setosa', 'versicolor', 'virginica'
columns: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature')
sepal_length (float)
sepal_width (float)
petal_length (float)
petal_width (float)
iris_organism_name (cat)
Labels:
📎 ulabels (4, ULabel): 'setosa', 'versicolor', 'virginica', 'study0'
See the database content:
ln.view(registries=["Feature", "FeatureSet", "ULabel"])
Show code cell output
Feature
created_at | created_by_id | run_id | updated_at | uid | name | dtype | unit | description | synonyms | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
7 | 2024-05-19 23:23:32.402656+00:00 | 1 | None | 2024-05-19 23:23:32.402695+00:00 | lBmzw6jmRUgN | iris_organism_name | cat | None | None | None |
6 | 2024-05-19 23:23:32.402534+00:00 | 1 | None | 2024-05-19 23:23:32.402574+00:00 | drQTElEw1rf5 | petal_width | float | None | None | None |
5 | 2024-05-19 23:23:32.402411+00:00 | 1 | None | 2024-05-19 23:23:32.402450+00:00 | CByVI6TFPVMy | petal_length | float | None | None | None |
4 | 2024-05-19 23:23:32.402285+00:00 | 1 | None | 2024-05-19 23:23:32.402327+00:00 | Xe5Tq6Dbvrb1 | sepal_width | float | None | None | None |
3 | 2024-05-19 23:23:32.402119+00:00 | 1 | None | 2024-05-19 23:23:32.402192+00:00 | a8s7SXkNIsRI | sepal_length | float | None | None | None |
2 | 2024-05-19 23:23:27.967153+00:00 | 1 | None | 2024-05-19 23:23:28.815905+00:00 | 0AUm4iYF8CpE | iris_species_name | cat[ULabel] | None | None | None |
1 | 2024-05-19 23:23:27.748427+00:00 | 1 | None | 2024-05-19 23:23:27.757144+00:00 | wCwcODTISR4C | study_name | cat[ULabel] | None | None | None |
FeatureSet
created_at | created_by_id | run_id | uid | name | n | dtype | registry | hash | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2 | 2024-05-19 23:23:28.821520+00:00 | 1 | None | QZRqz9cFg6SmJxzBU9wO | None | 2 | None | Feature | 58lX_dBcok06ZlN12ryt |
4 | 2024-05-19 23:23:32.433112+00:00 | 1 | None | 89vRDkS1DsNXvKWh80Hv | None | 5 | None | Feature | bHnAxI79Pu6350MpTFQN |
5 | 2024-05-19 23:23:32.489564+00:00 | 1 | None | oarAFpPXtq3yKlqmjTkS | None | 1 | None | Feature | pr7SYbKy1OLWX2q1FMAe |
ULabel
created_at | created_by_id | run_id | updated_at | uid | name | description | reference | reference_type | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
9 | 2024-05-19 23:23:27.906120+00:00 | 1 | None | 2024-05-19 23:23:27.906188+00:00 | ELICS2S0 | is_study | None | None | None |
8 | 2024-05-19 23:23:27.902991+00:00 | 1 | None | 2024-05-19 23:23:27.903032+00:00 | 7nSTk5j6 | study2 | None | None | None |
7 | 2024-05-19 23:23:27.902874+00:00 | 1 | None | 2024-05-19 23:23:27.902916+00:00 | MSMDe9ja | study1 | None | None | None |
6 | 2024-05-19 23:23:27.902723+00:00 | 1 | None | 2024-05-19 23:23:27.902794+00:00 | QqzvhQoW | study0 | None | None | None |
5 | 2024-05-19 23:23:27.816868+00:00 | 1 | None | 2024-05-19 23:23:27.816938+00:00 | PIHQsKzj | is_species | None | None | None |
4 | 2024-05-19 23:23:27.806211+00:00 | 1 | None | 2024-05-19 23:23:27.806254+00:00 | 8fXKs7fs | virginica | None | None | None |
3 | 2024-05-19 23:23:27.806091+00:00 | 1 | None | 2024-05-19 23:23:27.806134+00:00 | JBfBT1AI | versicolor | None | None | None |
Manage follow-up data¶
Assume that a couple of weeks later, we receive a new dataset in a follow-up study 2.
Let’s track a new analysis:
ln.settings.transform.stem_uid = "dMtrt8YMSdl6"
ln.settings.transform.version = "1"
ln.track()
Show code cell output
💡 notebook imports: lamindb==0.72.0 pandas==1.5.3
💡 saved: Transform(version='1', uid='dMtrt8YMSdl65zKv', name='Tutorial: Features & labels', key='tutorial2', type='notebook', updated_at=2024-05-19 23:23:32 UTC, created_by_id=1)
💡 saved: Run(uid='nadRBLpmOBWAh8BdZs61', transform_id=3, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_nadRBLpmOBWAh8BdZs61.txt
Register a joint collection¶
Assume we already ran all preprocessing including the ML model.
We get a DataFrame and store it as an artifact:
df = ln.core.datasets.df_iris_in_meter_study2()
ln.Artifact.from_df(df, description="Iris study 2 - transformed").save()
Show code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/G76wEGtq4G8Hku8MjZrz.parquet')
✅ storing artifact 'G76wEGtq4G8Hku8MjZrz' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/G76wEGtq4G8Hku8MjZrz.parquet'
Artifact(updated_at=2024-05-19 23:23:33 UTC, uid='G76wEGtq4G8Hku8MjZrz', suffix='.parquet', accessor='DataFrame', description='Iris study 2 - transformed', size=5397, hash='1OWu4rEeeob4-ZdGnLhTLw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=3, run_id=3)
Let’s load it:
artifact2 = ln.Artifact.filter(description="Iris study 2 - transformed").one()
We can now store the joint collection:
collection = ln.Collection(
[artifact, artifact2], name="Iris flower study 1 & 2 - transformed"
)
collection.save()
Show code cell output
✅ loaded: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ loaded: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature', hash='bHnAxI79Pu6350MpTFQN', created_by_id=1)
💡 adding artifact [5] as input for run 3, adding parent transform 2
Auto-concatenate datasets¶
Because both datasets measured the same validated feature set, we can auto-concatenate the collection:
collection.load().tail()
Show code cell output
sepal_length | sepal_width | petal_length | petal_width | iris_organism_name | |
---|---|---|---|---|---|
145 | 0.067 | 0.030 | 0.052 | 0.023 | virginica |
146 | 0.063 | 0.025 | 0.050 | 0.019 | virginica |
147 | 0.065 | 0.030 | 0.052 | 0.020 | virginica |
148 | 0.062 | 0.034 | 0.054 | 0.023 | virginica |
149 | 0.059 | 0.030 | 0.051 | 0.018 | virginica |
We can also access & query the underlying two artifact objects:
collection.artifacts.df()
Show code cell output
version | created_at | created_by_id | updated_at | uid | storage_id | key | suffix | accessor | description | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
5 | None | 2024-05-19 23:23:32.418732+00:00 | 1 | 2024-05-19 23:23:32.418781+00:00 | SlGWQhSRAsBBpNFCi64d | 1 | None | .parquet | DataFrame | Iris study 1 - after measuring sepal & petal m... | 5347 | zMBDnOFHeA8CwpaI_7KF9g | md5 | None | None | 2 | 2 | 1 | True |
6 | None | 2024-05-19 23:23:33.898346+00:00 | 1 | 2024-05-19 23:23:33.898402+00:00 | G76wEGtq4G8Hku8MjZrz | 1 | None | .parquet | DataFrame | Iris study 2 - transformed | 5397 | 1OWu4rEeeob4-ZdGnLhTLw | md5 | None | None | 3 | 3 | 1 | True |
Or look at their data lineage:
Or look at the database:
ln.view()
Show code cell output
Artifact
version | created_at | created_by_id | updated_at | uid | storage_id | key | suffix | accessor | description | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
6 | None | 2024-05-19 23:23:33.898346+00:00 | 1 | 2024-05-19 23:23:33.898402+00:00 | G76wEGtq4G8Hku8MjZrz | 1 | None | .parquet | DataFrame | Iris study 2 - transformed | 5397 | 1OWu4rEeeob4-ZdGnLhTLw | md5 | NaN | None | 3 | 3 | 1 | True |
5 | None | 2024-05-19 23:23:32.418732+00:00 | 1 | 2024-05-19 23:23:32.418781+00:00 | SlGWQhSRAsBBpNFCi64d | 1 | None | .parquet | DataFrame | Iris study 1 - after measuring sepal & petal m... | 5347 | zMBDnOFHeA8CwpaI_7KF9g | md5 | NaN | None | 2 | 2 | 1 | True |
4 | None | 2024-05-19 23:23:24.836147+00:00 | 1 | 2024-05-19 23:23:24.836198+00:00 | 62uS4kSWwh553bVPhdkJ | 2 | iris_studies/study2_raw_images | None | None | 665518 | PX8Vt9T28y-uCEJO1tKm7A | md5-d | 51.0 | None | 1 | 1 | 1 | False | |
3 | None | 2024-05-19 23:23:24.646038+00:00 | 1 | 2024-05-19 23:23:24.646092+00:00 | nE7BMpA4cluio47JdU66 | 2 | iris_studies/study1_raw_images | None | None | 640617 | j61W__GgImA18CKrIf7FVg | md5-d | 49.0 | None | 1 | 1 | 1 | False | |
2 | None | 2024-05-19 23:23:24.280523+00:00 | 1 | 2024-05-19 23:23:24.280576+00:00 | UI4dKwLK1iDoIAWMvJ4j | 2 | iris_studies/study0_raw_images | None | None | 656692 | wVYKPpEsmmrqSpAZIRXCFg | md5-d | 51.0 | None | 1 | 1 | 1 | False | |
1 | None | 2024-05-19 23:23:23.817202+00:00 | 1 | 2024-05-19 23:23:23.817253+00:00 | 5ruHcLPvlViem0z49X9f | 2 | iris_studies/study0_raw_images/meta.csv | .csv | None | None | 4355 | ZpAEpN0iFYH6vjZNigic7g | md5 | NaN | None | 1 | 1 | 1 | False |
Collection
version | created_at | created_by_id | updated_at | uid | name | description | hash | reference | reference_type | transform_id | run_id | artifact_id | visibility | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
4 | None | 2024-05-19 23:23:33.936147+00:00 | 1 | 2024-05-19 23:23:33.936192+00:00 | zP8h0uQuKg8VAe9kxWvX | Iris flower study 1 & 2 - transformed | None | I5HABiYOqx1fjRZ2be5E | None | None | 3 | 3 | None | 1 |
3 | 3 | 2024-05-19 23:23:24.843549+00:00 | 1 | 2024-05-19 23:23:24.843593+00:00 | w7p8GWvoDRJViSLvgHz0 | Iris collection | Now includes study2_raw_images | T-U8z2Zi5rFYdAD9pzmS | None | None | 1 | 1 | None | 1 |
2 | 2 | 2024-05-19 23:23:24.653688+00:00 | 1 | 2024-05-19 23:23:24.653733+00:00 | w7p8GWvoDRJViSLv1rKm | Iris collection | Now includes study1_raw_images | 5cCK6ZLOPB0cV3tyeZup | None | None | 1 | 1 | None | 1 |
1 | 1 | 2024-05-19 23:23:24.467012+00:00 | 1 | 2024-05-19 23:23:24.467066+00:00 | w7p8GWvoDRJViSLvunTx | Iris collection | Iris study 0 | WwFLpNFmK8GMC2dSGj1W | None | None | 1 | 1 | None | 1 |
Feature
created_at | created_by_id | run_id | updated_at | uid | name | dtype | unit | description | synonyms | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
7 | 2024-05-19 23:23:32.402656+00:00 | 1 | None | 2024-05-19 23:23:32.402695+00:00 | lBmzw6jmRUgN | iris_organism_name | cat | None | None | None |
6 | 2024-05-19 23:23:32.402534+00:00 | 1 | None | 2024-05-19 23:23:32.402574+00:00 | drQTElEw1rf5 | petal_width | float | None | None | None |
5 | 2024-05-19 23:23:32.402411+00:00 | 1 | None | 2024-05-19 23:23:32.402450+00:00 | CByVI6TFPVMy | petal_length | float | None | None | None |
4 | 2024-05-19 23:23:32.402285+00:00 | 1 | None | 2024-05-19 23:23:32.402327+00:00 | Xe5Tq6Dbvrb1 | sepal_width | float | None | None | None |
3 | 2024-05-19 23:23:32.402119+00:00 | 1 | None | 2024-05-19 23:23:32.402192+00:00 | a8s7SXkNIsRI | sepal_length | float | None | None | None |
2 | 2024-05-19 23:23:27.967153+00:00 | 1 | None | 2024-05-19 23:23:28.815905+00:00 | 0AUm4iYF8CpE | iris_species_name | cat[ULabel] | None | None | None |
1 | 2024-05-19 23:23:27.748427+00:00 | 1 | None | 2024-05-19 23:23:27.757144+00:00 | wCwcODTISR4C | study_name | cat[ULabel] | None | None | None |
FeatureSet
created_at | created_by_id | run_id | uid | name | n | dtype | registry | hash | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2 | 2024-05-19 23:23:28.821520+00:00 | 1 | None | QZRqz9cFg6SmJxzBU9wO | None | 2 | None | Feature | 58lX_dBcok06ZlN12ryt |
4 | 2024-05-19 23:23:32.433112+00:00 | 1 | None | 89vRDkS1DsNXvKWh80Hv | None | 5 | None | Feature | bHnAxI79Pu6350MpTFQN |
5 | 2024-05-19 23:23:32.489564+00:00 | 1 | None | oarAFpPXtq3yKlqmjTkS | None | 1 | None | Feature | pr7SYbKy1OLWX2q1FMAe |
Run
uid | transform_id | started_at | finished_at | created_by_id | report_id | environment_id | is_consecutive | reference | reference_type | created_at | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
1 | VBT7VLzVRSYnVrx1ESJm | 1 | 2024-05-19 23:23:21.424037+00:00 | None | 1 | None | None | True | None | None | 2024-05-19 23:23:21.424162+00:00 |
2 | 0LiZ43vXlo5liMUV0RSD | 2 | 2024-05-19 23:23:29.190316+00:00 | None | 1 | None | None | True | None | None | 2024-05-19 23:23:29.190437+00:00 |
3 | nadRBLpmOBWAh8BdZs61 | 3 | 2024-05-19 23:23:32.914125+00:00 | None | 1 | None | None | True | None | None | 2024-05-19 23:23:32.914248+00:00 |
Storage
created_at | created_by_id | run_id | updated_at | uid | root | description | type | region | instance_uid | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
2 | 2024-05-19 23:23:23.780440+00:00 | 1 | None | 2024-05-19 23:23:23.780523+00:00 | EzvEnPnH | s3://lamindb-dev-datasets | None | s3 | us-east-1 | pZ1VQkyD3haH |
1 | 2024-05-19 23:23:19.714345+00:00 | 1 | None | 2024-05-19 23:23:19.714408+00:00 | fgwMZj1JIHOy | /home/runner/work/lamindb/lamindb/docs/lamin-t... | None | local | None | 5WuFt3cW4zRx |
Transform
version | uid | name | key | description | type | latest_report_id | source_code_id | reference | reference_type | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
3 | 1 | dMtrt8YMSdl65zKv | Tutorial: Features & labels | tutorial2 | None | notebook | None | None | None | None | 2024-05-19 23:23:32.907670+00:00 | 2024-05-19 23:23:32.907714+00:00 | 1 |
2 | None | gzL8KJt6OelI | Petal & sepal regressor | None | None | pipeline | None | None | None | None | 2024-05-19 23:23:29.186701+00:00 | 2024-05-19 23:23:29.186726+00:00 | 1 |
1 | 0 | NJvdsWWbJlZS6K79 | Tutorial: Artifacts | tutorial | None | notebook | None | None | None | None | 2024-05-19 23:23:21.416956+00:00 | 2024-05-19 23:23:21.416999+00:00 | 1 |
ULabel
created_at | created_by_id | run_id | updated_at | uid | name | description | reference | reference_type | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
9 | 2024-05-19 23:23:27.906120+00:00 | 1 | None | 2024-05-19 23:23:27.906188+00:00 | ELICS2S0 | is_study | None | None | None |
8 | 2024-05-19 23:23:27.902991+00:00 | 1 | None | 2024-05-19 23:23:27.903032+00:00 | 7nSTk5j6 | study2 | None | None | None |
7 | 2024-05-19 23:23:27.902874+00:00 | 1 | None | 2024-05-19 23:23:27.902916+00:00 | MSMDe9ja | study1 | None | None | None |
6 | 2024-05-19 23:23:27.902723+00:00 | 1 | None | 2024-05-19 23:23:27.902794+00:00 | QqzvhQoW | study0 | None | None | None |
5 | 2024-05-19 23:23:27.816868+00:00 | 1 | None | 2024-05-19 23:23:27.816938+00:00 | PIHQsKzj | is_species | None | None | None |
4 | 2024-05-19 23:23:27.806211+00:00 | 1 | None | 2024-05-19 23:23:27.806254+00:00 | 8fXKs7fs | virginica | None | None | None |
3 | 2024-05-19 23:23:27.806091+00:00 | 1 | None | 2024-05-19 23:23:27.806134+00:00 | JBfBT1AI | versicolor | None | None | None |
User
uid | handle | name | created_at | updated_at | |
---|---|---|---|---|---|
id | |||||
1 | 00000000 | anonymous | None | 2024-05-19 23:23:19.709467+00:00 | 2024-05-19 23:23:19.709492+00:00 |
This is it! 😅
If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.
Appendix¶
Manage metadata¶
Avoid duplicates¶
Let’s create a label "project1"
:
ln.ULabel(name="project1").save()
Show code cell output
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1', created_by_id=1)
We already created a project1
label before, let’s see what happens if we try to create it again:
label = ln.ULabel(name="project1")
label.save()
Show code cell output
❗ loaded ULabel record with same name: 'project1' (disable via `ln.settings.upon_create_search_names`)
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1', created_by_id=1)
Instead of creating a new record, LaminDB loads and returns the existing record from the database.
If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.
Say, we spell “project 1” with a white space:
ln.ULabel(name="project 1")
Show code cell output
ULabel(uid='pbW1p7fo', name='project 1', created_by_id=1)
To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.
You can switch it off for performance gains via upon_create_search_names
.
Update & delete records¶
label = ln.ULabel.filter(name="project1").first()
label
Show code cell output
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1', created_by_id=1)
label.name = "project1a"
label.save()
label
Show code cell output
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1a', created_by_id=1)
label.delete()
Show code cell output
(1, {'lnschema_core.ULabel': 1})
Manage storage¶
Change default storage¶
The default storage location is:
ln.settings.storage
Show code cell output
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')
You can change it by setting ln.settings.storage = "s3://my-bucket"
.
See all storage locations¶
ln.Storage.df()
Show code cell output
created_at | created_by_id | run_id | updated_at | uid | root | description | type | region | instance_uid | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
2 | 2024-05-19 23:23:23.780440+00:00 | 1 | None | 2024-05-19 23:23:23.780523+00:00 | EzvEnPnH | s3://lamindb-dev-datasets | None | s3 | us-east-1 | pZ1VQkyD3haH |
1 | 2024-05-19 23:23:19.714345+00:00 | 1 | None | 2024-05-19 23:23:19.714408+00:00 | fgwMZj1JIHOy | /home/runner/work/lamindb/lamindb/docs/lamin-t... | None | local | None | 5WuFt3cW4zRx |
Show code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial
❗ calling anonymously, will miss private instances
💡 deleting instance anonymous/lamin-tutorial