Tutorial: Features & labels

In Tutorial: Artifacts, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

  1. Findability: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.

  2. Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

Hint

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you’re just looking to readily validate and annotate a dataset with features and labels, see this guide: Validate, standardize & annotate.

import lamindb as ln
import pandas as pd

ln.settings.verbosity = "hint"
Hide code cell output
💡 connected lamindb: anonymous/lamin-tutorial

Re-cap

Let’s briefly re-cap what we learned in Introduction. We started with simple labeling:

# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()
Hide code cell output
Artifact(updated_at=2024-05-19 23:23:24 UTC, uid='UI4dKwLK1iDoIAWMvJ4j', key='iris_studies/study0_raw_images', suffix='', size=656692, hash='wVYKPpEsmmrqSpAZIRXCFg', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False)

Provenance:
  📎 created_by: User(uid='00000000', handle='anonymous')
  📎 storage: uid='EzvEnPnH', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1', instance_uid='pZ1VQkyD3haH')
  📎 transform: Transform(version='0', uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', type='notebook')
  📎 run: Run(uid='VBT7VLzVRSYnVrx1ESJm', started_at=2024-05-19 23:23:21 UTC, is_consecutive=True)
Labels:
  📎 ulabels (1, ULabel): 'Study 0: initial plant gathering'

In general, it’s good practice to associate labels with features so that we can later feed them into learning algorithms with a defined dimension:

feature = ln.Feature(name="study_name", dtype="cat").save()
artifact.labels.add(study0, feature)
artifact.describe()
✅ linked feature 'study_name' to registry 'ULabel'
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='AVSJ4LiRpONRelwyWJl5', n=1, registry='Feature', hash='mWcNd6CmMPMB0RzbKYFS', created_by_id=1)
Artifact(updated_at=2024-05-19 23:23:24 UTC, uid='UI4dKwLK1iDoIAWMvJ4j', key='iris_studies/study0_raw_images', suffix='', size=656692, hash='wVYKPpEsmmrqSpAZIRXCFg', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False)

Provenance:
  📎 created_by: User(uid='00000000', handle='anonymous')
  📎 storage: uid='EzvEnPnH', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1', instance_uid='pZ1VQkyD3haH')
  📎 transform: Transform(version='0', uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', type='notebook')
  📎 run: Run(uid='VBT7VLzVRSYnVrx1ESJm', started_at=2024-05-19 23:23:21 UTC, is_consecutive=True)
Features:
  external: FeatureSet(uid='AVSJ4LiRpONRelwyWJl5', n=1, registry='Feature')
    🔗 study_name (1, cat[ULabel]): 'Study 0: initial plant gathering'
Labels:
  📎 ulabels (1, ULabel): 'Study 0: initial plant gathering'

Register metadata

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. "species") and labels represent measured values (e.g. "iris setosa", "iris versicolor", "iris virginica").

In statistics, you’d say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

Register labels

We study 3 species of the Iris plant: setosa, versicolor & virginica. Let’s create 3 labels with ULabel.

speciess = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(speciess)

ULabel enables you to manage an in-house ontology to manage all kinds of generic labels.

What are alternatives to ULabel?

In a complex project, you’ll likely want dedicated typed registries for selected label types, e.g., Gene, Tissue, etc. See: Manage biological registries.

ULabel, however, will get you quite far and scale to ~1M labels.

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

is_species = ln.ULabel(name="is_species").save()
is_species.children.set(speciess)
is_species.view_parents(with_children=True)
Hide code cell output
_images/97cd303ca9519922601c4f328d5e6f0031b93b8015d24a06f085e9f5c5bcd9d2.svg
studies = [ln.ULabel(name=name) for name in ["study0", "study1", "study2"]]
ln.save(studies)
is_study = ln.ULabel(name="is_study").save()
is_study.children.set(studies)
is_study.view_parents(with_children=True)
Hide code cell output
_images/4a487ad92a55651e99c4e4ce4009af948202fcc641351940af2629255b674e8a.svg

Register features

For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.

When we integrate datasets, feature names will label columns that store data.

Let’s create and save two Feature records to identify measurements of the iris species label and the study:

ln.Feature(name="iris_species_name", dtype="cat").save()

# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()

Run an ML model

Let’s now run a mock ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
    ln.track(transform=transform)
    input_data = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
    input_paths = [
        path.download_to(path.name) for path in input_data.artifacts[0].path.glob("*")
    ]
    # apply ML model
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data


df = run_ml_model()
Hide code cell output
💡 saved: Transform(uid='gzL8KJt6OelI', name='Petal & sepal regressor', type='pipeline', updated_at=2024-05-19 23:23:29 UTC, created_by_id=1)
💡 saved: Run(uid='0LiZ43vXlo5liMUV0RSD', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_0LiZ43vXlo5liMUV0RSD.txt

The output is a dataframe:

df.head()
Hide code cell output
sepal_length sepal_width petal_length petal_width iris_organism_name
0 0.051 0.035 0.014 0.002 setosa
1 0.049 0.030 0.014 0.002 setosa
2 0.047 0.032 0.013 0.002 setosa
3 0.046 0.031 0.015 0.002 setosa
4 0.050 0.036 0.014 0.002 setosa

And this is the pipeline that produced the dataframe:

ln.core.run_context.transform.view_parents()
Hide code cell output
_images/23161e65f55e8fa7411ebe3b6e88fe5826575b46af2f3c83fc8136e7a52ae286.svg

Register the output data

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?

Use the unit field of Feature. In the above example, you’d do:

for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/SlGWQhSRAsBBpNFCi64d.parquet')
✅ storing artifact 'SlGWQhSRAsBBpNFCi64d' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/SlGWQhSRAsBBpNFCi64d.parquet'
Artifact(updated_at=2024-05-19 23:23:32 UTC, uid='SlGWQhSRAsBBpNFCi64d', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=2, run_id=2)
artifact.features.add_feature_set(ln.FeatureSet(new_features), slot="columns")

Feature sets

Get an overview of linked features:

artifact.features
Hide code cell output
Features:
  columns: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature')
    sepal_length (float)
    sepal_width (float)
    petal_length (float)
    petal_width (float)
    iris_organism_name (cat)

You’ll see that they’re always grouped in sets that correspond to records of FeatureSet.

Why does LaminDB model feature sets, not just features?
  1. Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you’ll only need to store 1M instead of 1M x 20k = 20B links.

  2. Interpretation: Model protein panels, gene panels, etc.

  3. Data integration: Feature sets provide the currency that determines whether two collections can be easily concatenated.

These reasons do not hold for label sets. Hence, LaminDB does not model label sets.

A slot provides a string key to access feature sets. It’s typically the accessor within the registered data object, here pd.DataFrame.columns.

Let’s use it to access all linked features:

artifact.features["columns"].df()
Hide code cell output
created_at created_by_id run_id updated_at uid name dtype unit description synonyms
id
3 2024-05-19 23:23:32.402119+00:00 1 None 2024-05-19 23:23:32.402192+00:00 a8s7SXkNIsRI sepal_length float None None None
4 2024-05-19 23:23:32.402285+00:00 1 None 2024-05-19 23:23:32.402327+00:00 Xe5Tq6Dbvrb1 sepal_width float None None None
5 2024-05-19 23:23:32.402411+00:00 1 None 2024-05-19 23:23:32.402450+00:00 CByVI6TFPVMy petal_length float None None None
6 2024-05-19 23:23:32.402534+00:00 1 None 2024-05-19 23:23:32.402574+00:00 drQTElEw1rf5 petal_width float None None None
7 2024-05-19 23:23:32.402656+00:00 1 None 2024-05-19 23:23:32.402695+00:00 lBmzw6jmRUgN iris_organism_name cat None None None

There is one categorical feature, let’s add the species labels:

species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.iris_species_name)
Hide code cell output
✅ linked new feature 'iris_species_name' together with new feature set FeatureSet(uid='oarAFpPXtq3yKlqmjTkS', n=1, registry='Feature', hash='pr7SYbKy1OLWX2q1FMAe', created_by_id=1)

Let’s now add study labels:

artifact.labels.add(ulabels.study0, feature=features.study_name)
Hide code cell output
✅ loaded: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)

In addition to the columns feature set, we now have an external feature set:

artifact.features
Hide code cell output
Features:
  external: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature')
    🔗 study_name (2, cat[ULabel]): 'study0'
    🔗 iris_species_name (2, cat[ULabel]): 'setosa', 'versicolor', 'virginica'
  columns: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature')
    sepal_length (float)
    sepal_width (float)
    petal_length (float)
    petal_width (float)
    iris_organism_name (cat)

This is the context for our artifact:

artifact.describe()
artifact.view_lineage()
Hide code cell output
Artifact(updated_at=2024-05-19 23:23:32 UTC, uid='SlGWQhSRAsBBpNFCi64d', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='zMBDnOFHeA8CwpaI_7KF9g', hash_type='md5', visibility=1, key_is_virtual=True)

Provenance:
  📎 created_by: User(uid='00000000', handle='anonymous')
  📎 storage: uid='fgwMZj1JIHOy', root='/home/runner/work/lamindb/lamindb/docs/lamin-tutorial', type='local', instance_uid='5WuFt3cW4zRx')
  📎 transform: Transform(uid='gzL8KJt6OelI', name='Petal & sepal regressor', type='pipeline')
  📎 run: Run(uid='0LiZ43vXlo5liMUV0RSD', started_at=2024-05-19 23:23:29 UTC, is_consecutive=True)
Features:
  external: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature')
    🔗 study_name (2, cat[ULabel]): 'study0'
    🔗 iris_species_name (2, cat[ULabel]): 'setosa', 'versicolor', 'virginica'
  columns: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature')
    sepal_length (float)
    sepal_width (float)
    petal_length (float)
    petal_width (float)
    iris_organism_name (cat)
Labels:
  📎 ulabels (4, ULabel): 'setosa', 'versicolor', 'virginica', 'study0'
_images/60dfea37825dda11653451083fe4836773fcaa96dda35fd4d0b139255497cc9c.svg

See the database content:

ln.view(registries=["Feature", "FeatureSet", "ULabel"])
Hide code cell output
Feature
created_at created_by_id run_id updated_at uid name dtype unit description synonyms
id
7 2024-05-19 23:23:32.402656+00:00 1 None 2024-05-19 23:23:32.402695+00:00 lBmzw6jmRUgN iris_organism_name cat None None None
6 2024-05-19 23:23:32.402534+00:00 1 None 2024-05-19 23:23:32.402574+00:00 drQTElEw1rf5 petal_width float None None None
5 2024-05-19 23:23:32.402411+00:00 1 None 2024-05-19 23:23:32.402450+00:00 CByVI6TFPVMy petal_length float None None None
4 2024-05-19 23:23:32.402285+00:00 1 None 2024-05-19 23:23:32.402327+00:00 Xe5Tq6Dbvrb1 sepal_width float None None None
3 2024-05-19 23:23:32.402119+00:00 1 None 2024-05-19 23:23:32.402192+00:00 a8s7SXkNIsRI sepal_length float None None None
2 2024-05-19 23:23:27.967153+00:00 1 None 2024-05-19 23:23:28.815905+00:00 0AUm4iYF8CpE iris_species_name cat[ULabel] None None None
1 2024-05-19 23:23:27.748427+00:00 1 None 2024-05-19 23:23:27.757144+00:00 wCwcODTISR4C study_name cat[ULabel] None None None
FeatureSet
created_at created_by_id run_id uid name n dtype registry hash
id
2 2024-05-19 23:23:28.821520+00:00 1 None QZRqz9cFg6SmJxzBU9wO None 2 None Feature 58lX_dBcok06ZlN12ryt
4 2024-05-19 23:23:32.433112+00:00 1 None 89vRDkS1DsNXvKWh80Hv None 5 None Feature bHnAxI79Pu6350MpTFQN
5 2024-05-19 23:23:32.489564+00:00 1 None oarAFpPXtq3yKlqmjTkS None 1 None Feature pr7SYbKy1OLWX2q1FMAe
ULabel
created_at created_by_id run_id updated_at uid name description reference reference_type
id
9 2024-05-19 23:23:27.906120+00:00 1 None 2024-05-19 23:23:27.906188+00:00 ELICS2S0 is_study None None None
8 2024-05-19 23:23:27.902991+00:00 1 None 2024-05-19 23:23:27.903032+00:00 7nSTk5j6 study2 None None None
7 2024-05-19 23:23:27.902874+00:00 1 None 2024-05-19 23:23:27.902916+00:00 MSMDe9ja study1 None None None
6 2024-05-19 23:23:27.902723+00:00 1 None 2024-05-19 23:23:27.902794+00:00 QqzvhQoW study0 None None None
5 2024-05-19 23:23:27.816868+00:00 1 None 2024-05-19 23:23:27.816938+00:00 PIHQsKzj is_species None None None
4 2024-05-19 23:23:27.806211+00:00 1 None 2024-05-19 23:23:27.806254+00:00 8fXKs7fs virginica None None None
3 2024-05-19 23:23:27.806091+00:00 1 None 2024-05-19 23:23:27.806134+00:00 JBfBT1AI versicolor None None None

Manage follow-up data

Assume that a couple of weeks later, we receive a new dataset in a follow-up study 2.

Let’s track a new analysis:

ln.settings.transform.stem_uid = "dMtrt8YMSdl6"
ln.settings.transform.version = "1"
ln.track()
Hide code cell output
💡 notebook imports: lamindb==0.72.0 pandas==1.5.3
💡 saved: Transform(version='1', uid='dMtrt8YMSdl65zKv', name='Tutorial: Features & labels', key='tutorial2', type='notebook', updated_at=2024-05-19 23:23:32 UTC, created_by_id=1)
💡 saved: Run(uid='nadRBLpmOBWAh8BdZs61', transform_id=3, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_nadRBLpmOBWAh8BdZs61.txt

Register a joint collection

Assume we already ran all preprocessing including the ML model.

We get a DataFrame and store it as an artifact:

df = ln.core.datasets.df_iris_in_meter_study2()
ln.Artifact.from_df(df, description="Iris study 2 - transformed").save()
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/G76wEGtq4G8Hku8MjZrz.parquet')
✅ storing artifact 'G76wEGtq4G8Hku8MjZrz' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/G76wEGtq4G8Hku8MjZrz.parquet'
Artifact(updated_at=2024-05-19 23:23:33 UTC, uid='G76wEGtq4G8Hku8MjZrz', suffix='.parquet', accessor='DataFrame', description='Iris study 2 - transformed', size=5397, hash='1OWu4rEeeob4-ZdGnLhTLw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=3, run_id=3)

Let’s load it:

artifact2 = ln.Artifact.filter(description="Iris study 2 - transformed").one()

We can now store the joint collection:

collection = ln.Collection(
    [artifact, artifact2], name="Iris flower study 1 & 2 - transformed"
)
collection.save()
Hide code cell output
✅ loaded: FeatureSet(uid='QZRqz9cFg6SmJxzBU9wO', n=2, registry='Feature', hash='58lX_dBcok06ZlN12ryt', created_by_id=1)
✅ loaded: FeatureSet(uid='89vRDkS1DsNXvKWh80Hv', n=5, registry='Feature', hash='bHnAxI79Pu6350MpTFQN', created_by_id=1)
💡 adding artifact [5] as input for run 3, adding parent transform 2

Auto-concatenate datasets

Because both datasets measured the same validated feature set, we can auto-concatenate the collection:

collection.load().tail()
Hide code cell output
sepal_length sepal_width petal_length petal_width iris_organism_name
145 0.067 0.030 0.052 0.023 virginica
146 0.063 0.025 0.050 0.019 virginica
147 0.065 0.030 0.052 0.020 virginica
148 0.062 0.034 0.054 0.023 virginica
149 0.059 0.030 0.051 0.018 virginica

We can also access & query the underlying two artifact objects:

collection.artifacts.df()
Hide code cell output
version created_at created_by_id updated_at uid storage_id key suffix accessor description size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual
id
5 None 2024-05-19 23:23:32.418732+00:00 1 2024-05-19 23:23:32.418781+00:00 SlGWQhSRAsBBpNFCi64d 1 None .parquet DataFrame Iris study 1 - after measuring sepal & petal m... 5347 zMBDnOFHeA8CwpaI_7KF9g md5 None None 2 2 1 True
6 None 2024-05-19 23:23:33.898346+00:00 1 2024-05-19 23:23:33.898402+00:00 G76wEGtq4G8Hku8MjZrz 1 None .parquet DataFrame Iris study 2 - transformed 5397 1OWu4rEeeob4-ZdGnLhTLw md5 None None 3 3 1 True

Or look at their data lineage:

collection.view_lineage()
Hide code cell output
_images/4f50189eee3e4061892ba8be96ba95a435527167faa3004789ac4be8e85a720d.svg

Or look at the database:

ln.view()
Hide code cell output
Artifact
version created_at created_by_id updated_at uid storage_id key suffix accessor description size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual
id
6 None 2024-05-19 23:23:33.898346+00:00 1 2024-05-19 23:23:33.898402+00:00 G76wEGtq4G8Hku8MjZrz 1 None .parquet DataFrame Iris study 2 - transformed 5397 1OWu4rEeeob4-ZdGnLhTLw md5 NaN None 3 3 1 True
5 None 2024-05-19 23:23:32.418732+00:00 1 2024-05-19 23:23:32.418781+00:00 SlGWQhSRAsBBpNFCi64d 1 None .parquet DataFrame Iris study 1 - after measuring sepal & petal m... 5347 zMBDnOFHeA8CwpaI_7KF9g md5 NaN None 2 2 1 True
4 None 2024-05-19 23:23:24.836147+00:00 1 2024-05-19 23:23:24.836198+00:00 62uS4kSWwh553bVPhdkJ 2 iris_studies/study2_raw_images None None 665518 PX8Vt9T28y-uCEJO1tKm7A md5-d 51.0 None 1 1 1 False
3 None 2024-05-19 23:23:24.646038+00:00 1 2024-05-19 23:23:24.646092+00:00 nE7BMpA4cluio47JdU66 2 iris_studies/study1_raw_images None None 640617 j61W__GgImA18CKrIf7FVg md5-d 49.0 None 1 1 1 False
2 None 2024-05-19 23:23:24.280523+00:00 1 2024-05-19 23:23:24.280576+00:00 UI4dKwLK1iDoIAWMvJ4j 2 iris_studies/study0_raw_images None None 656692 wVYKPpEsmmrqSpAZIRXCFg md5-d 51.0 None 1 1 1 False
1 None 2024-05-19 23:23:23.817202+00:00 1 2024-05-19 23:23:23.817253+00:00 5ruHcLPvlViem0z49X9f 2 iris_studies/study0_raw_images/meta.csv .csv None None 4355 ZpAEpN0iFYH6vjZNigic7g md5 NaN None 1 1 1 False
Collection
version created_at created_by_id updated_at uid name description hash reference reference_type transform_id run_id artifact_id visibility
id
4 None 2024-05-19 23:23:33.936147+00:00 1 2024-05-19 23:23:33.936192+00:00 zP8h0uQuKg8VAe9kxWvX Iris flower study 1 & 2 - transformed None I5HABiYOqx1fjRZ2be5E None None 3 3 None 1
3 3 2024-05-19 23:23:24.843549+00:00 1 2024-05-19 23:23:24.843593+00:00 w7p8GWvoDRJViSLvgHz0 Iris collection Now includes study2_raw_images T-U8z2Zi5rFYdAD9pzmS None None 1 1 None 1
2 2 2024-05-19 23:23:24.653688+00:00 1 2024-05-19 23:23:24.653733+00:00 w7p8GWvoDRJViSLv1rKm Iris collection Now includes study1_raw_images 5cCK6ZLOPB0cV3tyeZup None None 1 1 None 1
1 1 2024-05-19 23:23:24.467012+00:00 1 2024-05-19 23:23:24.467066+00:00 w7p8GWvoDRJViSLvunTx Iris collection Iris study 0 WwFLpNFmK8GMC2dSGj1W None None 1 1 None 1
Feature
created_at created_by_id run_id updated_at uid name dtype unit description synonyms
id
7 2024-05-19 23:23:32.402656+00:00 1 None 2024-05-19 23:23:32.402695+00:00 lBmzw6jmRUgN iris_organism_name cat None None None
6 2024-05-19 23:23:32.402534+00:00 1 None 2024-05-19 23:23:32.402574+00:00 drQTElEw1rf5 petal_width float None None None
5 2024-05-19 23:23:32.402411+00:00 1 None 2024-05-19 23:23:32.402450+00:00 CByVI6TFPVMy petal_length float None None None
4 2024-05-19 23:23:32.402285+00:00 1 None 2024-05-19 23:23:32.402327+00:00 Xe5Tq6Dbvrb1 sepal_width float None None None
3 2024-05-19 23:23:32.402119+00:00 1 None 2024-05-19 23:23:32.402192+00:00 a8s7SXkNIsRI sepal_length float None None None
2 2024-05-19 23:23:27.967153+00:00 1 None 2024-05-19 23:23:28.815905+00:00 0AUm4iYF8CpE iris_species_name cat[ULabel] None None None
1 2024-05-19 23:23:27.748427+00:00 1 None 2024-05-19 23:23:27.757144+00:00 wCwcODTISR4C study_name cat[ULabel] None None None
FeatureSet
created_at created_by_id run_id uid name n dtype registry hash
id
2 2024-05-19 23:23:28.821520+00:00 1 None QZRqz9cFg6SmJxzBU9wO None 2 None Feature 58lX_dBcok06ZlN12ryt
4 2024-05-19 23:23:32.433112+00:00 1 None 89vRDkS1DsNXvKWh80Hv None 5 None Feature bHnAxI79Pu6350MpTFQN
5 2024-05-19 23:23:32.489564+00:00 1 None oarAFpPXtq3yKlqmjTkS None 1 None Feature pr7SYbKy1OLWX2q1FMAe
Run
uid transform_id started_at finished_at created_by_id report_id environment_id is_consecutive reference reference_type created_at
id
1 VBT7VLzVRSYnVrx1ESJm 1 2024-05-19 23:23:21.424037+00:00 None 1 None None True None None 2024-05-19 23:23:21.424162+00:00
2 0LiZ43vXlo5liMUV0RSD 2 2024-05-19 23:23:29.190316+00:00 None 1 None None True None None 2024-05-19 23:23:29.190437+00:00
3 nadRBLpmOBWAh8BdZs61 3 2024-05-19 23:23:32.914125+00:00 None 1 None None True None None 2024-05-19 23:23:32.914248+00:00
Storage
created_at created_by_id run_id updated_at uid root description type region instance_uid
id
2 2024-05-19 23:23:23.780440+00:00 1 None 2024-05-19 23:23:23.780523+00:00 EzvEnPnH s3://lamindb-dev-datasets None s3 us-east-1 pZ1VQkyD3haH
1 2024-05-19 23:23:19.714345+00:00 1 None 2024-05-19 23:23:19.714408+00:00 fgwMZj1JIHOy /home/runner/work/lamindb/lamindb/docs/lamin-t... None local None 5WuFt3cW4zRx
Transform
version uid name key description type latest_report_id source_code_id reference reference_type created_at updated_at created_by_id
id
3 1 dMtrt8YMSdl65zKv Tutorial: Features & labels tutorial2 None notebook None None None None 2024-05-19 23:23:32.907670+00:00 2024-05-19 23:23:32.907714+00:00 1
2 None gzL8KJt6OelI Petal & sepal regressor None None pipeline None None None None 2024-05-19 23:23:29.186701+00:00 2024-05-19 23:23:29.186726+00:00 1
1 0 NJvdsWWbJlZS6K79 Tutorial: Artifacts tutorial None notebook None None None None 2024-05-19 23:23:21.416956+00:00 2024-05-19 23:23:21.416999+00:00 1
ULabel
created_at created_by_id run_id updated_at uid name description reference reference_type
id
9 2024-05-19 23:23:27.906120+00:00 1 None 2024-05-19 23:23:27.906188+00:00 ELICS2S0 is_study None None None
8 2024-05-19 23:23:27.902991+00:00 1 None 2024-05-19 23:23:27.903032+00:00 7nSTk5j6 study2 None None None
7 2024-05-19 23:23:27.902874+00:00 1 None 2024-05-19 23:23:27.902916+00:00 MSMDe9ja study1 None None None
6 2024-05-19 23:23:27.902723+00:00 1 None 2024-05-19 23:23:27.902794+00:00 QqzvhQoW study0 None None None
5 2024-05-19 23:23:27.816868+00:00 1 None 2024-05-19 23:23:27.816938+00:00 PIHQsKzj is_species None None None
4 2024-05-19 23:23:27.806211+00:00 1 None 2024-05-19 23:23:27.806254+00:00 8fXKs7fs virginica None None None
3 2024-05-19 23:23:27.806091+00:00 1 None 2024-05-19 23:23:27.806134+00:00 JBfBT1AI versicolor None None None
User
uid handle name created_at updated_at
id
1 00000000 anonymous None 2024-05-19 23:23:19.709467+00:00 2024-05-19 23:23:19.709492+00:00

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix

Manage metadata

Avoid duplicates

Let’s create a label "project1":

ln.ULabel(name="project1").save()
Hide code cell output
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1', created_by_id=1)

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()
Hide code cell output
❗ loaded ULabel record with same name: 'project1' (disable via `ln.settings.upon_create_search_names`)
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1', created_by_id=1)

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")
Hide code cell output
ULabel(uid='pbW1p7fo', name='project 1', created_by_id=1)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via upon_create_search_names.

Update & delete records

label = ln.ULabel.filter(name="project1").first()
label
Hide code cell output
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1', created_by_id=1)
label.name = "project1a"
label.save()
label
Hide code cell output
ULabel(updated_at=2024-05-19 23:23:34 UTC, uid='zH0xGp3I', name='project1a', created_by_id=1)
label.delete()
Hide code cell output
(1, {'lnschema_core.ULabel': 1})

Manage storage

Change default storage

The default storage location is:

ln.settings.storage
Hide code cell output
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations

ln.Storage.df()
Hide code cell output
created_at created_by_id run_id updated_at uid root description type region instance_uid
id
2 2024-05-19 23:23:23.780440+00:00 1 None 2024-05-19 23:23:23.780523+00:00 EzvEnPnH s3://lamindb-dev-datasets None s3 us-east-1 pZ1VQkyD3haH
1 2024-05-19 23:23:19.714345+00:00 1 None 2024-05-19 23:23:19.714408+00:00 fgwMZj1JIHOy /home/runner/work/lamindb/lamindb/docs/lamin-t... None local None 5WuFt3cW4zRx
Hide code cell content
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial
❗ calling anonymously, will miss private instances
💡 deleting instance anonymous/lamin-tutorial