Validate & standardize for developers¶
LaminDB makes it easy to validate categorical variables based on registries (CanValidate
).
How do I validate based on a public ontology?
CanValidate
methods validate against the registries in your LaminDB instance.
In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable
ontology object: public = Registry.public()
.
By default, from_values()
considers a match in a public reference a validated value for any bionty
entity.
What to do for non-validated values?
Be aware when you are working with a freshly initialized instance: nothing is validated as no records have yet been registered.
Run inspect
to get instructions of how to register non-validated values. You may need to standardize your values, fix typos or simply register them.
Setup¶
!lamin init --storage ./test-validate --schema bionty
Show code cell output
💡 connected lamindb: testuser1/test-validate
import lamindb as ln
import bionty as bt
import pandas as pd
💡 connected lamindb: testuser1/test-validate
ln.settings.verbosity = "info"
Pre-populate registries:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.Artifact.from_df(df, description="test data").save()
ln.ULabel(name="Project A").save()
ln.ULabel(name="Project B").save()
bt.Disease.from_public(ontology_id="MONDO:0004975").save()
Show code cell output
❗ no run & transform get linked, consider calling ln.track()
✅ storing artifact 'qaeVQf9QvmERgDQKLEJw' at '/home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb/qaeVQf9QvmERgDQKLEJw.parquet'
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0004975'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:08 UTC, uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimer dementia|Alzheimer disease|Alzheimer's disease|presenile and senile dementia|Alzheimers disease|Alzheimer's dementia|Alzheimers dementia|AD', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', created_by_id=1, public_source_id=29)
✅ created 2 Disease records from Bionty matching ontology_id: 'MONDO:0001627', 'MONDO:0005574'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:08 UTC, uid='6AMrlbw8', name='dementia', ontology_id='MONDO:0001627', synonyms='dementia (disease)|dementia', description='Loss Of Intellectual Abilities Interfering With An Individual'S Social And Occupational Functions. Causes Include Alzheimer'S Disease, Brain Injuries, Brain Tumors, And Vascular Disorders.', created_by_id=1, public_source_id=29)
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002039'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:09 UTC, uid='6yfRDD23', name='cognitive disorder', ontology_id='MONDO:0002039', synonyms='cognitive disease|cognitive disorder', description='A Disease Affects Cognitive Processes.', created_by_id=1, public_source_id=29)
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002025'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:09 UTC, uid='6HNgrMK9', name='psychiatric disorder', ontology_id='MONDO:0002025', synonyms='Psychiatric disorder|Psychiatric disease', description='A Disorder Characterized By Behavioral And/Or Psychological Abnormalities, Often Accompanied By Physical Symptoms. The Symptoms May Cause Clinically Significant Distress Or Impairment In Social And Occupational Areas Of Functioning. Representative Examples Include Anxiety Disorders, Cognitive Disorders, Mood Disorders And Schizophrenia.', created_by_id=1, public_source_id=29)
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0700096'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:10 UTC, uid='3Pcu72hb', name='human disease', ontology_id='MONDO:0700096', synonyms='human disease or disorder', created_by_id=1, public_source_id=29)
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0000001'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:08 UTC, uid='6PeduEmE', name='tauopathy', ontology_id='MONDO:0005574', description='Neurodegenerative Disorders Involving Deposition Of Abnormal Tau Protein Isoforms (Tau Proteins) In Neurons And Glial Cells In The Brain. Pathological Aggregations Of Tau Proteins Are Associated With Mutation Of The Tau Gene On Chromosome 17 In Patients With Alzheimer Disease; Dementia; Parkinsonian Disorders; Progressive Supranuclear Palsy (Supranuclear Palsy, Progressive); And Corticobasal Degeneration.', created_by_id=1, public_source_id=29)
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005559'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:11 UTC, uid='6sgNFaDE', name='neurodegenerative disease', ontology_id='MONDO:0005559', synonyms='central nervous system neurodegenerative disorder|brain degeneration|neurodegenerative disease|central nervous system degenerative disorder|degenerative disorder of central nervous system', description='A Disorder Of The Central Nervous System Characterized By Gradual And Progressive Loss Of Neural Tissue And Neurologic Function.', created_by_id=1, public_source_id=29)
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002602'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:12 UTC, uid='5dTDVEfc', name='central nervous system disorder', ontology_id='MONDO:0002602', synonyms='central nervous system disease|central nervous system disorder|disease of the central nervous system|CNS disorder|central nervous system disease or disorder|disease of central nervous system|central nervous disease|disease or disorder of central nervous system|disorder of central nervous system', description='A Disease Involving The Central Nervous System.', created_by_id=1, public_source_id=29)
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005071'
💡 also saving parents of Disease(updated_at=2024-05-20 08:59:13 UTC, uid='3NKHns2m', name='nervous system disorder', ontology_id='MONDO:0005071', synonyms='neurological disease|disorder of nervous system|nervous system disease|neurologic disease|nervous system disorder|nervous system disease or disorder|neurologic disorder|disease or disorder of nervous system|disease of nervous system|neurological disorder', description='A Non-Neoplastic Or Neoplastic Disorder That Affects The Brain, Spinal Cord, Or Peripheral Nerves.', created_by_id=1, public_source_id=29)
Standard validation¶
Name duplication¶
Creating a record with the same name field automatically returns the existing record:
ln.ULabel(name="Project A")
❗ loaded ULabel record with same name: 'Project A' (disable via `ln.settings.upon_create_search_names`)
ULabel(updated_at=2024-05-20 08:59:07 UTC, uid='Jiox3U5d', name='Project A', created_by_id=1)
Bulk creating records using from_values()
only returns validated records:
Note: Terms validated with public reference are also created with .from_values
, see Manage biological registries for details.
projects = ["Project A", "Project B", "Project D", "Project E"]
ln.ULabel.from_values(projects)
✅ loaded 2 ULabel records matching name: 'Project A', 'Project B'
❗ did not create ULabel records for 2 non-validated names: 'Project D', 'Project E'
[ULabel(updated_at=2024-05-20 08:59:07 UTC, uid='Jiox3U5d', name='Project A', created_by_id=1),
ULabel(updated_at=2024-05-20 08:59:07 UTC, uid='SVAPxUGl', name='Project B', created_by_id=1)]
(Versioned records also account for version
in addition to name
. Also see: idempotency.)
Data duplication¶
Creating an artifact or collection with the same content automatically returns the existing record:
ln.Artifact.from_df(df, description="same data")
❗ no run & transform get linked, consider calling ln.track()
❗ returning existing artifact with same hash: Artifact(updated_at=2024-05-20 08:59:07 UTC, uid='qaeVQf9QvmERgDQKLEJw', suffix='.parquet', accessor='DataFrame', description='test data', size=2722, hash='-xXHpj8x-liAvd51DtHVnA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1)
❗ updated description from test data to same data
Artifact(updated_at=2024-05-20 08:59:07 UTC, uid='qaeVQf9QvmERgDQKLEJw', suffix='.parquet', accessor='DataFrame', description='same data', size=2722, hash='-xXHpj8x-liAvd51DtHVnA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1)
Schema-based validation¶
Type checks, constraint checks, and Django validators can be configured in the schema.
Registry-based validation¶
validate()
validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.
Using dedicated registries¶
For instance, bionty
types basic biological entities: every entity has its own registry, a Python class.
By default, the first string field is used for validation. For Disease
, it’s name
:
diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = bt.Disease.validate(diseases)
validated
✅ 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
array([ True, False, False])
Validate against a non-default field:
bt.Disease.validate(
["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], bt.Disease.ontology_id
)
✅ 1 term (33.30%) is validated for ontology_id
❗ 2 terms (66.70%) are not validated for ontology_id: MONDO:0004976, MONDO:0004977
array([ True, False, False])
Using the ULabel
registry¶
Any entity that doesn’t have its dedicated registry (“is not typed”) can be validated & registered using ULabel
:
ln.ULabel.validate(["Project A", "Project B", "Project C"])
✅ 2 terms (66.70%) are validated for name
❗ 1 term (33.30%) is not validated for name: Project C
array([ True, True, False])
Inspect & standardize¶
When validation fails, you can call inspect()
to figure out what to do.
inspect()
applies the same definition of validation as validate()
, but returns a rich return value InspectResult
. Most importantly, it logs recommended curation steps that would render the data validated.
result = bt.Disease.inspect(diseases)
✅ 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
detected 2 terms with synonyms: Alzheimer's disease, AD
→ standardize terms via .standardize()
In this case, it suggests to call standardize()
to standardize synonyms:
bt.Disease.standardize(result.non_validated)
💡 standardized 2/2 terms
['Alzheimer disease', 'Alzheimer disease']
For more, see Manage biological registries.
Extend registries¶
Sometimes, we simply want to register new records to extend the content of registries:
result = ln.ULabel.inspect(projects)
✅ 2 terms (50.00%) are validated for name
❗ 2 terms (50.00%) are not validated for name: Project D, Project E
couldn't validate 2 terms: 'Project D', 'Project E'
→ if you are sure, create new records via ln.ULabel() and save to your registry
new_labels = [ln.ULabel(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels
Show code cell output
[ULabel(updated_at=2024-05-20 08:59:13 UTC, uid='UarQtRMG', name='Project D', created_by_id=1),
ULabel(updated_at=2024-05-20 08:59:13 UTC, uid='hc8fYsae', name='Project E', created_by_id=1)]
Validate features¶
When calling File.from_...
and Collection.from_...
, features are automatically validated.
Validated features are grouped in “feature sets” indexed by “slots”.
For a basic example, see Tutorial: Features & labels.
For an overview of data formats used to model different data types, see Data types.
Bulk validation¶
Show code cell content
# clean up test instance
!lamin delete --force test-validate
!rm -r test-validate
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/lamin", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 367, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 103, in delete
return delete(instance, force=force)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 98, in delete
n_objects = check_storage_is_empty(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 760, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb contains 1 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb/_is_initialized', '/home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb/qaeVQf9QvmERgDQKLEJw.parquet']