What happens if I save the same artifacts & records twice?¶
LaminDB’s operations are idempotent in the sense defined in this document.
This allows you to re-run a notebook or script without erroring or duplicating data. Similar behavior holds for human data entry.
Summary¶
Metadata records¶
If you try to create any metadata record (Registry
) and upon_create_search_names
is True
(the default):
LaminDB will warn you if a record with similar
name
exists and display a table of similar existing records.You can then decide whether you’d like to save a record to the database or rather query an existing one from the table.
If a name already has an exact match in a registry, LaminDB will return it instead of creating a new record. For versioned entities, also the version must be passed.
If you set upon_create_search_names
to False
, you’ll directly populate the DB.
Data: artifacts & collections¶
If you try to create a Artifact
object from the same content, depending on upon_artifact_create_if_hash_exists
,
you’ll get an existing object, if
upon_artifact_create_if_hash_exists = "warn_return_existing"
(the default)you’ll get an error, if
upon_artifact_create_if_hash_exists = "error"
you’ll get a warning and a new object, if
upon_artifact_create_if_hash_exists = "warn_create_new"
Examples¶
!lamin init --storage ./test-idempotency
💡 connected lamindb: testuser1/test-idempotency
import lamindb as ln
import pytest
ln.settings.verbosity = "hint"
ln.settings.transform.stem_uid = "ANW20Fr4eZgM"
ln.settings.transform.version = "1"
ln.track()
💡 connected lamindb: testuser1/test-idempotency
💡 notebook imports: lamindb==0.72.0 pytest==8.2.1
💡 saved: Transform(version='1', uid='ANW20Fr4eZgM5zKv', name='What happens if I save the same artifacts & records twice?', key='idempotency', type='notebook', updated_at=2024-05-20 08:58:18 UTC, created_by_id=1)
💡 saved: Run(uid='R40i0MvUdj5wL3llcdCL', transform_id=1, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_R40i0MvUdj5wL3llcdCL.txt
Metadata records¶
assert ln.settings.upon_create_search_names
Let us add a first record to the ULabel
registry:
label = ln.ULabel(name="My project 1")
label.save()
ULabel(updated_at=2024-05-20 08:58:20 UTC, uid='e3YaAmJN', name='My project 1', created_by_id=1)
If we create a new record, we’ll automatically get search results that give clues on whether we are prone to duplicating an entry:
label = ln.ULabel(name="My project 1a")
label.save()
ULabel(updated_at=2024-05-20 08:58:20 UTC, uid='afo2kuMG', name='My project 1a', created_by_id=1)
In case we match an existing name directly, we’ll get the existing object:
label = ln.ULabel(name="My project 1")
❗ loaded ULabel record with same name: 'My project 1' (disable via `ln.settings.upon_create_search_names`)
If we save it again, it will not create a new entry in the registry:
label.save()
ULabel(updated_at=2024-05-20 08:58:20 UTC, uid='e3YaAmJN', name='My project 1', created_by_id=1)
Now, if we create a third record, we’ll get two alternatives:
label = ln.ULabel(name="My project 1b")
If we prefer to not perform a search, e.g. for performance reasons or too noisy logging, we can switch it off.
ln.settings.upon_create_search_names = False
label = ln.ULabel(name="My project 1c")
In this walkthrough, switch it back on:
ln.settings.upon_create_search_names = True
Data: artifacts and collections¶
Warn upon trying to re-ingest an existing artifact¶
assert ln.settings.upon_artifact_create_if_hash_exists == "warn_return_existing"
filepath = ln.core.datasets.file_fcs()
Create an Artifact
:
artifact = ln.Artifact(filepath, description="My fcs artifact")
artifact.save()
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/geRYWso4sA2pi9jHSsch.fcs')
✅ storing artifact 'geRYWso4sA2pi9jHSsch' at '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/geRYWso4sA2pi9jHSsch.fcs'
Artifact(updated_at=2024-05-20 08:58:21 UTC, uid='geRYWso4sA2pi9jHSsch', suffix='.fcs', description='My fcs artifact', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1)
Show code cell content
assert artifact.hash == "KCEXRahJ-Ui9Y6nksQ8z1A"
Create an Artifact
from the same path:
artifact2 = ln.Artifact(filepath, description="My fcs artifact")
❗ returning existing artifact with same hash: Artifact(updated_at=2024-05-20 08:58:21 UTC, uid='geRYWso4sA2pi9jHSsch', suffix='.fcs', description='My fcs artifact', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1)
It gives us the existing object:
assert artifact.id == artifact2.id
assert artifact.run == artifact2.run
If you save it again, nothing will happen (the operation is idempotent):
artifact2.save()
Artifact(updated_at=2024-05-20 08:58:21 UTC, uid='geRYWso4sA2pi9jHSsch', suffix='.fcs', description='My fcs artifact', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1)
In the hidden cell below, you’ll see how this interplays with data lineage.
Show code cell content
ln.track(new_run=True)
artifact3 = ln.Artifact(filepath, description="My fcs artifact")
assert artifact3.id == artifact2.id
assert artifact3.run != artifact2.run
assert artifact3.previous_runs.first() == artifact2.run
💡 notebook imports: lamindb==0.72.0 pytest==8.2.1
💡 loaded: Transform(version='1', uid='ANW20Fr4eZgM5zKv', name='What happens if I save the same artifacts & records twice?', key='idempotency', type='notebook', updated_at=2024-05-20 08:58:18 UTC, created_by_id=1)
💡 saved: Run(uid='jgStTJ7lXhN1TXdpPFpR', transform_id=1, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_jgStTJ7lXhN1TXdpPFpR.txt
❗ returning existing artifact with same hash: Artifact(updated_at=2024-05-20 08:58:21 UTC, uid='geRYWso4sA2pi9jHSsch', suffix='.fcs', description='My fcs artifact', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1)
Error upon trying to re-ingest an existing artifact¶
ln.settings.upon_artifact_create_if_hash_exists = "error"
In this case, you’ll not be able to create an object from the same content:
with pytest.raises(FileExistsError):
artifact3 = ln.Artifact(filepath, description="My new fcs artifact")
Warn and create a new artifact¶
Lastly, let us discuss the following setting:
ln.settings.upon_artifact_create_if_hash_exists = "warn_create_new"
In this case, you’ll create a new object:
artifact4 = ln.Artifact(filepath, description="My new fcs artifact")
artifact4.save()
❗ creating new Artifact object despite existing artifact with same hash: Artifact(updated_at=2024-05-20 08:58:21 UTC, uid='geRYWso4sA2pi9jHSsch', suffix='.fcs', description='My fcs artifact', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1)
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/eoZtbsAnPY0I8lvC3AOh.fcs')
✅ storing artifact 'eoZtbsAnPY0I8lvC3AOh' at '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/eoZtbsAnPY0I8lvC3AOh.fcs'
Artifact(updated_at=2024-05-20 08:58:22 UTC, uid='eoZtbsAnPY0I8lvC3AOh', suffix='.fcs', description='My new fcs artifact', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=2)
You can verify that it’s a new entry by comparing the ids:
assert artifact4.id != artifact.id
artifact4.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").df()
version | created_at | created_by_id | updated_at | uid | storage_id | key | suffix | accessor | description | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
1 | None | 2024-05-20 08:58:21.734588+00:00 | 1 | 2024-05-20 08:58:21.794564+00:00 | geRYWso4sA2pi9jHSsch | 1 | None | .fcs | None | My fcs artifact | 6785467 | KCEXRahJ-Ui9Y6nksQ8z1A | md5 | None | None | 1 | 1 | 1 | True |
2 | None | 2024-05-20 08:58:22.806007+00:00 | 1 | 2024-05-20 08:58:22.806060+00:00 | eoZtbsAnPY0I8lvC3AOh | 1 | None | .fcs | None | My new fcs artifact | 6785467 | KCEXRahJ-Ui9Y6nksQ8z1A | md5 | None | None | 1 | 2 | 1 | True |
Show code cell content
assert len(artifact.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").list()) == 2
!lamin delete --force test-idempotency
!rm -r test-idempotency
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/lamin", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 367, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 103, in delete
return delete(instance, force=force)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 98, in delete
n_objects = check_storage_is_empty(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 760, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb contains 2 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/_is_initialized', '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/eoZtbsAnPY0I8lvC3AOh.fcs', '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/geRYWso4sA2pi9jHSsch.fcs']