RxRx: cell imaging¶
rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.
High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.
In this guide, you’ll see how to query some of these data using LaminDB: laminlabs/rxrx.
If you’d like to transfer data into your own LaminDB instance, see the transfer guide.
If you’d like to understand how the
laminlabs/rxrx
instance was curated, see this repository.
Setup¶
import lamindb as ln
import bionty as bt
import wetlab as wl
ln.connect("laminlabs/lamindata")
Search & look up metadata¶
We’ll find all treatments in the Treatment
registry:
df = wl.Treatment.df()
df.shape
(1139, 13)
Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:
sirnas = wl.Treatment.filter(system="siRNA").lookup(return_field="name")
We’re also interested in features, cell lines & wells:
ln.Feature.df()
created_at | created_by_id | run_id | updated_at | uid | name | dtype | unit | description | synonyms | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
135 | 2023-07-12 12:54:25.605932+00:00 | 2 | None | 2024-03-26 13:23:37.050138+00:00 | UPnuN18Vro7T | sirna | float | None | None | None |
134 | 2023-07-12 12:54:25.605879+00:00 | 2 | None | 2024-03-26 13:20:59.284093+00:00 | RFz9tVF39RXJ | well_type | float | None | None | None |
132 | 2023-07-12 12:54:25.605769+00:00 | 2 | None | 2024-03-26 13:20:57.352241+00:00 | ghhC57uNYQhD | well | float | None | None | None |
131 | 2023-07-12 12:54:25.605717+00:00 | 2 | None | 2024-03-26 13:20:13.255028+00:00 | gUecWT2bNsch | plate | float | None | None | None |
303 | 2023-07-12 12:54:25.605663+00:00 | 2 | None | 2024-03-26 13:20:11.349207+00:00 | 4ycwa8er0EB2 | experiment | cat[ULabel|wetlab.Experiment] | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5 | 2023-07-12 12:54:24.401456+00:00 | 2 | None | 2023-10-14 15:42:03.557973+00:00 | b1oB0I2Nxx7w | feature_4 | float | None | None | None |
4 | 2023-07-12 12:54:24.401441+00:00 | 2 | None | 2023-10-14 15:42:03.431243+00:00 | qehni2DU75bT | feature_3 | float | None | None | None |
3 | 2023-07-12 12:54:24.401425+00:00 | 2 | None | 2023-10-14 15:42:03.306655+00:00 | cANjhBnEosz7 | feature_2 | float | None | None | None |
2 | 2023-07-12 12:54:24.401408+00:00 | 2 | None | 2023-10-14 15:42:03.181750+00:00 | RhHNXlP1jpqi | feature_1 | float | None | None | None |
1 | 2023-07-12 12:54:24.401373+00:00 | 2 | None | 2023-10-14 15:42:03.055457+00:00 | UwWDQLrCTdks | feature_0 | float | None | None | None |
311 rows × 10 columns
cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")
Load the collection¶
This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.
Let us get the corresponding object and some information about it:
collection = ln.Collection.filter(uid="KMEQhAvRQDXLvNTNWlsT").one()
collection.view_lineage()
collection.describe()
Show code cell output
Collection(version='1', updated_at=2024-03-26 13:24:48 UTC, uid='KMEQhAvRQDXLvNTNWlsT', name='Annotated RxRx1 images', hash='jKVAYzd5in11dWtr-C0M7g', visibility=1)
Provenance:
📎 created_by: User(uid='FBa7SHjn', handle='falexwolf', name='Alex Wolf')
📎 transform: Transform(version='1', uid='Zo0qJt4IQPsb5zKv', name='Ingest the RxRx1 dataset', key='02-rxrx1-ingest', type='notebook')
📎 run: Run(uid='o7nwbuGqaY65aZ6jzmrt', started_at=2024-03-26 13:13:16 UTC, is_consecutive=True)
📎 artifact: uid='KMEQhAvRQDXLvNTNWlsT', key='rxrx1/metadata.parquet', suffix='.parquet', accessor='DataFrame', description='Metadata with file paths for each RxRx1 image.', size=5722206, hash='jKVAYzd5in11dWtr-C0M7g', hash_type='md5', visibility=1, key_is_virtual=True)
Features:
columns: FeatureSet(uid='E58U5AxvUTGmMnE5P4iT', n=11, registry='Feature')
path (cat)
well_id (float)
plate (float)
well (float)
site (float)
well_type (float)
sirna (float)
sirna_id (float)
🔗 experiment (cat[ULabel|wetlab.Experiment])
🔗 experiment (11, ULabel):
🔗 experiment (11, wetlab.Experiment):
🔗 cell_line (cat[bionty.CellLine])
🔗 split (11, cat[ULabel]): 'train', 'test'
external: FeatureSet(uid='jyMP6a3aI8kKzVZquyFK', n=1, registry='Feature')
🔗 readout (cat[bionty.ExperimentalFactor])
Labels:
📎 ulabels (2, ULabel): 'train', 'test'
The dataset consists in a metadata file and a folder path pointing to the image files:
collection.artifact.load().head()
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1.png |
1 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w2.png |
2 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w3.png |
3 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w4.png |
4 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w5.png |
Query image files¶
Because we didn’t choose to register each image as a record in the Artifact
registry, we have to query the images through the metadata file of the dataset:
# df = collection.artifact.load()
We can query a subset of images using metadata registries & pandas query syntax:
# query = df[
# (df.cell_line == cell_lines.hep_g2_cell)
# & (df.sirna == sirnas.s15652)
# & (df.well == wells.m15)
# & (df.plate == 1)
# & (df.site == 2)
# ]
# query
To access the individual images based on this query result:
# images = [artifact.path.parent / key for key in query.path]
# images
Download an image to disk:
# path = UPath(images[1])
# path.download_to(".")
# from IPython.display import Image
# Image(f"./{path.name}")
Use DuckDB to query metadata
As an alternative to pandas, we could use DuckDB to query image metadata.
import duckdb
filter = (
f"{features.cell_type} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
f"{features.plate} == '1' and {features.site} == '2'"
)
parquet_data = duckdb.from_parquet(artifact.path.as_posix())
parquet_data.filter(filter)