RxRx: cell imaging

rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.

High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.

  • In this guide, you’ll see how to query some of these data using LaminDB: laminlabs/rxrx.

  • If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

  • If you’d like to understand how the laminlabs/rxrx instance was curated, see this repository.

Setup

import lamindb as ln
import bionty as bt
import wetlab as wl

ln.connect("laminlabs/lamindata")

Search & look up metadata

We’ll find all treatments in the Treatment registry:

df = wl.Treatment.df()
df.shape
(1139, 13)

Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:

sirnas = wl.Treatment.filter(system="siRNA").lookup(return_field="name")

We’re also interested in features, cell lines & wells:

ln.Feature.df()
created_at created_by_id run_id updated_at uid name dtype unit description synonyms
id
135 2023-07-12 12:54:25.605932+00:00 2 None 2024-03-26 13:23:37.050138+00:00 UPnuN18Vro7T sirna float None None None
134 2023-07-12 12:54:25.605879+00:00 2 None 2024-03-26 13:20:59.284093+00:00 RFz9tVF39RXJ well_type float None None None
132 2023-07-12 12:54:25.605769+00:00 2 None 2024-03-26 13:20:57.352241+00:00 ghhC57uNYQhD well float None None None
131 2023-07-12 12:54:25.605717+00:00 2 None 2024-03-26 13:20:13.255028+00:00 gUecWT2bNsch plate float None None None
303 2023-07-12 12:54:25.605663+00:00 2 None 2024-03-26 13:20:11.349207+00:00 4ycwa8er0EB2 experiment cat[ULabel|wetlab.Experiment] None None None
... ... ... ... ... ... ... ... ... ... ...
5 2023-07-12 12:54:24.401456+00:00 2 None 2023-10-14 15:42:03.557973+00:00 b1oB0I2Nxx7w feature_4 float None None None
4 2023-07-12 12:54:24.401441+00:00 2 None 2023-10-14 15:42:03.431243+00:00 qehni2DU75bT feature_3 float None None None
3 2023-07-12 12:54:24.401425+00:00 2 None 2023-10-14 15:42:03.306655+00:00 cANjhBnEosz7 feature_2 float None None None
2 2023-07-12 12:54:24.401408+00:00 2 None 2023-10-14 15:42:03.181750+00:00 RhHNXlP1jpqi feature_1 float None None None
1 2023-07-12 12:54:24.401373+00:00 2 None 2023-10-14 15:42:03.055457+00:00 UwWDQLrCTdks feature_0 float None None None

311 rows × 10 columns

cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")

Load the collection

This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.

Let us get the corresponding object and some information about it:

collection = ln.Collection.filter(uid="KMEQhAvRQDXLvNTNWlsT").one()
collection.view_lineage()
collection.describe()
Hide code cell output
_images/adcf37330da9094a5c239cf7ac4625d6723f2d9384ae82e33cd924f049819ace.svg
Collection(version='1', updated_at=2024-03-26 13:24:48 UTC, uid='KMEQhAvRQDXLvNTNWlsT', name='Annotated RxRx1 images', hash='jKVAYzd5in11dWtr-C0M7g', visibility=1)

Provenance:
  📎 created_by: User(uid='FBa7SHjn', handle='falexwolf', name='Alex Wolf')
  📎 transform: Transform(version='1', uid='Zo0qJt4IQPsb5zKv', name='Ingest the RxRx1 dataset', key='02-rxrx1-ingest', type='notebook')
  📎 run: Run(uid='o7nwbuGqaY65aZ6jzmrt', started_at=2024-03-26 13:13:16 UTC, is_consecutive=True)
  📎 artifact: uid='KMEQhAvRQDXLvNTNWlsT', key='rxrx1/metadata.parquet', suffix='.parquet', accessor='DataFrame', description='Metadata with file paths for each RxRx1 image.', size=5722206, hash='jKVAYzd5in11dWtr-C0M7g', hash_type='md5', visibility=1, key_is_virtual=True)
Features:
  columns: FeatureSet(uid='E58U5AxvUTGmMnE5P4iT', n=11, registry='Feature')
    path (cat)
    well_id (float)
    plate (float)
    well (float)
    site (float)
    well_type (float)
    sirna (float)
    sirna_id (float)
    🔗 experiment (cat[ULabel|wetlab.Experiment])
        🔗 experiment (11, ULabel): 
        🔗 experiment (11, wetlab.Experiment): 
    🔗 cell_line (cat[bionty.CellLine])
    🔗 split (11, cat[ULabel]): 'train', 'test'
  external: FeatureSet(uid='jyMP6a3aI8kKzVZquyFK', n=1, registry='Feature')
    🔗 readout (cat[bionty.ExperimentalFactor])
Labels:
  📎 ulabels (2, ULabel): 'train', 'test'

The dataset consists in a metadata file and a folder path pointing to the image files:

collection.artifact.load().head()
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1.png
1 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w2.png
2 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w3.png
3 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w4.png
4 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w5.png

Query image files

Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:

# df = collection.artifact.load()

We can query a subset of images using metadata registries & pandas query syntax:

# query = df[
#     (df.cell_line == cell_lines.hep_g2_cell)
#     & (df.sirna == sirnas.s15652)
#     & (df.well == wells.m15)
#     & (df.plate == 1)
#     & (df.site == 2)
# ]
# query

To access the individual images based on this query result:

# images = [artifact.path.parent / key for key in query.path]
# images

Download an image to disk:

# path = UPath(images[1])
# path.download_to(".")
# from IPython.display import Image
# Image(f"./{path.name}")
Use DuckDB to query metadata

As an alternative to pandas, we could use DuckDB to query image metadata.

import duckdb

filter = (
    f"{features.cell_type} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)

parquet_data = duckdb.from_parquet(artifact.path.as_posix())

parquet_data.filter(filter)