CELLxGENE: scRNA-seq¶

CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).

You can use the CELLxGENE data in three ways:

In the current guide, you’ll see how to query metadata and data based on AnnData objects.
If you want to use these in your own LaminDB instance, see the transfer guide.
If you’d like to leverage the TileDB-SOMA API for the data subset of CELLxGENE Census, see the Census guide.

If you are interested in building similar data assets in-house:

See the scRNA guide for how to create a growing versioned queryable scRNA-seq dataset.
See the Annotate for validating, curating and registering your own AnnData objects.
Reach out if you are interested in a full zero-copy clone of laminlabs/cellxgene to accelerate building your in-house LaminDB instances.

Setup¶

Load the public LaminDB instance that mirrors cellxgene on the CLI:

!lamin load laminlabs/cellxgene

💡 connected lamindb: laminlabs/cellxgene

import lamindb as ln
import bionty as bt

💡 connected lamindb: laminlabs/cellxgene

❗ Full backed capabilities are not available for this version of anndata, please install anndata>=0.9.1.

Query & understand metadata¶

Auto-complete metadata¶

You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.

Let’s use auto-complete to look up cell types:

cell_types = bt.CellType.lookup()
cell_types.effector_t_cell

CellType(updated_at=2023-11-28 22:30:57 UTC, uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', created_by_id=1, public_source_id=48)

You can also arbitrarily chain filters and create lookups from them:

organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(name="is_suspension_type").one().children.lookup()  # suspension types

Understand ontologies¶

View the related ontology terms:

effector_t_cell.view_parents(distance=2, with_children=True)

_images/6cdfc2f61da5a14e92b8512c8b1af5865ee670a550a55ae2659acf11ebca5fbc.svg

Or access them programmatically:

effector_t_cell.children.df()

	created_at	created_by_id	run_id	updated_at	uid	name	ontology_id	abbr	synonyms	description	public_source_id
id
931	2023-11-28 22:27:55.565976+00:00	1	None	2023-11-28 22:27:55.565981+00:00	2VQirdSp	effector CD8-positive, alpha-beta T cell	CL:0001050	None	effector CD8-positive, alpha-beta T lymphocyte...	A Cd8-Positive, Alpha-Beta T Cell With The Phe...	48
1088	2023-11-28 22:27:55.569828+00:00	1	None	2023-11-28 22:27:55.569832+00:00	490Xhb24	effector CD4-positive, alpha-beta T cell	CL:0001044	None	effector CD4-positive, alpha-beta T lymphocyte...	A Cd4-Positive, Alpha-Beta T Cell With The Phe...	48
1229	2023-11-28 22:27:55.572880+00:00	1	None	2023-11-28 22:27:55.572884+00:00	69TEBGqb	exhausted T cell	CL:0011025	None	Tex cell\|An effector T cell that displays impa...	None	48
1309	2023-11-28 22:27:55.575440+00:00	1	None	2023-11-28 22:27:55.575444+00:00	5s4gCMdn	cytotoxic T cell	CL:0000910	None	cytotoxic T lymphocyte\|cytotoxic T-lymphocyte\|...	A Mature T Cell That Differentiated And Acquir...	48
1331	2023-11-28 22:27:55.575949+00:00	1	None	2023-11-28 22:27:55.575955+00:00	43cBCa7s	helper T cell	CL:0000912	None	helper T-lymphocyte\|T-helper cell\|helper T lym...	A Effector T Cell That Provides Help In The Fo...	48

Query artifacts¶

Unlike in the SOMA guide, here, we’ll query sets of .h5ad files, which correspond to AnnData objects.

To access them, we query the Collection record that links the latest LTS set of .h5ad artifacts:

collection = ln.Collection.filter(name="cellxgene-census", version="2023-12-15").one()
collection

Collection(version='2023-12-15', updated_at=2024-01-30 09:09:49 UTC, uid='dMyEX3NTfKOEYXyMu591', name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24)

You can get all linked artifacts as a dataframe - there are >1000 h5ad files in cellxgene-census version 2023-12-15.

collection.artifacts.count()

collection.artifacts.df().head()  # not tracking run & transform because read-only instance

Show code cell output Hide code cell output

	version	created_at	created_by_id	updated_at	uid	storage_id	key	suffix	accessor	description	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	visibility	key_is_virtual
id
2825	2023-12-15	2024-01-11 09:13:25.387366+00:00	1	2024-01-24 07:18:54.197599+00:00	OoktqBIu8jCoGOJlaQPo	2	cell-census/2023-12-15/h5ads/fc0ceb80-d2d9-47c...	.h5ad	AnnData	Sst Chodl - DLPFC: Seattle Alzheimer's Disease...	73375840	DqV7FraZIIP_l2DJuvHk_g-9	md5-n	None	1877	16	22	1	False
2031	2023-12-15	2024-01-11 09:13:23.820851+00:00	1	2024-01-24 07:19:02.027481+00:00	n33nFE2kXSNzNhIAtS3S	2	cell-census/2023-12-15/h5ads/44c83972-e5d2-485...	.h5ad	AnnData	L5 IT - DLPFC: Seattle Alzheimer's Disease Atl...	4605202922	ztuPyGXWH_OyCq1OyPlNkw-549	md5-n	None	104106	16	22	1	False
1813	2023-12-15	2024-01-11 09:13:23.307694+00:00	1	2024-01-24 07:19:04.190720+00:00	mtoOxeGG0Rg3NPH1AlwD	2	cell-census/2023-12-15/h5ads/100c6145-7b0e-4ba...	.h5ad	AnnData	Microglia-PVM - DLPFC: Seattle Alzheimer's Dis...	634716733	-B96CrmiOANuzE3xU78WsQ-76	md5-n	None	42486	16	22	1	False
1804	2023-12-15	2024-01-11 09:13:23.282158+00:00	1	2024-01-24 07:19:04.646675+00:00	V0tqrgE1z1NY2eUUKKQE	2	cell-census/2023-12-15/h5ads/0ed60482-a34f-426...	.h5ad	AnnData	Lamp5 - DLPFC: Seattle Alzheimer's Disease Atl...	1580667477	xRTDQGA4iOC4r8sSgz53vQ-189	md5-n	None	55968	16	22	1	False
2532	2023-12-15	2024-01-11 09:13:24.792407+00:00	1	2024-01-29 07:49:54.125887+00:00	dEP0dZ8UxLgwnkLjHssX	2	cell-census/2023-12-15/h5ads/bd65a70f-b274-413...	.h5ad	AnnData	Single-cell sequencing links multiregional imm...	1204103287	5hUwdflh_erDK-U2bEzfvw-144	md5-n	None	167283	16	22	1	False

You can query across artifacts by arbitrary metadata combinations, for instance:

query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame

Show code cell output Hide code cell output

	version	created_at	created_by_id	updated_at	uid	storage_id	key	suffix	accessor	description	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	visibility	key_is_virtual
id
1880	2023-12-15	2024-01-11 09:13:23.448150+00:00	1	2024-01-29 07:46:33.152678+00:00	WwmBIhBNLTlRcSoBpatT	2	cell-census/2023-12-15/h5ads/20d87640-4be8-487...	.h5ad	AnnData	Mature kidney dataset: immune	44647761	hSLF-GPhLXaC2tVIOJEdXA-6	md5-n	None	7803	16	22	1	False
1880	2023-12-15	2024-01-11 09:13:23.448150+00:00	1	2024-01-29 07:46:33.152678+00:00	WwmBIhBNLTlRcSoBpatT	2	cell-census/2023-12-15/h5ads/20d87640-4be8-487...	.h5ad	AnnData	Mature kidney dataset: immune	44647761	hSLF-GPhLXaC2tVIOJEdXA-6	md5-n	None	7803	16	22	1	False
1930	2023-12-15	2024-01-11 09:13:23.544310+00:00	1	2024-01-29 07:46:37.205210+00:00	gHlQ5Muwu3G9pvFC7egT	2	cell-census/2023-12-15/h5ads/2d31c0ca-0233-41c...	.h5ad	AnnData	Fetal kidney dataset: immune	64056560	jENeQIq0JdoHl5PyfY-sjA-8	md5-n	None	6847	16	22	1	False
2405	2023-12-15	2024-01-11 09:13:24.526987+00:00	1	2024-01-29 07:49:11.905786+00:00	P4Oai3OLGAzRwoicaxCB	2	cell-census/2023-12-15/h5ads/9ea768a2-87ab-46b...	.h5ad	AnnData	Mature kidney dataset: full	192484358	yghldeu2bOC5jtvnqZH8Og-23	md5-n	None	40268	16	22	1	False
2405	2023-12-15	2024-01-11 09:13:24.526987+00:00	1	2024-01-29 07:49:11.905786+00:00	P4Oai3OLGAzRwoicaxCB	2	cell-census/2023-12-15/h5ads/9ea768a2-87ab-46b...	.h5ad	AnnData	Mature kidney dataset: full	192484358	yghldeu2bOC5jtvnqZH8Og-23	md5-n	None	40268	16	22	1	False

Query arrays¶

Each artifact stores an array in form of an annotated data matrix, an AnnData object.

Let’s look at the first array in the artifact query and show metadata using .describe():

artifact = query.first()
artifact.describe()

Show code cell output Hide code cell output

Artifact(version='2023-12-15', updated_at=2024-01-29 07:46:33 UTC, uid='WwmBIhBNLTlRcSoBpatT', key='cell-census/2023-12-15/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad', suffix='.h5ad', accessor='AnnData', description='Mature kidney dataset: immune', size=44647761, hash='hSLF-GPhLXaC2tVIOJEdXA-6', hash_type='md5-n', n_observations=7803, visibility=1, key_is_virtual=False)

Provenance:
  📎 created_by: User(uid='kmvZDIX9', handle='sunnyosun', name='Sunny Sun')
  📎 storage: uid='oIYGbD74', root='s3://cellxgene-data-public', type='s3', region='us-west-2')
  📎 transform: Transform(version='0', uid='V4AGIdOJcOgj6K79', name='Census release 2023-12-15 (LTS)', key='cencus-release-2023-12-15-LTS', type='notebook')
  📎 run: Run(uid='UAAiLAi0BrLvlKnsuvP3', started_at=2024-01-29 07:27:05 UTC, is_consecutive=False)
  📎 input_of (core.Run): ['2024-01-30 09:07:36 UTC']
Features:
  var: FeatureSet(uid='MLFo2ZBXvibkOyBR9UOR', n=32922, dtype='number', registry='bionty.Gene')
    'None', 'EBF1', 'LINC02202', 'RNF145', 'LINC01932', 'UBLCP1', 'IL12B', 'LINC01845', 'LINC01847', 'ADRA1B', 'TTC1', 'PWWP2A', 'FABP6', 'FABP6-AS1', 'CCNJL', 'C1QTNF2', 'FAM200C', 'SLU7', 'PTTG1', 'MIR3142HG'
  obs: FeatureSet(uid='zAQ6WnmIMDLslhfgdIOt', name='obs metadata', n=11, dtype='category', registry='Feature')
    🔗 assay (11, cat[bionty.ExperimentalFactor]): '10x 3' v2'
    🔗 cell_type (11, cat[bionty.CellType]): 'classical monocyte', 'plasmacytoid dendritic cell', 'natural killer cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell', 'mast cell', 'neutrophil', 'non-classical monocyte', 'CD8-positive, alpha-beta T cell', 'B cell'
    🔗 development_stage (11, cat[bionty.DevelopmentalStage]): '2-year-old human stage', '4-year-old human stage', '12-year-old human stage', '44-year-old human stage', '49-year-old human stage', '53-year-old human stage', '63-year-old human stage', '64-year-old human stage', '67-year-old human stage', '70-year-old human stage'
    🔗 disease (11, cat[bionty.Disease]): 'normal'
    🔗 donor_id (11, cat[ULabel]): 'TxK2', 'Wilms1', 'TxK4', 'TTx', 'RCC3', 'RCC1', 'VHL', 'TxK3', 'TxK1', 'Wilms3'
    🔗 self_reported_ethnicity (11, cat[bionty.Ethnicity]): 'unknown'
    🔗 sex (11, cat[bionty.Phenotype]): 'male', 'female'
    🔗 tissue (11, cat[bionty.Tissue]): 'renal medulla', 'kidney blood vessel', 'renal pelvis', 'cortex of kidney', 'kidney'
    🔗 organism (11, cat[bionty.Organism]): 'human'
    🔗 tissue_type (11, cat[ULabel]): 
    🔗 suspension_type (11, cat[ULabel]): 'cell'
Labels:
  📎 organisms (1, bionty.Organism): 'human'
  📎 tissues (5, bionty.Tissue): 'renal medulla', 'kidney blood vessel', 'renal pelvis', 'cortex of kidney', 'kidney'
  📎 cell_types (12, bionty.CellType): 'classical monocyte', 'plasmacytoid dendritic cell', 'natural killer cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell', 'mast cell', 'neutrophil', 'non-classical monocyte', 'CD8-positive, alpha-beta T cell', 'B cell'
  📎 diseases (1, bionty.Disease): 'normal'
  📎 phenotypes (2, bionty.Phenotype): 'male', 'female'
  📎 experimental_factors (1, bionty.ExperimentalFactor): '10x 3' v2'
  📎 developmental_stages (12, bionty.DevelopmentalStage): '2-year-old human stage', '4-year-old human stage', '12-year-old human stage', '44-year-old human stage', '49-year-old human stage', '53-year-old human stage', '63-year-old human stage', '64-year-old human stage', '67-year-old human stage', '70-year-old human stage'
  📎 ethnicities (1, bionty.Ethnicity): 'unknown'
  📎 ulabels (14, ULabel): 'TxK2', 'Wilms1', 'TxK4', 'TTx', 'RCC3', 'RCC1', 'VHL', 'TxK3', 'TxK1', 'Wilms3'

If you want to query a slice of the array data, you have two options:

Cache & load the entire array into memory via artifact.load() -> AnnData (caches the h5ad on disk, so that you only download once)
Stream the array from the cloud using a cloud-backed accessor artifact.backed() -> AnnDataAccessor

Both options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute there).

Cache & load:

adata = artifact.load()
adata

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

Stream:

adata_backed = artifact.backed()
adata_backed

We now have an AnnDataAccessor object, which behaves much like an AnnData, and the query looks the same.

Train an ML model¶

You can directly train an ML models on the entire collection.

See Train a machine learning model on a collection.

Exploring data by collection¶

Alternatively,

you can search a file on the LaminHub UI and fetch it through: ln.Artifact.get(uid)
or query for a collection you found on CZ CELLxGENE Discover

Let’s search the collections from CELLxGENE within the 2023-12-15 release:

ln.Collection.filter(version="2023-12-15").search("immune human kidney", limit=10)

<QuerySet []>

Let’s get the record of the top hit collection:

collection = ln.Collection.filter(uid="kqiPjpzpK9H9rdtnV67f").one()

collection

Collection(version='2023-12-15', updated_at=2024-01-29 07:54:33 UTC, uid='kqiPjpzpK9H9rdtnV67f', name='Spatiotemporal immune zonation of the human kidney', description='10.1126/science.aat5031', hash='4wGcXeeqsjVdbRdU7ZuJ', reference='120e86b4-1195-48c5-845b-b98054105eec', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22)

We see it’s a Science paper and we could find more information using the DOI or CELLxGENE collection id.

Check different versions of this collection:

collection.versions.df()

	version	created_at	created_by_id	updated_at	uid	name	description	hash	reference	reference_type	transform_id	run_id	artifact_id	visibility
id
17	2023-07-25	2024-01-08 12:01:20.121086+00:00	1	2024-01-08 12:01:20.121095+00:00	kqiPjpzpK9H9rdtnHWas	Spatiotemporal immune zonation of the human ki...	10.1126/science.aat5031	w_VZE7n841ktaA9FjdLh	120e86b4-1195-48c5-845b-b98054105eec	CELLxGENE Collection ID	NaN	NaN	None	1
365	2023-12-15	2024-01-11 13:41:06.531224+00:00	1	2024-01-29 07:54:33.854515+00:00	kqiPjpzpK9H9rdtnV67f	Spatiotemporal immune zonation of the human ki...	10.1126/science.aat5031	4wGcXeeqsjVdbRdU7ZuJ	120e86b4-1195-48c5-845b-b98054105eec	CELLxGENE Collection ID	17.0	22.0	None	1

Each collection has at least one Artifact file associated to it. Let’s get the associated artifacts:

collection.artifacts.df()

	version	created_at	created_by_id	updated_at	uid	storage_id	key	suffix	accessor	description	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	visibility	key_is_virtual
id
1778	2023-12-15	2024-01-11 09:13:23.214114+00:00	1	2024-01-29 07:46:06.497662+00:00	b2x19Eg28GGSNnXW1hAD	2	cell-census/2023-12-15/h5ads/08073b32-d389-41f...	.h5ad	AnnData	Fetal kidney dataset: nephron	159545411	_JE59jFHDrOn0hj4i1yXSQ-20	md5-n	None	10790	16	22	1	False
1880	2023-12-15	2024-01-11 09:13:23.448150+00:00	1	2024-01-29 07:46:33.152678+00:00	WwmBIhBNLTlRcSoBpatT	2	cell-census/2023-12-15/h5ads/20d87640-4be8-487...	.h5ad	AnnData	Mature kidney dataset: immune	44647761	hSLF-GPhLXaC2tVIOJEdXA-6	md5-n	None	7803	16	22	1	False
1930	2023-12-15	2024-01-11 09:13:23.544310+00:00	1	2024-01-29 07:46:37.205210+00:00	gHlQ5Muwu3G9pvFC7egT	2	cell-census/2023-12-15/h5ads/2d31c0ca-0233-41c...	.h5ad	AnnData	Fetal kidney dataset: immune	64056560	jENeQIq0JdoHl5PyfY-sjA-8	md5-n	None	6847	16	22	1	False
1944	2023-12-15	2024-01-11 09:13:23.568572+00:00	1	2024-01-29 07:46:52.173865+00:00	USUgRVwrCMquHiImhk5D	2	cell-census/2023-12-15/h5ads/2fc9c59f-3cfd-48d...	.h5ad	AnnData	Mature kidney dataset: non PT parenchyma	39294782	3l5iNnBmPFbYfR3-THYWNQ-5	md5-n	None	4620	16	22	1	False
2405	2023-12-15	2024-01-11 09:13:24.526987+00:00	1	2024-01-29 07:49:11.905786+00:00	P4Oai3OLGAzRwoicaxCB	2	cell-census/2023-12-15/h5ads/9ea768a2-87ab-46b...	.h5ad	AnnData	Mature kidney dataset: full	192484358	yghldeu2bOC5jtvnqZH8Og-23	md5-n	None	40268	16	22	1	False
2570	2023-12-15	2024-01-11 09:13:24.870820+00:00	1	2024-01-29 07:50:01.866851+00:00	6mnZ3SeQFhffr3wTdZZb	2	cell-census/2023-12-15/h5ads/c52de62a-058d-4d7...	.h5ad	AnnData	Fetal kidney dataset: stroma	109942751	s24Q5-FNUNQPLZw9BuwOVg-14	md5-n	None	8345	16	22	1	False
2652	2023-12-15	2024-01-11 09:13:25.042157+00:00	1	2024-01-29 07:50:28.610568+00:00	11HQaMeIUaOwyHoOWVvA	2	cell-census/2023-12-15/h5ads/d7dcfd8f-2ee7-438...	.h5ad	AnnData	Fetal kidney dataset: full	341214674	2mnG5TiEpj0Wr5L19TTFRw-41	md5-n	None	27197	16	22	1	False

CELLxGENE: scRNA-seq¶

Setup¶

Query & understand metadata¶

Auto-complete metadata¶

Search & filter metadata¶

Understand ontologies¶

Query artifacts¶

Query arrays¶

Train an ML model¶

Exploring data by collection¶