Select Top Features#
Technically, any forms of observation-by-feature matrix is acceptable for the method we developed, and users are encouraged to explore the usability of our method with other types of data, even not in a biological context. However, single-cell transcriptomics data, as provided, usually is of high dimensionality and contains technical and biological noise. With testing different approaches of reducing the dimensionality and noise, we recommend that users select a number of top differentially expressed genes (DEGs) for each cluster (or group of clusters) that a vertex represents.
We implemented a fast Wilcoxon rank-sum test method which can be invoked with function select_top_features.
The test is done in a one group versus all other groups manner. Here, we will choose the top DEGs for Osteoblast cells
(shortened as "OS"), Reticular cells ("RE") and Chondrocytes ("CH"), as also shown in the previously mentioned
publication. The number of top DEGs for each cluster is set to 30 (nTop = 30), thus 90 unique genes at maximum are expected to be returned.
import CytoSimplex as csx
import scanpy as sc
adata = sc.read(filename='test.h5ad',
backup_url="https://figshare.com/ndownloader/files/41034857")
vertices = {"OS": "Osteoblast_1",
"RE": "Reticular_1",
"CH": "Chondrocyte_1"}
selected_genes = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, n_top=30)
selected_genes[:10]
['Steap1',
'Smim5',
'H2-DMa',
'Zcchc5',
'Lims2',
'Fam89a',
'Ninj2',
'Scin',
'Pygl',
'Slc2a5']
Alternatively, users can set return_stats=True to obtain a table of
full Wilcoxon rank-sum test statistics including the result for all clusters, instead of selected vertices.
stats = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, return_stats=True)
stats
| group | avgExpr | logFC | ustat | auc | pval | padj | pct_in | pct_out | feature | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CH | 0.000000 | 0.000000 | 3010.5 | 0.500000 | NaN | NaN | 0.000000 | 0.000000 | Rp1 |
| 1 | CH | 0.000000 | 0.000000 | 3010.5 | 0.500000 | NaN | NaN | 0.000000 | 0.000000 | Sox17 |
| 2 | CH | 10.610097 | 7.637897 | 4547.5 | 0.755273 | 4.666025e-08 | 4.881361e-07 | 81.481481 | 21.524664 | Mrpl15 |
| 3 | CH | 6.077630 | 3.874176 | 3826.0 | 0.635443 | 9.406324e-04 | 3.637987e-03 | 48.148148 | 16.143498 | Lypla1 |
| 4 | CH | 0.000000 | -0.057590 | 2997.0 | 0.497758 | 7.669166e-01 | 1.000000e+00 | 0.000000 | 0.448430 | Gm37988 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 242911 | Reticular_2 | 1.582342 | -0.496039 | 1048.0 | 0.483172 | 7.931468e-01 | 1.000000e+00 | 11.111111 | 14.937759 | PISD |
| 242912 | Reticular_2 | 0.000000 | -2.912214 | 846.0 | 0.390041 | 1.202520e-01 | 1.000000e+00 | 0.000000 | 21.991701 | DHRSX |
| 242913 | Reticular_2 | 0.000000 | -0.154048 | 1071.0 | 0.493776 | 7.746696e-01 | 1.000000e+00 | 0.000000 | 1.244813 | CAAA01147332.1 |
| 242914 | Reticular_2 | 16.142743 | 2.744758 | 1350.0 | 0.622407 | 2.151089e-01 | 1.000000e+00 | 100.000000 | 83.402490 | tdT-WPRE_trans |
| 242915 | Reticular_2 | 0.000000 | 0.000000 | 1084.5 | 0.500000 | NaN | NaN | 0.000000 | 0.000000 | CreER-WPRE_trans |
242916 rows × 10 columns
The returned table can be considered as the concatenation of the tables of all tests, with the column group indicating
which cluster the test is primarily based on. For example, the 3rd row is the result of the test for gene “Mrpl15” in group “CH”,
which represents the original cluster “Chondrocyte_1”, against all other clusters.