CytoSimplex.select_top_features#

CytoSimplex.select_top_features(x, cluster_var, vertices, n_top=30, processed=False, lfc_thresh=0.1, return_stats=False, feature_names=None)[source]#

Select top features for each vertices based on the wilcoxon test.

Parameters
  • x (Union[AnnData, ndarray, csr_matrix]) – The matrix of gene expression, where each row is a cell and each column is a gene. Recommended to be full size raw counts. (i.e. not log-transformed or normalized and not only for highly variable genes) When given anndata.AnnData, x.X will be used.

  • cluster_var (Union[str, list, Series]) – The cluster assignment of each single cell. If x is an anndata.AnnData, cluster_var can be a str that specifies the name of the cluster variable in x.obs. list or pandas.Series is accepted in all cases, and the length must equal to x.shape[0].

  • vertices (Union[list, dict]) –

    The terminal specifications. Depending on the type of simplex to be visualized in downstream, the number of vertices (n) should be determined by users. e.g. 3 elements for a 2-simplex (ternary simplex / triangle). Acceptable input include:

    • A list of n str that exist in the categories of cluster_var.

    • A dict of n keys. The keys are presented as customizable vertex names. The corresponding value for each key can be either a str for a single cluster, or a list of str for grouped vertex of multiple clusters.

  • n_top (int) – The number of top features to select for each vertex.

  • processed (bool) – Whether the input matrix is already processed. If False, the input matrix will be log transformed and row normalized. If True, the input matrix will be directly used to calculate the rank-sum statistics. And logFC will be calculated assuming that the input matrix is log-transformed.

  • lfc_thresh (float) – The log fold change threshold to select up-regulated genes.

  • return_stats (bool) – Whether to return the full statistics of all clusters and all features instead of only returning the selected top features by default.

  • feature_names (Optional[str]) – The names of the features in the matrix. If None, the feature names will be the index of the matrix.

Return type

Union[list, DataFrame]

Returns

  • selected (list, when return_stats=False.) – The list of selected features. Maximum length is n_top * len(vertices) when enough features can pass the threshold.

  • stats (pandas.DataFrame, when return_stats=True.) – The statistics of the wilcoxon test, with n_groups * n_features rows. Columns are ‘group’, ‘avgExpr’, ‘logFC’, ‘ustat’, ‘auc’, ‘pval’, ‘padj’, ‘pct_in’, ‘pct_out’ and ‘feature’.

Examples

>>> import CytoSimplex as csx
>>> import scanpy as sc
>>> adata = sc.read(
...     filename="test.h5ad",
...     backup_url="https://figshare.com/ndownloader/files/41034857"
... )
>>> vertices = {'OS': ["Osteoblast_1", "Osteoblast_2", "Osteoblast_3"],
...             'RE': ['Reticular_1', 'Reticular_2'],
...             'CH': ['Chondrocyte_1', 'Chondrocyte_2', 'Chondrocyte_3']}
>>> gene = csx.select_top_features(adata, "cluster", vertices)
>>> gene[:8]
['Nrk', 'Eps8l2', 'Mfi2', 'Fam101a', 'Scin', 'Sox5', 'Fbln7', 'Edil3']
>>> stats = csx.select_top_features(adata, "cluster", vertices, return_stats=True)
    group    avgExpr     logFC   ustat     auc          pval          padj  pct_in  pct_out           feature
0         CH   0.000000  0.000000  5000.0  0.5000           NaN           NaN     0.0      0.0               Rp1
1         CH   0.000000  0.000000  5000.0  0.5000           NaN           NaN     0.0      0.0             Sox17
2         CH   8.413918  5.771032  6948.0  0.6948  7.674938e-08  9.017050e-07    64.0     19.0            Mrpl15
3         CH   4.627888  2.507528  5894.0  0.5894  4.888534e-03  1.529972e-02    36.0     15.5            Lypla1
4         CH   0.256851  0.256851  5100.0  0.5100  4.999579e-02  1.110755e-01     2.0      0.0           Gm37988
...      ...        ...       ...     ...     ...           ...           ...     ...      ...               ...
141696    RE   2.269271  0.260934  5105.0  0.5105  7.154050e-01  1.000000e+00    16.0     14.5              PISD
141697    RE   2.593425 -0.267436  4968.0  0.4968  9.268655e-01  1.000000e+00    18.0     22.0             DHRSX
141698    RE   0.000000 -0.185628  4925.0  0.4925  3.973749e-01  9.619204e-01     0.0      1.5    CAAA01147332.1
141699    RE  16.347951  3.563943  7373.0  0.7373  2.048452e-07  1.117704e-05   100.0     80.0    tdT-WPRE_trans
141700    RE   0.000000  0.000000  5000.0  0.5000           NaN           NaN     0.0      0.0  CreER-WPRE_trans