| Title: | Fast, Robust Clustering Algorithms for Gene Enrichment Data |
|---|---|
| Description: | Fast 'C++' agglomerative hierarchical clustering algorithm packaged into easily callable R functions, designed to help cluster biological terms based on how similar of genes are expressed in their activation. |
| Authors: | Junguk Hur [aut, cre] (ORCID: <https://orcid.org/0000-0002-0736-2149>), Sarah Hong [aut], Jane Kim [aut] |
| Maintainer: | Junguk Hur <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.2 |
| Built: | 2026-05-18 07:09:03 UTC |
| Source: | https://github.com/hurlab/richcluster |
This function performs clustering on enrichment results by integrating gene similarity scores and various clustering strategies.
cluster( enrichment_results, df_names = NULL, min_terms = 5, min_value = 0.1, distance_metric = "kappa", distance_cutoff = 0.5, linkage_method = "average", linkage_cutoff = 0.5 )cluster( enrichment_results, df_names = NULL, min_terms = 5, min_value = 0.1, distance_metric = "kappa", distance_cutoff = 0.5, linkage_method = "average", linkage_cutoff = 0.5 )
enrichment_results |
A list of dataframes, each containing enrichment results. Each dataframe should include at least the columns 'Term', 'GeneID', and 'Padj'. |
df_names |
Optional, a character vector of names for the enrichment result dataframes. Must match the length of 'enrichment_results'. Default is 'NULL'. |
min_terms |
Minimum number of terms each final cluster must include |
min_value |
Minimum 'Pvalue' a term must have in order to be counted in final clustering |
distance_metric |
A string specifying the distance metric to use (e.g., "kappa"). |
distance_cutoff |
A numeric value for the distance cutoff (0 < cutoff <= 1). |
linkage_method |
A string specifying the linkage method to use (e.g., "average"). Supported options are "single", "complete", "average", and "ward". |
linkage_cutoff |
A numeric value between 0 and 1 for the membership cutoff. |
A named list containing: - 'distance_matrix': The distance matrix used in clustering. - 'clusters': The final clusters. - 'df_list': The original list of enrichment result dataframes. - 'merged_df': The merged dataframe containing combined results. - 'cluster_options': A list of clustering parameters used in the analysis. - 'df_names' (optional): The names of the input dataframes if provided.
Generates a horizontal bar plot showing average enrichment significance for each cluster, across one or more enrichment datasets.
cluster_bar(cluster_result, clusters = NULL, value_type = "Padj", title = NULL)cluster_bar(cluster_result, clusters = NULL, value_type = "Padj", title = NULL)
cluster_result |
A result list returned by |
clusters |
Optional numeric vector of cluster IDs to include. Defaults to all clusters. |
value_type |
The column name to use for enrichment significance ("Padj" or "Pvalue"). |
title |
Optional plot title. If NULL, a default will be generated. |
A plotly object representing the bar plot.
# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) cbar <- cluster_bar(cluster_result) cbar# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) cbar <- cluster_bar(cluster_result) cbar
This function generates a correlation heatmap for a specific cluster based on the provided distance matrix.
cluster_correlation_hmap( final_clusters, distance_matrix, cluster_number, merged_df )cluster_correlation_hmap( final_clusters, distance_matrix, cluster_number, merged_df )
final_clusters |
A dataframe containing the final cluster data. |
distance_matrix |
A matrix representing the distances between terms. |
cluster_number |
An integer specifying the cluster number to visualize. |
merged_df |
A dataframe with all terms used to map term indices to names. |
An interactive heatmaply heatmap.
Creates a dot plot summarizing cluster-level enrichment across datasets. Each point represents a cluster, with its size proportional to the number of terms and its x-position reflecting average significance (e.g., Padj or Pvalue).
cluster_dot(cluster_result, clusters = NULL, value_type = "Padj", title = NULL)cluster_dot(cluster_result, clusters = NULL, value_type = "Padj", title = NULL)
cluster_result |
A result list returned from |
clusters |
Optional numeric vector of cluster IDs to include. Defaults to all clusters. |
value_type |
The name of the value column to visualize (e.g., "Padj" or "Pvalue"). |
title |
Optional title for the plot. If NULL, a default title is generated. |
A plotly object representing the dot plot.
# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) cdot <- cluster_dot(cluster_result) cdot# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) cdot <- cluster_dot(cluster_result) cdot
Generates an interactive heatmap from the given clustering results, visualizing -log10(Padj) values for each cluster. The function aggregates values per cluster and assigns representative terms as row names.
cluster_hmap( cluster_result, clusters = NULL, value_type = "Padj", aggr_type = mean )cluster_hmap( cluster_result, clusters = NULL, value_type = "Padj", aggr_type = mean )
cluster_result |
A list containing a data frame ('cluster_df') with clustering results. The data frame must contain at least the columns 'Cluster', 'Term', and 'value_type_*' values. |
clusters |
Optional. A numeric or character vector specifying the clusters to include. If NULL (default), all clusters are included. |
value_type |
A character string specifying the column name prefix for values to display in hmap cells. Defaults to '"Padj"'. |
aggr_type |
A function used to aggregate values across clusters (e.g., 'mean' or 'median'). Defaults to 'mean'. |
The function processes the given cluster data frame ('cluster_df'), aggregating the 'value_type_*' values per cluster using the specified 'aggr_type' function. The -log10 transformation is applied, and infinite values are replaced with 0.
Representative terms are selected by choosing the term with the lowest 'value_type' in each cluster.
The final heatmap is generated using 'heatmaply::heatmaply()', with an interactive 'plotly' visualization.
An interactive heatmap object ('plotly'), displaying the -log10(Padj) values across clusters, with representative terms as row labels.
This function generates a network graph for a specific cluster based on the provided distance matrix. The opacity and length of the edges correspond to the given distance_metric (eg, kappa) score similarity between terms, which is based on shared gene content.
cluster_network(final_clusters, distance_matrix, cluster_number, merged_df)cluster_network(final_clusters, distance_matrix, cluster_number, merged_df)
final_clusters |
A dataframe containing the final cluster data. |
distance_matrix |
A matrix representing the distances between terms. |
cluster_number |
An integer specifying the cluster number to visualize. |
merged_df |
A dataframe with all terms used to map term indices to names. |
An interactive networkD3 network graph.
This function creates a side-by-side comparison of network graphs for a single cluster using different p-value types.
compare_network_graphs_plotly(cluster_result, cluster_num, pval_names)compare_network_graphs_plotly(cluster_result, cluster_num, pval_names)
cluster_result |
The result from the clustering function. |
cluster_num |
The cluster number to plot. |
pval_names |
A list of p-value names to compare. |
A plotly object.
This function performs clustering on enrichment results using an algorithm inspired by DAVID's functional clustering method.
david_cluster( enrichment_results, df_names = NULL, similarity_threshold = 0.5, initial_group_membership = 3, final_group_membership = 3, multiple_linkage_threshold = 0.5 )david_cluster( enrichment_results, df_names = NULL, similarity_threshold = 0.5, initial_group_membership = 3, final_group_membership = 3, multiple_linkage_threshold = 0.5 )
enrichment_results |
A list of dataframes, each containing enrichment results. Each dataframe should include at least the columns 'Term', 'GeneID', and 'Padj'. |
df_names |
Optional, a character vector of names for the enrichment result dataframes. Must match the length of 'enrichment_results'. Default is 'NULL'. |
similarity_threshold |
A numeric value for the kappa score cutoff (0 < cutoff <= 1). |
initial_group_membership |
Minimum number of terms to form an initial seed group. |
final_group_membership |
Minimum number of terms for a final cluster. |
multiple_linkage_threshold |
A numeric value for the merging threshold. |
A named list containing the clustering results.
Returns a comprehensive dataframe containing all the different terms in all clusters.
export_df(cluster_result)export_df(cluster_result)
cluster_result |
The cluster_result object from cluster() |
A data.frame view of the clustering
Filters the full list of clusters by keeping only those with greater than or equal to min_terms # of terms.
filter_clusters(all_clusters, min_terms)filter_clusters(all_clusters, min_terms)
all_clusters |
A dataframe containing the merged seeds with column named 'ClusterIndices'. |
min_terms |
An integer specifying the minimum number of terms required in a cluster. |
The filtered data frame with clusters filtered to include only those with at least 'min_terms' terms.
This function maps a vector of column names to standardized names for "GeneID", "Pvalue", and "Padj" based on known variations.
format_colnames(colnames)format_colnames(colnames)
colnames |
A character vector of column names to be standardized. |
A character vector of standardized column names.
This function generates a network graph for the entire distance matrix.
full_network(cluster_result)full_network(cluster_result)
cluster_result |
Cluster result named list from richCluster::cluster() |
An interactive networkD3 network graph.
This function merges multiple enrichment results ('enrichment_results') into a single dataframe by combining unique GeneID elements across each geneset, and averaging Pvalue / Padj values for each term across all enrichment_results.
merge_enrichment_results(enrichment_results)merge_enrichment_results(enrichment_results)
enrichment_results |
A list of geneset dataframes containing columns c('Term', 'GeneID', 'Pvalue', 'Padj') |
A single merged geneset dataframe with all original columns suffixed with the index of the geneset, with new columns 'GeneID', 'Pvalue', 'Padj' containing the merged values.
This function visualizes a single cluster as a network graph.
plot_network_graph( cluster_result, cluster_num, distance_matrix, valuetype_list )plot_network_graph( cluster_result, cluster_num, distance_matrix, valuetype_list )
cluster_result |
The result from the clustering function. |
cluster_num |
The cluster number to plot. |
distance_matrix |
The distance matrix used for clustering. |
valuetype_list |
A list of value types (e.g., "Pvalue_1", "Padj_1") to use for node coloring. |
A plot object.
Run clustering in C++ backend
runRichCluster( terms, geneIDs, distanceMetric, distanceCutoff, linkageMethod, linkageCutoff )runRichCluster( terms, geneIDs, distanceMetric, distanceCutoff, linkageMethod, linkageCutoff )
terms |
Character vector of term names |
geneIDs |
Character vector of geneIDs |
distanceMetric |
e.g. "kappa" |
distanceCutoff |
numeric between 0 and 1 |
linkageMethod |
e.g. "average" |
linkageCutoff |
numeric between 0 and 1 |
A list containing the clustering results with the following components:
A numeric matrix containing pairwise distances between terms based on gene similarity
A data frame with columns 'Cluster' (cluster ID) and 'TermIndices' (comma-separated indices of terms in each cluster)
The hierarchical clustering dendrogram structure from the agglomerative clustering process
Creates a horizontal bar plot showing enrichment values for individual terms in a selected cluster.
term_bar(cluster_result, cluster = 1, value_type = "Padj", title = NULL)term_bar(cluster_result, cluster = 1, value_type = "Padj", title = NULL)
cluster_result |
A result list returned by |
cluster |
Cluster ID (numeric) or term name (character) to visualize. |
value_type |
The column name to use for enrichment significance ("Padj" or "Pvalue"). |
title |
Optional plot title. If NULL, a default will be generated. |
A plotly object representing the bar plot.
# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) tbar <- term_bar(cluster_result, cluster = 1) tbar# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) tbar <- term_bar(cluster_result, cluster = 1) tbar
Creates a dot plot of individual terms within a specified cluster, showing their significance and number of genes.
term_dot(cluster_result, cluster = 1, value_type = "Padj", title = NULL)term_dot(cluster_result, cluster = 1, value_type = "Padj", title = NULL)
cluster_result |
A result list returned from |
cluster |
Cluster ID (numeric) or term name (character) to plot. |
value_type |
The name of the value column to visualize (e.g., "Padj" or "Pvalue"). |
title |
Optional title for the plot. If NULL, a default title is generated using the representative term. |
A plotly object representing the dot plot of terms.
# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) tdot <- term_dot(cluster_result, cluster = 1) tdot# Load example data cluster_result <- readRDS(system.file("extdata", "cluster_result.rds", package = "richCluster")) tdot <- term_dot(cluster_result, cluster = 1) tdot
Creates an interactive heatmap displaying -log10(Padj) values for selected clusters and terms. Users can specify clusters numerically or select them by providing term names. The function ensures that the final heatmap includes all terms from the selected clusters as well as any explicitly provided terms.
term_hmap(cluster_result, clusters, terms, value_type, aggr_type, title = NULL)term_hmap(cluster_result, clusters, terms, value_type, aggr_type, title = NULL)
cluster_result |
A list containing a data frame ('cluster_df') with clustering results. The data frame must include at least the columns 'Cluster', 'Term', and 'Padj_*' values. |
clusters |
Optional. A numeric vector specifying the cluster numbers to display, or a character vector specifying terms whose clusters should be included. Defaults to 'NULL', which includes all clusters. |
terms |
Optional. A character vector specifying additional terms to include in the heatmap. Defaults to 'NULL'. |
value_type |
A character string specifying the column name prefix for adjusted p-values. Defaults to '"Padj"'. |
aggr_type |
A function used to aggregate values across clusters (e.g., 'mean' or 'median'). Defaults to 'mean'. |
title |
An optional parameter to title the plot something else. |
The function processes the given 'cluster_df', identifying the clusters and terms to be visualized. If 'clusters' is specified as a numeric vector, the function directly filters based on cluster numbers. If 'clusters' is given as a character vector, it identifies the clusters associated with those terms and retrieves all terms from the selected clusters.
The 'Padj_*' values are transformed using '-log10()', and infinite values are replaced with '0'. The resulting heatmap is generated using 'heatmaply::heatmaply()' with fixed row ordering (no hierarchical clustering).
An interactive heatmap object ('plotly'), displaying the -log10(Padj) values across clusters, with representative terms as row labels and color-coded cluster annotations.