8000 Initialise agglomerateByModule by RiboRings · Pull Request #744 · microbiome/mia · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Initialise agglomerateByModule #744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: devel
Choose a base branch
from
Open

Initialise agglomerateByModule #744

wants to merge 11 commits into from

Conversation

RiboRings
Copy link
Member
@RiboRings RiboRings commented Jun 27, 2025

Basic agglomerateByModule function leveraging cross-product was initialised for SummarizedExperiment class.

Consider whether we should:

  • Add separate method for TreeSE
  • Add support for sample modules in colData
  • Integrate more with existing agglomerate and merge functions

@RiboRings RiboRings requested a review from TuomasBorman June 27, 2025 08:40
@RiboRings
Copy link
Member Author

Example usage:

devtools::load_all()

data("GlobalPatterns")
tse <- GlobalPatterns

# Add some random modules
N_module <- 3L
modules <- sample(c(TRUE, FALSE),
                  size = nrow(tse) * N_module,
                  prob = c(0.2, 0.8),
                  replace = TRUE)

modules <- modules |> matrix(nrow = nrow(tse))
colnames(modules) <- paste0("module_", seq_len(ncol(modules)))
rowData(tse) <- cbind(rowData(tse), modules)

# Extract module columns
module_columns <- grep("module_", colnames(rowData(tse)), value = TRUE)

# Agglomerate based on modules
tse_module <- agglomerateByModule(tse, by = 1, group = module_columns)

@antagomir
Copy link
Member

Seems good. Some notes:

**Add separate method for TreeSE: ** this is feasible for agglomerateByRanks where the grouping can follow phylogenetic tree. For modules the tree aggomeration might be ill-defined in many cases? Is it meaningful to do that for modules in the general case (or are there relevant special cases to support?). This can be added later if this requires more thinking.

**Add support for sample modules in colData: ** ideally this method could be symmetric and could be applied for rows and cols alike. We also have other such functions and that should be the default.

Integrate more with existing agglomerate and merge functions: Is there a fundamental difference compared to agglomerateByVariable? If not, we might like to combine these two. It might currently operate only with one column at the time but that is minor difference and just a special case of multi-column thing.

@RiboRings
Copy link
Member Author

Add separate method for TreeSE I agree with you, so it seems preferable to downgrade TreeSE to SE and remove trees when agglomerating by module.

Add support for sample modules in colData: Now it works for both features and samples.

Integrate more with existing agglomerate and merge functions: I came to notice a subtle difference between the two. While agglomerateByVariable takes factor variables as groups, agglomerateByModule requires one or more binary (0/1, F/T) variables. Therefore, the two would be analogous only if factor variables were first one-hot encoded and used to agglomerate, but this does not sound like a convenient solution for factors with many levels. So in general, I would keep them as they are (partially integrated) and recommend agglomerateByVariable for single factor groups and agglomerateByModule for one or multiple logical groups.

@RiboRings
Copy link
Member Author

This PR is ready for review

Copy link
Contributor
@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I guess this is more efficient than implementation in https://github.com/microbiome/mia/tree/agglomerateByModule (i.e., the branch can be removed)?

DESCRIPTION Outdated
@@ -71,6 +71,7 @@ Imports:
dplyr,
IRanges,
MASS,
Matrix,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this dependency? You could use DelayedArray and crossprod()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked a bit into this. Basically Matrix provides an optimised method for crossprod and sparse array for large adjacency matrices. This functionality is absent in DelayedArray, which focuses on delaying (instead of optimising) matrix operations.

In a nutshell, we have two options:

  1. downgrade to base matrix and crossprod if we prioritise reducing package dependencies
  2. use Matrix crossprod if we prioritise operations with optimised efficiency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tough choice. Could depend on how much that performance gain would be in (2) in practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a sparse Matrix of 1000 features by 1000 modules with 80% zeros, the time is about 0.1 s, or 1/10 compared to using base matrix and crossprod. The memory usage is 4 Mb instead of 12 Mb. Efficiency further improves for higher percentages of zeros, whereas it stays the same for base matrix.

Reference: https://alfurka.github.io/2022-08-12-faster-matrix-computation-with-r-using-sparsematrix/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the modules in a separate place (e.g., metadata) might allow us to get rid of Matrix dependency, since the user would be able to convert modules to sparse array by themselves, and the agglomerateByModule function could be adjusted to support both Matrix and base matrix without importing it. But I would need to test if this can actually work.

Comment on lines -153 to -163
.sum_counts_accross_features_na <- function(x, average, ids, ...){
# Which cell is not NA?
is_not_na <- !is.na(x)
type(is_not_na) <- "integer"
# Aggregate data to certain groups
x <- rowsum(x, ids, na.rm = TRUE)
# Calculate average if specified
if( average ){
x <- x/rowsum(is_not_na, ids)
#' @importFrom scuttle sumCountsAcrossFeatures
.agglomerate_assay <- function(
assay.type, assay, by, ids, na.rm, average, BPPARAM
){
# Check assay
.check_assay_for_merge(assay.type, assay)
# Transpose if we are merging columns
if( by == 2L ){
assay <- t(assay)
}
return(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the functionality from this function is moved? It is better to modify only necessary lines to improve clarity and maintainability while minimizing the risk of introducing unintended changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the extra change. I was trying to make .merge_rows_or_columns more efficient, and give it a similar structure of .agglomerate_module_assay (which in my opinion is also clearer).

The code for .merge_rows_or_columns was using lapply for each of the 4 implemented stages of assay-wise operation (checking, transposing, merging and transposing back). For a clearer and more efficient syntax, I reduced the 4 lapply calls to a single streamlined (m)apply call, which in addition also checks and removes NAs for each assay (not only the first one).

@antagomir
Copy link
Member

Still on this point:

Integrate more with existing agglomerate and merge functions:

  • agglomerateByVariable takes factor variables as groups
  • agglomerateByModule requires one or more binary (0/1, F/T) variables

Your argument was that factors with multiple levels should be one-hot-encoded to use the same.

But on the other hand, isn't it possible to use agglomerateByVariable for binary variables, and then it will essentially reduce to the same than agglomerateByModule? If so, agglomerateByModule would not have added value because same could be obtained with agglomerateByVariable applied on binary variables/factors/logicals/numerics.

But am I missing sometihing?

@RiboRings
Copy link
Member Author

Several binary modules can be reduced to a single factor column as long as each feature belongs to only one module. Otherwise, the module matrix is necessary, in which case using agglomerateByVariable for each module would return one agglomerated SE with two rows (module = TRUE and module = FALSE) for each and every module, whereas agglomerateByModule would return one agglomerated SE with all modules as rows.

@antagomir
Copy link
Member

Yes indeed, unless we consider the case where the variable in agglomerateByVariable contains NAs, which will not be agglomerated. Binary module encoded with NA + TRUE could result in a single row per module. This is more unclear way to implement something like that.

Another point is that agglomerateByVariable has been written to collapse rows (or cols) based on a single variable (e.g. factor with levels). It could be extended to cover case with multiple variables but this would more intuitively be a set of collapsed matrices with varying dimensionality (which would be determined based on the number of factor levels per each variable). Matrices with varying dimensionality could go into altExps. This could make sense in principle but it would indeed be a different method than agglomerateByModule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0