-
Notifications
You must be signed in to change notification settings - Fork 32
Initialise agglomerateByModule #744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
Example usage:
|
Seems good. Some notes: **Add separate method for TreeSE: ** this is feasible for agglomerateByRanks where the grouping can follow phylogenetic tree. For modules the tree aggomeration might be ill-defined in many cases? Is it meaningful to do that for modules in the general case (or are there relevant special cases to support?). This can be added later if this requires more thinking. **Add support for sample modules in colData: ** ideally this method could be symmetric and could be applied for rows and cols alike. We also have other such functions and that should be the default. Integrate more with existing agglomerate and merge functions: Is there a fundamental difference compared to agglomerateByVariable? If not, we might like to combine these two. It might currently operate only with one column at the time but that is minor difference and just a special case of multi-column thing. |
Add separate method for TreeSE I agree with you, so it seems preferable to downgrade TreeSE to SE and remove trees when agglomerating by module. Add support for sample modules in colData: Now it works for both features and samples. Integrate more with existing agglomerate and merge functions: I came to notice a subtle difference between the two. While agglomerateByVariable takes factor variables as groups, agglomerateByModule requires one or more binary (0/1, F/T) variables. Therefore, the two would be analogous only if factor variables were first one-hot encoded and used to agglomerate, but this does not sound like a convenient solution for factors with many levels. So in general, I would keep them as they are (partially integrated) and recommend agglomerateByVariable for single factor groups and agglomerateByModule for one or multiple logical groups. |
This PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I guess this is more efficient than implementation in https://github.com/microbiome/mia/tree/agglomerateByModule (i.e., the branch can be removed)?
DESCRIPTION
Outdated
@@ -71,6 +71,7 @@ Imports: | |||
dplyr, | |||
IRanges, | |||
MASS, | |||
Matrix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid this dependency? You could use DelayedArray
and crossprod()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked a bit into this. Basically Matrix provides an optimised method for crossprod and sparse array for large adjacency matrices. This functionality is absent in DelayedArray, which focuses on delaying (instead of optimising) matrix operations.
In a nutshell, we have two options:
- downgrade to base matrix and crossprod if we prioritise reducing package dependencies
- use Matrix crossprod if we prioritise operations with optimised efficiency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tough choice. Could depend on how much that performance gain would be in (2) in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a sparse Matrix of 1000 features by 1000 modules with 80% zeros, the time is about 0.1 s, or 1/10 compared to using base matrix and crossprod. The memory usage is 4 Mb instead of 12 Mb. Efficiency further improves for higher percentages of zeros, whereas it stays the same for base matrix.
Reference: https://alfurka.github.io/2022-08-12-faster-matrix-computation-with-r-using-sparsematrix/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing the modules in a separate place (e.g., metadata) might allow us to get rid of Matrix dependency, since the user would be able to convert modules to sparse array by themselves, and the agglomerateByModule function could be adjusted to support both Matrix and base matrix without importing it. But I would need to test if this can actually work.
.sum_counts_accross_features_na <- function(x, average, ids, ...){ | ||
# Which cell is not NA? | ||
is_not_na <- !is.na(x) | ||
type(is_not_na) <- "integer" | ||
# Aggregate data to certain groups | ||
x <- rowsum(x, ids, na.rm = TRUE) | ||
# Calculate average if specified | ||
if( average ){ | ||
x <- x/rowsum(is_not_na, ids) | ||
#' @importFrom scuttle sumCountsAcrossFeatures | ||
.agglomerate_assay <- function( | ||
assay.type, assay, by, ids, na.rm, average, BPPARAM | ||
){ | ||
# Check assay | ||
.check_assay_for_merge(assay.type, assay) | ||
# Transpose if we are merging columns | ||
if( by == 2L ){ | ||
assay <- t(assay) | ||
} | ||
return(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the functionality from this function is moved? It is better to modify only necessary lines to improve clarity and maintainability while minimizing the risk of introducing unintended changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the extra change. I was trying to make .merge_rows_or_columns
more efficient, and give it a similar structure of .agglomerate_module_assay
(which in my opinion is also clearer).
The code for .merge_rows_or_columns
was using lapply for each of the 4 implemented stages of assay-wise operation (checking, transposing, merging and transposing back). For a clearer and more efficient syntax, I reduced the 4 lapply calls to a single streamlined (m)apply call, which in addition also checks and removes NAs for each assay (not only the first one).
Still on this point: Integrate more with existing agglomerate and merge functions:
Your argument was that factors with multiple levels should be one-hot-encoded to use the same. But on the other hand, isn't it possible to use agglomerateByVariable for binary variables, and then it will essentially reduce to the same than agglomerateByModule? If so, agglomerateByModule would not have added value because same could be obtained with agglomerateByVariable applied on binary variables/factors/logicals/numerics. But am I missing sometihing? |
Several binary modules can be reduced to a single factor column as long as each feature belongs to only one module. Otherwise, the module matrix is necessary, in which case using agglomerateByVariable for each module would return one agglomerated SE with two rows (module = TRUE and module = FALSE) for each and every module, whereas agglomerateByModule would return one agglomerated SE with all modules as rows. |
Yes indeed, unless we consider the case where the variable in agglomerateByVariable contains NAs, which will not be agglomerated. Binary module encoded with NA + TRUE could result in a single row per module. This is more unclear way to implement something like that. Another point is that agglomerateByVariable has been written to collapse rows (or cols) based on a single variable (e.g. factor with levels). It could be extended to cover case with multiple variables but this would more intuitively be a set of collapsed matrices with varying dimensionality (which would be determined based on the number of factor levels per each variable). Matrices with varying dimensionality could go into altExps. This could make sense in principle but it would indeed be a different method than agglomerateByModule. |
Basic agglomerateByModule function leveraging cross-product was initialised for SummarizedExperiment class.
Consider whether we should: