8000 PRELOADKRAKENUNIQ: Fix preloading mechanism to prevent (partial) double loading of DB by muniheart · Pull Request #8397 · nf-core/modules · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

PRELOADKRAKENUNIQ: Fix preloading mechanism to prevent (partial) double loading of DB #8397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 15, 2025

Conversation

muniheart
Copy link
Contributor
@muniheart muniheart commented May 6, 2025

The default value for --preload-size is params.krakenuniq_ram_chunk_size. Allow the default value to be overridden by placing options passed in task.ext.args on the command-line after default option.

Option --preload-size must be passed to the database preload call and to the classify call for each input dataset.

PR checklist

Closes #XXX

  • [ x] This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

params.krakenuniq_ram_chunk_size.

Pass ram_chunk_size to database preload call and each classify call.
@muniheart muniheart changed the title Pass task.ext.args last to give them precedence. Pass task.ext.args last to give it precedence. May 6, 2025
@jfy133 jfy133 changed the title Pass task.ext.args last to give it precedence. PRELOADKRAKENUNIQ: Pass task.ext.args last to give it precedence. May 6, 2025
Copy link
Member
@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I've realised --reload-size and ram_chunk_size should be removed entirely as it's an optional thing (--preload makes sense in the context of the module, but `--preload-size is an optional override)

See: nf-core/taxprofiler#603 (comment)

Krakenuniq command-line option `--preload-size` is non-mandatory.  When it is
required, it should be passed in `ext.args`.  Also, remove the associated
process input, `ram_chunk_size`.
@muniheart
Copy link
Contributor Author

Based on test runs, the initial krakenuniq call to load the full database into memory does not seem to be required when option --preload-size is used. This makes sense intuitively: the initial krakenuniq call loads the database; subsequent calls load the database in chunks, ignoring memory-mapped database. Preloading in this situation could cause problems: OOM kill when memory is insufficient to hold full database; competition for memory between preloaded db and a chunk thereof. A possible solutions would be to search ext.args for --preload-size and skip the preload when it is present. Thoughts, @jfy133?

@jfy133
Copy link
Member
jfy133 commented May 8, 2025

@muniheart thanks for the testing! Yes, I agree your proposal makes sense - something like def preload_mode = ext.args.contains('--preload_chunk)size') ? '' : '--preload' ?

@jfy133
Copy link
Member
jfy133 commented May 8, 2025

Btw @muniheart I strongly suggest you join the nf-core github organisation (https://nf-co.re/join#github) so you don't have to wait for approval for tests to run.

If you are unable to join the slack, just let me know here and I can add you.

When `--preload-size` option is passed in `ext.args`, preloading the database
is not necessary and may unnecessarily consume memory.
@muniheart muniheart force-pushed the update-my-module branch from 0d0037b to b94764c Compare May 8, 2025 19:22
@jfy133
Copy link
Member
jfy133 commented May 9, 2025

Error in tests is: ProcessKRAKENUNIQ_PRELOADEDKRAKENUNIQ declares 6 input channels but 7 were specified @muniheart but otherwise I think this is OK now!

@jfy133 jfy133 changed the title PRELOADKRAKENUNIQ: Pass task.ext.args last to give it precedence. PRELOADKRAKENUNIQ: Fix preloading mechanism to prevent (partial) double loading of DB May 9, 2025
Remove input `ram_chunk_size`.  Pass `--preload-size $ram_chunk_size` in
`ext.args` to load database in chunks.  Database will be loaded into memory
prior to classification if and only if option `--preload-size` is not
present in `ext.args`.
@muniheart
Copy link
Contributor Author

Sorry, @jfy133, I made changes since your last review. After looking at generated BASH scripts I realized the script code would be clearer if I were to follow your method of using conditional logic in nextflow rather than passing it to BASH.

I also made a patch to the module for testing which I will remove when the module PR is approved.

@jfy133
Copy link
Member
jfy133 commented May 10, 2025

@muniheart no worries!

Firstly, please let me know if you're happy to be added to the nf-core github organisation so we don't have to have me approve the tests each time.
We should be testing the module here first, not on taxprofiler

Otherwise conceptually yes, the new method will be the best way, just the way you've written it by + multiple script blocks is not what we do so I would rather stick with the normal way of writing such a pattern so other communiuty members can follow, i.e.:

<...>
script:
    assert sequence_type in ['fasta', 'fastq']
    sequences = sequences instanceof List ? sequences : [sequences]

    def args = task.ext.args ?: ''
    def args2 = task.ext.args ?: ''
    
    // Evaluate to initially preload full database or will be live loaded 
    def preload_mode = !task.ext.args.contains('--preload-size')
    def preload_db = "krakenuniq $args  --db $db --preload --threads $task.cpus"
    def preload_cmd = preload_mode ? preload_cmd : ''

    classified   = meta.single_end ? "\${PREFIX}.classified.${sequence_type}"   : "\${PREFIX}.merged.classified.${sequence_type}"
    unclassified = meta.single_end ? "\${PREFIX}.unclassified.${sequence_type}" : "\${PREFIX}.merged.unclassified.${sequence_type}"
    classified_option = save_output_reads ? "--classified-out \"${classified}\"" : ''
    unclassified_option = save_output_reads ? "--unclassified-out \"${unclassified}\"" : ''
    def output_option = save_output ? '--output "\${PREFIX}.krakenuniq.classified.txt"' : ''
    def report = report_file ? '--report-file "\${PREFIX}.krakenuniq.report.txt"' : ''
    compress_reads_command = save_output_reads ? "find . -maxdepth 0 -name '*.${sequence_type}' -print0 | xargs -0 -t -P ${task.cpus} -I % gzip --no-name %" : ''
    def command_inputs_file = '.inputs.txt'

    if (meta.single_end) {
        assert sequences.size() == prefixes.size()
        command_inputs = [sequences, prefixes].transpose().collect { seq, prefix -> "${seq}\t${prefix}" }

        """
        # Store the batch of samples for later command input.
        cat <<-END_INPUTS > ${command_inputs_file}
        ${command_inputs.join('\n        ')}
        END_INPUTS

       // If pre-loading full database do this before running profiling
        ${preload_cmd}

        # Run the KrakenUniq classification on each sample in the batch.
        while IFS='\t' read -r SEQ PREFIX; do
            krakenuniq \\
                --db $db \\
                --threads $task.cpus \\
                $report \\
                $output_option \\
                $unclassified_option \\
                $classified_option \\
                $args2 \\
                "\${SEQ}"
        done < ${command_inputs_file}

        $compress_reads_command

        cat <<-END_VERSIONS > versions.yml
        "${task.process}":
            krakenuniq: \$(echo \$(krakenuniq --version 2>&1) | sed 's/^.*KrakenUniq version //; s/ .*\$//')
        END_VERSIONS
        """
<...>

@muniheart
Copy link
Contributor Author

I'm on slack and happy to be added to nf-core github organization.

@muniheart
Copy link
Contributor Author

I can't insert nextflow comment as suggested, @jfy133, without closing/reopening script string. Changing to BASH comment may lead to confusion in case preload_mode is false.

@jfy133
Copy link
Member
jfy133 commented May 12, 2025

I'm on slack and happy to be added to nf-core github organization.

Great! I've added you to the organisation, you should have an email and also if you go to https://github.com/nf-core you should have invitation there too.

I can't insert nextflow comment as suggested, @jfy133, without closing/reopening script string. Changing to BASH comment may lead to confusion in case preload_mode is false.

I'm not sure if I follow, sorry. if it's OK with you I can push directly to your branch what I mean. But I wait for your OK first before I do so (I could do it via multiple suggestions but might be ugly)

@muniheart
Copy link
Contributor Author

You are free to push directly to my branch, @jfy133.

@jfy133
Copy link
Member
jfy133 commented May 13, 2025

Have a look now @muniheart and feel free to test it, but I checked teh .command.sh of the test and I think it works as expected :)

@muniheart
Copy link
Contributor Author

Official test results are in. The .command.sh LGTM, too. Thanks for seeing this through, @jfy133!

Apptainer> nf-core modules -g https://github.com/muniheart/modules.git --branch update-my-module test --profile singular
ity krakenuniq/preloadedkrakenuniq


                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\ 
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 3.2.1 - https://nf-co.re


INFO     Generating nf-test snapshot
╭─────────────────────────────────────────────────── nf-test output ───────────────────────────────────────────────────╮
│                                                                                                                      │
│ 🚀 nf-test 0.9.2                                                                                                     │
│ https://www.nf-test.com                                                                                              │
│ (c) 2021 - 2024 Lukas Forer and 
8000
Sebastian Schoenherr                                                                 │
│                                                                                                                      │
│ Load .nf-test/plugins/nft-bam/0.5.0/nft-bam-0.5.0.jar                                                                │
│ Load .nf-test/plugins/nft-compress/0.1.0/nft-compress-0.1.0.jar                                                      │
│ Load .nf-test/plugins/nft-vcf/1.0.7/nft-vcf-1.0.7.jar                                                                │
│ Load .nf-test/plugins/nft-csv/0.1.0/nft-csv-0.1.0.jar                                                                │
│ Load .nf-test/plugins/nft-utils/0.0.3/nft-utils-0.0.3.jar                                                            │
│                                                                                                                      │
│ Test Process KRAKENUNIQ_PRELOADEDKRAKENUNIQ                                                                          │
│                                                                                                                      │
│                                                                                                                      │
│   Test [de01b0fc] 'sarscov2 - FASTA' PASSED (47.621s)                                                                │
│   Test [daed92ca] 'sarscov2 - FASTA - ramchunksize' PASSED (17.027s)                                                 │
│   Test [57e5bc3b] 'sarscov2 - Illumina FASTQ single' PASSED (15.898s)                                                │
│   Test [b5727000] 'sarscov2 - Illumina FASTQ paired-end' PASSED (16.598s)                                            │
│   Test [a87eef07] 'sarscov2 - FASTA - stub' PASSED (14.532s)                                                         │
│   Test [f426379c] 'sarscov2 - Illumina FASTQ single - stub' PASSED (15.077s)                                         │
│   Test [318d4dfb] 'sarscov2 - Illumina FASTQ paired-end - stub' PASSED (14.943s)                                     │
│                                                                                                                      │
│                                                                                                                      │
│ SUCCESS: Executed 7 tests in 141.824s                                                                                │
│                                                                                                                      │
╰────────────────────────────────────────────────────────

@jfy133 jfy133 added this pull request to the merge queue May 15, 2025
Merged via the queue into nf-core:master with commit cdab130 May 15, 2025
36 checks passed
@muniheart
Copy link
Contributor Author

I don't know how I missed it, but the logic is reversed in stub:,

    script:
    def preload_mode = !task.ext.args.toString().contains('--preload-size')
    def preload_cmd = preload_mode ? "krakenuniq ${args} --db ${db} --preload --threads ${task.cpus}" : ''
    stub:
    def preload_mode = !task.ext.args.toString().contains('--preload-size')
    def preload_cmd = preload_mode ? '' : "krakenuniq ${args} --db ${db} --preload --threads ${task.cpus}"

@jfy133
Copy link
Member
jfy133 commented May 15, 2025

9h poop, copy paste error... Could you make s be PR?

@muniheart
Copy link
Contributor Author

No problem. #8483

github-merge-queue bot pushed a commit that referenced this pull request May 16, 2025
* Pass task.ext.args last to give them precedence over
params.krakenuniq_ram_chunk_size.

Pass ram_chunk_size to database preload call and each classify call.

* Remove krakenuniq command-line option `--preload-size`.

Krakenuniq command-line option `--preload-size` is non-mandatory.  When it is
required, it should be passed in `ext.args`.  Also, remove the associated
process input, `ram_chunk_size`.

* Preload database if and only if option `--preload-size` is not present.

When `--preload-size` option is passed in `ext.args`, preloading the database
is not necessary and may unnecessarily consume memory.

* Fix bug in database preloading.  Remove input `ram_chunk_size`.

Remove input `ram_chunk_size`.  Pass `--preload-size $ram_chunk_size` in
`ext.args` to load database in chunks.  Database will be loaded into memory
prior to classification if and only if option `--preload-size` is not
present in `ext.args`.

* Reference `preload_cmd` in BASH script instead of concatenating scripts.

* Use environment variable to control preloading of database.

* Don't export variable.

* Restructure preload insertion

* Add test that preload works

* Fix logic

* Fix logic in stub block.

---------

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0