Open
Description
I downloaded GBIF data from a .json file and uncompressed it, but when trying to use the data with phylonext, it can't read it properly...
# Download data based on the
curl -Ssi \
--user "$USER":"$PASSWORD" \
-H "Content-Type: application/json" \
-X POST -d @gbif_filters.json \
https://api.gbif.org/v1/occurrence/download/request
# Check status
IDDOWNLOAD="myDownload_ID" # I have my download ID
curl -Ss https://api.gbif.org/v1/occurrence/download/$IDDOWNLOAD | jq .
# Download the data locally
mkdir -p ~/Projects/phylonext/GBIF_dumps
cd ~/Projects/phylonext/GBIF_dumps
aria2c \
https://api.gbif.org/v1/occurrence/download/request/$IDDOWNLOAD.zip \
-o gbif_dump.zip
# unzip GBIF data
unzip -d ~/Projects/phylonext/GBIF_dumps ~/Projects/phylonext/GBIF_dumps/gbif_dump.zip
Then run the pipeline
OUTPUTPHYLONEXT="$HOME/Projects/phylonext/analysis_1"
# Run the pipeline
nextflow run vmikk/PhyloNext -r main \
--input "$HOME/Projects/phylonext/GBIF_dumps/occurrence.parquet/" \
--dbscan true \
--iterations 100 \
--outdir "$OUTPUTPHYLONEXT" \
-profile docker
N E X T F L O W ~ version 25.04.2
Launching `https://github.com/vmikk/PhyloNext` [maniac_swirles] DSL2 - revision: 3d8eae09fd [main]
====================================================================
PhyloNext: GBIF phylogenetic diversity pipeline, Version 1.4.2
====================================================================
GBIF occurrence dump: ~/Projects/phylonext/GBIF_dumps/occurrence.parquet/
Output path: ~/Projects/phylonext/analysis_1
Launch directory: ~/Projects/phylonext/GBIF_dumps
Working directory: ~/Projects/phylonext/GBIF_dumps/work
====================================================================
executor > local (2)
[29/b70af7] occ_filter | 0 of 1 ✘
[1a/e9df44] record_count | 0 of 1
[- ] outl_low -
[- ] outl_high -
[- ] prep_ott_ids -
[- ] get_ott_tree -
[- ] merge_occ -
[- ] prep_biodiv -
[- ] phylodiv -
[- ] rand_filelist -
[- ] aggregate_rnds_biodiv -
[- ] div_to_csv -
[- ] plot_pd -
[- ] plot_leaflet -
[- ] derived_datasets -
ERROR ~ Error executing process > 'record_count'
Caused by:
Process `record_count` terminated with an error exit status (1)
Command executed:
10_Record_counts.R --input occurrence.parquet --phylum NA --class NA --order NA --family NA --genus NA --country null --latmin null --latmax null --lonmin null --lonmax null --minyear 1945 --maxyear null --coordprecision 0.1 --coorduncertainty 10000 --coorduncertaintyexclude 301,3036,999,9999 --basisofrecordinclude null --basisofrecordexclude FOSSIL_SPECIMEN,LIVING_SPECIMEN --excludehuman true --terrestrial Land_Buffered_025_dgr.RData --roundcoords 2 --resolution 4 --threads 4 --rcode $(which "Shapefile_filters.R") --output "Record_counts"
Command exit status:
1
Command output:
Counting the number of records per grid cell
Script name: 10_Record_counts.R
Parsing input options and arguments...
Input occurrences: occurrence.parquet
Selected phyla: NA
Selected classes: NA
Selected orders: NA
Selected families: NA
Selected genera: NA
File with GBIF specieskeys: NA
Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts
Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15
Number of available CPU threads: 22
Setting number of CPU threads to: 4
Loading Parquet data
Command error:
Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
executor > local (2)
[29/b70af7] occ_filter | 0 of 1 ✘
[1a/e9df44] record_count | 0 of 1 ✘
[- ] outl_low -
[- ] outl_high -
[- ] prep_ott_ids -
[- ] get_ott_tree -
[- ] merge_occ -
[- ] prep_biodiv -
[- ] phylodiv -
[- ] rand_filelist -
[- ] aggregate_rnds_biodiv -
[- ] div_to_csv -
[- ] plot_pd -
[- ] plot_leaflet -
[- ] derived_datasets -
Pipeline execution stopped with the following message: Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts
Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15
Number of available CPU threads: 22
Setting number of CPU threads to: 4
Loading Parquet data
Error in `arrow::open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000'. Is this a 'parquet' file?: Could not open Parquet input source '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000': Parquet file size is 0 bytes
Backtrace:
▆
1. └─arrow::open_dataset(INPUT)
2. └─base::tryCatch(...)
3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. └─value[[3L]](cond)
6. └─arrow:::augment_io_error_msg(e, call, format = format)
7. └─rlang::abort(msg, call = call)
Execution halted
ERROR ~ Error executing process > 'record_count'
Caused by:
Process `record_count` terminated with an error exit status (1)
Command executed:
10_Record_counts.R --input occurrence.parquet --phylum NA --class NA --order NA --family NA --genus NA --country null --latmin null --latmax null --lonmin null --lonmax null --minyear 1945 --maxyear null --coordprecision 0.1 --coorduncertainty 10000 --coorduncertaintyexclude 301,3036,999,9999 --basisofrecordinclude null --basisofrecordexclude FOSSIL_SPECIMEN,LIVING_SPECIMEN --excludehuman true --terrestrial Land_Buffered_025_dgr.RData --roundcoords 2 --resolution 4 --threads 4 --rcode $(which "Shapefile_filters.R") --output "Record_counts"
Command exit status:
1
Command output:
Counting the number of records per grid cell
Script name: 10_Record_counts.R
Parsing input options and arguments...
Input occurrences: occurrence.parquet
Selected phyla: NA
Selected classes: NA
Selected orders: NA
Selected families: NA
Selected genera: NA
File with GBIF specieskeys: NA
Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts
Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15
Number of available CPU threads: 22
Setting number of CPU threads to: 4
Loading Parquet data
Command error:
Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts
Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15
Number of available CPU threads: 22
Setting number of CPU threads to: 4
Loading Parquet data
Error in `arrow::open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000'. Is this a 'parquet' file?: Could not open Parquet input source '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000': Parquet file size is 0 bytes
Backtrace:
▆
1. └─arrow::open_dataset(INPUT)
2. └─base::tryCatch(...)
3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. └─value[[3L]](cond)
6. └─arrow:::augment_io_error_msg(e, call, format = format)
7. └─rlang::abort(msg, call = call)
Execution halted
Work dir:
~/Projects/phylonext/GBIF_dumps/work/1a/e9df442294732310f0cf2c75ff39be
Container:
vmikk/rarrow:1.4.0
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
But reading one partition works...
arrow::read_parquet("~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000066")
I delete the 000000 file and it seems to work. Is there a way to skip the empty files automatically from the pipeline?
Metadata
Metadata
Assignees
Labels
No labels