Reading parquet file error

I downloaded GBIF data from a .json file and uncompressed it, but when trying to use the data with phylonext, it can't read it properly...

# Download data based on the 
curl -Ssi \
  --user "$USER":"$PASSWORD" \
  -H "Content-Type: application/json" \
  -X POST -d @gbif_filters.json \
  https://api.gbif.org/v1/occurrence/download/request

# Check status
IDDOWNLOAD="myDownload_ID" # I have my download ID 
curl -Ss https://api.gbif.org/v1/occurrence/download/$IDDOWNLOAD | jq .

# Download the data locally
mkdir -p ~/Projects/phylonext/GBIF_dumps
cd ~/Projects/phylonext/GBIF_dumps

aria2c \
  https://api.gbif.org/v1/occurrence/download/request/$IDDOWNLOAD.zip \
  -o gbif_dump.zip

# unzip GBIF data 
unzip -d ~/Projects/phylonext/GBIF_dumps ~/Projects/phylonext/GBIF_dumps/gbif_dump.zip

Then run the pipeline

OUTPUTPHYLONEXT="$HOME/Projects/phylonext/analysis_1"
# Run the pipeline 
nextflow run vmikk/PhyloNext -r main \
  --input "$HOME/Projects/phylonext/GBIF_dumps/occurrence.parquet/" \
  --dbscan true  \
  --iterations 100  \
  --outdir "$OUTPUTPHYLONEXT" \
  -profile docker

N E X T F L O W   ~  version 25.04.2

Launching `https://github.com/vmikk/PhyloNext` [maniac_swirles] DSL2 - revision: 3d8eae09fd [main]


====================================================================
  PhyloNext: GBIF phylogenetic diversity pipeline, Version 1.4.2
====================================================================
  GBIF occurrence dump:     ~/Projects/phylonext/GBIF_dumps/occurrence.parquet/
  Output path:              ~/Projects/phylonext/analysis_1
Launch directory:         ~/Projects/phylonext/GBIF_dumps
Working directory:        ~/Projects/phylonext/GBIF_dumps/work

====================================================================
  
  
  executor >  local (2)
[29/b70af7] occ_filter            | 0 of 1 ✘
[1a/e9df44] record_count          | 0 of 1
[-        ] outl_low              -
  [-        ] outl_high             -
  [-        ] prep_ott_ids          -
  [-        ] get_ott_tree          -
  [-        ] merge_occ             -
  [-        ] prep_biodiv           -
  [-        ] phylodiv              -
  [-        ] rand_filelist         -
  [-        ] aggregate_rnds_biodiv -
  [-        ] div_to_csv            -
  [-        ] plot_pd               -
  [-        ] plot_leaflet          -
  [-        ] derived_datasets      -
  ERROR ~ Error executing process > 'record_count'

Caused by:
  Process `record_count` terminated with an error exit status (1)


Command executed:
  
  10_Record_counts.R       --input   occurrence.parquet       --phylum  NA       --class   NA       --order   NA       --family  NA       --genus   NA       --country null       --latmin  null       --latmax  null       --lonmin  null       --lonmax  null       --minyear 1945       --maxyear null       --coordprecision          0.1       --coorduncertainty        10000       --coorduncertaintyexclude 301,3036,999,9999       --basisofrecordinclude null       --basisofrecordexclude FOSSIL_SPECIMEN,LIVING_SPECIMEN                     --excludehuman true                     --terrestrial Land_Buffered_025_dgr.RData                                   --roundcoords 2       --resolution  4       --threads     4       --rcode       $(which "Shapefile_filters.R")       --output      "Record_counts"

Command exit status:
  1

Command output:
  Counting the number of records per grid cell
Script name: 10_Record_counts.R
Parsing input options and arguments...
Input occurrences: occurrence.parquet
Selected phyla: NA
Selected classes: NA
Selected orders: NA
Selected families: NA
Selected genera: NA
File with GBIF specieskeys: NA
Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts

Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15

Number of available CPU threads:  22
Setting number of CPU threads to:  4
Loading Parquet data

Command error:
  Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
executor >  local (2)
[29/b70af7] occ_filter            | 0 of 1 ✘
[1a/e9df44] record_count          | 0 of 1 ✘
[-        ] outl_low              -
  [-        ] outl_high             -
  [-        ] prep_ott_ids          -
  [-        ] get_ott_tree          -
  [-        ] merge_occ             -
  [-        ] prep_biodiv           -
  [-        ] phylodiv              -
  [-        ] rand_filelist         -
  [-        ] aggregate_rnds_biodiv -
  [-        ] div_to_csv            -
  [-        ] plot_pd               -
  [-        ] plot_leaflet          -
  [-        ] derived_datasets      -
  Pipeline execution stopped with the following message: Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts

Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15

Number of available CPU threads:  22
Setting number of CPU threads to:  4
Loading Parquet data
Error in `arrow::open_dataset()`:
  ! Invalid: Error creating dataset. Could not read schema from '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000'. Is this a 'parquet' file?: Could not open Parquet input source '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000': Parquet file size is 0 bytes
Backtrace:
  ▆
1. └─arrow::open_dataset(INPUT)
2.   └─base::tryCatch(...)
3.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
4.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
5.         └─value[[3L]](cond)
6.           └─arrow:::augment_io_error_msg(e, call, format = format)
7.             └─rlang::abort(msg, call = call)
Execution halted
ERROR ~ Error executing process > 'record_count'

Caused by:
  Process `record_count` terminated with an error exit status (1)


Command executed:
  
  10_Record_counts.R       --input   occurrence.parquet       --phylum  NA       --class   NA       --order   NA       --family  NA       --genus   NA       --country null       --latmin  null       --latmax  null       --lonmin  null       --lonmax  null       --minyear 1945       --maxyear null       --coordprecision          0.1       --coorduncertainty        10000       --coorduncertaintyexclude 301,3036,999,9999       --basisofrecordinclude null       --basisofrecordexclude FOSSIL_SPECIMEN,LIVING_SPECIMEN                     --excludehuman true                     --terrestrial Land_Buffered_025_dgr.RData                                   --roundcoords 2       --resolution  4       --threads     4       --rcode       $(which "Shapefile_filters.R")       --output      "Record_counts"

Command exit status:
  1

Command output:
  Counting the number of records per grid cell
Script name: 10_Record_counts.R
Parsing input options and arguments...
Input occurrences: occurrence.parquet
Selected phyla: NA
Selected classes: NA
Selected orders: NA
Selected families: NA
Selected genera: NA
File with GBIF specieskeys: NA
Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts

Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15

Number of available CPU threads:  22
Setting number of CPU threads to:  4
Loading Parquet data

Command error:
  Coordinate precision threshold: 0.1
Maximum allowed coordinate uncertainty: 10000
Black-listed values of coordinate uncertainty: 301,3036,999,9999
Country codes: NA
Minimum latitude: NA
Maximum latitude: NA
Minimum longitude: NA
Maximum longitude: NA
Basis of record to include: NA
Basis of record to exclude: FOSSIL_SPECIMEN,LIVING_SPECIMEN
Minimum year of occurrence: 1945
Maximum year of occurrence: NA
List of extict species: NA
Exclusion of human records: TRUE
Round coordinates: 2
Custom polygons: NA
WGSRPD data: NA
WGSRPD regions: NA
Terrestrial data: Land_Buffered_025_dgr.RData
Country and province centroids: NA
Capitals: NA
Institutions: NA
Uraban areas: NA
Spatial resolution: 4
Coordinate rounding: 2
Number of CPU threads to use: 4
Output prefix: Record_counts

Loading R packages...
arrow 14.0.0.2
data.table 1.14.10
dplyr 1.1.4
h3 3.7.2
sf 1.0.15

Number of available CPU threads:  22
Setting number of CPU threads to:  4
Loading Parquet data
Error in `arrow::open_dataset()`:
  ! Invalid: Error creating dataset. Could not read schema from '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000'. Is this a 'parquet' file?: Could not open Parquet input source '~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000000': Parquet file size is 0 bytes
Backtrace:
  ▆
1. └─arrow::open_dataset(INPUT)
2.   └─base::tryCatch(...)
3.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
4.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
5.         └─value[[3L]](cond)
6.           └─arrow:::augment_io_error_msg(e, call, format = format)
7.             └─rlang::abort(msg, call = call)
Execution halted

Work dir:
  ~/Projects/phylonext/GBIF_dumps/work/1a/e9df442294732310f0cf2c75ff39be

Container:
  vmikk/rarrow:1.4.0

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

-- Check '.nextflow.log' file for details

But reading one partition works...

arrow::read_parquet("~/Projects/phylonext/GBIF_dumps/occurrence.parquet/000066")

I delete the 000000 file and it seems to work. Is there a way to skip the empty files automatically from the pipeline?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions