10000 [LANDGRIF-1262] Refactor (removes) `tiff_to_h3_table.py` to something more modular by BielStela · Pull Request #899 · Vizzuality/landgriffon · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[LANDGRIF-1262] Refactor (removes) tiff_to_h3_table.py to something more modular #899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 23, 2023

Conversation

BielStela
Copy link
Contributor
@BielStela BielStela commented Feb 28, 2023

General description

Rewrite of the raster to h3 ingestion. Highlights:

  • Updates to use of psycopg 3 api
  • Removes some archeological remains (on the way to the British museum)
  • overall rewrite the file with love, trying to make it easy to follow and modular to separate DB things from raster->h3.

Testing instructions

⚠️ TODO THIS MUST BE REVIEWED AND CHECKED CAREFULLY

  • Make sure it can handle the bigger imports like Hansen ✔️
  • Is FAST enought 🟡
  • Yields exacly (or maybe better) db state as the old script
  • Everyone likes it

Known issues & a bit of Discussion

The process has 3 bottlenecks currently [seconds running SPAM ingestion]:

  1. Convert a list of raster files (as many as there are in the folder parameter) to h3 index -> value series. This has been solved by using multiprocess pool since is quite embarrassingly parallel. [24s]
  2. Join the resulting Series/Dataframes into a big fat DataFrame to use as source for the DB. This is slow and I'm already (I think) using a fast way of joining dataframes using the h3 index as index and DataFrame.join(list[DataFrame])[57s]
  3. The DB ingestion is also single threaded and writing the dataframe to the h3_grid_table takes a while for SPAM [34s]

So for the SPAM case, which has 51 rasters to ingest (the worst case), the old script takes ~90s and the new approach takes 116s.

For datasets with a single raster it is faster.

All this has to be put into perspective that the longest time in the ingestion is taken by the deforestation (hansen + ghg) part which is 99.9% downloading and preprocessing with gdal and the resulting raster ingested is as small as a single SPAM one.
Thus, the next big step should be move the postprocessed results of deforestation to the cloud.

And why you may ask? I find it a little bit less cluttered and easy to follow to apply modifications safely.

PD.: I almost forget, this PR also removes unused files in the data folder which are quite annoying to have around

Checklist before merging

  • Branch name / PR includes the related Jira ticket Id.
  • Tests to check core implementation / bug fix added.
  • All checks in Continuous Integration workflow pass.
  • Feature functionally tested by reviewer(s).
  • Code reviewed by reviewer(s).
  • Documentation updated (README, CHANGELOG...) (if required)

@vercel
Copy link
vercel bot commented Feb 28, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments
Name Status Preview Comments Updated
landgriffon-client ⬜️ Ignored (Inspect) Mar 23, 2023 at 7:30PM (UTC)
landgriffon-cookie-traceability ⬜️ Ignored (Inspect) Mar 23, 2023 at 7:30PM (UTC)

@BielStela BielStela requested a review from elpamart February 28, 2023 15:32
@BielStela BielStela marked this pull request as draft February 28, 2023 15:47
@alexeh alexeh force-pushed the refactor-tiff-to-h3-table branch from 2ee2ae1 to b76e531 Compare March 1, 2023 08:37
@alexeh alexeh force-pushed the dev branch 2 times, most recently from 2fff34a to 8a3eb61 Compare March 3, 2023 04:44
@BielStela BielStela force-pushed the refactor-tiff-to-h3-table branch from 5fe2fce to 29388e4 Compare March 3, 2023 09:18
@BielStela BielStela requested review from aagm and fgassert March 3, 2023 11:43
@BielStela BielStela removed the WIP Work In Progress label Mar 3, 2023
@BielStela BielStela marked this pull request as ready for review March 3, 2023 15:04
@BielStela BielStela requested a review from alexeh March 3, 2023 15:27
@BielStela BielStela changed the title Refactor (removes) tiff_to_h3_table.py to something more modular [LANDGRIF-1262] Refactor (removes) tiff_to_h3_table.py to something more modular Mar 6, 2023
Copy link
Collaborator
@alexeh alexeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lemme know when we can fire this up to test :)

@alexeh alexeh changed the base branch from dev to data/feature/update-deforestation-pipe-mask-leviv March 10, 2023 06:13
@alexeh alexeh changed the base branch from data/feature/update-deforestation-pipe-mask-leviv to dev March 10, 2023 06:14
@BielStela BielStela force-pushed the refactor-tiff-to-h3-table branch 3 times, most recently from e21e078 to 10e68ed Compare March 23, 2023 19:27
…porter a bit.

Renames file
Remove unused water footprint ingestion. Also it is not found in the current URL
Clean unintended splited string
Fix docker compose call in data.sh to new version and add host network to compose
remove commented code
Removes unnecessary branch and always raise when check fails
Fix typos
Remove coment
Fix query string injection and documentation of the script
Add some defensive raises in case the ingestion is not going at expected
Replace all call to tif_folder to new script
Also adds some necessary shushes to all the commands, the output was getting crazy.
Add some logging, remove unused connection pool (not needed) and adds h3 dependency
Makes raster to h3 parallel
Uses multiprocessing for the conversion to h3 dataframes. Removes the usages of opened readers with ExitStack because are not serializable into pickle thus not allowing multiprocessing on a list of DatasetReaders.
Stills slow in the CONCAT or JOIN and the DB ingestion
EMBARASSING commit: I was deleting the table in another function 😭😭😭
WIP More refactor and try find out why table is not created
Change conenction manager to ConnectinPool, minor fixes here and there to make it WORK 🥳
Finish raster_to_h3 script 🙌💃. (It will need fixes everywhere)
Updates rasterio and h3ronpy deps
WIP: Add new raster to h3 file
Delete unneded files
@BielStela BielStela force-pushed the refactor-tiff-to-h3-table branch from 10e68ed to cc09e6c Compare March 23, 2023 19:29
@BielStela BielStela merged commit 693b0f4 into dev Mar 23, 2023
@BielStela BielStela deleted the refactor-tiff-to-h3-table branch March 23, 2023 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0