8000 GitHub - kbdsbx/kunpeng: A new data-driven global ocean environmental model
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

kbdsbx/kunpeng

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

KUNPENG

Large Ocean Environmental Model (LOEM is the internal code)

Code Catalog

|-data_process Data processing, downloading ocean and atmospheric information from the Copernicus Climate Service website
|-|-download.py Download program
|-|-mean.py Calculate the daily average of all data, which can be used to calculate the ACC indicator
|-|-rename.py Used to change the variable names stored in the nc file
|-|-statistic.py Calculate statistical measures of data, including mean, standard deviation, maximum or minimum value, which can be used for standardization and normalization

|-models 
|-|-core Model Core
|-|-|-graph The processing core of graph neural networks
|-|-|-|-graph.py Definition and processing methods of data structure for graphs
|-|-|-|-Icosahedron.py Definition and processing methods of data structures for closed polyhedra

|-|-|-AdaptiveFourierNeuralBlock.py Adaptive Fourier neural operator block, processing (Batch, Height, Width, Channel) data
|-|-|-AdaptiveFourierNeuralBlock1D.py Adaptive Fourier neural operator block, processing (Batch, Layer, Channel) data
|-|-|-AdaptiveFourierNeuralOperator.py Adaptive Fourier Neural Operator
|-|-|-AdaptiveFourierNeuralOperator1D.py Adaptive Fourier Neural Operator
|-|-|-AFNONet.py Adaptive Fourier Neural Operator Network,https://doi.org/10.48550/arXiv.2208.05419
|-|-|-AIGOMS.py AI Global Ocean Simulation System,https://doi.org/10.48550/arXiv.2308.03152
|-|-|-ChangeChannel.py Channel dimension reduction block
|-|-|-CubePatchEmbedding.py 3D Embedding Blocks
|-|-|-CubePatchMerging.py 3D block merging
|-|-|-CubePatchUnembedding.py 3D Anti Embedding Block
|-|-|-CubePatchUnmerging.py 3D inverse merging
|-|-|-DCNBlock.py Deformable convolution block
|-|-|-DCNv4.py Deformable convolution operator
|-|-|-DownBlock.py Spatial downsampling block
|-|-|-DownSampling.py Channel downsampling block
|-|-|-DropPath.py DropPath regular
|-|-|-EarthAttention.py Earth Multi head Attention Operator
|-|-|-EarthSpecificBlock.py Earth's Multi Head Attention Block
|-|-|-FUXI.py Fuxi Model,https://doi.org/10.1038/s41612-023-00512-1
|-|-|-GraphCast.py Graph prediction,https://doi.org/10.1126/science.adi2336
|-|-|-GraphEmbedding.py Embedding the grid data into a graph
|-|-|-GraphNeuralBlock.py Graph neural network block
|-|-|-GraphUnembedding.py Graph reverse embedding, mapping the graph to a data grid
|-|-|-LOEM.py The core prediction model of the project being created
|-|-|-MaskedLatitudeACCLoss.py ACC operator with mask and latitude weight
|-|-|-MaskedLatitudeL1Loss.py L1 loss function with mask and latitude weight
|-|-|-MaskedLatitudeL2Loss.py L2 loss function with mask and latitude weight
|-|-|-MLP.py multilayer perceptrons
|-|-|-PANGU.py Pangu model,https://doi.org/10.1038/s41586-023-06185-3
|-|-|-PatchEmbedding.py Image Embedding
|-|-|-PatchMerge.py Image Merge
|-|-|-PatchUnembedding.py Image Anti Embedding
|-|-|-PatchUnmerge.py Image anti Merge
|-|-|-SWINRNNPlus.py SwinRNN+,http://dx.doi.org/10.48448/zn7f-fc64
|-|-|-SwinTransformerBlock.py SwinTransformer block
|-|-|-SwinTransformerBlockv2.py SwinTransformer v2 block
|-|-|-UpBlock.py Spatial upsampling block
|-|-|-UpSampling.py Channel upsampling block
|-|-|-WindowAttention.py Sliding window multi head attention operator
|-|-|-WindowAttentionv2.py Sliding Window Multi Head Attention Operator v2 Version

|-|-loader Data Loader
|-|-|- distributedSharedMemoryCache.py The ocean environment data caching algorithm using shared memory is only effective on Linux
|-|-|- ncLoader.py Data loader, used to load ocean environment data files in nc format, should be used in conjunction with DistributedSampler

|-|-test Unit Testing

|-|-train 
|-|-|-export.py Export weight files in pt or pth format as onnx structured weight files
|-|-|-loem_optimize_deepspeed.py Using the DS framework and Bayesian optimization algorithm to find the optimal LOEM structure parameters and hyperparameters
|-|-|-loem_optimize.py Using Bayesian optimization algorithm to find the optimal LOEM structure parameters and hyperparameters
|-|-|-npu_convert.py Convert the PT weight file saved in NPU internal format to a general weight file
|-|-|-predict.py Predicting future features of a certain period using a weight file
|-|-|-train_deepspeed.py Train the model using the DS framework. This file can train all the models mentioned above, so all of them are discarded
|-|-|-train_local.py Using a single machine multi card training model, suitable for restricted environments without open ports
|-|-|-train.py Using distributed training models
|-|-|-validate.py Model validation

|-|-visualization Visualization
|-|-|-loss.py Output model training loss reduction process
|-|-|-map.py Comparison of output model prediction results
|-|-|-plot.py Comparison of evaluation indicators for output models

|-|-config.py configuration file
|-|-utils.py Generic methods

|-scripts Script directory
|-|-update_code.bat Package script, package the code into Docker image and publish it to Alibaba Cloud
|-web
|-|-templates 
|-|-|-index.html Homepage Template
|-|-serve.py Flask service program, use Python server. py to enable web services
|-|-setup.bat Old service program

Some explanations

  1. A model generally has a three-layer structure, with the outermost layer being the model itself. The model consists of one or more blocks, which are composed of one or more operators. Many blocks or operators may be used in multiple models, such as MLP or upsampling, and there are also some models that use them quickly, such as Earth Embedding.
  2. Unit testing does not cover all code and only conducts result testing in critical or logically complex areas, so it cannot be ruled out that there may be undiscovered bugs or logic errors in the code.

Data Catalog

Due to the large number of directories involved in the entire project, the method of specifying directories separately for each function will not be used. Instead, a unified root directory will be used as the data directory to store all data files in an orderly manner. The data root directory '/' has the following arrangement

|- orig Original Data Catalog
|-|- 0000 Static file directory
|-|-|- tf_0.25x0.25_0000_00_00.nc Sea land mask, sea level height
|-|- 2010 Data file directory
|-|-|- gaef_<resolution>x<resolution>_<year>_<month>_<day>.nc Atmospheric data file
|-|-|- gmef_<resolution>x<resolution>_<year>_<month>_<day>.nc Ocean data file

|- pred Inference result directory, the inference results of each model are stored here
|-|- exp1 Divide subdirectories according to different tags
|-|-|- 11425 Inferring how many steps the model used has run (steps)
|-|-|-|- gaef_0.5x0.5_2021_07_02_by_PANGU_at_2021_07_01.nc Atmospheric prediction results, file name including initial date, prediction date, and prediction model
|-|-|-|- gmef_0.5x0.5_2021_07_02_by_PANGU_at_2021_07_01.nc Ocean prediction results, file name including initial date, prediction date, and prediction model

|- pth Weight directory
|-|- exp1 Divide subdirectories according to different tags
|-|-|- global_step11425 Inferring how many steps the model used has run (steps)
|-|-|-|- mp_rank_00_model_states.pt Model weight file
|-|-|-|- <model><steps>.onnx Model export file

|- statistic Statistical directory, including sample statistics and validation metrics
|-|- exp1 Divide subdirectories according to different tags
|-|-|- <model><steps> Statistical results of the model
|-|-|- <model><steps> Statistical results of the model
|-|- mean Daily average data, used to calculate the ACC index of the model
|-|-|- gaef_0.25x0.25_mean_01_01.nc Atmospheric daily mean data
|-|-|- gmef_0.25x0.25_mean_01_01.nc Marine daily average data
|-|- gmef_min_features.nc Minimum value of ocean statistics
|-|- gmef_max_features.nc The maximum value of ocean statistics
|-|- gmef_mean_features.nc Mean of ocean statistics
|-|- gmef_std_features.nc Standard deviation of ocean statistics
|-|- gaef_min_features.nc Minimum value of atmospheric statistics
|-|- gaef_max_features.nc Maximum value of atmospheric statistics
|-|- gaef_mean_features.nc Mean of atmospheric statistics
|-|- gaef_std_features.nc Standard deviation of atmospheric statistics

environment

Training generally uses a single machine/multi machine/multi card environment under the Linux operating system, and all of the following content does not include Windows.

CUDA environment preparation

  1. (Suggestion) Install Anaconda

follow https://www.anaconda.com/download/success Download installation script

install

> bash <conda-installer-name>-latest-Linux-x86_64.sh

If you can't install it, remember to give the file executable permission

> chmod +x <conda-installer-name>-latest-Linux-x86_64.sh

Create a new Python environment

> conda init
> conda create -n <env-name> python=3.10
> conda activate <env-name>
  1. Install CUDA toolkit

exist https://developer.nvidia.com/cuda-11-4-0-download-archive Choose an installation method that is suitable for the environment

11.4 is a commonly used CUDA version compatible with Python 3.10. Choose the appropriate OS, architecture, and version, install according to the command, and after installation, use the command to verify

> nvidia-smi
  1. Install PyTorch+CUDA version

exist https://pytorch.org/get-started/previous-versions/ Choose the appropriate version for installation

Currently using torch1.12.1+cu113, which is the command

> pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html

It is not recommended to install with conda. If conda cannot find the CUDA version of PyTorch due to not climbing over the wall, it will automatically install the CPU version

After successful installation, use the following command to enter the Python environment

> python

Use the following code to test if the installation is successful (input line by line without error)

import torch
torch.cuda.is_available() # Output True
  1. Install other dependency libraries

If you need to download data from the Copernicus website and process it, install it

  • numpy Mathematics Library
  • xarray Processing NC files
  • setuptools
  • copernicusmarine Copernicus Ocean Service Download Interface
  • NetCDF4 nc file processing engine

command

> pip install numpy xarray setuptools copernicusmarine netCDF4

If only training, installation

  • numpy Mathematics Library
  • xarray Processing NC files
  • setuptools
  • pandas Process Database
  • scipy Scientific Computing Library
  • scikit-learn Scientific Computing Library
  • mpi4py Multi process interaction library
  • torchsummary Model Analysis Library
  • deepspeed Large model training framework

NPU environment preparation

The rest is consistent with CUDA, but CUDA toolkit needs to be replaced with Cann toolkit

  1. Same CUDE

  2. exist https://www.hiascend.com/developer/download/community/result Download the Cann toolkit cann kernels、cann nnal

  • cann toolkit CANN development kit package
  • cann kernels CANN operator package
  • cann nnal CANN Accelerator Library

The corresponding development package needs to be selected based on the pytorch version. Pytorch 1.12 supports a maximum CANN of 8.0.RC2.beta1

Install after downloading is complete

chmod +x Ascend-<...>.run
Ascend-<...>.run

After installation, commands can be used to verify

> npu-smi info
  1. Install pytorch-npu

CPU version pytorch needs to be installed first

> pip install torch==1.12.1+cpu  --index-url https://download.pytorch.org/whl/cpu

Reinstall pytorch npu, and the installed version should correspond to the CANN and pytorch versions, as shown in the corresponding table https://gitee.com/ascend/pytorch  

> pip install pyyaml
> pip install setuptools
> pip install torch-npu==1.11.0.post14
  1. same CUDA

  2. Initialize environment variables

The environment variable file will be stored in the installation directory during the installation of Cann

> source /usr/local/Ascend/ascend-toolkit/set_env.sh

Use the following code to test if the installation is successful (input line by line without error)

import torch
import torch_npu
torch.npu.is_available() # Output True

train

Training with Deepspeed framework

for example

> NCCL_SOCKET_IFNAME=enp6s0 nohup ds --hostfile ./hostfile --no_ssh --master_addr 10.32.1.71 --master_port 25354 --node_rank 3 --num_gpus 2 /home/LOEM/code/models/train/train_deepspeed.py -n 4 -m AFNONET -lf MLL1 -mix False -b 1 -e 100 -lr 5e-4 -decay 0.01 -d 0 -hdim 768 -sttr 2010-01-01 -edtr 2019-12-31 -stte 2020-01-01 -edte 2020-12-31 -lc 150 -nw 0 -tag exp2 > afnonet.out 2>&1 &

explain

The training command is a three-layer nested command, with the outermost layer being 'nohup', the middle layer being 'ds', and the inner layer being' train_deepspeed. py '`

Nohup is a background execution program that hosts the training program to the background for training, preventing the program from ending after bash is closed. The command is as follows, and this program will redirect the output to outfile.out

nohup <command> > outfile.out 2>&1 &

DS is a DeepSpeed learning framework that creates multiple processes based on the number of hosts and devices. Each process executes the same code and communicates with each other during the learning process

For the single machine multi care program, use the following command

ds --num_gpus=8 <command>

For the Multi machine, multi card program, use the following command

ds --hostfile ./hostfile --no_ssh --master_addr 10.32.1.71 --master_port 25354 --node_rank 0 --num_gpus 2 <command>
  • hostfile : Equipment file description
  • no_ssh : Do not use SSH for automatic login
  • master_addr : Main server IP
  • master_port : Main server port
  • node_rank : The current node number starts from 0, and the primary server number is 0
  • num_gpus : Number of devices per node
  • <command> : Program commands running on each device

The format of the hostfile file file is

worker-1 slots=4
worker-2 slots=4

Used to record the number of nodes and the number of devices per node

For multi machine training, it is necessary to specify the network segment name used for each host's NCCL communication method. For example, here 'NCCL_SOCKET_IFNAME=enp6s0' is used to indicate the use of the 'enp6s0' network segment

The real learning program is train_deepspeed.exe,For details, please refer to models/train/train_deepspeed.py, including parameters

  • -n/--nodes The number of nodes (servers) is 1 for a single machine with multiple cards
  • -m/--model Optional models
  • -lf/--loss_func loss function
    • MSE mean squared error
    • LMSE Mean square error with latitude weight
    • LL1 Mean absolute error with latitude weight
    • MLL1 Mean absolute error with sea land mask and latitude weight
  • -dev/--device device
    • cuda Nvidia Cuda device
    • npu Huawei Ascend device
    • cpu Do not use acceleration devices
  • -mix/--mixing_precision Whether to use mixed precision training
  • -b/--batch_size Batch training size
  • -e/--epochs Periodization
  • -lr/--learning_rate Learning rate
  • -decay/--decay punishment
  • -d/--debugger Whether to debug in limited resources
    • 0 Do not debug
    • 1 Remove all learning blocks and run in minimal mode for debugging the framework
    • 2 Only use two blocks for debugging operators
  • -il/--input_length Sample input length
  • -ol/--output_length Sample output length
  • -lat/--latitude Latitude width
  • -lon/--longitude Longitude width
  • -hdim/--hidden_dim Hidden layer size
  • -sttr/--start_date_train Training set start date
  • -edtr/--end_date_triain Training set end date
  • -stte/--start_date_test Test set start date
  • -edte/--end_date_test Test set end date
  • -lc/--load_cache Cache, cache is based on daily data as a unit
  • -shm/--shm_cache Whether to use shared memory as a caching method, invalid if lc==False
  • -nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
  • -db/--database Root directory for data storage
  • -tag/--tag The label for this training

training results

After the training is completed, the weight information of the trained PT will be stored in data/pth/<tag>/globalis_step<num>/mpurank_00_madel_stetes.pt

verify

Verify the use of validate.py to evaluate the model by calculating MAE and ACC

Due to the lack of backpropagation training, verification requires less computational resources,Can be used on personal PC,But the calculation cycle may be very long.

> python validate.py -pt /home/LOEM/data/pth/exp1/global_step11425/mp_rank_00_model_states.pt -tag exp1
  • -m/--model Optional models
  • -c/--cpu Whether to use CPU or not
  • -b/--batch_size Batch training size
  • -d/--debugger Whether to debug in limited resources
    • 0 Do not debug
    • 1 Remove all learning blocks and run in minimal mode for debugging the framework
    • 2 Only use two blocks for debugging operators
  • -il/--input_length Sample input length
  • -ol/--output_length Sample output length
  • -lat/--latitude Latitude width
  • -lon/--longitude Longitude width
  • -hdim/--hidden_dim Hidden layer size
  • -l/--lead The predicted frame length for verification is generally measured in days
  • -stvl/--start_date_val Verification set start date
  • -edvl/--end_date_val Verification set end date
  • -lc/--load_cache Cache, cache is based on daily data as a unit
  • -nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
  • -db/--database Root directory for data storage
  • -tag/--tag The label for this training
  • -pt/--pt_file Weight file path

Since the weight file does not include the model structure, it is necessary to use relevant parameters to specify the model and initialize it

Verification results

The verification results will be stored in 'data/static//. json', including the following parameters

{
    "val_l1" : [], // MAE
    "val_acc" : [], // ACC
    "mean_time" : [], // The average time for inferring a single sample
    "train_loss_list" : [], // Training loss reduction process
    "test_loss_list" : [], // Test the loss reduction process. If the training process is not tested, this item is empty
}

inference

Using predict.py for inference, generate predictions for the next N days based on one or more (depending on the model) ocean environment variable samples from different dates, and store them in nc format files.

Due to the lack of backpropagation training, it is inferred that less computing resources are used, Can be used on personal PC.

> python predict.py -m PANGU -b 1 -d 0 -il 1 -ol 1 -hdim 192 -pt "E:/LOEM/data/pth/exp3/global_step18280-1/mp_rank_00_model_states.pt" -tag exp3
  • -m/--model Optional models
  • -c/--cpu Whether to use CPU or not
  • -b/--batch_size Batch training size
  • -d/--debugger Whether to debug in limited resources
    • 0 Do not debug
    • 1 Remove all learning blocks and run in minimal mode for debugging the framework
    • 2 Only use two blocks for debugging operators
  • -il/--input_length Sample input length
  • -ol/--output_length Sample output length
  • -lat/--latitude Latitude width
  • -lon/--longitude Longitude width
  • -hdim/--hidden_dim Hidden layer size
  • -s/--save_lead The frames to be saved are usually measured in days. For example, if it is [3, 5, 8, 10], the next 10 days will be predicted and the results of the 3rd, 5th, 8th, and 10th days will be saved
  • -st/--start_date Predict starting date
  • -ed/--end_date Predict the end date, which refers to the end day of data loading
  • -lc/--load_cache Cache, cache is based on daily data as a unit
  • -nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
  • -db/--database Root directory for data storage
  • -tag/--tag The label for this training
  • -pt/--pt_file Weight file path

Inference results

The inference results will be saved in the /data/pred/<tag>/

export

The pt format file only contains weights and does not include model structures. Using 'export. py' can export onnx format files containing file weights and model structures, which can be run across platforms and environments, and can also be deployed in non python or non pytorch environments

Not all models can be exported to onnx format files, depending on whether the model operator used is supported. For example, older versions do not support fast Fourier transform, so AFNOnet and AIGOMS cannot export to onnx format

About

A new data-driven global ocean environmental model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0