KUNPENG

Large Ocean Environmental Model (LOEM is the internal code)

Code Catalog

|-data_process Data processing, downloading ocean and atmospheric information from the Copernicus Climate Service website
|-|-download.py Download program
|-|-mean.py Calculate the daily average of all data, which can be used to calculate the ACC indicator
|-|-rename.py Used to change the variable names stored in the nc file
|-|-statistic.py Calculate statistical measures of data, including mean, standard deviation, maximum or minimum value, which can be used for standardization and normalization

|-models 
|-|-core Model Core
|-|-|-graph The processing core of graph neural networks
|-|-|-|-graph.py Definition and processing methods of data structure for graphs
|-|-|-|-Icosahedron.py Definition and processing methods of data structures for closed polyhedra

|-|-|-AdaptiveFourierNeuralBlock.py Adaptive Fourier neural operator block, processing (Batch, Height, Width, Channel) data
|-|-|-AdaptiveFourierNeuralBlock1D.py Adaptive Fourier neural operator block, processing (Batch, Layer, Channel) data
|-|-|-AdaptiveFourierNeuralOperator.py Adaptive Fourier Neural Operator
|-|-|-AdaptiveFourierNeuralOperator1D.py Adaptive Fourier Neural Operator
|-|-|-AFNONet.py Adaptive Fourier Neural Operator Network，https://doi.org/10.48550/arXiv.2208.05419
|-|-|-AIGOMS.py AI Global Ocean Simulation System，https://doi.org/10.48550/arXiv.2308.03152
|-|-|-ChangeChannel.py Channel dimension reduction block
|-|-|-CubePatchEmbedding.py 3D Embedding Blocks
|-|-|-CubePatchMerging.py 3D block merging
|-|-|-CubePatchUnembedding.py 3D Anti Embedding Block
|-|-|-CubePatchUnmerging.py 3D inverse merging
|-|-|-DCNBlock.py Deformable convolution block
|-|-|-DCNv4.py Deformable convolution operator
|-|-|-DownBlock.py Spatial downsampling block
|-|-|-DownSampling.py Channel downsampling block
|-|-|-DropPath.py DropPath regular
|-|-|-EarthAttention.py Earth Multi head Attention Operator
|-|-|-EarthSpecificBlock.py Earth's Multi Head Attention Block
|-|-|-FUXI.py Fuxi Model，https://doi.org/10.1038/s41612-023-00512-1
|-|-|-GraphCast.py Graph prediction，https://doi.org/10.1126/science.adi2336
|-|-|-GraphEmbedding.py Embedding the grid data into a graph
|-|-|-GraphNeuralBlock.py Graph neural network block
|-|-|-GraphUnembedding.py Graph reverse embedding, mapping the graph to a data grid
|-|-|-LOEM.py The core prediction model of the project being created
|-|-|-MaskedLatitudeACCLoss.py ACC operator with mask and latitude weight
|-|-|-MaskedLatitudeL1Loss.py L1 loss function with mask and latitude weight
|-|-|-MaskedLatitudeL2Loss.py L2 loss function with mask and latitude weight
|-|-|-MLP.py multilayer perceptrons
|-|-|-PANGU.py Pangu model，https://doi.org/10.1038/s41586-023-06185-3
|-|-|-PatchEmbedding.py Image Embedding
|-|-|-PatchMerge.py Image Merge
|-|-|-PatchUnembedding.py Image Anti Embedding
|-|-|-PatchUnmerge.py Image anti Merge
|-|-|-SWINRNNPlus.py SwinRNN+，http://dx.doi.org/10.48448/zn7f-fc64
|-|-|-SwinTransformerBlock.py SwinTransformer block
|-|-|-SwinTransformerBlockv2.py SwinTransformer v2 block
|-|-|-UpBlock.py Spatial upsampling block
|-|-|-UpSampling.py Channel upsampling block
|-|-|-WindowAttention.py Sliding window multi head attention operator
|-|-|-WindowAttentionv2.py Sliding Window Multi Head Attention Operator v2 Version

|-|-loader Data Loader
|-|-|- distributedSharedMemoryCache.py The ocean environment data caching algorithm using shared memory is only effective on Linux
|-|-|- ncLoader.py Data loader, used to load ocean environment data files in nc format, should be used in conjunction with DistributedSampler

|-|-test Unit Testing

|-|-train 
|-|-|-export.py Export weight files in pt or pth format as onnx structured weight files
|-|-|-loem_optimize_deepspeed.py Using the DS framework and Bayesian optimization algorithm to find the optimal LOEM structure parameters and hyperparameters
|-|-|-loem_optimize.py Using Bayesian optimization algorithm to find the optimal LOEM structure parameters and hyperparameters
|-|-|-npu_convert.py Convert the PT weight file saved in NPU internal format to a general weight file
|-|-|-predict.py Predicting future features of a certain period using a weight file
|-|-|-train_deepspeed.py Train the model using the DS framework. This file can train all the models mentioned above, so all of them are discarded
|-|-|-train_local.py Using a single machine multi card training model, suitable for restricted environments without open ports
|-|-|-train.py Using distributed training models
|-|-|-validate.py Model validation

|-|-visualization Visualization
|-|-|-loss.py Output model training loss reduction process
|-|-|-map.py Comparison of output model prediction results
|-|-|-plot.py Comparison of evaluation indicators for output models

|-|-config.py configuration file
|-|-utils.py Generic methods

|-scripts Script directory
|-|-update_code.bat Package script, package the code into Docker image and publish it to Alibaba Cloud
|-web
|-|-templates 
|-|-|-index.html Homepage Template
|-|-serve.py Flask service program, use Python server. py to enable web services
|-|-setup.bat Old service program

Some explanations

A model generally has a three-layer structure, with the outermost layer being the model itself. The model consists of one or more blocks, which are composed of one or more operators. Many blocks or operators may be used in multiple models, such as MLP or upsampling, and there are also some models that use them quickly, such as Earth Embedding.
Unit testing does not cover all code and only conducts result testing in critical or logically complex areas, so it cannot be ruled out that there may be undiscovered bugs or logic errors in the code.

Data Catalog

Due to the large number of directories involved in the entire project, the method of specifying directories separately for each function will not be used. Instead, a unified root directory will be used as the data directory to store all data files in an orderly manner. The data root directory '/' has the following arrangement

|- orig Original Data Catalog
|-|- 0000 Static file directory
|-|-|- tf_0.25x0.25_0000_00_00.nc Sea land mask, sea level height
|-|- 2010 Data file directory
|-|-|- gaef_<resolution>x<resolution>_<year>_<month>_<day>.nc Atmospheric data file
|-|-|- gmef_<resolution>x<resolution>_<year>_<month>_<day>.nc Ocean data file

|- pred Inference result directory, the inference results of each model are stored here
|-|- exp1 Divide subdirectories according to different tags
|-|-|- 11425 Inferring how many steps the model used has run (steps)
|-|-|-|- gaef_0.5x0.5_2021_07_02_by_PANGU_at_2021_07_01.nc Atmospheric prediction results, file name including initial date, prediction date, and prediction model
|-|-|-|- gmef_0.5x0.5_2021_07_02_by_PANGU_at_2021_07_01.nc Ocean prediction results, file name including initial date, prediction date, and prediction model

|- pth Weight directory
|-|- exp1 Divide subdirectories according to different tags
|-|-|- global_step11425 Inferring how many steps the model used has run (steps)
|-|-|-|- mp_rank_00_model_states.pt Model weight file
|-|-|-|- <model><steps>.onnx Model export file

|- statistic Statistical directory, including sample statistics and validation metrics
|-|- exp1 Divide subdirectories according to different tags
|-|-|- <model><steps> Statistical results of the model
|-|-|- <model><steps> Statistical results of the model
|-|- mean Daily average data, used to calculate the ACC index of the model
|-|-|- gaef_0.25x0.25_mean_01_01.nc Atmospheric daily mean data
|-|-|- gmef_0.25x0.25_mean_01_01.nc Marine daily average data
|-|- gmef_min_features.nc Minimum value of ocean statistics
|-|- gmef_max_features.nc The maximum value of ocean statistics
|-|- gmef_mean_features.nc Mean of ocean statistics
|-|- gmef_std_features.nc Standard deviation of ocean statistics
|-|- gaef_min_features.nc Minimum value of atmospheric statistics
|-|- gaef_max_features.nc Maximum value of atmospheric statistics
|-|- gaef_mean_features.nc Mean of atmospheric statistics
|-|- gaef_std_features.nc Standard deviation of atmospheric statistics

environment

Training generally uses a single machine/multi machine/multi card environment under the Linux operating system, and all of the following content does not include Windows.

CUDA environment preparation

(Suggestion) Install Anaconda

follow https://www.anaconda.com/download/success Download installation script

install

> bash <conda-installer-name>-latest-Linux-x86_64.sh

If you can't install it, remember to give the file executable permission

> chmod +x <conda-installer-name>-latest-Linux-x86_64.sh

Create a new Python environment

> conda init
> conda create -n <env-name> python=3.10
> conda activate <env-name>

Install CUDA toolkit

exist https://developer.nvidia.com/cuda-11-4-0-download-archive Choose an installation method that is suitable for the environment

11.4 is a commonly used CUDA version compatible with Python 3.10. Choose the appropriate OS, architecture, and version, install according to the command, and after installation, use the command to verify

> nvidia-smi

Install PyTorch+CUDA version

exist https://pytorch.org/get-started/previous-versions/ Choose the appropriate version for installation

Currently using torch1.12.1+cu113, which is the command

> pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html

It is not recommended to install with conda. If conda cannot find the CUDA version of PyTorch due to not climbing over the wall, it will automatically install the CPU version

After successful installation, use the following command to enter the Python environment

> python

Use the following code to test if the installation is successful (input line by line without error)

import torch
torch.cuda.is_available() # Output True

Install other dependency libraries

If you need to download data from the Copernicus website and process it, install it

numpy Mathematics Library
xarray Processing NC files
setuptools
copernicusmarine Copernicus Ocean Service Download Interface
NetCDF4 nc file processing engine

command

> pip install numpy xarray setuptools copernicusmarine netCDF4

If only training, installation

numpy Mathematics Library
xarray Processing NC files
setuptools
pandas Process Database
scipy Scientific Computing Library
scikit-learn Scientific Computing Library
mpi4py Multi process interaction library
torchsummary Model Analysis Library
deepspeed Large model training framework

NPU environment preparation

The rest is consistent with CUDA, but CUDA toolkit needs to be replaced with Cann toolkit

Same CUDE
exist https://www.hiascend.com/developer/download/community/result Download the Cann toolkit cann kernels、cann nnal

cann toolkit CANN development kit package
cann kernels CANN operator package
cann nnal CANN Accelerator Library

The corresponding development package needs to be selected based on the pytorch version. Pytorch 1.12 supports a maximum CANN of 8.0.RC2.beta1

Install after downloading is complete

chmod +x Ascend-<...>.run
Ascend-<...>.run

After installation, commands can be used to verify

> npu-smi info

Install pytorch-npu

CPU version pytorch needs to be installed first

> pip install torch==1.12.1+cpu  --index-url https://download.pytorch.org/whl/cpu

Reinstall pytorch npu, and the installed version should correspond to the CANN and pytorch versions, as shown in the corresponding table https://gitee.com/ascend/pytorch

> pip install pyyaml
> pip install setuptools
> pip install torch-npu==1.11.0.post14

same CUDA
Initialize environment variables

The environment variable file will be stored in the installation directory during the installation of Cann

> source /usr/local/Ascend/ascend-toolkit/set_env.sh

Use the following code to test if the installation is successful (input line by line without error)

import torch
import torch_npu
torch.npu.is_available() # Output True

train

Training with Deepspeed framework

for example

> NCCL_SOCKET_IFNAME=enp6s0 nohup ds --hostfile ./hostfile --no_ssh --master_addr 10.32.1.71 --master_port 25354 --node_rank 3 --num_gpus 2 /home/LOEM/code/models/train/train_deepspeed.py -n 4 -m AFNONET -lf MLL1 -mix False -b 1 -e 100 -lr 5e-4 -decay 0.01 -d 0 -hdim 768 -sttr 2010-01-01 -edtr 2019-12-31 -stte 2020-01-01 -edte 2020-12-31 -lc 150 -nw 0 -tag exp2 > afnonet.out 2>&1 &

explain

The training command is a three-layer nested command, with the outermost layer being 'nohup', the middle layer being 'ds', and the inner layer being' train_deepspeed. py '`

Nohup is a background execution program that hosts the training program to the background for training, preventing the program from ending after bash is closed. The command is as follows, and this program will redirect the output to outfile.out

nohup <command> > outfile.out 2>&1 &

DS is a DeepSpeed learning framework that creates multiple processes based on the number of hosts and devices. Each process executes the same code and communicates with each other during the learning process

For the single machine multi care program, use the following command

ds --num_gpus=8 <command>

For the Multi machine, multi card program, use the following command

ds --hostfile ./hostfile --no_ssh --master_addr 10.32.1.71 --master_port 25354 --node_rank 0 --num_gpus 2 <command>

hostfile : Equipment file description
no_ssh : Do not use SSH for automatic login
master_addr : Main server IP
master_port : Main server port
node_rank : The current node number starts from 0, and the primary server number is 0
num_gpus : Number of devices per node
<command> : Program commands running on each device

The format of the hostfile file file is

worker-1 slots=4
worker-2 slots=4

Used to record the number of nodes and the number of devices per node

For multi machine training, it is necessary to specify the network segment name used for each host's NCCL communication method. For example, here 'NCCL_SOCKET_IFNAME=enp6s0' is used to indicate the use of the 'enp6s0' network segment

The real learning program is train_deepspeed.exe，For details, please refer to models/train/train_deepspeed.py, including parameters

-n/--nodes The number of nodes (servers) is 1 for a single machine with multiple cards
-m/--model Optional models
- PANGU Pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM model To be trained
-lf/--loss_func loss function
- MSE mean squared error
- LMSE Mean square error with latitude weight
- LL1 Mean absolute error with latitude weight
- MLL1 Mean absolute error with sea land mask and latitude weight
-dev/--device device
- cuda Nvidia Cuda device
- npu Huawei Ascend device
- cpu Do not use acceleration devices
-mix/--mixing_precision Whether to use mixed precision training
-b/--batch_size Batch training size
-e/--epochs Periodization
-lr/--learning_rate Learning rate
-decay/--decay punishment
-d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
-il/--input_length Sample input length
-ol/--output_length Sample output length
-lat/--latitude Latitude width
-lon/--longitude Longitude width
-hdim/--hidden_dim Hidden layer size
-sttr/--start_date_train Training set start date
-edtr/--end_date_triain Training set end date
-stte/--start_date_test Test set start date
-edte/--end_date_test Test set end date
-lc/--load_cache Cache, cache is based on daily data as a unit
-shm/--shm_cache Whether to use shared memory as a caching method, invalid if lc==False
-nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
-db/--database Root directory for data storage
-tag/--tag The label for this training

training results

After the training is completed, the weight information of the trained PT will be stored in data/pth/<tag>/globalis_step<num>/mpurank_00_madel_stetes.pt

verify

Verify the use of validate.py to evaluate the model by calculating MAE and ACC

Due to the lack of backpropagation training, verification requires less computational resources，Can be used on personal PC，But the calculation cycle may be very long.

> python validate.py -pt /home/LOEM/data/pth/exp1/global_step11425/mp_rank_00_model_states.pt -tag exp1

-m/--model Optional models
- PANGU pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+ Model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast Model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM Model to be trained
-c/--cpu Whether to use CPU or not
-b/--batch_size Batch training size
-d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
-il/--input_length Sample input length
-ol/--output_length Sample output length
-lat/--latitude Latitude width
-lon/--longitude Longitude width
-hdim/--hidden_dim Hidden layer size
-l/--lead The predicted frame length for verification is generally measured in days
-stvl/--start_date_val Verification set start date
-edvl/--end_date_val Verification set end date
-lc/--load_cache Cache, cache is based on daily data as a unit
-nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
-db/--database Root directory for data storage
-tag/--tag The label for this training
-pt/--pt_file Weight file path

Since the weight file does not include the model structure, it is necessary to use relevant parameters to specify the model and initialize it

Verification results

The verification results will be stored in 'data/static//. json', including the following parameters

{
    "val_l1" : [], // MAE
    "val_acc" : [], // ACC
    "mean_time" : [], // The average time for inferring a single sample
    "train_loss_list" : [], // Training loss reduction process
    "test_loss_list" : [], // Test the loss reduction process. If the training process is not tested, this item is empty
}

inference

Using predict.py for inference, generate predictions for the next N days based on one or more (depending on the model) ocean environment variable samples from different dates, and store them in nc format files.

Due to the lack of backpropagation training, it is inferred that less computing resources are used, Can be used on personal PC.

> python predict.py -m PANGU -b 1 -d 0 -il 1 -ol 1 -hdim 192 -pt "E:/LOEM/data/pth/exp3/global_step18280-1/mp_rank_00_model_states.pt" -tag exp3

-m/--model Optional models
- PANGU pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+ model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM model to be trained
-c/--cpu Whether to use CPU or not
-b/--batch_size Batch training size
-d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
-il/--input_length Sample input length
-ol/--output_length Sample output length
-lat/--latitude Latitude width
-lon/--longitude Longitude width
-hdim/--hidden_dim Hidden layer size
-s/--save_lead The frames to be saved are usually measured in days. For example, if it is [3, 5, 8, 10], the next 10 days will be predicted and the results of the 3rd, 5th, 8th, and 10th days will be saved
-st/--start_date Predict starting date
-ed/--end_date Predict the end date, which refers to the end day of data loading
-lc/--load_cache Cache, cache is based on daily data as a unit
-nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
-db/--database Root directory for data storage
-tag/--tag The label for this training
-pt/--pt_file Weight file path

Inference results

The inference results will be saved in the /data/pred/<tag>/

export

The pt format file only contains weights and does not include model structures. Using 'export. py' can export onnx format files containing file weights and model structures, which can be run across platforms and environments, and can also be deployed in non python or non pytorch environments

Not all models can be exported to onnx format files, depending on whether the model operator used is supported. For example, older versions do not support fast Fourier transform, so AFNOnet and AIGOMS cannot export to onnx format

-m/--model Optional models
- PANGU pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+ model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM model to be trained
-c/--cpu Whether to use CPU or not
-d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
-il/--input_length Sample input length
-ol/--output_length Sample output length
-lat/--latitude Latitude width
-lon/--longitude Longitude width
-hdim/--hidden_dim Hidden layer size
-db/--database Root directory for data storage
-tag/--tag The label for this training
-pt/--pt_file Weight file path
-ox/--onnx_file Export file path

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_process		data_process
models		models
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KUNPENG

Code Catalog

Some explanations

Data Catalog

environment

CUDA environment preparation

NPU environment preparation

train

Training with Deepspeed framework

explain

training results

verify

Verification results

inference

Inference results

export

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kbdsbx/kunpeng

Folders and files

Latest commit

History

Repository files navigation

KUNPENG

Code Catalog

Some explanations

Data Catalog

environment

CUDA environment preparation

NPU environment preparation

train

Training with Deepspeed framework

explain

training results

verify

Verification results

inference

Inference results

export

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages