Large Ocean Environmental Model (LOEM is the internal code)
|-data_process Data processing, downloading ocean and atmospheric information from the Copernicus Climate Service website
|-|-download.py Download program
|-|-mean.py Calculate the daily average of all data, which can be used to calculate the ACC indicator
|-|-rename.py Used to change the variable names stored in the nc file
|-|-statistic.py Calculate statistical measures of data, including mean, standard deviation, maximum or minimum value, which can be used for standardization and normalization
|-models
|-|-core Model Core
|-|-|-graph The processing core of graph neural networks
|-|-|-|-graph.py Definition and processing methods of data structure for graphs
|-|-|-|-Icosahedron.py Definition and processing methods of data structures for closed polyhedra
|-|-|-AdaptiveFourierNeuralBlock.py Adaptive Fourier neural operator block, processing (Batch, Height, Width, Channel) data
|-|-|-AdaptiveFourierNeuralBlock1D.py Adaptive Fourier neural operator block, processing (Batch, Layer, Channel) data
|-|-|-AdaptiveFourierNeuralOperator.py Adaptive Fourier Neural Operator
|-|-|-AdaptiveFourierNeuralOperator1D.py Adaptive Fourier Neural Operator
|-|-|-AFNONet.py Adaptive Fourier Neural Operator Network,https://doi.org/10.48550/arXiv.2208.05419
|-|-|-AIGOMS.py AI Global Ocean Simulation System,https://doi.org/10.48550/arXiv.2308.03152
|-|-|-ChangeChannel.py Channel dimension reduction block
|-|-|-CubePatchEmbedding.py 3D Embedding Blocks
|-|-|-CubePatchMerging.py 3D block merging
|-|-|-CubePatchUnembedding.py 3D Anti Embedding Block
|-|-|-CubePatchUnmerging.py 3D inverse merging
|-|-|-DCNBlock.py Deformable convolution block
|-|-|-DCNv4.py Deformable convolution operator
|-|-|-DownBlock.py Spatial downsampling block
|-|-|-DownSampling.py Channel downsampling block
|-|-|-DropPath.py DropPath regular
|-|-|-EarthAttention.py Earth Multi head Attention Operator
|-|-|-EarthSpecificBlock.py Earth's Multi Head Attention Block
|-|-|-FUXI.py Fuxi Model,https://doi.org/10.1038/s41612-023-00512-1
|-|-|-GraphCast.py Graph prediction,https://doi.org/10.1126/science.adi2336
|-|-|-GraphEmbedding.py Embedding the grid data into a graph
|-|-|-GraphNeuralBlock.py Graph neural network block
|-|-|-GraphUnembedding.py Graph reverse embedding, mapping the graph to a data grid
|-|-|-LOEM.py The core prediction model of the project being created
|-|-|-MaskedLatitudeACCLoss.py ACC operator with mask and latitude weight
|-|-|-MaskedLatitudeL1Loss.py L1 loss function with mask and latitude weight
|-|-|-MaskedLatitudeL2Loss.py L2 loss function with mask and latitude weight
|-|-|-MLP.py multilayer perceptrons
|-|-|-PANGU.py Pangu model,https://doi.org/10.1038/s41586-023-06185-3
|-|-|-PatchEmbedding.py Image Embedding
|-|-|-PatchMerge.py Image Merge
|-|-|-PatchUnembedding.py Image Anti Embedding
|-|-|-PatchUnmerge.py Image anti Merge
|-|-|-SWINRNNPlus.py SwinRNN+,http://dx.doi.org/10.48448/zn7f-fc64
|-|-|-SwinTransformerBlock.py SwinTransformer block
|-|-|-SwinTransformerBlockv2.py SwinTransformer v2 block
|-|-|-UpBlock.py Spatial upsampling block
|-|-|-UpSampling.py Channel upsampling block
|-|-|-WindowAttention.py Sliding window multi head attention operator
|-|-|-WindowAttentionv2.py Sliding Window Multi Head Attention Operator v2 Version
|-|-loader Data Loader
|-|-|- distributedSharedMemoryCache.py The ocean environment data caching algorithm using shared memory is only effective on Linux
|-|-|- ncLoader.py Data loader, used to load ocean environment data files in nc format, should be used in conjunction with DistributedSampler
|-|-test Unit Testing
|-|-train
|-|-|-export.py Export weight files in pt or pth format as onnx structured weight files
|-|-|-loem_optimize_deepspeed.py Using the DS framework and Bayesian optimization algorithm to find the optimal LOEM structure parameters and hyperparameters
|-|-|-loem_optimize.py Using Bayesian optimization algorithm to find the optimal LOEM structure parameters and hyperparameters
|-|-|-npu_convert.py Convert the PT weight file saved in NPU internal format to a general weight file
|-|-|-predict.py Predicting future features of a certain period using a weight file
|-|-|-train_deepspeed.py Train the model using the DS framework. This file can train all the models mentioned above, so all of them are discarded
|-|-|-train_local.py Using a single machine multi card training model, suitable for restricted environments without open ports
|-|-|-train.py Using distributed training models
|-|-|-validate.py Model validation
|-|-visualization Visualization
|-|-|-loss.py Output model training loss reduction process
|-|-|-map.py Comparison of output model prediction results
|-|-|-plot.py Comparison of evaluation indicators for output models
|-|-config.py configuration file
|-|-utils.py Generic methods
|-scripts Script directory
|-|-update_code.bat Package script, package the code into Docker image and publish it to Alibaba Cloud
|-web
|-|-templates
|-|-|-index.html Homepage Template
|-|-serve.py Flask service program, use Python server. py to enable web services
|-|-setup.bat Old service program
- A model generally has a three-layer structure, with the outermost layer being the model itself. The model consists of one or more blocks, which are composed of one or more operators. Many blocks or operators may be used in multiple models, such as MLP or upsampling, and there are also some models that use them quickly, such as Earth Embedding.
- Unit testing does not cover all code and only conducts result testing in critical or logically complex areas, so it cannot be ruled out that there may be undiscovered bugs or logic errors in the code.
Due to the large number of directories involved in the entire project, the method of specifying directories separately for each function will not be used. Instead, a unified root directory will be used as the data directory to store all data files in an orderly manner. The data root directory '/' has the following arrangement
|- orig Original Data Catalog
|-|- 0000 Static file directory
|-|-|- tf_0.25x0.25_0000_00_00.nc Sea land mask, sea level height
|-|- 2010 Data file directory
|-|-|- gaef_<resolution>x<resolution>_<year>_<month>_<day>.nc Atmospheric data file
|-|-|- gmef_<resolution>x<resolution>_<year>_<month>_<day>.nc Ocean data file
|- pred Inference result directory, the inference results of each model are stored here
|-|- exp1 Divide subdirectories according to different tags
|-|-|- 11425 Inferring how many steps the model used has run (steps)
|-|-|-|- gaef_0.5x0.5_2021_07_02_by_PANGU_at_2021_07_01.nc Atmospheric prediction results, file name including initial date, prediction date, and prediction model
|-|-|-|- gmef_0.5x0.5_2021_07_02_by_PANGU_at_2021_07_01.nc Ocean prediction results, file name including initial date, prediction date, and prediction model
|- pth Weight directory
|-|- exp1 Divide subdirectories according to different tags
|-|-|- global_step11425 Inferring how many steps the model used has run (steps)
|-|-|-|- mp_rank_00_model_states.pt Model weight file
|-|-|-|- <model><steps>.onnx Model export file
|- statistic Statistical directory, including sample statistics and validation metrics
|-|- exp1 Divide subdirectories according to different tags
|-|-|- <model><steps> Statistical results of the model
|-|-|- <model><steps> Statistical results of the model
|-|- mean Daily average data, used to calculate the ACC index of the model
|-|-|- gaef_0.25x0.25_mean_01_01.nc Atmospheric daily mean data
|-|-|- gmef_0.25x0.25_mean_01_01.nc Marine daily average data
|-|- gmef_min_features.nc Minimum value of ocean statistics
|-|- gmef_max_features.nc The maximum value of ocean statistics
|-|- gmef_mean_features.nc Mean of ocean statistics
|-|- gmef_std_features.nc Standard deviation of ocean statistics
|-|- gaef_min_features.nc Minimum value of atmospheric statistics
|-|- gaef_max_features.nc Maximum value of atmospheric statistics
|-|- gaef_mean_features.nc Mean of atmospheric statistics
|-|- gaef_std_features.nc Standard deviation of atmospheric statistics
Training generally uses a single machine/multi machine/multi card environment under the Linux operating system, and all of the following content does not include Windows.
- (Suggestion) Install Anaconda
follow https://www.anaconda.com/download/success Download installation script
install
> bash <conda-installer-name>-latest-Linux-x86_64.sh
If you can't install it, remember to give the file executable permission
> chmod +x <conda-installer-name>-latest-Linux-x86_64.sh
Create a new Python environment
> conda init
> conda create -n <env-name> python=3.10
> conda activate <env-name>
- Install CUDA toolkit
exist https://developer.nvidia.com/cuda-11-4-0-download-archive Choose an installation method that is suitable for the environment
11.4 is a commonly used CUDA version compatible with Python 3.10. Choose the appropriate OS, architecture, and version, install according to the command, and after installation, use the command to verify
> nvidia-smi
- Install PyTorch+CUDA version
exist https://pytorch.org/get-started/previous-versions/ Choose the appropriate version for installation
Currently using torch1.12.1+cu113, which is the command
> pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
It is not recommended to install with conda. If conda cannot find the CUDA version of PyTorch due to not climbing over the wall, it will automatically install the CPU version
After successful installation, use the following command to enter the Python environment
> python
Use the following code to test if the installation is successful (input line by line without error)
import torch
torch.cuda.is_available() # Output True
- Install other dependency libraries
If you need to download data from the Copernicus website and process it, install it
- numpy Mathematics Library
- xarray Processing NC files
- setuptools
- copernicusmarine Copernicus Ocean Service Download Interface
- NetCDF4 nc file processing engine
command
> pip install numpy xarray setuptools copernicusmarine netCDF4
If only training, installation
- numpy Mathematics Library
- xarray Processing NC files
- setuptools
- pandas Process Database
- scipy Scientific Computing Library
- scikit-learn Scientific Computing Library
- mpi4py Multi process interaction library
- torchsummary Model Analysis Library
- deepspeed Large model training framework
The rest is consistent with CUDA, but CUDA toolkit needs to be replaced with Cann toolkit
-
Same CUDE
-
exist https://www.hiascend.com/developer/download/community/result Download the Cann toolkit cann kernels、cann nnal
- cann toolkit CANN development kit package
- cann kernels CANN operator package
- cann nnal CANN Accelerator Library
The corresponding development package needs to be selected based on the pytorch version. Pytorch 1.12 supports a maximum CANN of 8.0.RC2.beta1
Install after downloading is complete
chmod +x Ascend-<...>.run
Ascend-<...>.run
After installation, commands can be used to verify
> npu-smi info
- Install pytorch-npu
CPU version pytorch needs to be installed first
> pip install torch==1.12.1+cpu --index-url https://download.pytorch.org/whl/cpu
Reinstall pytorch npu, and the installed version should correspond to the CANN and pytorch versions, as shown in the corresponding table https://gitee.com/ascend/pytorch
> pip install pyyaml
> pip install setuptools
> pip install torch-npu==1.11.0.post14
-
same CUDA
-
Initialize environment variables
The environment variable file will be stored in the installation directory during the installation of Cann
> source /usr/local/Ascend/ascend-toolkit/set_env.sh
Use the following code to test if the installation is successful (input line by line without error)
import torch
import torch_npu
torch.npu.is_available() # Output True
for example
> NCCL_SOCKET_IFNAME=enp6s0 nohup ds --hostfile ./hostfile --no_ssh --master_addr 10.32.1.71 --master_port 25354 --node_rank 3 --num_gpus 2 /home/LOEM/code/models/train/train_deepspeed.py -n 4 -m AFNONET -lf MLL1 -mix False -b 1 -e 100 -lr 5e-4 -decay 0.01 -d 0 -hdim 768 -sttr 2010-01-01 -edtr 2019-12-31 -stte 2020-01-01 -edte 2020-12-31 -lc 150 -nw 0 -tag exp2 > afnonet.out 2>&1 &
The training command is a three-layer nested command, with the outermost layer being 'nohup', the middle layer being 'ds', and the inner layer being' train_deepspeed. py '`
Nohup is a background execution program that hosts the training program to the background for training, preventing the program from ending after bash is closed. The command is as follows, and this program will redirect the output to outfile.out
nohup <command> > outfile.out 2>&1 &
DS is a DeepSpeed learning framework that creates multiple processes based on the number of hosts and devices. Each process executes the same code and communicates with each other during the learning process
For the single machine multi care program, use the following command
ds --num_gpus=8 <command>
For the Multi machine, multi card program, use the following command
ds --hostfile ./hostfile --no_ssh --master_addr 10.32.1.71 --master_port 25354 --node_rank 0 --num_gpus 2 <command>
- hostfile : Equipment file description
- no_ssh : Do not use SSH for automatic login
- master_addr : Main server IP
- master_port : Main server port
- node_rank : The current node number starts from 0, and the primary server number is 0
- num_gpus : Number of devices per node
- <command> : Program commands running on each device
The format of the hostfile file file is
worker-1 slots=4
worker-2 slots=4
Used to record the number of nodes and the number of devices per node
For multi machine training, it is necessary to specify the network segment name used for each host's NCCL communication method. For example, here 'NCCL_SOCKET_IFNAME=enp6s0' is used to indicate the use of the 'enp6s0' network segment
The real learning program is train_deepspeed.exe,For details, please refer to models/train/train_deepspeed.py, including parameters
- -n/--nodes The number of nodes (servers) is 1 for a single machine with multiple cards
- -m/--model Optional models
- PANGU Pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM model To be trained
- -lf/--loss_func loss function
- MSE mean squared error
- LMSE Mean square error with latitude weight
- LL1 Mean absolute error with latitude weight
- MLL1 Mean absolute error with sea land mask and latitude weight
- -dev/--device device
- cuda Nvidia Cuda device
- npu Huawei Ascend device
- cpu Do not use acceleration devices
- -mix/--mixing_precision Whether to use mixed precision training
- -b/--batch_size Batch training size
- -e/--epochs Periodization
- -lr/--learning_rate Learning rate
- -decay/--decay punishment
- -d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
- -il/--input_length Sample input length
- -ol/--output_length Sample output length
- -lat/--latitude Latitude width
- -lon/--longitude Longitude width
- -hdim/--hidden_dim Hidden layer size
- -sttr/--start_date_train Training set start date
- -edtr/--end_date_triain Training set end date
- -stte/--start_date_test Test set start date
- -edte/--end_date_test Test set end date
- -lc/--load_cache Cache, cache is based on daily data as a unit
- -shm/--shm_cache Whether to use shared memory as a caching method, invalid if lc==False
- -nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
- -db/--database Root directory for data storage
- -tag/--tag The label for this training
After the training is completed, the weight information of the trained PT will be stored in data/pth/<tag>/globalis_step<num>/mpurank_00_madel_stetes.pt
Verify the use of validate.py
to evaluate the model by calculating MAE and ACC
Due to the lack of backpropagation training, verification requires less computational resources,Can be used on personal PC,But the calculation cycle may be very long.
> python validate.py -pt /home/LOEM/data/pth/exp1/global_step11425/mp_rank_00_model_states.pt -tag exp1
- -m/--model Optional models
- PANGU pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+ Model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast Model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM Model to be trained
- -c/--cpu Whether to use CPU or not
- -b/--batch_size Batch training size
- -d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
- -il/--input_length Sample input length
- -ol/--output_length Sample output length
- -lat/--latitude Latitude width
- -lon/--longitude Longitude width
- -hdim/--hidden_dim Hidden layer size
- -l/--lead The predicted frame length for verification is generally measured in days
- -stvl/--start_date_val Verification set start date
- -edvl/--end_date_val Verification set end date
- -lc/--load_cache Cache, cache is based on daily data as a unit
- -nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
- -db/--database Root directory for data storage
- -tag/--tag The label for this training
- -pt/--pt_file Weight file path
Since the weight file does not include the model structure, it is necessary to use relevant parameters to specify the model and initialize it
The verification results will be stored in 'data/static//. json', including the following parameters
{
"val_l1" : [], // MAE
"val_acc" : [], // ACC
"mean_time" : [], // The average time for inferring a single sample
"train_loss_list" : [], // Training loss reduction process
"test_loss_list" : [], // Test the loss reduction process. If the training process is not tested, this item is empty
}
Using predict.py
for inference, generate predictions for the next N days based on one or more (depending on the model) ocean environment variable samples from different dates, and store them in nc format files.
Due to the lack of backpropagation training, it is inferred that less computing resources are used, Can be used on personal PC.
> python predict.py -m PANGU -b 1 -d 0 -il 1 -ol 1 -hdim 192 -pt "E:/LOEM/data/pth/exp3/global_step18280-1/mp_rank_00_model_states.pt" -tag exp3
- -m/--model Optional models
- PANGU pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+ model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM model to be trained
- -c/--cpu Whether to use CPU or not
- -b/--batch_size Batch training size
- -d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
- -il/--input_length Sample input length
- -ol/--output_length Sample output length
- -lat/--latitude Latitude width
- -lon/--longitude Longitude width
- -hdim/--hidden_dim Hidden layer size
- -s/--save_lead The frames to be saved are usually measured in days. For example, if it is [3, 5, 8, 10], the next 10 days will be predicted and the results of the 3rd, 5th, 8th, and 10th days will be saved
- -st/--start_date Predict starting date
- -ed/--end_date Predict the end date, which refers to the end day of data loading
- -lc/--load_cache Cache, cache is based on daily data as a unit
- -nw/--num_workers The number of data loading processes, if using shared memory, needs to be set to 0, otherwise an error will be reported
- -db/--database Root directory for data storage
- -tag/--tag The label for this training
- -pt/--pt_file Weight file path
The inference results will be saved in the /data/pred/<tag>/
The pt format file only contains weights and does not include model structures. Using 'export. py' can export onnx format files containing file weights and model structures, which can be run across platforms and environments, and can also be deployed in non python or non pytorch environments
Not all models can be exported to onnx format files, depending on whether the model operator used is supported. For example, older versions do not support fast Fourier transform, so AFNOnet and AIGOMS cannot export to onnx format
- -m/--model Optional models
- PANGU pangu model https://doi.org/10.1038/s41586-023-06185-3
- FUXI fuxi model https://doi.org/10.1038/s41612-023-00512-1
- AIGOMS AI-GOMS model https://doi.org/10.48550/arXiv.2308.03152
- AFNONET Adaptive Fourier Model https://doi.org/10.48550/arXiv.2208.05419
- SWINRNNP SwinRNN+ model http://dx.doi.org/10.48448/zn7f-fc64
- GRAPHCAST GraphCast model https://doi.org/10.1126/science.adi2336
- OURS/LOEM LOEM model to be trained
- -c/--cpu Whether to use CPU or not
- -d/--debugger Whether to debug in limited resources
- 0 Do not debug
- 1 Remove all learning blocks and run in minimal mode for debugging the framework
- 2 Only use two blocks for debugging operators
- -il/--input_length Sample input length
- -ol/--output_length Sample output length
- -lat/--latitude Latitude width
- -lon/--longitude Longitude width
- -hdim/--hidden_dim Hidden layer size
- -db/--database Root directory for data storage
- -tag/--tag The label for this training
- -pt/--pt_file Weight file path
- -ox/--onnx_file Export file path