nnUNetV2修改笔记-D11

Wang Zihao2024-06-292025-07-11

网络训练

今天重新训练了CAUNet，主要是第四折有问题，重新多训练了几遍，最后又好了，无语。

看着还可以，主要是不知道这样改的网络都这么不稳定，没改过的网络感觉一开始dice上的就很快，这种改过的网络就很低，但是跑100多e后感觉也还可以。CAUNet跑完了，但是周五也没时间了，等到周一来了再验证吧。感觉可能是学习率或者是优化器的问题？有空的话做实验试一试？

接下来本来准备训练SAUNet的，但是这个网络跑了几次损失函数都是nan，几折都是这样。


#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
 

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [128, 128, 128], 'median_image_size_in_voxels': [139.0, 169.0, 138.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [True, True, True, True], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.saunet.SAPlainConvUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'n_conv_per_stage': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}, 'deep_supervision': True}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': False} 
 
These are the global plan.json settings:
 {'dataset_name': 'Dataset043_BraTS2019_sa', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [139, 169, 138], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 21113.0, 'mean': 542.999755859375, 'median': 343.0, 'min': 0.0, 'percentile_00_5': 36.0, 'percentile_99_5': 8810.0, 'std': 929.7076416015625}, '1': {'max': 18011.0, 'mean': 637.3016967773438, 'median': 398.0, 'min': 0.0, 'percentile_00_5': 35.0, 'percentile_99_5': 8077.0, 'std': 934.3526611328125}, '2': {'max': 31404.0, 'mean': 929.93505859375, 'median': 629.0, 'min': 0.0, 'percentile_00_5': 94.0, 'percentile_99_5': 16792.0, 'std': 1901.22119140625}, '3': {'max': 29422.0, 'mean': 655.0728149414062, 'median': 409.0, 'min': 0.0, 'percentile_00_5': 65.0, 'percentile_99_5': 16095.0, 'std': 1809.3948974609375}}} 
 
2024-06-28 20:55:38.787769: unpacking dataset... 
2024-06-28 20:55:41.889807: unpacking done... 
2024-06-28 20:55:41.891027: do_dummy_2d_data_aug: False 
2024-06-28 20:55:41.892254: Using splits from existing split file: /home/wzh-dl/nnunet/nnUNet_preprocessed/Dataset043_BraTS2019_sa/splits_final.json 
2024-06-28 20:55:41.892995: The split file contains 5 splits. 
2024-06-28 20:55:41.893032: Desired fold for training: 0 
2024-06-28 20:55:41.893061: This split has 268 training and 67 validation cases. 
2024-06-28 20:55:41.902031: Unable to plot network architecture: 
2024-06-28 20:55:41.902077: No module named 'hiddenlayer' 
2024-06-28 20:55:41.912451:  
2024-06-28 20:55:41.912497: Epoch 0 
2024-06-28 20:55:41.912628: Current learning rate: 0.01 
2024-06-28 21:00:15.003617: train_loss 0.3861 
2024-06-28 21:00:15.006468: val_loss 0.319 
2024-06-28 21:00:15.006593: Pseudo dice [np.float32(0.0), np.float32(0.0), np.float32(0.0)] 
2024-06-28 21:00:15.007578: Epoch time: 273.09 s 
2024-06-28 21:00:15.007675: Yayy! New best EMA pseudo Dice: 0.0 
2024-06-28 21:00:19.410886:  
2024-06-28 21:00:19.411277: Epoch 1 
2024-06-28 21:00:19.411479: Current learning rate: 0.00991 
2024-06-28 21:04:29.235484: train_loss 0.2494 
2024-06-28 21:04:29.241981: val_loss 0.2322 
2024-06-28 21:04:29.242288: Pseudo dice [np.float32(0.0), np.float32(0.0), np.float32(0.0)] 
2024-06-28 21:04:29.243144: Epoch time: 249.83 s 
2024-06-28 21:04:32.809436:  
2024-06-28 21:04:32.809648: Epoch 2 
2024-06-28 21:04:32.809788: Current learning rate: 0.00982

很奇怪，先把他跳过跑CABM了。跑了两折多了，感觉还可以。

另外，今天突然发现一个问题就是Auto-dl上面的2080ti也是12核cpu带的，为啥速度比我的快这么多，绷不住了。周末把CABM跑一跑，等到周一一块验证一下。这周就这样吧。