SCTNet 项目排坑
- 任务
- 过程记录
- 在旧环境下运行
- 重配环境
- 训练
- 测试
- 速度测试
任务
想跑一下最新的实时分割代码。由于这个项目也是基于mmsegmentation的,所以我想先看看之前那个环境是否能直接适配。不行,我再新建环境。
过程记录
在旧环境下运行
必要工作:
git clone https://github.com/xzz777/SCTNet.git
cd SCTNet/
conda activate OPENMMLAB
pip3 install timm
pip3 install einops
项目根目录下新建data和pretrained文件夹
创建数据库软链接:
ln -s /media/lcy-magic/Dataset/Segment_Dataset/ade/ ./data/
下载预训练权重(我只下了这几个):
pretrained/
├── SCT-B_Pretrain.pth
├── SCT-S_Pretrain.pth
├── Teacher_SegFormer_B2_ADE.pth
└── Teacher_SegFormer_B3_ADE.pth
测试:
python tools/test.py configs\sctnet\ADE20K\sctnet-b_8x4_160k_ade.py pretrained/SCTNet-B-ADE20K.pth --eval mIoU
报错:
Traceback (most recent call last):
File "tools/test.py", line 14, in <module>
from mmcv.cnn.utils import revert_sync_batchnorm
ImportError: cannot import name 'revert_sync_batchnorm' from 'mmcv.cnn.utils' (/home/lcy-magic/anaconda3/envs/OPENMMLAB/lib/python3.8/site-packages/mmcv/cnn/utils/__init__.py)
寄。感觉是版本问题。我还是按照项目要求来吧。
重配环境
conda create -n SCTNET python=3.8 -y
conda activate SCTNET
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install mmcv-full==1.6.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.11.0/index.html
pip3 install platformdirs==3.9.0
pip install timm matplotlib prettytable einops
pip install pykitti
运行:
python tools/test.py configs\sctnet\ADE20K\sctnet-b_8x4_160k_ade.py pretrained/SCTNet-B-ADE20K.pth --eval mIoU
报错:
Traceback (most recent call last):
File "tools/test.py", line 323, in <module>
main()
File "tools/test.py", line 135, in main
cfg = mmcv.Config.fromfile(args.config)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/config.py", line 340, in fromfile
cfg_dict, cfg_text = Config._file2dict(filename,
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/config.py", line 183, in _file2dict
check_file_exist(filename)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/path.py", line 23, in check_file_exist
raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/lcy-magic/Segment_TEST/SCTNet/configssctnetADE20Ksctnet-b_8x4_160k_ade.py" does not exist
其中,/SCTNet/configssctnetADE20Ksctnet-b_8x4_160k_ade.py
这一段明显离谱,这里改正应该就好了。发现是指令里/写反了,这是因为作者应该是在win上运行,win和ubuntu在这个正好是反向的。改称后运行:
python tools/test.py configs/sctnet/ADE20K/sctnet-b_8x4_160k_ade.py pretrained/SCTNet-B-ADE20K.pth --eval mIoU
报错:
Traceback (most recent call last):
File "tools/test.py", line 323, in <module>
main()
File "tools/test.py", line 221, in main
model = build_segmentor(cfg.model, test_cfg=cfg.get('test_cfg'))
File "/home/lcy-magic/Segment_TEST/SCTNet/mmseg/models/builder.py", line 48, in build_segmentor
return SEGMENTORS.build(
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/registry.py", line 237, in build
return self.build_func(*args, **kwargs, registry=self)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg
raise type(e)(f'{obj_cls.__name__}: {e}')
FileNotFoundError: EncoderDecoder_Distill: SCTNet: pretrain/SCT-B_Pretrain.pth can not be found.
说一个预训练权重文件找不到。看来作者的预训练权重放在pretrain文件夹内,我的在pretrained文件夹内。我直接在vscode里全部替换掉:
运行后又报错:
Traceback (most recent call last):
File "tools/test.py", line 323, in <module>
main()
File "tools/test.py", line 225, in main
checkpoint = load_checkpoint(model, args.checkpoint, map_location='cpu')
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", line 627, in load_checkpoint
checkpoint = _load_checkpoint(filename, map_location, logger)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", line 561, in _load_checkpoint
return CheckpointLoader.load_checkpoint(filename, map_location, logger)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", line 303, in load_checkpoint
return checkpoint_loader(filename, map_location) # type: ignore
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", line 322, in load_from_local
raise FileNotFoundError(f'{filename} can not be found.')
FileNotFoundError: pretrained/SCTNet-B-ADE20K.pth can not be found.
确认了下,谷歌网盘上确实没有提供这个:
那我就换成SCT-B_Pretrained试试:
python tools/test.py configs/sctnet/ADE20K/sctnet-b_8x4_160k_ade.py pretrained/SCT-B_Pretrain.pth --eval mIoU
似乎不太对:
得到结果:
换成:
python tools/test.py configs/sctnet/ADE20K/sctnet-b_8x4_160k_ade.py pretrained/Teacher_SegFormer_B2_ADE.pth --eval mIoU
就更扯了:
看来这个参数确实没有提供,得自己训练。那我就现开始训练吧:
训练
因为他给的脚本是分布式训练的。而我笔记本只有一个GPU,所以打算直接用train.py:
python tools/train.py configs/sctnet/ADE20K/sctnet-b_8x4_160k_ade.py
报错:
Traceback (most recent call last):
File "tools/train.py", line 16, in <module>
from mmseg import __version__
ModuleNotFoundError: No module named 'mmseg'
本想直接安装:
pip install -v -e .
结果他竟然没有setup.py。看readme上说:
那我就clone一下这个codebase,在这里面安装:
git clone -b v0.26.0 https://github.com/open-mmlab/mmsegmentation.git
cd mmsegmentation/
pip install -v -e .
再运行:
python tools/train.py configs/sctnet/ADE20K/sctnet-b_8x4_160k_ade.py
报错:
Traceback (most recent call last):
File "tools/train.py", line 245, in <module>
main()
File "tools/train.py", line 163, in main
cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/config.py", line 596, in dump
f.write(self.pretty_text)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/config.py", line 508, in pretty_text
text, _ = FormatCode(text, style_config=yapf_style, verify=True)
TypeError: FormatCode() got an unexpected keyword argument 'verify'
网上查了下,这个博客参考博客说,是yapf版本不对。我一检查果然:
降低版本:
pip install yapf==0.40.1
再运行训练脚本,还是有问题:
raceback (most recent call last):
File "tools/train.py", line 245, in <module>
main()
File "tools/train.py", line 201, in main
model = build_segmentor(
File "/home/lcy-magic/Segment_TEST/SCTNet/mmsegmentation/mmseg/models/builder.py", line 48, in build_segmentor
return SEGMENTORS.build(
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/registry.py", line 237, in build
return self.build_func(*args, **kwargs, registry=self)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/registry.py", line 61, in build_from_cfg
raise KeyError(
KeyError: 'EncoderDecoder_Distill is not in the models registry'
看来作者的bash脚本还是有些特殊设置的。于是我让GPT帮我修改了下那个bash脚本如下:
CONFIG=$1
GPUS=1
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch \
--nproc_per_node=$GPUS \
$(dirname "$0")/train.py \
$CONFIG \
--launcher pytorch ${@:2}
执行:
bash tools/dist_train.sh configs/sctnet/ADE20K/sctnet-b_8x4_160k_ade.py
报错:
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 7.79 GiB total capacity; 5.37 GiB already allocated; 270.38 MiB free; 5.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 28348) of binary: /home/lcy-magic/anaconda3/envs/SCTNET/bin/python
告诉我cuda不够用了。查看实时占用:
watch -n 1 nvidia-smi
空载情况下还是可以的,没占用太多:
再运行训练脚本,果然报错的时候已经要满了:
我想,用他的s版本,会不会可以,试了试,并不行:
bash tools/dist_train.sh configs/sctnet/ADE20K/sctnet-s_8x4_160k_ade.py
只能改batch size了。发现官方说在config里改:
于是,我把s那个版本的配置文件做了修改:
data = dict(
# samples_per_gpu=8,
samples_per_gpu=4,
workers_per_gpu=4,
果然能跑了:
查看实时显存也基本吃完了,说明基本没法再大了:
先回宿舍了,明天回来看结果。
训练到下午13:18,终于训练完了:
效果还行,其实感觉有点差哈哈。之前训练的SegFormerB1:mIoU35.25,mACC49.83。
训练过程的checkpoint在work_dir目录下了。可以用他跑下测试。
测试
用最后一个checkpoint测试:
python tools/test.py configs/sctnet/ADE20K/sctnet-s_8x4_160k_ade.py work_dirs/sctnet-s_8x4_160k_ade/latest.pth --eval mIoU
报错:
Traceback (most recent call last):
File "tools/test.py", line 323, in <module>
main()
File "tools/test.py", line 221, in main
model = build_segmentor(cfg.model, test_cfg=cfg.get('test_cfg'))
File "/home/lcy-magic/Segment_TEST/SCTNet/mmsegmentation/mmseg/models/builder.py", line 48, in build_segmentor
return SEGMENTORS.build(
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/registry.py", line 237, in build
return self.build_func(*args, **kwargs, registry=self)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/home/lcy-magic/anaconda3/envs/SCTNET/lib/python3.8/site-packages/mmcv/utils/registry.py", line 61, in build_from_cfg
raise KeyError(
KeyError: 'EncoderDecoder_Distill is not in the models registry'
奇怪了,之前还没问题的。查看了下github上的issue和官方文档,检查了好久,我的__init.py__没有问题。最后我想,是不是我把环境变量设成项目根目录就能找到这个模块了。于是:
export PYTHONPATH=$PYTHONPATH:~/Segment_TEST/SCTNet
果然!再运行测试脚本,成果了:
速度测试
按照readme指示:
cd speed/
python torch_speed.py --type sctnet-b-seg100
python torch_speed.py --type sctnet-s-ade
有结果了: