使用多张GPU进行训练的代码

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
model = getModel(args)
if torch.cuda.device_count() > 1:
	print("Let's use", torch.cuda.device_count(), "GPUs!")
	model = nn.DataParallel(model, device_ids=[0, 1, 2, 3])   # device_ids=[0, 1, 2, 3]      
model.to(device)
1.当需要改变主GPU时，只需要修改如下：
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3"
device_ids=[0, 1, 2]  # 此时的GPU从GPU1开始，使用三张GPU
2.使用多张GPU时，batch_size大于1才能使得多张GPU都工作。
测试调用模型
 
    G_model, _ = getModel(args)
    if torch.cuda.device_count() > 1:
        G_model = nn.DataParallel(G_model)
        checkpoint = torch.load(r'./Gmodel.pth', map_location='cuda')
        G_model.load_state_dict(checkpoint)
        if isinstance(G_model, torch.nn.DataParallel):
            G_model = G_model.module
    else:
        checkpoint = torch.load(r'./Gmodel.pth', map_location='cuda')
        G_model.load_state_dict(checkpoint)
    G_model = G_model.to(device)
                    @【TOC】使用多张GPU进行训练的代码os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")print(device)model = getModel(args)if torch.cuda.device_count() &gt; 1:	print("Let's use", torch.cuda.device_c
cifar10_97.23 使用 run.sh 文件开始训练
cifar10_97.50 使用 run.4GPU.sh 开始训练
在集群中改变GPU调用个数修改 run.sh 文件
nohup srun --job-name=cf23 $pt --gres=gpu:2 -n1 bash cluster_run.sh $cmd 2>&1 1>>log.cf50_2GPU &
修改 –gres=gpu:2 即可
Python 文件代码修改
parser.add_argument('--batch_size', type=
net = nn.DataParallel(net).cuda()
上面这行代码就可以实现了，nn.DataParallel()内的参数可以不填，则默认使用所有可以使用的显卡。
如果不设置多显卡并行计算，那么上面的那一句代码改为下面的就行了：
net = net.cuda()
				在单机多gpu可以满足的情况下, 绝对不建议使用多机多gpu进行训练, 多台机器之间传输数据的时间非常慢, 如果机器只是千兆网卡, 再加上别的一些损耗, 网络的传输速度跟不上, 会导致训练速度实际很慢。
1 初始化
   初始化操作一般在程序刚开始的时候进行。
   在进行多机多gpu运行训练的时候，需要先使用 torch.distributed.init_process_group()进行初始化。 torch.distributed.init_process_group...
				pytorch使用多GPU训练有两种方式：DataParallel和ModelParallel，这里主要介绍DataParallel
机制： DataParallel的机制是把每个minibatch分为GPU个数份儿，然后把原始模型复制到多个GPU上，在每个GPU上进行正向传播，在反向传播的时候，把梯度相加（而不是求平均）更新到原始模型上。
两种指定GUP id的方式：
通过环境变量：os.environ["CUDA_VISIBLE_DEVICES"]="1,2,3,4"，好处是只对指定的ids的GPU.