You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some cases, when torch is being trained, there will be a sudden abort of training after a few epochs due to dimensional mismatch of the tensor, and it is not known what causes this problem. This problem occurs both in cpu and gpu.
But in the process of using it, I found that when writing the mod, avoiding repeated assignments to the same named object can reduce such a situation.
For example:
convnext_block<-nnn_module(
initialize=function(dim,dropout_p=0.1,layer_scale_init_value=1e-6){
self$conv<-nn_conv2d(dim,dim,kernel_size=7,padding=3,groups=dim)
self$ln<-nn_layer_norm(dim)
self$linear1<-nn_linear(dim,dim4)
self$gelu<-nn_gelu()
self$linear2<-nn_linear(dim4,dim)
#self$gamma<-nn_parameter(layer_scale_init_valuetorch_ones(1,1,1,dim))
self$dropout<-nn_dropout(dropout_p)
},forward=function(xcxb1){
xcxbresid<-xcxb1
xcxb2<-self$conv(xcxb1)
xcxb3<-xcxb2$permute(c(1,3,4,2))
xcxb4<-self$ln(xcxb3)
xcxb5<-self$linear1(xcxb4)
xcxb6<-self$gelu(xcxb5)
xcxb7<-self$linear2(xcxb6)
#xcxb<-xcxbself$gamma
xcxb8<-xcxb7$permute(c(1,4,2,3))
torch_add(self$dropout(xcxb8),xcxbresid)})
n<-convnext_block(64)
n
In normal programming, it is sufficient to use the same name as ‘scxb’, but when the layers of the neural network are stacked, using the same name as ‘scxb’ will result in a tensor dimension error during the training process. Using xcb1,xcb2,xcb3,........ This pro-approach reduces the number of errors to a large extent, but it still cannot be avoided.
Finally, I found that this can be avoided by using rm;gc();cuda_empty_cache() during training: as follows:
model<-model$to(device=device)
for(epoch in 1:100){
optimizer<-optim_adam(model$parameters,lr=0.001)#optimiser
model$train()# set to train model
coro::loop(for(b in ministdlta){
optimiser$zero_grad()
output<-model(b[[1]]$to(device=device))
loss<-nnf_multilabel_soft_margin_loss(output,b[[2]]$to(device=device))
loss$backward()
optimiser$step()
})
rm(list=c(‘b’, ‘output’, ‘loss’))
gc()
cuda_empty_cache()
}
This will largely avoid tensor dimension errors during training.
I doubt that such a problem arises when the underlying code for the data tuning of the torch package has a vulnerability in the physical address of the data
Translated with DeepL.com (free version)
The text was updated successfully, but these errors were encountered:
In some cases, when torch is being trained, there will be a sudden abort of training after a few epochs due to dimensional mismatch of the tensor, and it is not known what causes this problem. This problem occurs both in cpu and gpu.
But in the process of using it, I found that when writing the mod, avoiding repeated assignments to the same named object can reduce such a situation.
For example:
convnext_block<-nnn_module(
initialize=function(dim,dropout_p=0.1,layer_scale_init_value=1e-6){
self$conv<-nn_conv2d(dim,dim,kernel_size=7,padding=3,groups=dim)
self$ln<-nn_layer_norm(dim)
self$linear1<-nn_linear(dim,dim4)
self$gelu<-nn_gelu()
self$linear2<-nn_linear(dim4,dim)
#self$gamma<-nn_parameter(layer_scale_init_valuetorch_ones(1,1,1,dim))
self$dropout<-nn_dropout(dropout_p)
},forward=function(xcxb1){
xcxbresid<-xcxb1
xcxb2<-self$conv(xcxb1)
xcxb3<-xcxb2$permute(c(1,3,4,2))
xcxb4<-self$ln(xcxb3)
xcxb5<-self$linear1(xcxb4)
xcxb6<-self$gelu(xcxb5)
xcxb7<-self$linear2(xcxb6)
#xcxb<-xcxbself$gamma
xcxb8<-xcxb7$permute(c(1,4,2,3))
torch_add(self$dropout(xcxb8),xcxbresid)})
n<-convnext_block(64)
n
In normal programming, it is sufficient to use the same name as ‘scxb’, but when the layers of the neural network are stacked, using the same name as ‘scxb’ will result in a tensor dimension error during the training process. Using xcb1,xcb2,xcb3,........ This pro-approach reduces the number of errors to a large extent, but it still cannot be avoided.
Finally, I found that this can be avoided by using rm;gc();cuda_empty_cache() during training: as follows:
model<-model$to(device=device)
for(epoch in 1:100){
optimizer<-optim_adam(model$parameters,lr=0.001)#optimiser
model$train()# set to train model
coro::loop(for(b in ministdlta){
optimiser$zero_grad()
output<-model(b[[1]]$to(device=device))
loss<-nnf_multilabel_soft_margin_loss(output,b[[2]]$to(device=device))
loss$backward()
optimiser$step()
})
rm(list=c(‘b’, ‘output’, ‘loss’))
gc()
cuda_empty_cache()
}
This will largely avoid tensor dimension errors during training.
I doubt that such a problem arises when the underlying code for the data tuning of the torch package has a vulnerability in the physical address of the data
Translated with DeepL.com (free version)
The text was updated successfully, but these errors were encountered: