training in epoch ,error in sizes of tensors must match except in dimension1.expected size 1 but got size 28 for tensor number 2 in the list #1188

linxy-1992 · 2024-08-21T03:55:26Z

In some cases, when torch is being trained, there will be a sudden abort of training after a few epochs due to dimensional mismatch of the tensor, and it is not known what causes this problem. This problem occurs both in cpu and gpu.
But in the process of using it, I found that when writing the mod, avoiding repeated assignments to the same named object can reduce such a situation.
For example:
convnext_block<-nnn_module(
initialize=function(dim,dropout_p=0.1,layer_scale_init_value=1e-6){
self$conv<-nn_conv2d(dim,dim,kernel_size=7,padding=3,groups=dim)
self$ln<-nn_layer_norm(dim)
self$linear1<-nn_linear(dim,dim4)
self$gelu<-nn_gelu()
self$linear2<-nn_linear(dim4,dim)
#self$gamma<-nn_parameter(layer_scale_init_valuetorch_ones(1,1,1,dim))
self$dropout<-nn_dropout(dropout_p)
},forward=function(xcxb1){
xcxbresid<-xcxb1
xcxb2<-self$conv(xcxb1)
xcxb3<-xcxb2$permute(c(1,3,4,2))
xcxb4<-self$ln(xcxb3)
xcxb5<-self$linear1(xcxb4)
xcxb6<-self$gelu(xcxb5)
xcxb7<-self$linear2(xcxb6)
#xcxb<-xcxbself$gamma
xcxb8<-xcxb7$permute(c(1,4,2,3))
torch_add(self$dropout(xcxb8),xcxbresid)})
n<-convnext_block(64)
n
In normal programming, it is sufficient to use the same name as ‘scxb’, but when the layers of the neural network are stacked, using the same name as ‘scxb’ will result in a tensor dimension error during the training process. Using xcb1,xcb2,xcb3,........ This pro-approach reduces the number of errors to a large extent, but it still cannot be avoided.
Finally, I found that this can be avoided by using rm;gc();cuda_empty_cache() during training: as follows:
model<-model$to(device=device)
for(epoch in 1:100){
optimizer<-optim_adam(model$parameters,lr=0.001)#optimiser
model$train()# set to train model
coro::loop(for(b in ministdlta){
optimiser$zero_grad()
output<-model(b[[1]]$to(device=device))
loss<-nnf_multilabel_soft_margin_loss(output,b[[2]]$to(device=device))
loss$backward()
optimiser$step()
})
rm(list=c(‘b’, ‘output’, ‘loss’))
gc()
cuda_empty_cache()
}
This will largely avoid tensor dimension errors during training.
I doubt that such a problem arises when the underlying code for the data tuning of the torch package has a vulnerability in the physical address of the data

Translated with DeepL.com (free version)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training in epoch ,error in sizes of tensors must match except in dimension1.expected size 1 but got size 28 for tensor number 2 in the list #1188

training in epoch ,error in sizes of tensors must match except in dimension1.expected size 1 but got size 28 for tensor number 2 in the list #1188

training in epoch ,error in sizes of tensors must match except in dimension1.expected size 1 but got size 28 for tensor number 2 in the list #1188

training in epoch ,error in sizes of tensors must match except in dimension1.expected size 1 but got size 28 for tensor number 2 in the list #1188

Comments