Someone asked this question on PTT (in Chinese). He trained a rectal cancer detection model on MRI images with 5 fold cross validation, but out-of-fold AUC were less than 0.5 in every folds. After some searched on Internet, he found someone said: oh, if you reverse the label (switch class 0 and 1), than you can get AUC better than 0.5, your model still learnt something. In my humble opinion, it is very dangerous to reverse label on a worst than random model. So, how to solve it?
First, make sure your code is bug free. Maybe you just accidentally misplace 0 and 1, so the AUC reversed. If your code is bug free, then we can consider 2 following cases:
Case 1: train AUC ≤ 0.5 and test AUC < 0.5
It has something wrong in the model or data, thus model is underfitting. Maybe model structure or parameter, numerical range of data, etc.
Case 2: train AUC > 0.5 and test AUC < 0.5
The model training is reasonable, but test AUC < 0.5. It means that under current feature space, the distribution of training data and testing data are different. Reversing the predicted label is very dangerous.
Consider following case, now we are training a cat-dog classifier, that is, given a image, determine it contains a cat or a dog. During training, the model finds that all image with a tongue out are dogs, otherwise are cats. The model learns this well, so train AUC > 0.5. But, when testing, all cats has a tongue out, so the predictions are wrong which cause test AUC < 0.5.
In this case, simply flipping the predicted values would not be a reasonable solution as it would only work for this particular test set and not for all cats and dogs. Instead, we should investigate why the model relied solely on the presence of the tongue as a decisive feature and make corrections accordingly.
Solution: check the code, check training performance check the fail cases
When you find test AUC < 0.5, do not reverse the label! Make sure the code is bug free. Check whether training AUC greater than 0.5. Check wrong predictions in testing data, and find which “features” make your model predicts correct in training data but wrong in testing data.