Record of my interview experience with QuantumBlack (McKinsey) as a data scientist in Singapore in 2019.
Someone asked this question on PTT (in Chinese). He trained a rectal cancer detection model on MRI images with 5 fold cross validation, but out-of-fold AUC were less than 0.5 in every folds. After some searched on Internet, he found someone said: oh, if you reverse the label (switch class 0 and 1), than you can get AUC better than 0.5, your model still learnt something. In my humble opinion, it is very dangerous to reverse label on a worst than random model. So, how to solve it?
This article will introduce how to download Wikipedia corpus and train word embedding on it. All the code will be on Github. Downloading time and training time is extremely long, so I also uploaded my pretrained embedding. You can download my pretrained embedding here: Chinese Word2Vec, Chinese FastText, English Word2Vec, English FastText.
Recently bought a new computer, encountered some strange problems and stuck for a long time. Just document the steps for setting up the environment here.