Abstract (EN):
There are several indices that provide an indication of different types on the performance of QSAR classification models, being the area under a Receiver Operating Characteristic (ROC) curve still the most powerful test to overall assess such performance. All ROC related parameters can be calculated for both the training and test sets, but, nevertheless, neither of them constitutes an absolute indicator of the classification performance by themselves. Moreover, one of the biggest drawbacks is the computing time needed to obtain the area under the ROC curve, which naturally slows down any calculation algorithm. The present study proposes two new parameters based on distances in a ROC curve for the selection of classification models with an appropriate balance in both training and test sets, namely the following: the ROC graph Euclidean distance (ROCED) and the ROC graph Euclidean distance corrected with Fitness Function (FIT(lambda)) (ROCFIT). The behavior of these indices was observed through the study on the mutagenicity for four genotoxicity end points of a number of nonaromatic halogenated derivatives. It was found that the ROCED parameter gets a better balance between sensitivity and specificity for both the training and prediction sets than other indices such as the Matthews correlation coefficient, the Wilk's lambda, or parameters like the area under the ROC curve. However, when the ROCED parameter was used, the follow-on linear discriminant models showed the lower statistical significance. But the other parameter, ROCFIT, maintains the ROCED capabilities while improving the significance of the models due to the inclusion of FIT(lambda)).
Language:
English
Type (Professor's evaluation):
Scientific