![]() This assumption, however, rarely holds for phylogenetic data. AIC assumes that the sample size is large enough to maintain the asymptotic property of the likelihood function and thus penalizes the likelihood score only for the number of model parameters. As the maximum-likelihood score generally increases with the inclusion of more parameters, the Akaike and Bayesian Information Criteria, AIC ( Akaike 1973, 1974) and BIC ( Schwarz 1978), assign different penalties according to the number of parameters included in the model while also considering the data size. Other information criteria compute the maximum-likelihood scores for all the candidate models simultaneously. Furthermore, it has been shown that the choices determined by the hLRT are influenced by the order of pairwise tests and the significance threshold used to reject the simpler model in each paired comparison ( Yang et al. Obviously, no substitution model can fully capture the genuine complexity of the evolutionary process, such that even the most adequate one merely provides an approximation of reality ( Box 1976), therefore posing model misspecifications that may bias the results of LRTs in phylogenetics ( Zhang 1999). However, LRTs assume that at least one of the compared models is adequate and might be incorrect when the models are misspecified ( Foutz and Srivastava 1977 Kent 1982 Golden 1995). For example, the likelihood ratio test (LRT) for comparing between a pair of nested models, as approximated using the chi-square distribution, has been expanded to the comparison of multiple models via the hierarchical likelihood ratio test (hLRT) criterion, which performs a sequence of LRTs between pairs of nested substitution models, until a model that cannot be rejected is reached. However, although criteria for phylogenetic model selection have been adapted from the general statistical literature, they rely on assumptions that do not hold for phylogenetic data analysis ( Posada and Buckley 2004). For example, the most widely used method, MODELTEST ( Posada and Crandall 1998), was included in the 100 all-time top-cited papers by Web of Science ( Van Noorden et al. This is evident by the wide use of model selection as an inherent component of phylogenetic analysis. 1985 Tamura 1992 Tamura and Nei 1993 Schöniger and Von Haeseler 1994 Zharkikh 1994 Huelsenbeck and Crandall 1997) and the need to choose one (or few) has established model selection as a prerequisite for phylogeny reconstruction ( Goldman 1993a Huelsenbeck and Rannala 1997 Sullivan and Swofford 1997, 2001 Posada and Crandall 2001 Pupko et al. ![]() The abundance of substitution models ( Jukes and Cantor 1969 Kimura 1980 Felsenstein 1981 Cowan 1984 Hasegawa et al. Model selection, phylogenetic reconstruction, simulations, nucleotide substitution models, machine learning, Random Forest for regression Introduction By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. ![]() We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |