Abstract
This article addresses some problems in outlier detection and variable selection in linear regression models. First, in outlier detection there are problems known as smearing and masking. Smearing means that one outlier makes another, non-outlier observation appear as an outlier, and masking that one outlier prevents another one from being detected. Detecting outliers one by one may therefore give misleading results. In this article a genetic algorithm is presented which considers different possible groupings of the data into outlier and non-outlier observations. In this way all outliers are detected at the same time. Second, it is known that outlier detection and variable selection can influence each other, and that different results may be obtained, depending on the order in which these two tasks are performed. It may therefore be useful to consider these tasks simultaneously, and a genetic algorithm for a simultaneous outlier detection and variable selection is suggested. Two real data sets are used to illustrate the algorithms, which are shown to work well. In addition, the scalability of the algorithms is considered with an experiment using generated data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Author information
Authors and Affiliations
Corresponding author
Additional information
I would like to thank Dr Tero Aittokallio and an anonymous referee for useful comments.
Rights and permissions
About this article
Cite this article
Tolvi, J. Genetic algorithms for outlier detection and variable selection in linear regression models. Soft Computing 8, 527–533 (2004). https://doi.org/10.1007/s00500-003-0310-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-003-0310-2