Abstract
In the classical leave-one-out procedure for outlier detection in regression analysis, we exclude an observation and then construct a model on the remaining data. If the difference between predicted and observed value is high we declare this value an outlier. As a rule, those procedures utilize single comparison testing. The problem becomes much harder when the observations can be associated with a given degree of membership to an underlying population, and the outlier detection should be generalized to operate over fuzzy data. We present a new approach for outlier detection that operates over fuzzy data using two inter-related algorithms. Due to the way outliers enter the observation sample, they may be of various order of magnitude. To account for this, we divided the outlier detection procedure into cycles. Furthermore, each cycle consists of two phases. In Phase 1, we apply a leave-one-out procedure for each non-outlier in the dataset. In Phase 2, all previously declared outliers are subjected to Benjamini–Hochberg step-up multiple testing procedure controlling the false-discovery rate, and the non-confirmed outliers can return to the dataset. Finally, we construct a regression model over the resulting set of non-outliers. In that way, we ensure that a reliable and high-quality regression model is obtained in Phase 1 because the leave-one-out procedure comparatively easily purges the dubious observations due to the single comparison testing. In the same time, the confirmation of the outlier status in relation to the newly obtained high-quality regression model is much harder due to the multiple testing procedure applied hence only the true outliers remain outside the data sample. The two phases in each cycle are a good trade-off between the desire to construct a high-quality model (i.e., over informative data points) and the desire to use as much data points as possible (thus leaving as much observations as possible in the data sample). The number of cycles is user defined, but the procedures can finalize the analysis in case a cycle with no new outliers is detected. We offer one illustrative example and two other practical case studies (from real-life thrombosis studies) that demonstrate the application and strengths of our algorithms. In the concluding section, we discuss several limitations of our approach and also offer directions for future research.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Selvanathan, S.A., Selvanathan, S., Keller, G.: Business Statistics: Australia New Zealand, 7th ed. Cengage Learning Australia (2017).
Freedman, D.: Statistical Models: Theory and Practice. Cambridge University Press, Cambridge (2009)
Magnusson, M., Andersen, M., Jonasson, J., Vehtari, A.: Bayesian leave-one-out cross-validation for large data. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, vol. 97, pp. 4244–4253 (2019)
Yan, X., Gang Su, X.: Linear regression analysis: Theory and computing, World Scientific (2009)
Chukhrova, N., Johannssen, A.: Fuzzy regression analysis: systematic review and bibliography. Appl. Soft Comput. J. 84, 105708 (2019)
Denoeux, Th.: Maximum likelihood estimation from fuzzy data using the EM algorithm. Fuzzy Sets Syst. 183, 72–91 (2011)
Nikolova, N., Panayotov, P., Panayotova, D., Ivanova, S., Tenekedjiev, K.: Using fuzzy sets in surgical treatment selection and homogenizing stratification of patients with significant chronic ischemic mitral regurgitation. Int. J. Comput. Intell. Syst. 12, 1075 (2019)
Viertl, R.: Statistical Methods for Fuzzy Data. Wiley, New York (2011)
Coppi, R.: Management of uncertainty in statistical reasoning: the case of regression analysis. Int. J. Approx. Reason. 47(3), 284–305 (2008)
Dubois, D., Nguyen, H.T., Prade, H.: Possibility theory, probability and fuzzy sets misunderstandings, bridges and gaps. In: Dubois, D., Prade, H. (eds.) Fundamentals of Fuzzy Sets. The Handbooks of Fuzzy Sets Series, vol. 7, pp. 343–438. Springer, Boston (2000)
Dubois, D., Prade, H: Fuzzy sets and probability: misunderstandings, bridges and gaps. In: Second IEEE International Conference on Fuzzy Systems, San Francisco, CA, USA, vol. 2, pp. 1059–1068 (1993)
Ruspini, E.: Possibility as similarity; the semantics of fuzzy logic. In: UAI '90: Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, MIT, Cambridge, MA, USA, July 27–29 (1990)
Chachi, J., Taheri, S.: Multiple fuzzy regression model for fuzzy input-output data. Iran. J. Fuzzy Syst. 13(4), 63–78 (2016)
Klir, G.: Foundations of fuzzy set theory and fuzzy logic: a historical overview. Int. J. Gen. Syst. 30(2), 91–131 (2001)
Coppi, R., D’Urso, P., Giordani, P., Santoro, A.: Least squares estimation of a linear regression model with LR fuzzy response. Comput. Stat. Data Anal. 51, 267–286 (2006)
D’Urso, P.: Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data. Comput. Stat. Data Anal. 42, 47–72 (2003)
Gao, P., Gao, Y.: Quadrilateral Interval Type-2 Fuzzy Regression Analysis for Data Outlier Detection. Mathematical Problems in Engineering 2019, 4914593 (2019). https://doi.org/10.1155/2019/4914593
Tanaka, H., Hayashi, I., Watada, J.: Possibilistic linear regression analysis for fuzzy data. Eur. J. Oper. Res. 40, 389–396 (1989)
Tanaka, H., Vejima, S., Asai, K.: Linear regression analysis with fuzzy model. IEEE Trans. Syst. Man Cybern. 12, 903–907 (1982)
Diamond, P.: Fuzzy least squares. Inf. Sci. 46, 141–157 (1988)
Jinn, J.H., Song, C., Chao, J.C.: A study of fuzzy linear regression. In: InterStat, (6), http://interstat.statjournals.net/YEAR/2008/articles/0807006.pdf. Accessed 08 Nov 2020 (2008)
Cook, R.D.: Influential observations in linear regression. J. Am. Stat. Assoc. 74(365), 169–174 (1979)
Neter, J., Kutner, M.H., Nachtsheim, C.J., Wasserman, W.: Applied Linear Statistical Models, 4th edn. Irwin, Chicago (1996)
D’Urso, P., Gastaldi, T.: A least-squares approach to fuzzy linear regression analysis. Comput. Stat. Data Anal. 34, 427–440 (2000)
Efron, B., Tibshirani, R.: An Introduction to the Bootstrap, pp. 45–57. Chapman & Hall, New York (1993)
Maddala, G.S.: Introduction to Econometrics, 2nd edn. MacMillan, New York (1992)
Kao, C., Chyu, C.: A fuzzy linear regression model with better explanatory power. Fuzzy Sets Syst. 126, 401–409 (2002)
Peters, G.: Fuzzy linear regression with fuzzy intervals. Fuzzy Sets Syst. 63, 45–55 (1994)
Wang, G., Guo, P.: Outlier detection approaches in fuzzy regression models. In: 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), Edmonton, AB, pp. 980–985 (2013)
Modarres, M., Nasrabadi, E., Nasrabadi, M.: Fuzzy linear regression analysis from the point of view risk. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 12, 635–649 (2004)
Modarres, M., Nasrabadi, E., Nasrabadi, M.: Fuzzy linear regression with least squares errors. Appl. Math. Comput. 163, 977–989 (2005)
Bisserier, A., Boukezzoula, R., Galichet, S.: A revisited approach to linear fuzzy regression using trapezoidal fuzzy intervals. Inf. Sci. 180, 3653–3673 (2010)
D’Urso, P., Massari, R., Santoro, A.: Robust fuzzy regression analysis. Inf. Sci. 181, 4154–4174 (2011)
Dehghan, M., Hamidi, F., Salajegheh, H.: Study of linear regression based on least squares and fuzzy least absolutes deviations and its application in geography. In: 4th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 1–6 (2015)
D’Urso, P., Massari, R.: Weighted least squares and least median squares estimation for the fuzzy linear regression analysis. Metron 71, 279–306 (2013)
Bargiela, A., Pedrycz, W., Nakashima, T.: Multiple regression with fuzzy data. Fuzzy Sets Syst. 158, 2169–2188 (2007)
D’Urso, P., Santoro, A.: Goodness of fit and variable selection in the fuzzy multiple linear regression. Fuzzy Sets Syst. 157, 2627–2647 (2006)
Ferraro, M.B., Coppi, R., Gonzalez Rodriguez, G., Colubi, A.: A linear regression model for imprecise response. Int. J. Approx. Reason. 51, 759–770 (2010)
Kao, C., Chyu, C.: Least-square estimates in fuzzy regression analysis. Eur. J. Oper. Res. 148, 426–435 (2003)
Lu, J., Wang, R.: An enhanced fuzzy linear regression model with more flexible spreads. Fuzzy Sets Syst. 160, 2505–2523 (2009)
Chachi, J., Taheri, S.M., Arghami, N.R.: A hybrid fuzzy regression model and its application in hydrology engineering. Appl. Soft Comput. 25, 149–158 (2014)
Jajuga, K.: Linear fuzzy regression. Fuzzy Sets Syst. 20(3), 343–353 (1986)
Yang, M.-S., Ko, C.-H.: On cluster-wise fuzzy regression analysis. IEEE Trans. Syst. Man Cybern. B 27(1), 1–13 (1997)
Suk, H.W., Hwang, H.: Regularized fuzzy clusterwise ridge regression. Adv. Data Analy. Classif. 4(1), 35–51 (2010)
D’Urso, P., Santoro, A.: Fuzzy clusterwise linear regression analysis with symmetrical fuzzy output variable. Comput. Stat. Data Anal. 51(1), 287–313 (2006)
D’Urso, P., Massari, R., Santoro, A.: A class of fuzzy clusterwise regression models. Inf. Sci. 180, 4737–4762 (2010)
Lee, H.T., Chen, S.H.: Fuzzy regression model with fuzzy input and output data for manpower forecasting. Fuzzy Sets Syst. 119(2), 205–213 (2001)
Imoto, S., Yabuuchi, Y., Watada, J.: Fuzzy regression model of R & D project evaluation. Appl. Soft Comput. 8(3), 1266–1273 (2008)
Lee, H., Tanaka, H.: Fuzzy approximations with non-symmetric fuzzy parameters in fuzzy regression analysis. J. Oper. Res. Soc. Japan 42(1), 98–112 (1999)
Yang, Z., Yin, Y., Chen, Y.: Robust fuzzy varying coefficient regression analysis with crisp inputs and gaussian fuzzy output. J. Comput. Sci. Eng. 7(4), 263–271 (2013)
Khashei, M., Hejazi, S.R., Bijari, M.: A new hybrid artificial neural networks and fuzzy regression model for time series forecasting. Fuzzy Sets Syst. 159(7), 769–786 (2008)
Kwong, C.K., Chen, Y., Wong, H.: Modeling manufacturing processes using fuzzy regression with the detection of outliers. Int. J. Adv. Manuf. Technol. 36, 547–557 (2008)
Chan, K.Y., Kwong, C.K., Fogarty, T.C.: Modelling manufacturing processes using a genetic programming-based fuzzy regression with detection of outliers. Inf. Sci. 180, 506–518 (2010)
Gladysz, B., Kuchta, D.: Outliers detection in selected fuzzy regression models. In: WILF '07: Proceedings of the 7th International Workshop on Fuzzy Logic and Applications, (Berlin, Heidelberg), pp. 211–218, Springer-Verlag (2007)
Nasrabadi, E., Hashemi, S.M., Ghatee, M.: An LP-based approach to outliers detection in fuzzy regression analysis. Int. J. Uncertain. Fuzz. Knowl.-Based Syst. 15(4), 441–456 (2007)
Mashinchi, M. H., Orgun, M. A., Mashinchi, M. R.: A least square approach for the detection and removal of outliers for fuzzy linear regressions. In: Second World Congress on Nature and Biologically Inspired Computing Dec. 15–17, 2010 in Kitakyushu, Fukuoka, Japan, pp. 134–139 (2010)
Press, W.H., Teukolski, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes—The Art of Scientific Computing, 3rd edn. Cambridge University Press, Cambridge (2007)
Nikolova, N.D., Toneva-Zheynova, D., Naydenov, D., Tenekedjiev, K.: Imputing missing values of environment multi-dimensional vectors using a modified Roweis algorithm. In: Proc. IFAC Workshop on Dynamics and Control of Agriculture and Food Processing, Plovdiv, Bulgaria, pp. 119–205 (2012)
Tenekedjiev, K., Karakatsanis, N., Bekiaris, A.: Fictitious covariance matrices. In: Proc. Forth International Conference, Adaptive Computing in Design and Manufacture ACDM’2000, pp. 23–26, Plymouth, UK (2000)
Gujarati, D.N., Porter, D.: Basic Econometrics, 5th edn. McGraw-Hill, Boston (2008)
Montgomery, D., Peck, E., Vining, G.: Introduction to Linear Regression Analysis. Wiley, New York (2001)
Tenekedjiev, K., Radoinova, D.: Numeral procedures for stature estimating according to length of limb long bones in Bulgarian and Hungarian populations. Acta Morphol. Anthropol. 6, 90–97 (2001)
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29(4), 1165–1188 (2001)
Benjamini, Y.: Discovering the false discovery rate. J. R. Stat. Soc. B 72(4), 405–416 (2010)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995)
McCloskey, A.: Bonferroni-based size-correction for nonstandard testing problems. J. Econ. 200, 17–35 (2017)
Mittelhammer, R.C., Judge, G., Miller, D.: Econometric Foundations. Cambridge University Press, Cambridge (2000)
Romano, J.P., Shaikh, A.M., Wolf, M.: A practical two-step method for testing moment inequalities. Econometrica 82(5), 1979–2002 (2014)
Ariens, R.A.: Fibrin(ogen) and thrombotic disease. J. Thromb. Haemost. 11(Suppl 1), 294–305 (2013)
Mangold, A., Alias, S., Scherz, T., Hofbauer, T., Jakowitsch, J., Panzenböck, A., Simon, D., Laimer, D., Bangert, C., Kammerlander, A., Mascherbauer, J., Winter, M.P., Distelmaier, K., Adlbrecht, C., Preissner, K.T., Lang, I.M.: Coronary neutrophil extracellular trap burden and deoxyribonuclease activity in ST-elevation acute coronary syndrome are predictors of ST-segment resolution and infarct size. Circ Res. 116(7), 1182–1192 (2015)
Farkas, A., Farkas, V.J., Gubucz, I., Szabó, L., Bálint, K., Tenekedjiev, K., Nagy, A.I., Sótonyi, P., Hidi, L., Nagy, Z., Szikora, I., Merkely, B., Kolev, K.: Neutrophil extracellular traps in thrombi retrieved during interventional treatment of ischemic arterial diseases. Thromb. Res. 175, 46–52 (2019)
Kovács, A., Sótonyi, P., Nagy, A.I., Tenekedjiev, K., Wohner, N., Komorowicz, E., Kovács, E., Nikolova, N.D., Szabó, L., Kovalszky, I., Machovich, R., Szelid, Z., Becker, D., Merkely, B., Kolev, K.: Ultrastructure and composition of thrombi in coronary and peripheral artery disease: correlations with clinical and laboratory findings. Thromb. Res. 135(4), 760–766 (2015)
Varjú, I., Sótonyi, P., Machovich, R., Szabó, L., Tenekedjiev, K., Silva, M.M., Longstaff, C., Kolev, K.: Hindered dissolution of fibrin formed under mechanical stress. J. Thromb. Haemost. 9, 979–986 (2011)
Wohner, N., Sótonyi, P., Machovich, R., Szabó, L., Tenekedjiev, K., Silva, M.M., Longstaff, C., Kolev, K.: Lytic resistance of fibrin containing red blood cells. Arteriosc. Thromb. Vasc. Biol. 31, 2306–2313 (2011)
Politis, D.: Computer-intensive methods in statistical analysis. IEEE Signal Process. Mag. 15(1), 39–55 (1998)
Acknowledgements
This work was supported in part by the Hungarian National Research, Development and Innovation Office (NKFIH) (129528, KK) and the Higher Education Institutional Excellence Programme of the Ministry of Human Capacities in Hungary for the Molecular Biology thematic programme of Semmelweis University (KK). The research is also supported by the University of Tasmania’s internal research and development fund RT.112222.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nikolova, N., Rodríguez, R.M., Symes, M. et al. Outlier Detection Algorithms Over Fuzzy Data with Weighted Least Squares. Int. J. Fuzzy Syst. 23, 1234–1256 (2021). https://doi.org/10.1007/s40815-020-01049-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40815-020-01049-8