Abstract
In genetic association studies, the traits of interest may sometimes be collected from the reported data. Since subjects report exact responses and/or rounded responses, the histogram of data frequently exhibits spikes at particular values. This phenomenon, known as heaping, can cause difficulties in performing the association test via standard modeling approaches. Recently, several models have been proposed to identify the true unobservable underlying distribution from heaped data. However, all of these methods depend on probabilistic assumptions regarding the heaping mechanism. Unfortunately, probabilistic models cannot represent heaped data effectively, because heaping can be caused by imprecisely reported values. This type of imprecision is different from probabilistic uncertainty, which is described well by a probabilistic model. In this paper, we propose a fuzzy heaping model to identify genetic variants for the heaped count data. Our fuzzy model uses a mixture of likelihood functions for precisely and imprecisely reported data, treating heaped data as imprecise data represented by fuzzy sets. Moreover, since reported count data may include excess zeros, as well as heaped data, we extend our fuzzy heaping model to handle excess zeros. Through simulation studies, we show that the proposed fuzzy heaping model controls type I errors effectively and has great power to identify causal variants. We illustrate the proposed fuzzy heaping model through a study of the identification of genetic variants associated with the number of cigarettes smoked per day.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bar H, Lillard D (2012) Accounting for heaping in retrospectively reported event data. A mixture-model approach. Stat Med 31:3347–3365
Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Samani NJ, Todd JA (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447(7145):661–678
Bush WS, Moore JH (2012) Genome-wide association studies. PLoS Comput Biol 8(12):e1002822
Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban HJ, Yoon D, Lee MH, Kim DJ, Park M, Cha SH (2009) A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet 41(5):527–534
Dale SC, Robin JM et al (2014) Effect of neuronal nicotinic acetylcholine receptor genes (CHRN) on longitudinal cigarettes per day in adolescents and young adults. Nicotine Tob Res Feb 16(2):137–144
David SP et al (2012) Genome-wide meta-analyses of smoking behaviors in African Americans. Transl psychiatry 2(5):e119
Denoeux T (2011) Maximum likelihood estimation from fuzzy data using the EM algorithm. Fuzzy Sets Syst 183(1):72–91
Dubois D, Prade H (1980) Fuzzy sets and systems theory and applications. Academic Press, New York
Farrell L, Fry T, Harris M (2008) A pack a day for twenty years: smoking and cigarette packet sizes. Appl Econ 43:2833–2842
Hardy J, Singleton A (2009) Genomewide association studies and human disease. N Engl J Med 360(17):1759–1768
Heilbron D (1989) Generalized linear models for altered zero probabilities and overdispersion in count Data, SIMS Technical Report 9. University of California, San Francisco, Department of Epidemiology and Biostatistics
Jung H, Choi H, Park T (2015) Fuzzy mixture model for heaping data. In: Proceedings of the 9th NAUN international conference on applied mathematics, simulation, modelling (ASM ’15), Konya, Turkey, 20–22 May 2015
Jung H, Lee W, Yoon J, Choi S (2014) Likelihood inference based on fuzzy data in regression model. In: SCIS & ISIS 2014, IEEE, 1175-1179
Kumasaka N, Aoki M, Okada Y, Takahashi A, Ozaki K, Mushiroda T, Kamatani N (2012) Haplotypes with copy number and single nucleotide polymorphisms in CYP2A6 locus are associated with smoking quantity in a Japanese population. PLoS ONE 7(9):e44507
Lambert D (2008) Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 34:1–14
Li MD, Yoon D, Lee JY, Han BG, Niu T, Payne TJ, Park T (2010) Associations of variants in CHRNA5/A3/B4 gene cluster with smoking behaviors in a Korean population. PLoS ONE 5(8):e12183
Manolio TA, Brooks LD, Collins FS (2008) A HapMap harvest of insights into the genetics of common disease. J Clin Investig 118(5):1590–1605
Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39(7):906–913
Mullahy J (1986) Specification and testing of some modified count data models. J Econom 33:341–365
Mullahy J (1997) Heterogeneity, excess zeros, and the structure of count data model. J Appl Econom 12:337–350
Najafi Z, Taheri SM, Mashinchi M (2010) Likelihood ratio test based on fuzzy data. Int J Intell Technol Appl Stat 3(3):285–301
Rice JP et al (2012) CHRNB3 is more strongly associated with FTCD-based nicotine dependence than cigarettes per day: phenotype definition changes GWAS results, Addiction (Abingdon, England) 107.11 2019
The Tobacco and Genetics Consortium (2010) Genome-wide meta-analyses identify multiple loci associated with smoking behavior. Nat Genet 42(5):443–571
Thorgeirsson TE et al (2008) A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature 452(7187):638–642
Wang H, Heitjan DF (2008) Modeling heaping in self-reported cigarette counts. Stat Med 27:3789–3804
Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353
Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23(2):421–427
Acknowledgements
This work was supported by the Bio-Synergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the National Research Foundation and by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI16C2037).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Jung, HY., Choi, H. & Park, T. Fuzzy heaping mechanism for heaped count data with imprecision. Soft Comput 22, 4585–4594 (2018). https://doi.org/10.1007/s00500-017-2641-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-017-2641-4