Abstract
Ratings are frequently used to evaluate and compare subjects in various applications, from education to healthcare, because ratings provide succinct yet credible measures for comparing subjects. However, when multiple rating lists are combined or considered together, subjects often have missing ratings, because most rating lists do not rate every subject in the combined list. In this study, we propose analyses on missing value patterns using six real-world data sets in various applications, as well as the conditions for applicability of imputation algorithms. Based on the special structures and properties derived from the analyses, we propose optimization models and algorithms that minimize the total rating discordance across rating providers to impute missing ratings in the combined rating lists, using only the known rating information. The total rating discordance is defined as the sum of the pairwise discordance metric, which can be written as a quadratic function. Computational experiments based on real-world and synthetic rating data sets show that the proposed methods outperform the state-of-the-art general imputation methods in the literature in terms of imputation accuracy.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data and Code availability
The real and synthetic data files and codes are attached to the paper as supplements.
Notes
Some RPs use continuous scores, and we convert them into ordinal ratings.
Mann–Whitney U test is a nonparametric test that, for two groups G1 and G2, checks if the probability of G1 being greater than G2 is equal to the probability of G2 being greater than G1. It only requires that you are able to rank order the individual scores or values; there is no need to compute means or variances.
References
Adelman, D. (2020). An efficient frontier approach to scoring and ranking hospital performance. Operations Research, 68(3), 762–792.
Austin, J. M., Jha, A. K., Romano, P. S., Singer, S. J., Vogus, T. J., Wachter, R. M., & Pronovost, P. J. (2015). National hospital ratings systems share few common scores and may generate confusion instead of clarity. Health Affairs, 34(3), 423–430.
Bloomberg. (2014). Bloomberg impact report. Accessed March 2021, available at https://www.bloomberg.com/impact/
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Routledge.
Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
CSRHub. (2014). About CSRHub. Accessed March 2021, available at https://esg.csrhub.com/about-csrhub
Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772.
Cantor, R., & Packer, F. (1996). Determinants and impact of sovereign credit ratings. Economic Policy Review, 2(2), 37–53.
Casalo, L. V., Flavian, C., Guinaliu, M., & Ekinci, Y. (2015). Do online hotel rating schemes influence booking behaviors? International Journal of Hospitality Management, 49, 28–36.
Chang, A., Rudin, C., Cavaretta, M., Thomas, R., & Chou, G. (2012). How to reverse-engineer quality rankings. Machine learning, 88(3), 369–398.
Children At Risk. (2020). Texasschoolguide.org school rankings. Accessed May 2020, available at https://texasschoolguide.org
Dalton, B. (2017). The landscape of school rating systems. RTI Press Publication.
Dantzig, G. B., & Fulkerson, D. R. (2003). On the max flow min cut theorem of networks. Linear Inequalities and Related Systems, 38, 225–231.
Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th international conference on world wide web (pp. 613–622).
Even, S., & Tarjan, R. E. (1975). Network flow and testing graph connectivity. SIAM Journal on Computing, 4(4), 507–518.
Fagin, R., Kumar, R., & Sivakumar, D. (2003). Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 301–312).
García-Laencina, P. J., Sancho-Gómez, J.-L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: A review. Neural Computing and Applications, 19(2), 263–282.
Ghose, A., Ipeirotis, P. G., & Li, B. (2014). Examining the impact of ranking on consumer behavior and search engine revenue. Management Science, 60(7), 1632–1654.
GreatSchools. (2020). Greatschools rating. Accessed May 2020, available at https://www.greatschools.org/
GroupLens. (2022). Movielens 100k dataset. Accessed June 15, available at https://grouplens.org/datasets/movielens/100k/
Halbritter, G., & Dorfleitner, G. (2015). The wages of social responsibility-where are they? A critical review of ESG investing. Review of Financial Economics, 26, 25–35.
Harper, F. M., & Konstan, J. A. (2015). The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TIIS), 5(4), 1–19.
Harzing, A.-W. (2020). Journal quality list. Accessed January 21, 2021, available at https://www.harzing.com
Hastie, T., Mazumder, R., Lee, J. D., & Zadeh, R. (2015). Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16(1), 3367–3402.
Hastie, T., & Mazumder, R. (2015). softimpute: Matrix completion via iterative soft-thresholded SVD (R package version 1.4).
Hazelkorn, E. (2007). The impact of league tables and ranking systems on higher education decision making. Higher Education Management and Policy, 19(2), 1–24.
Hibbard, J. H., Stockard, J., & Tusler, M. (2005). Hospital performance reports: Impact on quality, market share, and reputation. Health Affairs, 24(4), 1150–1160.
Huang, D. Z. X. (2021). Environmental, social and governance (ESG) activity and firm performance: A review and consolidation. Accounting & Finance, 61(1), 335–360.
Kim, J., Park, Y. W., & Williams, A. J. (2021). A mathematical programming approach for imputation of unknown journal ratings in a combined journal quality list. Decision Sciences, 52(2), 455–482.
Kowarik, A., & Templ, M. (2016). Imputation with the R package vim. Journal of Statistical Software, 74(7), 1–16.
LeapFrog Group. (2019). Spring 2019 LeapFrog hospital safety grade. Accessed May 2019, available at https://www.hospitalsafetygrade.org
Lerner, M. (2015). School quality has a mighty influence on neighborhood choice, home values. The Washington Post.
Lin, S. (2010). Rank aggregation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5), 555–570.
Lin, W.-C., & Tsai, C.-F. (2020). Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53(2), 1487–1509.
Luca, M., & Smith, J. (2013). Salience in quality disclosure: Evidence from the us news college rankings. Journal of Economics & Management Strategy, 22(1), 58–77.
MSCI. (2014). MSCI KLD 400 social index. Accessed March 2021, available at https://www.msci.com/msci-kld-400-social-index
Marlin, B. M., Zemel, R. S. (2009). Collaborative prediction and ranking with non-random missing data. In Proceedings of the 3rd ACM conference on recommender systems (pp. 5–12).
Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287–2322.
McBrayer, G. A. (2018). Does persistence explain ESG disclosure decisions? Corporate Social Responsibility and Environmental Management, 25(6), 1074–1086.
Niche. (2020). Niche rating. Accessed May 2020, available at https://www.niche.com/
Ozcan, Y. A., et al. (2008). Health care benchmarking and performance evaluation. Springer.
Patil, B. M., Joshi, R. C., & Toshniwal, D. (2010). Missing value imputation based on k-mean clustering with weighted distance. In International conference on contemporary computing (pp. 600–609). Springer.
Ramlatchan, A., Yang, M., Liu, Q., Li, M., Wang, J., & Li, Y. (2018). A survey of matrix completion methods for recommendation systems. Big Data Mining and Analytics, 1(4), 308–323.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC Press.
SchoolDigger. (2020). Schooldigger rating. Accessed May 2020, available at https://www.schooldigger.com/
Stekhoven, D. J., & Bühlmann, P. (2012). Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.
Surroca, J., Tribó, J. A., & Waddock, S. (2010). Corporate responsibility and financial performance: The role of intangible resources. Strategic Management Journal, 31(5), 463–490.
Sustainalytics. (2014). About us–sustainalytics. Accessed March 2021, available at https://www.sustainalytics.com/about-us/
The Centers for Medicare & Medicaid Services. (2019). Hospital compare datasets. Accessed May 2019, available at https://data.medicare.gov/data/hospital-compare
Txschools.gov. (2020). Texas school rating. Accessed May 2020, available at https://txschools.gov/schools
U.S. News & World Report. (2019). 2018 - 2019 U.S. News best hospitals. Accessed June 2019, available at https://health.usnews.com/best-hospitals
Venkatesh, A. K., Bernheim, S. M., Qin, L., Bao, H., Simoes, J., Norton, E., Wing, M., Glennon, G., Shah, R., Herrin, J., Lin, H., Lin, Z., & Krumholz, H. M. (2017). Overall hospital quality star rating on hospital compare december 2017 updates and specifications report. available at https://s3.amazonaws.com/assets.fiercemarkets.net/public/004-Healthcare/StrRtg_Dec_QtrUpdtSpecRpt.pdf
Waddock, S. A., & Graves, S. B. (1997). The corporate social performance-financial performance link. Strategic Management Journal, 18(4), 303–319.
Acknowledgements
We thank Dr. Y. Zhuang for his technical assistance in collecting a dataset. We also thank the editors and anonymous reviewers for their valuable suggestions and comments that improve the paper.
Funding
No funding information is available to report.
Author information
Authors and Affiliations
Contributions
YWP is the lead author of the paper, who initiated the project, collected most of the data sets, developed initial models and algorithms, and performed the experiments. JK developed the estimatability concept and the relevant theorems. YWP and JK improved the algorithms and derived the optimality results. DZ prepared a data set and provided feedback throughout the whole process of the project. All authors are involved in the writing of the paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Ulf Brefeld.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix A: Data preparation and description
Missing rating data frequently appear in practice. In the current research, we are particularly interested in data such that decision-makers or stakeholders might take advantage of imputing the missing entries. For example, customers can use the imputed values to find a service provider that would likely offer better customer service. Institutions can use them to evaluate unrated subjects. Some companies use it to make intelligent recommendation systems. We compiled multiple real-world data sets in various applications: hospital rating, journal rating, environmental, social, and governance (ESG) rating, elementary and high school rating, and movie rating. In Sections A1 - A5, we describe the detailed background and our data collection/compiling procedures for each data set. Section A6 presents the conversion procedure of the continuous rating scores. Finally, in Section A7, we present the detailed statistics and distribution of the missing values in the collected data sets.
1.1 A1: Journal data
The quality of journal publications has been adopted as one of the most significant criteria for faculty scholarship evaluation. The baseline indices are frequently defined based on several important factors, such as the publication volume, citation metrics, acceptance rates, Journal Impact Factor (JIF), H-index, SCImago Journal Rank (SJR) indicator, CiteScores, Source Normalized Impact per Paper (SNIP), and the Eigenfactor, among others. Despite the existence of such indices, these quality measures are too fragmentary to be adopted as sole quality quantifiers for a journal. Alternatively, some organizations have developed journal rating systems that determine journal ratings by considering various quality indices along with inputs of expert panels. Such journal rating lists in the business research domain include the Australian Business Deans Council (ABDC), the Association of Business Schools (ABS), the Erasmus Research Institute of Management (EJL), the European Journal of Information Systems (EJIS), and the High Council for Evaluation of Research and Higher Education (HCERES).
Many academic institutions adopt a journal quality list and use the referred ratings as a quality score for a journal publication. Nevertheless, it is commonly observed that some quality journals are not included in the journal list. This necessitates estimating the ratings of the journals within the quality principles of the designated journal list.
In this study, we use the recent journal rating data from Harzing.com (Harzing, 2020), which provides ratings for a collection of RPs. All raw ratings, such as \(\{A^*, A, B, C\}\), are converted to numerical values \(\{4,3,2,1\}\) such that higher rating values indicate better quality. Note that Kim et al. (2021) also use an earlier version of the data from Harzing.com, although the data set differs slightly in terms of the rating values and the number of RPs and subjects. Hence, the results are not directly comparable to the results of the current paper.
1.2 A2: Hospital rating data
Hospital administrators, healthcare consumers, and journalists are all captivated by the release of general hospital ratings and their implications for healthcare quality improvement and data analytics. In the United States, these ratings may come from a variety of sources, such as U.S. News and Consumer Reports (Ozcan, 2008). Recently, hospital ratings have been publicly released by U.S. government agencies such as the Centers for Medicaid and Medical Services (CMS) and other associated entities under the U.S. Department of Health and Human Services (HHS), sourced from the Agency for Healthcare Research and Quality (AHRQ). Close to 4,000 hospitals in the U.S. are assigned ratings of 1–5 stars (The Centers for Medicare & Medicaid Services, 2019). The Leapfrog Group, a U.S. nonprofit watchdog organization, also compiles data for informed purchasing in the healthcare industry to foster hospital transparency and value-based purchasing in the U.S. healthcare system. The evaluations are typically based on hundreds of different measures assessing the hospital facility, services, and equipment quality; a summarized rating is then composed according to the different weights assigned to different categories (Venkatesh et al., 2017).
To create a combined hospital quality list, we collect ratings from four publicly available rating systems: HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems) Star Ratings (The Centers for Medicare & Medicaid Services, 2019), LeapFrog’s Hospital Safety Grade (LeapFrog Group, 2019), the Centers for Medicare & Medicaid Services (The Centers for Medicare & Medicaid Services, 2019), and U.S. News’s Best Hospital (U.S. News & World Report, 2019). For notational convenience, we denote them as HCAHPS, LeapFrog, Medicare, and U.S. News, respectively. U.S. News provides ratings of hospitals for 16 specialties and nine procedures and conditions. Among these, we focus on 12 specialties that are not exclusively based on reputation survey and that have numerical scores, following Austin et al. (2015). The specialty scores are then standardized and averaged. The procedures for the other three sources of rating data (LeapFrog, Medicare, and HCAHPS) are relatively straightforward: A, B, C, D, and F are converted into 5, 4, 3, 2, and 1 for LeapFrog, while the 5-to-1 star rating systems of Medicare (Hospital overall rating) and HCAHPS (Overall hospital rating - star rating) are directly used without modifications. With these ratings from the four rating providers, we create a combined rating list by merging based on the available hospital name, address, city, state, and zip code. The combined rating data include 4,217 hospitals rated by four RPs.
1.3 A3: Environmental, social, and governance data
Environmental, social, and governance (ESG) assessments have become the standards used to measure a firm’s degree of involvement in socially responsible activities (Huang, 2021). ESG ratings have been widely used in various domains, including socially responsible investing, risk identification & management, and supplier selection & evaluation. Given the popularity of ESG measures in both academia and industry, there are several proprietary databases on the market that specialize in ESG ratings for firms. Firms are evaluated by a score derived from their performance in multiple indicators for each category: this varies across ESG databases. The overall performances can also be evaluated with composite scores.
Among the multiple databases available, we include four databases in our data set: (1) Kinder, Lydenberg, Domini, and Company (KLD), which is now known as MSCI ESG, (2) Bloomberg ESG, (3) Sustainalytics, and (4) CSRHub. KLD (MSCI, 2014) measures ESG scores based on binary indicators (Halbritter & Dorfleitner, 2015) across three categories (environmental, social, and governance). However, only the binary indicators are reported, while category or composite scores are not reported. To obtain a single score for each category, we follow the common approach that computes the difference between the strengths and concerns (Waddock & Graves, 1997). The overall ESG score is then obtained by aggregating the scores of the three categories. The ratings in the Bloomberg ESG data (Bloomberg, 2014) primarily indicate the extent to which a firm discloses information regarding ESG. The Bloomberg ESG includes a firm’s overall ESG score and ratings in each of the three categories. The score is updated annually and is tailored based on the industry, with the scale for the scores ranging from 0.1 to 100 (McBrayer, 2018). Sustainalytics (Sustainalytics, 2014) includes a firm’s ratings in the environmental, social, and governance categories based on multiple indicators. Sustainalytics reports a weighted overall ESG score, where the weights are determined by the firm’s industry. The scale of this overall ESG score is from 0 to 100 (Surroca et al., 2010). A firm’s overall ESG score in CSRHub (CSRHub, 2014) is based on the weighted combinations of the four categories (i.e., community, employees, environment, and governance), with a scale from 0 to 100. CSRHub also reports the overall score every month. We use the simple average of the 12 months as the annual score.
In this study, we focus on manufacturing firms (NAICS 31–33) in North America in 2014 from COMPUSTAT. Because KLD had stopped reporting the total numbers of strengths and concerns for each category after 2014, which are used to calculate the overall KLD score, the combined rating data use the year 2014. These ESG databases were merged based on the ticker symbol. It should be noted that missing values appear randomly due to the lack of information on the construction of the ESG score.
1.4 A4: Elementary and high school rating data
School rating systems have become widespread for various purposes. School administrators have adopted school ratings in school accountability reports as a baseline school performance measure to inspect their administrative strategies. This, in turn, provides guidelines for teachers to take action to improve their instructions and testing strategies. Highly rated schools also use school ratings for advertisements to attract good students. School quality, typically and simply represented by school ratings, is known to be a significant factor for house prices (Lerner, 2015). Students and parents often use school ratings to choose schools or school districts or provide incentives for school administrators to improve their pedagogical tactics.
Multiple organizations publish school ratings. For example, the Department of Education and its equivalents in state governments in the United States issue school ratings to compare school quality in the state. Some private organizations also issue school ratings for schools across the country. See Dalton (2017) for a taxonomy of school rating issuers.
Missingness in school rating systems occurs frequently. In a state school rating system, the missingness results from its geographical constraint. For consumer-oriented rating systems, despite their wide span of subjects, there are missing subjects in the list because each rating issuer maintains its own exclusion rules. For example, Niche.com measures the completeness of data for individual schools, and if a school does not provide a sufficient amount of data, the school is disqualified from the Niche rating.
We prepare two data sets for elementary and high school rating data for Dallas, Texas. To create a combined school rating list, we collect ratings from five publicly available rating systems: GreatSchools (GreatSchools, 2020), Niche (Niche, 2020), SchoolDigger (SchoolDigger, 2020), TexasSchoolGuide (Children At Risk, 2020), and TexasSchoolRating (Txschools.gov, 2020). The original ratings of GreatSchool and SchoolDigger in 10-to-1 and 5-to-0 scales, respectively, are converted into 5-to-1 and 6-to-1 scale ratings. The other RPs provide letter grades ({A+,...,D-} and {A+,...,D-,F}), which are converted into 4-to-1 and 5-to-1 scale ratings, respectively, based on the letter grade and ignoring the plus or minus. After creating a combined rating matrix by merging based on name and address, we keep the schools that have at least one rating available.
1.5 A5: Movie rating data
The movie rating data sets have been studied extensively in the collaborative filtering and matrix completion literature and share a similar structure with other data sets presented in this study. In other words, the users and movies are rating providers and subjects, respectively, in the context of our research. A movie recommendation system could use the imputed values to build recommended movie lists customized for individual users.
We remark that there are a few differences between movie rating data and other test data. Perhaps, the most striking distinction of the movie rating data from the other experimental data in this paper is its high missing rate. For example, the Movielens data sets (Harper & Konstan, 2015; GroupLens, 2022) have missing rates ranging from 94.1% to 99.7% while the other data sets in Table 1 have missing rates from 30.2% to 55.6%. Another critical difference is the size of the data. Even the smallest MovieLens data set includes thousands of movies and hundreds of users. Most imputation algorithms cannot handle this data. Lastly, the movie rating data has smaller user correlations than the other data sets we consider. Hence, the proposed imputation methods are unlikely to yield satisfactory imputations with the original form of movie rating data. For these reasons, we customize movie rating data to have similar sizes, missing rates, and inter-rater correlations with the other experimental data.
We use the MovieLens 100k data set (Harper & Konstan, 2015; GroupLens, 2022) to create a reduced data set with highly correlated rating providers. The original data includes 100,000 ratings from 943 users on 1682 movies, where 1–5 rating system is used for the ratings. To create coherent ratings without too many missing values, we first select the users who rate at least 20% of the movies. Next, we check the Kendall rank correlations to further reduce the user set by selecting the users whose average correlation is greater than 0.25. Finally, we keep the movies rated at least once by the final user set. The reduced data set with highly correlated users include 12 users and 1102 movies.
1.6 A6: Conversion of continuous scores
Our study considers ordinal rating data. When an RP provides continuous scores instead of explicit rating categories, the scores can be converted. Given an RP and scores for all subjects rated by the RP, we partition the subjects into five groups based on their scores and assign 1–5 scale ratings, where 5 is the best and 1 is the worst. Let \(d_1,d_2,d_3,\) and \(d_4\) be the cutoffs defining five ranges, which are defined below. Given rating provider j for subject i with a continuous score of \(s_{ij}\), the converted rating \(r_{ij}\) is determined by the following rule:
Observe that the values of \(d_1 - d_4\) significantly affect the distribution of the converted ratings. To resemble the distribution of the original scores, we use \((d_1,d_2,d_3,d_4) = (0.1,0.35, 0.65, 0.9)\) for US News of the hospital data, which is similar to Normal distribution. For the ESG data, we divide the full range of the scores into equal-length ranges, in order to preserve the distribution of the original scores. For rating provider j, we define \(s_j^{max}\) and \(s_j^{min}\) to be the 99%tile and 1%tile of \(s_{ij}\). With \(s_j^{range} = s_j^{max} - s_j^{min}\), we define \(d_j = s_j^{min} + \frac{j\cdot s_j^{range}}{5}\), for \(j \in \{1,2,3,4\}\).
1.7 A7: Missing rates by row and column
In this section, we present the detailed statistics of missing rates of the data sets. Figures 9 and 10 present the missing rates by column and row, respectively, while Table 1 in Sect. 2.1 only focuses on the summary statistics at the data set level. Hence, this section can help understand the distribution of missing rates over columns and rows.
Figure 9 visualizes missing rates for individual rating providers for all data sets, where the horizontal and vertical axes of each plot represent rating providers and missing rates in percentages. Each bar indicates the missing rate of the corresponding RP. For example, 16.1% for HCAHPS in Fig. 1a indicates that 16.1% of the 4,217 hospitals have missing ratings by HCAHPS. The missing rates vary across RPs and data sets, from 3.8% to 81.8%. Within the same data set, the ESG data set has RPs with missing rate 3.8% to 78.4%.
Figure 10 visualizes the proportions of subjects in percentages by the number of observed ratings. Each bar represents the percentage of subjects for the corresponding number of ratings available. For example, in Fig. 2a, 21.0% and 41.3% of the 4,217 hospitals are rated by only one RP and all four RPs, respectively. The percentages of fully rated subjects also vary across data sets, from 41.3% to 11.4%. Among the data sets, the Hospital and Journal data sets have the increasing number of ratings in the increasing number of available ratings. The other data sets generally have the decreasing number of ratings in the increasing number of available ratings.
Appendix B: Proofs of lemmas and theorems
Proof of Lemma 1
Suppose \(E_v=\emptyset\). This means that there is no \(2\times 2\) submatrix such that one entry is not level-u estimatable for all \(u\le v-1\), and each of the three remaining entries is either observed or level-u for some \(u\le v-1\). Because \(E_v=\emptyset\), the construction of level-\((v+1)\) estimatable entries must be based on only level-u estimatable entries for \(u\le v-1\). Because no desirable \(2\times 2\) submatrix exists, thus \(E_{v+1}=\emptyset\). Therefore, if \(E_v=\emptyset\) then \(E_u=\emptyset\) for all \(u\ge v+1\). \(\square\)
Proof of Theorem 1
Denote the distinct rating providers in C by \(j_1,\dots ,j_s\). Then, after relabeling, we can assume that \(j_1 - j_2 - \dots - j_s\) forms a connected path. For any \(j=1,\dots ,s-1\), because RPs j and \(j+1\) are connected, there exists a subject S rated by both j and \(j+1\). Therefore, for an arbitrary \(S'\in \mathcal {E}(j)\), by the existence of a commonly observed subject \(S\in \mathcal {O}(j)\cap \mathcal {O}(j+1)\) and the definition of estimatability, we have \(S'\in \mathcal {E}(j+1)\), showing that \(\mathcal {E}(j)\subseteq \mathcal {E}(j+1)\). Similarly, the opposite inclusion also holds, showing that \(\mathcal {E}(1)=\dots =\mathcal {E}(s)\). Then, for each \(l \in \{1,\dots ,s\}\),
where the first equality holds by the definition of \(\mathcal {O}(C)\), the inclusion holds because \(\mathcal {O}(j)\subseteq \mathcal {E}(j)\) for all \(j=1,\dots ,s\), and the second equality holds because \(\mathcal {E}(1)=\dots =\mathcal {E}(s)\). Finally, consider an arbitrary entry \((i'', j'')\in \mathcal {O}(C)\times C\). Then, \(j''=l\) for some \(l\in \{1,\dots ,s\}\). Therefore, \(S''\in \mathcal {O}(C)\subseteq \mathcal {E}(l)=\mathcal {E}(j'')\), showing that \(S''\) is estimatable. \(\square\)
Proof of Theorem 2
Because the “if” part directly follows from Theorem 1, it suffices to show the “only if" part. Suppose the graph is disconnected. Let \(C_1,\dots ,C_r\) be all the connected components of the graph, where \(r\ge 2\). Consider the r submatrices of the input data, where each submatrix \(X_i\) corresponds to the rating providers in \(C_i\) and the subjects in \(\mathcal {O}(C_i)\) for \(i=1,\dots ,r\). By Theorem 1, all the entries in \(D:=\bigcup _{i=1}^r X_i\) are either observed or estimatable. Moreover, because \(\{C_i\}_{i=1}^r\) are mutually exclusive, so are \(\{X_i\}_{i=1}^r\) and \(\{\mathcal {O}(C_i)\}_{i=1}^r\), respectively. Moreover, by Assumption 1, it holds that \(O(C_1),\cdots , O(C_r)\) partition the set of all subjects. This, in turn, shows that \(O(C_i)\subseteq E(C_i)\) for all \(i=1,\dots ,r\). Consider data entries outside D. We claim that none of these entries are estimatable. To prove this by contradiction, consider entry \(e\notin D\), which is included in the first estimatable batch outside D in the sequential construction of estimatable entries. In other words, the estimatability of e is derived by three corner entries in D. Suppose this entry e is associated with subject i and rating provider j. Let C be the connected component that includes j. Then, there exists a subject \(i'\ne i\) and a rating provider \(j'\ne j\) such that \((i,j'), (i',j)\), and \((i',j')\) entries are estimatable and included in D. We first show that j and \(j'\) are not in the same connected component. Assume \(j'\in C\). Because \((i, j')\in D\) and \(j'\in C\), it holds that \(i \in \mathcal {O}(C)\), implying that \((i,j)\in D\), creating a contradiction. Therefore, we assume that \(j'\in C'\) for some connected component \(C'\) other than C. Let \(X'\) be the submatrix associated with \(C'\) and \(\mathcal {O}(C')\). Because \((i', j')\in D\) and \(j'\in C'\), it holds that \(i'\in \mathcal {O}(C')\). Similarly, we can show that \(i'\in \mathcal {O}(C)\) because \((i',j)\in D\) and \(j\in C\). This contradicts the fact that \(\mathcal {O}(C)\) and \(\mathcal {O}(C')\) are disjoint. Therefore, e is estimatable. This means that the sequential construction of estimatable entries is discontinued after the construction of D. Therefore, the input data set is unestimatable. \(\square\)
Proof of Theorem 3
To prove the “if" part, assume \(I = \bigcup _{j\in N_{j'}} I_j\) for all \(j'\in J\) and consider an arbitrary missing entry \((i', j')\). Then, there exists \(j\in N_{j'}\) such that \(i'\in I_j\) and \(j \ne j'\). Because \(j\in N_{j'}\), there exists \(i\in I\) such that \(i \ne i'\) and \(i\in I_j \cap I_{j'}\). Hence, for the missing entry \((i', j')\), we have ratings available at \((i', j)\), (i, j), and \((i, j')\). This implies that \((i', j')\) is level-1 estimatable. Applying this procedure for all missing entries in X, we can show that X is level-1 estimatable.
We next prove the “only if" part. Assume that X is level-1 estimatable and consider an arbitrary rating provider \(j'\). Then, for each missing entry \((i', j')\), there exists \(i''\) and \(j''\) such that the ratings are available for entries at \((i'',j')\), \((i',j'')\), and \((i'',j'')\) because of the definition of level-1 estimatability. This implies that \(j'' \in N_{j'}\) and \(i' \in I_{j''}\). By combining the result for all missing entries in \(j'\), we can show that \(I = \bigcup _{j\in N_{j'}} I_j\) holds for \(j'\). Applying this procedure for all RPs in J completes the proof. \(\square\)
Proof of Theorem 4
Consider each term
of the objective function. Its square part is convex because it is a composition of a convex function (\(f(x)=x^2\)) and an affine function. Then, (8) is also convex because it is a scalar multiple of a convex function. Next, assume that the data set is estimatable. Because input data X includes at least one level-1 estimatable missing entry, there exists a term of the form (8) such that three ratings are known and one rating is unknown. Because this term is a univariate quadratic function that is concave upward, it is strongly convex. Finally, because the sum of a strongly convex function and a convex function is strongly convex, the objective function of (4) is strongly convex. \(\square\)
Appendix C: Analysis on probability of estimatability
In this section, we present an analysis on the estimatability of the data sets generated by Algorithm 1. Based on the graph representation in Sect. 3.1, we calculate the probability that a generated data set is estimatable. To achieve this goal, we must check whether or not the corresponding graph is connected. The graph connectivity calculation is based on edge connectivities.
Edge connection probability depends on the missing rate r and the number of subjects (rows) m. Following Algorithm 1, let us assume all RPs (columns) have the same missing rates and that the number of missing values in each RP is \(\lfloor rm \rfloor\). Then, the edge connection probability, denoted as \(P_{edge}\), can be derived as follows:
The edge connection probabilities \(P_{edge}\) are calculated over various r and m in Table 10. Note that the edge connection probabilities are reasonably high even if the missing rate r is as high as 0.8.
The graph connectivity depends on the edge connection probability (\(P_{edge}\)) and the number of RPs (n). However, it is difficult to calculate the probability mathematically given \(P_{edge}\) and n. On the other hand, it is easy to check if a generated graph is connected or not. Therefore, to approximate the graph connection probability, denoted as \(P_{connect}\), we generate 10,000 random graphs for each \((P_{edge},n)\) pair and estimate the probability. For this task, we use the networkx.is_connected function in Python. The simulation result is presented in Table 11.
As expected, \(P_{connect}\) rapidly increases as \(P_{edge}\) or n increases. Even for a small edge probability of 0.2, the graph connectivity is over 0.9, which demonstrates that Algorithm 1 successfully generates an estimatable data set nine out of ten times. Combining the results in Tables 10 and 11, we conclude that Algorithm 1 generates an estimatable data set with high probabilities for most of the realistic parameters for r, m, and n. For example, if \(r = 0.9\), \(m = 50\), and \(n = 4\), then \(P_{connect} = 0.8619\). This implies that Algorithm 1 generates an estimatable data set with a probability greater than 0.8619 if \(r \le 0.9\), \(m \ge 50\), and \(n \ge 4\). With this high success rate, one can repeat Algorithm 1 until an estimatable data set is generated.
Appendix D: Tests for robustness of algorithm
The proposed algorithms, QP-AS and dQP-SVAS, are deterministic. Hence, given a fixed data set, they always return the same set of imputed values and do not address the uncertainty issues of missing value imputation. To check the robustness of the proposed algorithms over varying subsets of the samples, in this section, we present multiple imputation versions of QP-AS and dQP-SVAS. For each imputation trial, we randomly sample 80% of the rows (subjects) in the data and run QP-AS (or dQP-SVAS). The iterations end when each of the rows is sampled at least 10 times. We refer to these algorithms as QP-ASMI and dQP-SVASMI.
We remark that each sampled data set should be estimatable. Even when the full data set is estimatable, it is possible to have a sampled data set that is not estimatable. Hence, we need to ensure that all sampled data sets are estimatable before calling QP-AS and dQP-SVAS. For each algorithm and data set, we use the following two performance metrics.
-
1.
%ZeroSD: the percentage of missing entries that have the same imputed value over all samples
-
2.
AvgSD: the average standard deviation of the imputed value.
By checking the two metrics, we can measure how consistent the imputed values are (%ZeroSD) and the variances, if not consistent (AvgSD).
In Table 12, we check the variations of the imputed values over multiple samples using the real experimental data sets in Table 3. For each algorithm, #Samples is the number of samples drawn until the termination criteria are met. The two performance metrics and #Samples are reported for each data set for the two algorithms QP-ASMI and dQP-SVASMI. Observe that at least 77% and 73% of the imputed values are identical for QP-ASMI and dQP-SVASMI, respectively. This implies that the proposed algorithms return the same imputed values for over-70% of the missing entries. Further, The imputed values’ average standard deviations are mostly less than 0.1. The results show that both algorithms return consistent imputed values even if the samples differ.
The two algorithms, QP-ASMI and dQP-SVASMI, are also compared against their deterministic versions QP-AS and dQP-SVAS in Table 13. Similar to the results in the main section, the gray cells and boldfaces are used to indicate the best and within-5%-gaps, respectively, among all algorithms in Table 4. For example, no algorithms in Table 13 have boldfaces or gray cells for the Journal10F data set, because all four algorithms are not close to the best result (missForest) in Table 4. The result shows that QP-AS and dQP-SVAS are slightly better than QP-ASMI and dQP-SVASMI, while the multiple imputation versions return similar quality solutions in terms of the relative gap from the best. There exists only one case (MAD of Movielens10F data set) where either dQP-ASMI or dQP-SVASMI beats the best solution among all benchmarks including QP-AS and dQP-SVAS. On the other hand, the computation times of QP-ASMI and dQP-SVASMI are much worse. Hence, considering the prediction performances and computation time, we conclude the deterministic versions outperform.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Park, Y.W., Kim, J. & Zhu, D. Discordance minimization-based imputation algorithms for missing values in rating data. Mach Learn 113, 241–279 (2024). https://doi.org/10.1007/s10994-023-06452-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06452-4