Identifying exceptional (dis)agreement between groups

Adnene Belfodil¹,
Sylvie Cazalens¹,
Philippe Lamarre¹ &
…
Marc Plantevit ORCID: orcid.org/0000-0003-4636-5753²

442 Accesses
1 Citation
Explore all metrics

Abstract

Under the term behavioral data, we consider any type of data featuring individuals performing observable actions on entities. For instance, voting data depict parliamentarians who express their votes w.r.t. legislative procedures. In this work, we address the problem of discovering exceptional (dis)agreement patterns in such data, i.e., groups of individuals that exhibit an unexpected (dis)agreement under specific contexts compared to what is observed in overall terms. To tackle this problem, we design a generic approach, rooted in the Subgroup Discovery/Exceptional Model Mining framework, which enables the discovery of such patterns in two different ways. A branch-and-bound algorithm ensures an efficient exhaustive search of the underlying search space by leveraging closure operators and optimistic estimates on the interestingness measures. A second algorithm abandons the completeness by using a sampling paradigm which provides an alternative when an exhaustive search approach becomes unfeasible. To illustrate the usefulness of discovering exceptional (dis)agreement patterns, we report a comprehensive experimental study on four real-world datasets relevant to three different application domains: political analysis, rating data analysis and healthcare surveillance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 8

Fig. 9

DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups

Discovering Local Subgroups, with an Application to Fraud Detection

Exceptional Model Mining

Article 04 February 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

http://parltrack.euwiki.org/, last access on October 15, 2019.
https://voteview.com/data, last access on October 25, 2019.
Since the majorities of $\langle u_1,u_2 \rangle $ voted respectively on $\{e_1,e_2,\underline{e_3},e_4,\underline{e_5},\underline{e_6}\}$ as follows: ${ \langle \mathrm {For}, \mathrm {For}\rangle }, { \langle \mathrm {Against}, \mathrm {Against}\rangle }, \underline{ \langle {\textit{Against}}, {\textit{For}}\rangle }, { \langle \mathrm {For}, \mathrm {For}\rangle }, \underline{ \langle {\textit{For}}, {\textit{Against}}\rangle }, \underline{ \langle \mathrm {Against}, \mathrm {Against}\rangle } $.
DEBuNk stands for Discovering Exceptional inter-group Behavior patterNs.
http://contentcheck.liris.cnrs.fr, last access on October 25, 2019.
Different quality measures are proposed in Sect. 3.
Roll-Call vote is a voting system where the vote of each member is recorded, such as http://www.europarl.europa.eu (EU parliament, last access on October 25, 2019) or https://voteview.com (US Congresses, last access on October 25, 2019).
o(i, e) returns the outcome (e.g., vote, rating) expressed by an individual i (e.g., parliamentarian, user) to an entity e (e.g., legislative procedure, movie) if given.
$(a_{j}\in [v_{1}\ldots w_{1}]) \wedge _j (a_{j}\in [v_{2}\ldots w_{2}])=a_{j}\in [min(v_{1}, v_{2}) \ldots max(w_{1}, w_{2})]$ for numerical attributes . For categorical attributes $(a_{j}=v_1) \wedge _j (a_{j}=v_{2}) = v_1$ if $v_1=v_2$ else $\mathrm {true}_{j}$.
Note that this does not require S to have a corresponding description in ${\mathcal {D}}_{E}$.
Cartesian product of the m lattices related to the attributes’ conditions spaces forms a lattice (Roman 2008).
For the sake of brevity and clarity, the reader is referred to Belfodil et al. (2019) for technical details about the computation of $|\downarrow r^g_j|$ and the uniform sampling of conditions $r_j$ from $\downarrow r^g_j$ for each attribute $a_{j}$.
https://github.com/Adnene93/DEBuNk, last access on October 25, 2019.
https://parltrack.org/, last access on October 15, 2019.
https://grouplens.org/datasets/movielens/100k/, last access on April 24, 2017.
https://www.yelp.com/dataset/challenge, last access on April 25, 2017.
http://open-data-assurance-maladie.ameli.fr/, last access on November 16, 2017.
Ameli —France National Health Insurance and Social Security Organization.
The Anatomical Therapeutic Chemical classification system classifies therapeutic drugs according to the organ or system on which they act and their chemical, pharmaco-logical and therapeutic properties—https://www.whocc.no/atc/structure_and_principles/, last access on October 18, 2019.
http://contentcheck.liris.cnrs.fr, last access October 25, 2019.
https://www.idf.org/component/attachments/attachments.html?id=811&task=download [Page=97], last access on October 18, 2019.
https://bit.ly/2zuorQK, last access on October 18, 2019.
Since common SD techniques require flat representations of the underlying dataset augmented with a target attribute, we have proposed two adaptations: SD-Majority for discovering (dis)agreement with the majority and SD-Cartesian for discovering (dis)agreement between two groups on the cartesian product $G_{E}^{} \times G_{I}^{} \times G_{I}^{}$. In both of the aformentioned adaptations, the target is equal to 1 if there is an agreement, 0 else. Experiments are performed using PySubgroup (Lemmerich and Becker 2018) while utilizing the precision gain (Fürnkranz et al. 2012) as a quality measure. Moreover, to take into account the usual agreement between groups, we adapt Exceptional Subgraph Mining (Kaytoue et al. 2017) to discover contextual (dis)agreement in subgraphs representing individuals group pairs.
27 runs for each method by varying $(|{\mathcal {A}}_E|,|{\mathcal {A}}_I|,\sigma _\varphi ) \in \left[ \left[ 1,2,3\right] ,\left[ 1,2,3\right] ,\left[ 0.2,0.4,0.6\right] \right] $.
81 runs by varying $(k,|{\mathcal {A}}_E|,|{\mathcal {A}}_I|,\sigma _\varphi ) \in \left[ \left[ 10,50,100\right] ,\left[ 1,2,3\right] ,\left[ 1,2,3\right] ,\left[ 0.2,0.4,0.6\right] \right] $.
http://www.oecd.org/gov/digital-government/open-government-data.htm, last access on October 25, 2019.

References

Amelio A, Pizzuti C (2012) Analyzing voting behavior in italian parliament: group cohesion and evolution. In: International conference on advances in social networks analysis and mining, ASONAM 2012, Istanbul, Turkey, 26–29 August 2012, pp 140–146. https://doi.org/10.1109/ASONAM.2012.33
Amer-Yahia S, Kleisarchaki S, Kolloju NK, Lakshmanan LVS, Zamar RH (2017) Exploring rated datasets with rating maps. In: Proceedings of the 26th international conference on World Wide Web, WWW 2017, Perth, Australia, April 3–7, 2017, pp 1411–1419. https://doi.org/10.1145/3038912.3052623
Atzmueller M (2015) Subgroup discovery. Wiley Interdiscip Rev Data Min Knowl Discov 5(1):35–49. https://doi.org/10.1002/widm.1144
Article Google Scholar
Atzmüller M, Puppe F (2006) Sd-map—a fast algorithm for exhaustive subgroup discovery. In: Knowledge discovery in databases: PKDD 2006, 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, September 18–22, 2006, Proceedings, pp 6–17. https://doi.org/10.1007/11871637_6
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246. https://doi.org/10.1023/A:1011429418057
Article MATH Google Scholar
Belfodil A, Cazalens S, Lamarre P, Plantevit M (2017) Flash points: discovering exceptional pairwise behaviors in vote or rating data. In: Machine learning and knowledge discovery in databases—European conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part II, pp 442–458. https://doi.org/10.1007/978-3-319-71246-8_27
Belfodil A, Cazalens S, Lamarre P, Plantevit M (2019) Identifying exceptional (dis)agreement between groups. Technical report, LIRIS UMR CNRS 5205. https://contentcheck.liris.cnrs.fr/public/technical_report_2019_02.pdf
Bendimerad AA, Cazabet R, Plantevit M, Robardet C (2017) Contextual subgraph discovery with mobility models. In: Complex networks and their applications VI—proceedings of complex networks 2017 (The sixth international conference on complex networks and their applications), Complex networks 2017, Lyon, France, November 29–December 1, 2017, pp 477–489. https://doi.org/10.1007/978-3-319-72150-7_39
Bendimerad AA, Plantevit M, Robardet C (2016) Unsupervised exceptional attributed sub-graph mining in urban data. In: IEEE 16th international conference on data mining, ICDM 2016, December 12–15, 2016, Barcelona, Spain, pp 21–30. https://doi.org/10.1109/ICDM.2016.0013
Boley M, Horváth T, Poigné A, Wrobel S (2010b) Listing closed sets of strongly accessible set systems with applications to data mining. Theor Comput Sci 411(3):691–700. https://doi.org/10.1016/j.tcs.2009.10.024
Article MathSciNet MATH Google Scholar
Boley M, Gärtner T, Grosskreutz H (2010a) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of the SIAM international conference on data mining, SDM 2010, April 29–May 1, 2010, Columbus, Ohio, USA, pp 177–188. https://doi.org/10.1137/1.9781611972801.16
Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, August 21–24, 2011, pp 582–590. https://doi.org/10.1145/2020408.2020500
Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: The 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12, Beijing, China, August 12–16, 2012, pp 69–77. https://doi.org/10.1145/2339530.2339545
Bosc G, Boulicaut J, Raïssi C, Kaytoue M (2018) Anytime discovery of a diverse set of patterns with monte carlo tree search. Data Min Knowl Discov 32(3):604–650. https://doi.org/10.1007/s10618-017-0547-5
Article MathSciNet MATH Google Scholar
Bosc G, Golebiowski J, Bensafi M, Robardet C, Plantevit M, Boulicaut J, Kaytoue M (2016) Local subgroup discovery for eliciting and understanding new structure-odor relationships. In: Discovery science—19th international conference, DS 2016, Bari, Italy, October 19–21, 2016, Proceedings, pp 19–34. https://doi.org/10.1007/978-3-319-46307-0_2
Charalabidis Y, Alexopoulos C, Loukis E (2016) A taxonomy of open government data research areas and topics. J Organ Comput Electron Commer 26(1–2):41–63
Article Google Scholar
Csisz I et al (1967) Information-type measures of difference of probability distributions and indirect observations. Stud Sci Math Hungar 2:299–318
MathSciNet Google Scholar
Das M, Amer-Yahia S, Das G, Yu C (2011) MRI: meaningful interpretations of collaborative ratings. PVLDB 4(11):1063–1074
Google Scholar
de Sá CR, Duivesteijn W, Azevedo PJ, Jorge AM, Soares C, Knobbe AJ (2018) Discovering a taste for the unusual: exceptional models for preference mining. Mach Learn 107(11):1775–1807. https://doi.org/10.1007/s10994-018-5743-z
Article MathSciNet MATH Google Scholar
de Sá CR, Duivesteijn W, Soares C, Knobbe AJ (2016) Exceptional preferences mining. In: Discovery science—19th international conference, DS 2016, Bari, Italy, October 19–21, 2016, Proceedings, pp 3–18. https://doi.org/10.1007/978-3-319-46307-0_1
Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, August 15–18, 1999, pp 43–52. https://doi.org/10.1145/312129.312191
Downar L, Duivesteijn W (2017) Exceptionally monotone models–the rank correlation model class for exceptional model mining. Knowl Inf Syst 51(2):369–394. https://doi.org/10.1007/s10115-016-0979-z
Article Google Scholar
Duivesteijn W, Feelders A, Knobbe AJ (2016) Exceptional model mining—supervised descriptive local pattern mining with complex target concepts. Data Min Knowl Discov 30(1):47–98. https://doi.org/10.1007/s10618-015-0403-4
Article MathSciNet MATH Google Scholar
Duivesteijn W, Knobbe AJ, Feelders A, van Leeuwen M (2010) Subgroup discovery meets bayesian networks—an exceptional model mining approach. In: ICDM 2010, the 10th IEEE international conference on data mining, Sydney, Australia, 14–17 December 2010, pp 158–167. https://doi.org/10.1109/ICDM.2010.53
Dzyuba V, van Leeuwen M, Raedt LD (2017) Flexible constrained sampling with guarantees for pattern mining. Data Min Knowl Discov 31(5):1266–1293. https://doi.org/10.1007/s10618-017-0501-6
Article MathSciNet MATH Google Scholar
Etter V, Herzen J, Grossglauser M, Thiran P (2014) Mining democracy. In: Proceedings of the second ACM conference on Online social networks, COSN 2014, Dublin, Ireland, October 1–2, 2014, pp 1–12. https://doi.org/10.1145/2660460.2660476
Fürnkranz J, Gamberger D, Lavrac N (2012) Foundations of rule learning. Cognitive technologies. Springer, Berlin. https://doi.org/10.1007/978-3-540-75197-7
Book MATH Google Scholar
Ganter B, Wille R (1999) Formal concept analysis—mathematical foundations. Springer, Berlin. https://doi.org/10.1007/978-3-642-59830-2
Book MATH Google Scholar
Ganter B, Kuznetsov SO (2001) Pattern structures and their projections. In: Delugach HS, Stumme G (eds) Conceptual structures: broadening the base, 9th international conference on conceptual structures, ICCS 2001, Stanford, CA, USA, July 30–August 3, 2001, Proceedings, Springer, Lecture notes in computer science, vol 2120, pp 129–142. https://doi.org/10.1007/3-540-44583-8_10
Giacometti A, Soulet A (2016) Frequent pattern outlier detection without exhaustive mining. In: Advances in knowledge discovery and data mining—20th Pacific-Asia conference, PAKDD 2016, Auckland, New Zealand, April 19–22, 2016, Proceedings, Part II, pp 196–207. https://doi.org/10.1007/978-3-319-31750-2_16
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226. https://doi.org/10.1007/s10618-009-0136-3
Article MathSciNet Google Scholar
Grosskreutz H, Boley M, Krause-Traudes M (2010) Subgroup discovery for election analysis: a case study in descriptive data mining. In: Discovery science—13th international conference, DS 2010, Canberra, Australia, October 6–8, 2010. Proceedings, pp 57–71. https://doi.org/10.1007/978-3-642-16184-1_5
Grosskreutz H, Lang B, Trabold D (2013) A relevance criterion for sequential patterns. In: Machine learning and knowledge discovery in databases—European conference, ECML PKDD 2013, Prague, Czech Republic, September 23–27, 2013, Proceedings, Part I, pp 369–384. https://doi.org/10.1007/978-3-642-40988-2_24
Grosskreutz H, Rüping S, Wrobel S (2008) Tight optimistic estimates for fast subgroup discovery. In: Machine learning and knowledge discovery in databases, European conference, ECML/PKDD 2008, Antwerp, Belgium, September 15–19, 2008, Proceedings, Part I, pp 440–456. https://doi.org/10.1007/978-3-540-87479-9_47
Harper FM, Konstan JA (2016) The movielens datasets: history and context. TiiS 5(4):19:1–19:19. https://doi.org/10.1145/2827872
Article Google Scholar
Hasan MA, Zaki MJ (2009) Output space sampling for graph patterns. PVLDB 2(1):730–741. https://doi.org/10.14778/1687627.1687710
Article Google Scholar
Hayes AF, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 1(1):77–89
Article Google Scholar
Herrera F, Carmona CJ, González P, del Jesús MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525. https://doi.org/10.1007/s10115-010-0356-2
Article Google Scholar
Hix S, Noury A, Roland G (2005) Power to the parties: cohesion and competition in the european parliament, 1979–2001. Br J Polit Sci 35(2):209–234
Article Google Scholar
Jakulin A (2004) Analyzing the us senate in 2003: similarities, networks, clusters and blocs. http://kt.ijs.si/aleks/Politics/us_senate.pdf. Accessed 18 Oct 2019
Johnson D, Sinanovic S (2001) Symmetrizing the Kullback-Leibler distance. IEEE Trans Inf Theory 47:1–8
Google Scholar
Kaytoue M, Kuznetsov SO, Napoli A, Duplessis S (2011) Mining gene expression data with pattern structures in formal concept analysis. Inf Sci 181(10):1989–2001. https://doi.org/10.1016/j.ins.2010.07.007
Article MathSciNet Google Scholar
Kaytoue M, Plantevit M, Zimmermann A, Bendimerad AA, Robardet C (2017) Exceptional contextual subgraph mining. Mach Learn 106(8):1171–1211. https://doi.org/10.1007/s10994-016-5598-0
Article MathSciNet MATH Google Scholar
Klösgen W (1996) Explora: a multipattern and multistrategy discovery assistant. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. MIT Press, Cambridge, pp 249–271
Google Scholar
Kuznetsov SO, Obiedkov SA (2002) Comparing performance of algorithms for generating concept lattices. J Exp Theor Artif Intell 14(2–3):189–216. https://doi.org/10.1080/09528130210164170
Article MATH Google Scholar
Lavrac N, Kavsek B, Flach PA, Todorovski L (2004) Subgroup discovery with CN2-SD. J Mach Learn Res 5:153–188
MathSciNet Google Scholar
Leman D, Feelders A, Knobbe AJ (2008) Exceptional model mining. In: Machine learning and knowledge discovery in databases, European conference, ECML/PKDD 2008, Antwerp, Belgium, September 15–19, 2008, Proceedings, Part II, pp 1–16. https://doi.org/10.1007/978-3-540-87481-2_1
Lemmerich F, Atzmueller M, Puppe F (2016) Fast exhaustive subgroup discovery with numerical target concepts. Data Min Knowl Discov 30(3):711–762. https://doi.org/10.1007/s10618-015-0436-8
Article MathSciNet MATH Google Scholar
Lemmerich F, Becker M (2018) pysubgroup: Easy-to-use subgroup discovery in python. In: Machine learning and knowledge discovery in databases—European conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part III, pp 658–662. https://doi.org/10.1007/978-3-030-10997-4_46
Li G, Zaki MJ (2016) Sampling frequent and minimal boolean patterns: theory and application in classification. Data Min Knowl Discov 30(1):181–225. https://doi.org/10.1007/s10618-015-0409-y
Article MathSciNet MATH Google Scholar
Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98), New York City, NY, USA, August 27–31, 1998, pp 80–86. http://www.aaai.org/Library/KDD/1998/kdd98-012.php
Moens S, Boley M (2014) Instant exceptional model mining using weighted controlled pattern sampling. In: Advances in intelligent data analysis XIII—13th international symposium, IDA 2014, Leuven, Belgium, October 30–November 1, 2014. Proceedings, pp 203–214. https://doi.org/10.1007/978-3-319-12571-8_18
Moens S, Goethals B (2013) Randomly sampling maximal itemsets. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, IDEA@KDD 2013, Chicago, IL, USA, August 11, 2013, pp 79–86. https://doi.org/10.1145/2501511.2501523
Novak PK, Lavrac N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403
MATH Google Scholar
Omidvar-Tehrani B, Amer-Yahia S, Dutot P, Trystram D (2016) Multi-objective group discovery on the social web. In: Machine learning and knowledge discovery in databases—European conference, ECML PKDD 2016, Riva del Garda, Italy, September 19–23, 2016, Proceedings, Part I, pp 296–312. https://doi.org/10.1007/978-3-319-46128-1_19
Orueta JF, Nuño-Solinis R, Mateos M, Vergara I, Grandes G, Esnaola S (2012) Monitoring the prevalence of chronic conditions: which data should we use? BMC Health Serv Res 12(1):365
Article Google Scholar
Pajala A, Jakulin A, Buntine W (2004) Parliamentary group and individual voting behavior in finnish parliament in year 2003: a group cohesion and voting similarity analysis. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.2295&rep=rep1&type=pdf. Accessed 18 Oct 2019
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Database theory—ICDT ’99, 7th international conference, Jerusalem, Israel, January 10–12, 1999, Proceedings, pp 398–416. https://doi.org/10.1007/3-540-49257-7_25
Roddy E, Doherty M (2010) Epidemiology of gout. Arthritis Res Ther 12(6):223
Article Google Scholar
Roman S (2008) Lattices and ordered sets. Springer, Berlin
MATH Google Scholar
Terada A, Okada-Hatakeyama M, Tsuda K, Sese J (2013) Statistical significance of combinatorial regulations. Proc Natl Acad Sci 110(32):12996–13001
Article MathSciNet Google Scholar
Tukey JW (1977) Exploratory data analysis. Addison-Wesley series in behavioral science: quantitative methods. Addison-Wesley. http://www.worldcat.org/oclc/03058187. Accessed 18 Oct 2019
van Leeuwen M, Knobbe AJ (2012) Diverse subgroup set discovery. Data Min Knowl Discov 25(2):208–242. https://doi.org/10.1007/s10618-012-0273-y
Article MathSciNet Google Scholar
Wang C, Crapo LM (1997) The epidemiology of thyroid disease and implications for screening. Endocrinol Metab Clin 26(1):189–218
Article Google Scholar
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Principles of data mining and knowledge discovery, first European symposium, PKDD ’97, Trondheim, Norway, June 24–27, 1997, Proceedings, pp 78–87. https://doi.org/10.1007/3-540-63223-9_108

Download references

Acknowledgements

This work has been partially supported by the project ContentCheck ANR-15-CE23-0025 funded by the French National Research Agency. The authors would like to thank the reviewers for their valuable remarks. Their thoughtful and deep comments allowed us to considerably improve this paper. They also warmly thank Wouter Duivesteijn, Albrecht Zimmermann and Aimene Belfodil for interesting discussions.

Author information

Authors and Affiliations

LIRIS UMR5205, CNRS, INSA Lyon, 69621, Villeurbanne, France
Adnene Belfodil, Sylvie Cazalens & Philippe Lamarre
LIRIS UMR5205, CNRS, Université Lyon 1, 69622, Villeurbanne, France
Marc Plantevit

Authors

Adnene Belfodil
View author publications
You can also search for this author in PubMed Google Scholar
Sylvie Cazalens
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Lamarre
View author publications
You can also search for this author in PubMed Google Scholar
Marc Plantevit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marc Plantevit.

Additional information

Responsible editor: Pauli Miettinen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs of theorems and propositions

Before giving the proof of the Proposition 1 we present the following lemma.

Lemma 1

Let $n \in {\mathbb {N}}^*$, $A = \{a_i\}_{1 \le i \le n}$ and $B = \{b_i\}_{1 \le i \le n}$ such that:

$$\begin{aligned} \forall i \in 1\ldots n-1,&\quad 0 \le a_{i} \le a_{i+1}, \\ \forall i \in 1\ldots n-1,&\quad 0 < b_{i+1} \le b_{i}. \end{aligned}$$

We have:

$$\begin{aligned} \forall k \in 1\ldots n,&\quad \frac{\sum _{i=1}^k a_i}{\sum _{i=1}^k b_i} \le \frac{\sum _{i=1}^n a_i}{\sum _{i=1}^n b_i} \le \frac{\sum _{i=n-k+1}^n a_i}{\sum _{i=n-k+1}^n b_i}. \end{aligned}$$

Proof (Lemma 1)

Using the same notations of the lemma, we know that:

$$\begin{aligned} \frac{\sum _{i=1}^n a_i}{\sum _{i=1}^n b_i} - \frac{\sum _{i=1}^k a_i}{\sum _{i=1}^k b_i} \end{aligned}$$

is of the same sign of:

$$\begin{aligned} \left( \sum _{i=1}^n a_i\right) \times \left( \sum _{i=1}^k b_i\right) - \left( \sum _{i=1}^k a_i\right) \times \left( \sum _{i=1}^n b_i\right) . \end{aligned}$$

This above quantity is equal to:

$$\begin{aligned} \left( \sum _{i=1}^k a_i + \sum _{i=k+1}^n a_i\right) \times \left( \sum _{i=1}^k b_i\right) - \left( \sum _{i=1}^k a_i\right) \times \left( \sum _{i=1}^k b_i + \sum _{i=k+1}^n b_i\right) . \end{aligned}$$

Which is equal to

$$\begin{aligned} \left( \sum _{i=k+1}^n a_i\right) \times \left( \sum _{i=1}^k b_i\right) - \left( \sum _{i=1}^k a_i\right) \times \left( \sum _{i=k+1}^n b_i\right) . \end{aligned}$$

Using the lemma hypotheses (orders between $a_i$’s and $b_i$’s), we have:

$$\begin{aligned} \sum _{i=k+1}^n a_i\ge & {} (n-k) \times a_k, \\ \sum _{i=1}^k b_i\ge & {} k \times b_k, \\ \sum _{i=1}^k a_i\le & {} k \times a_k, \\ \sum _{i=k+1}^n b_i\le & {} (n-k) \times b_k. \end{aligned}$$

Thus:

$$\begin{aligned} \left( \sum _{i=k+1}^n a_i\right) \times \left( \sum _{i=1}^k b_i\right)\ge & {} (n-k) \times k \times a_k \times b_k,\\ \left( \sum _{i=1}^k a_i\right) \times \left( \sum _{i=k+1}^n b_i\right)\le & {} (n-k) \times k \times a_k \times b_k. \end{aligned}$$

We conclude that

$$\begin{aligned} \left( \sum _{i=k+1}^n a_i\right) \times \left( \sum _{i=1}^k b_i\right) - \left( \sum _{i=1}^k a_i\right) \times \left( \sum _{i=k+1}^n b_i\right) \ge 0. \end{aligned}$$

Hence, we have:

$$\begin{aligned} \forall k \in 1\ldots n,&\quad \frac{\sum _{i=1}^k a_i}{\sum _{i=1}^k b_i} \le \frac{\sum _{i=1}^n a_i}{\sum _{i=1}^n b_i}. \end{aligned}$$

Similarly the inequality $\frac{\sum _{i=1}^n a_i}{\sum _{i=1}^n b_i} \le \frac{\sum _{i=n-k+1}^n a_i}{\sum _{i=n-k+1}^n b_i}$ can be easily proved following the same line of reasoning of the proof of the first part of the inequality. $\square $

Proof (Proposition 1)

By a straightforward application of Lemma 1 we obtain for any d s.t. $|G_{E}^{d}|\ge \sigma _E$ the following inequality.

$$\begin{aligned} LB(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) \le \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}). \end{aligned}$$

(21)

This stems from the fact that $LB(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})$ takes the sum of the lowest $\sigma _E$ quantities constituting the numerator of $\mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})$ and divides them by the sum of the greatest $\sigma _E$ quantities forming the denominator of $\mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})$.

Moreover, we have that LB is monotonic w.r.t. $\sqsubseteq $ of ${\mathcal {D}}_E$. i.e.

$$\begin{aligned} c \sqsubseteq d \Rightarrow LB(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}) \le LB(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}). \end{aligned}$$

(22)

This results from $c \sqsubseteq d \Rightarrow G_{E}^{d} \subseteq G_{E}^{c}$. Hence, if we reorder values of $G_{E}^{c}$ and $G_{E}^{d}$ where $G_{E}^{c}=\{e^c_1,\ldots ,e^c_{|G_{E}^{c}|}\}$ and $G_{E}^{d}=\{e^d_1,\ldots ,e^d_{|G_{E}^{d}|}\}$ as such:

$$\begin{aligned} {\left\{ \begin{array}{ll} w_{e^c_1}\cdot \alpha (e^c_1) \le w_{e^c_2}\cdot \alpha (e^c_2) \le \cdots \le w_{e^c_{\sigma _E}}\cdot \alpha (e^c_{\sigma _E}) \le \ldots \le w_{e^c_{|E_c|}}\cdot \alpha (e^c_{|G_{E}^{c}|}) \\ w_{e^d_1}\cdot \alpha (e^d_1) \le w_{e^d_2}\cdot \alpha (e^d_2) \le \cdots \le w_{e^d_{\sigma _E}}\cdot \alpha (e^d_{\sigma _E}) \le \cdots \le w_{e^d_{|G_{E}^{d}|}}\cdot \alpha (e^d_{|G_{E}^{d}|}) \end{array}\right. }. \end{aligned}$$

Given that $G_{E}^{d} \subseteq G_{E}^{c}$, it is clear that: $\forall i\le \sigma _E,\; w_{e^c_i}\cdot \alpha (e^c_i) \le w_{e^d_i}\cdot \alpha (e^d_i)$. Having that $m(G_{E}^{c},\sigma _E) = \{e^c_1,\ldots ,e^c_{\sigma _E}\}$ and $m(G_{E}^{d},\sigma _E) = \{e^d_1,\ldots ,e^d_{\sigma _E}\}$, it follows that:

$$\begin{aligned} \sum _{e \in m(G_{E}^{c},\sigma _E)} w_e \times \alpha (e) \le \sum _{e \in m(G_{E}^{d},\sigma _E)} w_e \times \alpha (e). \end{aligned}$$

(23)

Similarly, if we reorder entities e in descending order w.r.t the weights $w_e$ we have $\forall j\le \sigma _E\;|\; w_{e^d_j} \le w_{e^c_j}$. Resulting in:

$$\begin{aligned} \sum _{e \in Mw(G_{E}^{c},\sigma _E)} w_e \ge \sum _{e \in Mw(G_{E}^{d},\sigma _E)} w_e. \end{aligned}$$

(24)

Hence, from (23) and (24) we have $\mathrm {LB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}) \le \mathrm {LB}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})$ and provided that $LB(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) \le \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})$ from (21), we have: $\forall c,d \in {\mathcal {D}}_E,\;c \sqsubseteq d \Rightarrow \mathrm {LB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}) \le \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})$. $\square $

Proof (Proposition 2)

This proof is similar to the proof of Proposition 1. For the sake of brevity, we give a proof sketch. By a direct application of Lemma 1, it is clear that for any d s.t. $|G_{E}^{d}|\ge \sigma _E$,

$$\begin{aligned} \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) \le \mathrm {UB}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}). \end{aligned}$$

(25)

We have that UB is anti-monotonic w.r.t. $\sqsubseteq $ of ${\mathcal {D}}_E$. i.e.

$$\begin{aligned} c \sqsubseteq d \Rightarrow \mathrm {UB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}) \ge \mathrm {UB}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}). \end{aligned}$$

(26)

This results from $c \sqsubseteq d \Rightarrow G_{E}^{d} \subseteq G_{E}^{c}$. Thus,

$$\begin{aligned} \sum _{e \in M(G_{E}^{c},\sigma _E)} w_e \times \alpha (e) \ge \sum _{e \in M(G_{E}^{d},\sigma _E)} w_e \times \alpha (e)\; \mathrm {and } \sum _{e \in mw(G_{E}^{c},\sigma _E)} w_e \le \sum _{e \in mw(G_{E}^{d},\sigma _E)} w_e. \end{aligned}$$

Hence, given (25) and (26) it follows that:

$$\begin{aligned} \forall c,d \in {\mathcal {D}}_E.\;\;c \sqsubseteq d \Rightarrow \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) \le \mathrm {UB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}). \end{aligned}$$

$\square $

Proof (Proposition 3)

given $c,d \in {\mathcal {D}}_E$ such that $c \sqsubseteq d$, using Proposition 1 we have:

$$\begin{aligned} \begin{aligned} \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})&\le \mathrm {UB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}),\\ \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2})&\le \mathrm {UB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}). \end{aligned} \end{aligned}$$

Since $\varphi _{\mathrm {consent}}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) = \mathrm {max}(\mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}),0)$ thus

$\varphi _{\mathrm {consent}}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) \le \mathrm{oe}_{\mathrm{consent}}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})$ .

Similarly we have:

$$\begin{aligned} \begin{aligned} \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})&\ge \mathrm {LB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}),\\ \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2})&\le \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {LB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}). \end{aligned} \end{aligned}$$

Since $\varphi _{\mathrm {dissent}}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) = \mathrm {max}(\mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {IAS}{}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}),0)$ we get $\varphi _{\mathrm {dissent}}(G_{E}^{d},G_{I}^{u_1},G_{I}^{u_2}) \le oe_{dissent}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}).$$\square $

Proof (Proposition 4)

Given that $\forall (e, G_{I}^{u_1}, G_{I}^{u_2}) \in E \times 2^I \times 2^I\;:\;w(e, G_{I}^{u_1}, G_{I}^{u_2})=1$, we have for any $c \in {\mathcal {D}}_E$ s.t. $ |G_{E}^{c}|\ge \sigma _E$,

$$\begin{aligned} \mathrm {IAS}{}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})= \frac{\sum \nolimits _{e \in G_{E}^{c}} \alpha (e)}{|G_{E}^{c}|} \;\;\mathrm {and }\;\; \mathrm {UB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})=\frac{\sum \nolimits _{e \in M(G_{E}^{c},\sigma _E)} \alpha (e)}{\sigma _E}. \end{aligned}$$

It follows from the fact that $M(G_{E}^{c},\sigma _E) \subseteq G_{E}^{c}$:

$$\begin{aligned} \begin{aligned} \exists S \subseteq G_{E}^{c},\; \mathrm {UB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})&= \mathrm {IAS}{}(S,G_{I}^{u_1},G_{I}^{u_2}), \\ \mathrm {UB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2})&= \mathrm {IAS}{}(S,G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}),\\ \mathrm{oe}_{\mathrm{consent}}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})&= \varphi _{\mathrm {consent}}(S,G_{I}^{u_1},G_{I}^{u_2}). \end{aligned} \end{aligned}$$

The subset S being for example the set $M(G_{E}^{c},\sigma _E)$ itself. The same reasoning applies when considering $\mathrm{oe}_{\mathrm{dissent}}$. In this case we consider the lower bound LB. We have:

$$\begin{aligned} \mathrm {LB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})=\frac{\sum _{e \in m(G_{E}^{c},\sigma _E)} \alpha (e)}{\sigma _E}. \end{aligned}$$

Given that $m(G_{E}^{c},\sigma _E) \subseteq E$, we have:

$$\begin{aligned} \begin{aligned} \exists S \subseteq G_{E}^{c},\; \mathrm {LB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})&= \mathrm {IAS}{}(S,G_{I}^{u_1},G_{I}^{u_2}), \\ \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}) - \mathrm {LB}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})&= \mathrm {IAS}{}(G_{E}^{},G_{I}^{u_1},G_{I}^{u_2}) -\mathrm {IAS}{}(S,G_{I}^{u_1},G_{I}^{u_2}),\\ \mathrm{oe}_{\mathrm{dissent}}(G_{E}^{c},G_{I}^{u_1},G_{I}^{u_2})&= \varphi _{\mathrm {dissent}}(S,G_{I}^{u_1},G_{I}^{u_2}). \end{aligned} \end{aligned}$$

This proves that, if $\mathrm {IAS}{}$ is a simple mean, for any $c \in {\mathcal {D}}_E$ s.t. $ |G_{E}^{c}|\ge \sigma _E$:

Hence $\mathrm{oe}_{\mathrm{consent}}$ and $oe_{dissent}$ are tight optimistic estimates for respectively $\varphi _{\mathrm {consent}}$ and $\varphi _{\mathrm {dissent}}$ if the underlying $\mathrm {IAS}$ is a simple average. $\square $

Before giving the proof of the Proposition 5 we present the following lemma.

Lemma 2

The sums of the number of all descriptions covering each record in G is equal to the sum of the supports of all descriptions in ${\mathcal {D}}_{}$. That is:

$$\begin{aligned} \sum _{g \in G}|\downarrow {\delta (g)}| = \sum _{d \in {\mathcal {D}}} |G_{}^{d}|. \end{aligned}$$

Proof (Lemma 2)

For $g \in G$, we have $\downarrow {\delta (g)}=\{d \in {\mathcal {D}}\;$s. t.$\;d \sqsubseteq \delta (g)\}$ and for $d \in {\mathcal {D}}$, we have $G_{}^{d} = \{g \in G\;$s. t.$\;d \sqsubseteq \delta (g)\}$. Let us define the indicator function on ${\mathcal {D}} \times G$:

$$\begin{aligned} \mathbb {1}_{\sqsubseteq }(d,g) = {\left\{ \begin{array}{ll} 1 &{}\mathrm {if}\;d \sqsubseteq \delta (g)\\ 0 &{} \mathrm {else} \end{array}\right. }. \end{aligned}$$

Hence, we have $|\downarrow {\delta (g)}|=\sum _{d \in {\mathcal {D}}} \mathbb {1}_{\sqsubseteq }(d,g) $ and $|G_{}^{d}|=\sum _{g \in G} \mathbb {1}_{\sqsubseteq }(d,g)$ thus:

$$\begin{aligned} \begin{aligned} \sum _{g \in G} |\downarrow {\delta (g)}|&= \sum _{g \in G} \sum _{d \in {\mathcal {D}}} \mathbb {1}_{\sqsubseteq }(d,g) = \sum _{d \in {\mathcal {D}}} \sum _{g \in G} \mathbb {1}_{\sqsubseteq }(d,g) = \sum _{d \in {\mathcal {D}}} |G_{}^{d}|. \end{aligned} \end{aligned}$$

$\square $

Proof (Proposition 5)

We denote by $\mathbf gs $ the random record drawn in line 1 and by $\mathbf ds $ the random description drawn in line 2 of FBS.

$$\begin{aligned} \begin{aligned} {\mathbb {P}}(\mathbf ds = d)&= \sum _{g \in G} {\mathbb {P}} ((\mathbf gs = g ) (\mathbf ds = d | g)) \\&= \sum _{g \in G_{}^{d}} \frac{1}{|\downarrow {\delta (g)}|} \times \underbrace{\frac{|\downarrow {\delta (g)}|}{\sum _{i \in G}|\downarrow {\delta (i)}|}}_{\text {weight } w_g \text { normalized}} = \frac{|G_{}^{d}|}{\sum _{g \in G}|\downarrow {\delta (g)}|} \end{aligned}. \end{aligned}$$

It follows that from Lemma 2 that ${\mathbb {P}}(\mathbf ds = d) = \dfrac{|G_{}^{d}|}{\sum _{d' \in {\mathcal {D}}} |G_{}^{d'}|}. \qquad \qquad \qquad \;\;$$\square $

Proof of (Proposition 6)

Given Proposition 5, it is clear that $\forall p \in {\mathcal {P}} : p=(c,u_1,u_2) $satisfies${\mathcal {C}} \Rightarrow {\mathbb {P}}(p)=\frac{|\mathrm {ext}(p)|}{Z}>0$. with $|\mathrm {ext}(p)| = |G_{E}^{c}|\times |G_{I}^{u_1}|\times |G_{I}^{u_2}|$ and $ Z=\sum _{p' \in {\mathcal {P}}} |\mathrm {ext}(p')|$ a normalizing factor. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belfodil, A., Cazalens, S., Lamarre, P. et al. Identifying exceptional (dis)agreement between groups. Data Min Knowl Disc 34, 394–442 (2020). https://doi.org/10.1007/s10618-019-00665-9

Download citation

Received: 03 May 2018
Accepted: 15 November 2019
Published: 26 November 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10618-019-00665-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups

Discovering Local Subgroups, with an Application to Fraud Detection

Exceptional Model Mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs of theorems and propositions

Lemma 1

Proof (Lemma 1)

Proof (Proposition 1)

Proof (Proposition 2)

Proof (Proposition 3)

Proof (Proposition 4)

Lemma 2

Proof (Lemma 2)

Proof (Proposition 5)

Proof of (Proposition 6)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Identifying exceptional (dis)agreement between groups

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups

Discovering Local Subgroups, with an Application to Fraud Detection

Exceptional Model Mining

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs of theorems and propositions

Appendix: Proofs of theorems and propositions

Lemma 1

Proof (Lemma 1)

Proof (Proposition 1)

Proof (Proposition 2)

Proof (Proposition 3)

Proof (Proposition 4)

Lemma 2

Proof (Lemma 2)

Proof (Proposition 5)

Proof of (Proposition 6)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation