Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset \(X'\) that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (\(\texttt {IFM}\)), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of \(\texttt {IFM}\) within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of \(\texttt {IFM}\), an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Aggarwal CC, Yu PS (2008) A general survey of privacy-preserving data mining models and algorithms. In: Aggarwal CC, Yu PS (eds) Privacy-preserving data mining—models and algorithms, volume 34 of advances in database systems. Springer, Berlin, pp 11–52

    Chapter  Google Scholar 

  • Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, SIGMOD ’93, New York, NY, USA. ACM, pp 207–216

  • Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, New York, NY, USA. ACM, pp 439–450

  • Beheshti AK, Hejazi SR (2015) A novel hybrid column generation-metaheuristic approach for the vehicle routing problem with general soft time window. Inf Sci 316:598–615

    Article  Google Scholar 

  • Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127

    Article  MATH  Google Scholar 

  • Bertsimas D, Tsitsiklis JN (1997) Introduction to linear optimization. Athena Scientific, Belmont

    Google Scholar 

  • Bykowski A, Rigotti C (2001) A condensed representation to find frequent patterns. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’01, New York, NY, USA. ACM, pp 267–273

  • Cagliero L, Garza P (2013) Itemset generalization with cardinality-based constraints. Inf Sci 244:161–174

    Article  MathSciNet  MATH  Google Scholar 

  • Calders T (2004) Computational complexity of itemset frequency satisfiability. In: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’04, New York, NY, USA. ACM, pp 143–154

  • Calders T (2007) The complexity of satisfying constraints on databases of transactions. Acta Inf 44(7–8):591–624

    Article  MathSciNet  MATH  Google Scholar 

  • Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347

    Article  Google Scholar 

  • Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’03, New York, NY, USA. ACM, pp 211–222

  • Gilmore PC, Gomory RE (1961) A linear programming approach to the cutting-stock problem. Oper Res 9(6):849–859

    Article  MathSciNet  MATH  Google Scholar 

  • Gunopulos D, Khardon R, Mannila H, Toivonen H (1997) Data mining, hypergraph transversals, and machine learning. In: Mendelzon AO, Özsoyoglu ZM (eds) Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’97, ACM Press, pp 209–216

  • Guns T, Nijssen S, Raedt LD (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12):1951–1983

    Article  MathSciNet  MATH  Google Scholar 

  • Guzzo A, Moccia L, Saccà D, Serra E (2013) Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans Knowl Discov Data 7(4):18:1–18:39

    Article  Google Scholar 

  • Guzzo A, Saccà D, Serra E (2009) An effective approach to inverse frequent set mining. In: Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM ’09, Washington, DC, USA. IEEE Computer Society, pp 806–811

  • Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86

    Article  MathSciNet  Google Scholar 

  • Han J, Kamber M (2005) Data mining: concepts and techniques. Kaufmann, San Francisco

    MATH  Google Scholar 

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  • Hu T, Sung SY, Xiong H, Fu Q (2008) Discovery of maximum length frequent itemsets. Inf Sci 178(1):69–87

    Article  MathSciNet  Google Scholar 

  • Jindal R, Malaya DB (2016) A novel approach for mining frequent patterns from incremental data. IJDMMM 8(3):244–264

    Article  Google Scholar 

  • KDDCUP2000 (2000). https://www.kdd.org/kdd-cup/view/kdd-cup-2000. Accessed 4 May 2018

  • Liu L, Kantarcioglu M, Thuraisingham B (2008) The applicability of the perturbation based privacy preserving data mining for real-world data. Data Knowl Eng 65(1):5–21

    Article  Google Scholar 

  • Luenberger DG (2003) Linear and nonlinear programming, 2nd edn. Springer, Berlin

    MATH  Google Scholar 

  • Mendes R, Vilela JP (2017) Privacy-preserving data mining: methods, metrics, and applications. IEEE Access 5:10562–10582

    Article  Google Scholar 

  • Michael K, Miller KW (2013) Big data: new opportunities and new challenges [guest editors’ introduction]. Computer 46(6):22–24

    Article  Google Scholar 

  • Mielikainen T (2003) On inverse frequent set mining. In: Proceedings of 2nd workshop on privacy preserving data mining, PPDM ’03, Washington, DC, USA. IEEE Computer Society, pp 18–23

  • ms-IFM code (2018). Datasets and codes used by paper’s experiments for ms-IFM ans stored in GitHub repository. https://github.com/ninorullo/NoSQL-IFM. Accessed 18 Dec 2018

  • ms-IFM dataset (2017). Yelp challenge. https://www.yelp.com/dataset. Accessed 18 Dec 2018

  • Narayanan A, Shmatikov V(2009) De-anonymizing social networks. In: Proceedings—-IEEE symposium on security and privacy 2009 30th IEEE symposium on security and privacy, pp 173–187

  • Oliveira S RM, Zaïane OR (2003) Protecting sensitive knowledge by data sanitization. In: Proceedings of the third IEEE international conference on data mining, ICDM ’03, Washington, DC, USA. IEEE Computer Society, pp 613–616

  • Papadimitriou CH (1994) Computational complexity. Addison-Wesley, Boston

    MATH  Google Scholar 

  • Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory, ICDT ’99, London, UK. Springer-Verlag, pp 398–416

  • Patki N, Wedge R, Veeramachaneni K (2016) The synthetic data vault. In: 2016 IEEE international conference on data science and advanced analytics, DSAA 2016, Montreal, QC, Canada, October 17–19, 2016, IEEE, pp 399–410

  • Ramesh G, Maniatty W, Zaki MJ (2003) Feasible itemset distributions in data mining: theory and application. In Neven F, Beeri C, Milo T (eds) PODS, ACM, pp 284–295

  • Ramesh G, Zaki MJ, Maniatty W (2005) Distribution-based synthetic database generation techniques for itemset mining. In: IDEAS, IEEE Computer Society, pp 307–316

  • Saccà D, Serra E (2013) Number of minimal hypergraph transversals and complexity of IFM with infrequency: high in theory, but often not so much in practice!. Online Preliminary Paper from http://sacca.deis.unical.it/#view=object&format=object&id=1490/gid=160. Accessed 4 May 2018

  • Shah A, Gulati R (2016) Article: Privacy preserving data mining: techniques, classification and implications—a survey. International Journal of Computer Applications, 137(12):40–46. Published by Foundation of Computer Science (FCS), NY, USA

  • Stavropoulos EC, Verykios VS, Kagklis V (2016) A transversal hypergraph approach for the frequent itemset hiding problem. Knowl Inf Syst 47(3):625–645

    Article  Google Scholar 

  • Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570

    Article  MathSciNet  MATH  Google Scholar 

  • Weikum G (2013) Where’s the data in the big data wave? ACM Sigmod Blog http://wp.sigmod.org/?p=786. Accessed 4 May 2018

  • Wu H, Ning Y, Chakraborty P, Vreeken J, Tatti N, Ramakrishnan N (2018) Generating realistic synthetic population datasets. ACM Trans Knowl Discov Data 12(4):45:1–45:22

    Article  Google Scholar 

  • Wu X, Wu Y, Wang Y, Li Y (2005) Privacy aware market basket data set generation: A feasible approach for inverse frequent set mining. In: Proceedings of SIAM international conference on data mining, SDM’ 05, Philadelphia, PA, USA. SIAM, pp 103–114

  • Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’01, New York, NY, USA. ACM, pp 401–406

  • Zhong S (2007) Privacy-preserving algorithms for distributed mining of frequent itemsets. Inf Sci 177(2):490–503

    Article  MATH  Google Scholar 

  • Zhou B, Pei J, Luk W (2008) A brief survey on anonymization techniques for privacy preserving publishing of social network data. SIGKDD Explor Newsl 10(2):12–22

    Article  Google Scholar 

Download references

Funding

The funding was supported by MISE, Italian Ministry for Industry (Grant No. PON ID Service and Protect ID).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Domenico Saccá.

Additional information

Responsible editor: Toon Calders.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs

Appendix: Proofs

Theorem 1

(1) Decision \({\texttt {IFM}} _{\texttt {G} }\) and \({\texttt {IFM}} _{\texttt {I} }\) are \({\texttt {NEXP}} \)-complete; (2) decision \({\texttt {IFM}} _{\texttt {D} }\) is in \({\texttt {PSPACE}} \) and \({\texttt {PP}} \)-hard; (3) decision \({\texttt {IFM}} \) is in \({\texttt {PSPACE}} \) and \({\texttt {NP}} \)-hard; (4) decision \({\texttt {IFM}} _{\texttt {S} }\) is \({\texttt {NP}} \)-complete.

Proof

The complexity results of \(\texttt {IFM}_\texttt {I}\) and \(\texttt {IFM}\) have been proved respectively in Guzzo et al. (2013); Saccà and Serra (2013) and in Mielikainen (2003).

Let us now prove the \({\texttt {PSPACE}}\)-membership of \(\texttt {IFM}_\texttt {D}\). A polynomial space nondeterministic algorithm can set at the beginning \(size=0\) and \(\forall I \in S\), \(\sigma _I=0\). Then, for each possible transaction \(I \in {\mathcal {U}}_{\mathcal {I}}\), it guesses a duplicate value \(\delta _I\) such that \(\delta _I \le {\texttt {size}}\) if there exists \(J\in S\) such that \(J \subseteq I\) or \(\delta _J \le \delta '\) otherwise (i.e., \(J \in S'\)). After each guessing, the algorithm updates \({\texttt {size}}\) and \(\sigma _J\) for each \(J \in S: \, J \subseteq I\). Once all guesses are completed, it remains to verify whether \(size = {\texttt {size}}\) and for each \(I \in S: \, I \subseteq J\), \(\sigma _{\min }^I \le \sigma _I \le \sigma _{\max }^I\). It follows that the problem is in \({\texttt {PSPACE}}\).

The \({\texttt {PP}}\)-hardness of \(\texttt {IFM}_\texttt {D}\) can be easily proved by reduction from \({\texttt {FREQSAT}}\)\(\small {\{\texttt {NDUP}, \texttt {NTRANS}\}}\) that has been shown to be \({\texttt {PP}}\)-hard in Calders (2007) (Theorem 13). The \({\texttt {NEXP}}\)-hardness of \(\texttt {IFM}_\texttt {G}\) immediately derives from the \({\texttt {NEXP}}\)-hardness of \(\texttt {IFM}_\texttt {I}\), which is a sub-problem of \(\texttt {IFM}_\texttt {G}\).

Membership of \(\texttt {IFM}_\texttt {G}\) to \({\texttt {NEXP}}\) immediately follows from the fact that, given any instance x of \(\texttt {IFM}_\texttt {G}\), x is a yes-instance if and only if both \(x_I\) (obtained from x by removing the constraint 3) is a yes-instance of \(\texttt {IFM}_\texttt {I}\) and \(x_D\) (obtained from x by removing the constraint 2) is a yes-instance of \(\texttt {IFM}_\texttt {D}\).

We conclude by proving the \({\texttt {NP}}\)-completeness of \(\texttt {IFM}_S\) – this result was first presented without a proof in Guzzo et al. (2009). As for the membership to \({\texttt {NP}}\), we observe that the size of the output database \({\mathcal {D}}\) is certainly bounded by |S| and by the largest support required in the specification. Hence, in polynomial time a non-deterministic Turing machine may first guess such database (basically, the support for each itemset I in S) and then verify whether it satisfies the constraints \(\sigma _{min}^i\) and \(\sigma _{max}^i\), by simply computing the support \(\sigma ^{{\mathcal {D}}}(I_i)\) on \({\mathcal {D}}\).

To prove that \(\texttt {IFM}_{S}\) is \({\texttt {NP}}\)-hard, we exhibit a reduction from the graph 3-colorability problem of deciding whether, given a graph \(G=(V,E)\), there is a 3-coloring \(c:V\rightarrow \{r,g,b\}\) such that \(c(i)\ne c(j)\) for each pair of edges \((i,j)\in E\).

Based on the input graph \(G=(V,E)\), we construct an instance of the \(\texttt {IFM}_{S}({\mathcal {I}},\varSigma _S)\) problem such that: the set \({\mathcal {I}}\) of items is \(\{r,g,b,l_1,l_2,l_3\}\cup \{v_{x} | x \in V \}\cup \{e_{z,y}| (z,y) \in E\}\), where conceptually the item rgb are the colors in G, \(l_1, l_2, l_3\) are labels implementing an encoding of the three colors, \(v_{x}\) is an item for each node in G and \(e_{z,y}\) is an item for each edge in G. The encoding of colors by label is such that any two colors share exactly one label; in the proof we shall use the encoding \(r=\{l_1, l_2\}\), \(g=\{l_1, l_3\}\) and \(b=\{l_2, l_3\}\).

The set \(\varSigma _S\) contains two groups of constraints:

Group (I) these constraints are repeated for each node \(x \in V\) and enforces that x must be colored with exactly one color. There are 7 itemsets associated to x organized on 3 levels:

  • 3 itemsets at the highest level 2: there is an itemset for each possible color c for x, containing the items corresponding to the node x, to all the arcs leaving x, to the color c and to the encoding of the color — the support for such itemsets can be either 0 or 1;

  • 3 itemsets at level 1: there is an itemset for each of the 3 encoding labels, containing the item corresponding to the label and the item corresponding to the node x – the support must be exactly 1;

  • 1 itemset at level 0 containg the item corresponding to the node x – its support must be exactly 2;

We explicit the constraints below:

  • \((I_{x,r},0,1)\), \((I_{x,g},0,1)\), \((b_{x,r},0,1)\), where \(I_{x,r}=\)\(\{v_x, r,l_1, l_2\} \cup I\),

    \(I_{x,g}=\)\(\{v_x, g,l_1, l_3\}\)\(\cup I\), \(I_{x,b}=\)\(\{v_x, b,l_2, l_3\} \cup I\), and \(I = \{e_{x,y}|(x,y)\in E\}\);

  • \((\{l_1,v_{x}\},1,1)\), \((\{l_2,v_{x}\},1,1)\) and \((\{l_3,v_{x}\},1,1)\);

  • \((\{v_{x}\},2,2)\).

Because of the support constraints for the itemsets at level 1, \(\{l_1,v_{x}\}\) cannot occur as transaction in D as it inherits support from those itemsets; in addition, as the support of \(\{l_1,v_{x}\}\) is 2 and there are 3 itemsets at level 1 with obligatory support 1, exactly one of the itemsets at level 1 must occur as transaction, whereas the other two inherit support from a same itemset at level 2. It turns out that exactly one itemset at level 2 can occur as transaction whereas all others must have support 0 - the itemset selected as transaction will then fix the unique color for the node x.

Group (II) these constraints are repeated for each edge \((x,y) \in E\) and enforce that two end nodes of the edge have different color. There are 3 itemsets, one for each possible color; the constraints are: \((\{r,e_{z,y},0,1)\}\), \((\{g,e_{z,y},0,1)\}\), \((\{b,e_{z,y},0,1)\}\). These itemsets inherit support from two itemsets at level 2: one for x and the other for y. The constraints of Group (II) enforces that the two itemsets at level 2 cannot be of the same color, thus any two adjacent nodes cannot share the same color. It follows that the existence of a solution to \(\texttt {IFM}_{S}\) witnesses the fact that the graph G admits a 3-coloring; on the other hand, if \(\texttt {IFM}_{S}\) has no solution then the graph cannot be 3-colored. Hence, \(\texttt {IFM}_{S}\) is \({\texttt {NP}}\)-hard in general. \(\square \)

Proposition 1

Decision k-\({\texttt {IFM}} _{\texttt {G} }\) is in \({\texttt {PSPACE}} \) and \({\texttt {PP}} \)-hard and decision k-\({\texttt {IFM}} _{\texttt {I} }\) is in \({\texttt {PSPACE}} \) and \({\texttt {NP}} \)-hard.

Proof

The complexity results for k-\(\texttt {IFM}_\texttt {I}\) have been proved in (Guzzo et al. 2013; Saccà and Serra 2013). The \({\texttt {PP}}\)-hardness of k-\(\texttt {IFM}_\texttt {G}\) immediately derives from the \({\texttt {PP}}\)-hardness of \(\texttt {IFM}_\texttt {D}\), which is a sub-problem of k-\(\texttt {IFM}_\texttt {G}\). Finally \({\texttt {PSPACE}}\) membership of k-\(\texttt {IFM}_\texttt {G}\) follows from the fact that, given any instance x of k-\(\texttt {IFM}_\texttt {G}\), x is a yes-instance if and only if both \(x_I\) is a yes-instance of k-\(\texttt {IFM}_\texttt {I}\) and \(x_D\) is a yes-instance of \(\texttt {IFM}_\texttt {D}\).

Proposition 2

(1) Relaxed k-\({\texttt {IFM}} _{\texttt {I} }\) and relaxed \({\texttt {IFM}} \) are \({\texttt {NP}} \)-complete; (2) relaxed k-\({\texttt {IFM}} _{\texttt {G} }\) and relaxed \({\texttt {IFM}} _{\texttt {D} }\) are in \({\texttt {PSPACE}} \) and \({\texttt {PP}} \)-hard; (3) relaxed \({\texttt {IFM}} _{{S} }\) is in \({\texttt {P}}\).

Proof

The \({\texttt {NP}}\)-completeness of relaxed k-\(\texttt {IFM}_\texttt {I}\) has been proved in Guzzo et al. (2013); Saccà and Serra (2013). The \({\texttt {NP}}\)-membership of \(\texttt {IFM}\) derives from the fact that \(\texttt {IFM}\) is a sub-problem of k-\(\texttt {IFM}_\texttt {I}\); the \({\texttt {NP}}\)-hardness can be easily proved by reduction from \({\texttt {FREQSAT}}\) that has been shown to be \({\texttt {NP}}\)-complete in Calders (2004, 2007). The \({\texttt {PSPACE}}\)-memberships of relaxed k-\(\texttt {IFM}_\texttt {G}\) and relaxed \(\texttt {IFM}_\texttt {D}\) are obvious. The \({\texttt {PP}}\)-hardness can be easily proved by reduction from FREQSAT\( \small {\{\texttt {NDUP}\}}\) that has been shown to be \({\texttt {PP}}\)-hard in Calders (2007). It follows that k-\(\texttt {IFM}_\texttt {G}\) is \({\texttt {PP}}\)-hard as well. Finally, to prove that relaxed \(\texttt {IFM}_S\) is in \({\texttt {P}}\), we encode the problem as a system of linear equations over |S| rational variables \(\delta ^I\), \(\forall I \in S\), and \(2\, |S|\) inequalities implementing the minimal and maximum support constraint for each itemset in S. The result then follows, since it is well-known that deciding whether a system of linear equations admits a solution is feasible in polynomial time.

Proposition 3

Each column \(j \in U\) is an infeasible solution for the ILP price formulation.

Proof

By constraints (14), a generic column \(j\in U\) represented by the itemset \(I_j\) can be uniquely identified by the itemsets in \(S\cup B_{S'}\) that are included in \(I_j\). Then, in order to prove that the itemset \(I^*\) returned by PLI formulation is different from each other itemset representing a column j in U, the following condition must hold: either the itemset in \(S\cup B_{S'}\) contained in \(I^*\) is a subset of some itemset in \(S\cup B_{S'}\) contained in \(I_j\) or there exists an itemset in \(S\cup B_{S'}\) that is contained in \(I^*\) but not in \(I_j\). The previous condition can be formulate by the following disjunction of integer linear inequalities:

$$\begin{aligned} \sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\subset I_j\end{array} \end{array}}\ y_i\le k_j-1 \end{aligned}$$
(32)

or

$$\begin{aligned} \sum _{{\begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\not \subset I_j\end{array}}}\ y_i \ge 1 \end{aligned}$$
(33)

Now we prove that the above disjunction is equivalent to the inequality (15).

Suppose that the integer linear inequality (32) is not satisfied. Then

$$\begin{aligned} \sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\subset I_j\end{array} \end{array}} \ y_i =k_j \end{aligned}$$

and the integer linear inequality (15) becomes

$$\begin{aligned} -k_j\sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\not \subset I_j\end{array} \end{array}} \ y_i \le -1 \end{aligned}$$

that is equivalent to integer linear inequality (33). Instead if the integer linear inequality (33) is unsatisfied then

$$\begin{aligned} \sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\not \subset I_j\end{array} \end{array}}\ y_i=0 \end{aligned}$$

and the integer linear inequality (15) is equivalent to (32). It follows that each column \(j \in U\) is an infeasible solution for the PLI price formulation.

Proposition 4

The decision version of ms-\({\texttt {IFM}} \) is in \({\texttt {PSPACE}} \) and \({\texttt {NP}} \)-hard and the decision version of relaxed ms-\({\texttt {IFM}} \) is \({\texttt {NP}} \)-complete.

Proof

The NP-hardness of decision \({ms\texttt {-}{} \texttt {IFM}}\) derives from the fact that an instance of it without SV attributes and infrequency constraints and with exactly one MV attribute is an instance of the classical \(\texttt {IFM}\) decision problem, that is known to be NP-hard (see Mielikainen 2003). To prove membership to \({\texttt {PSPACE}}\), we use the following non-deterministic algorithm to solve any instance of the problem. We use \(m+{\widehat{m}}+1\) variables \(\ddot{s}\), \(\ddot{b}^f_1, \dots , \ddot{b}^f_m\), \(\ddot{b}^i_1, \dots , \ddot{b}^i_m\) and set all of them to 0. Then we perform the following steps. For each possible transaction t: (1) guess a value \(n_t\) between 0 and \({\texttt {size}}\), (2) if \(n_t>0\) then: (3) set \(\ddot{s} = \ddot{s}+n_t\), (4) for all \(I_i \in S\) s.t. \(I_i\) subsumes t then set \(\ddot{b}^f_i\) := \(\ddot{b}^f_i+n_t\) and (5) for all \(I_i \in S'\) s.t. \(I_i\) subsumes t then set \(\ddot{b}^i_i\) := \(\ddot{b}^i_i+n_t\). At the end of all iterations we check that: (1) \(\ddot{s} = {\texttt {size}}\), (2) for each \(I_i \in S\), \(\ddot{b}^f_i\) is included into the support constraint range for the frequent itemset \(I_i\) and (3) for each \(I_i \in S'\), \(\ddot{b}^i_i\) is less than or equal to the support constraint upper bound for the infrequent itemset \(I_i\). It is easy to see that the algorithm is correct and runs in a nondeterministic polynomial space. So the problem is in \(\mathrm {NPSPACE}\); hence, by Savitch’s theorem, it is in \({\texttt {PSPACE}}\) as well.

The NP-hardness of relaxed \({ms\texttt {-}{} \texttt {IFM}}\) derives from the fact that an instance of it without SV attributes and infrequency constraints and with exactly one MV attribute is an instance of decision \({\texttt {FREQSAT}}\), that is known to be NP-hard (see Calders 2004, 2007). To prove membership to \({\texttt {NP}}\), observe that the problem can be formulated as linear program LP with \(m+{\widehat{m}}+1\) constraints. It is well known that LP has a solution if and only if there exists a basic feasible solution x (i.e., with at most \(m+{\widehat{m}}+1\) values in x different from 0). Therefore we can guess a \((m+{\widehat{m}}+1)\)-vector S of variable indices and non-deterministically assign values \(v \le {\texttt {size}}\) to them. At this point we are able to verify that this basic solution is admissible (i.e. satisfy the linear constraints).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saccá, D., Serra, E. & Rullo, A. Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications. Data Min Knowl Disc 33, 1736–1774 (2019). https://doi.org/10.1007/s10618-019-00643-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00643-1

Keywords

Navigation