Abstract
In light of more data than ever being stored using cloud services and the request by the public for secure, privacy-enhanced, and easy-to-use systems, Searchable Encryption schemes were introduced. These schemes enable privacy-enhanced search among encrypted documents yet disclose (encrypted) queries and responses. The first query recovery attack, the IKK attack, uses the disclosed information to (partly) recover what plaintext words the client searched for. This can also leak information on the plaintext contents of the encrypted documents. Under specific assumptions, the IKK attack has been shown to potentially cause serious harm to the security of Searchable Encryption schemes.
We empirically review the IKK query recovery attack to improve the understanding of its feasibility and potential security damage. In order to do so, we vary the assumed query distribution, which is shown to have a severe (negative) impact on the accuracy of the attack, and the input parameters of the IKK attack to find a correlation between these parameters and the accuracy of the IKK attack. Furthermore, we show that the recovery rate of the attack can be increased up to 10% points, while decreasing the variance of the recovery rate up to 78% points by combining the results of multiple attack runs. We also show that the including deterministic components in the probabilistic IKK attack can increase the recovery rate up to 21% points and decrease its variance up to 57% points.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Apache Lucene java-user email dataset, September 2001-July 2011. http://mail-archives.apache.org/mod_mbox/lucene-java-user/. Accessed 19 May 2020
ENRON email dataset, version 7th May 2015. https://www.cs.cmu.edu/~./enron/. Accessed 19 May 2020
IKK query recovery attack implementation (Python). https://github.com/rubengrootroessink/IKK-query-recovery-attack. Accessed 27 June 2020
Blackstone, L., Kamara, S., Moataz, T.: Revisiting leakage abuse attacks. IACR Cryptol. ePrint Arch. 2019, 1175 (2019)
Boneh, D., Kushilevitz, E., Ostrovsky, R., Skeith, W.E.: Public key encryption that allows PIR queries. In: Menezes, A. (ed.) CRYPTO 2007. LNCS, vol. 4622, pp. 50–67. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74143-5_4
Cash, D., Grubbs, P., Perry, J., Ristenpart, T.: Leakage-abuse attacks against searchable encryption. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 668–679. ACM (2015)
Chen, G., Lai, T.H., Reiter, M.K., Zhang, Y.: Differentially private access patterns for searchable symmetric encryption. In: IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pp. 810–818. IEEE (2018)
Curtmola, R., Garay, J., Kamara, S., Ostrovsky, R.: Searchable symmetric encryption: improved definitions and efficient constructions. J. Comput. Secur. 19(5), 895–934 (2011)
Goh, E.J.: Secure indexes. IACR Cryptology ePrint Archive, 216 (2003)
Goldreich, O., Ostrovsky, R.: Software protection and simulation on oblivious rams. J. ACM (JACM) 43(3), 431–473 (1996)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. In: The Collected Works of Wassily Hoeffding, pp. 409–426. Springer (1994)
Islam, M.S., Kuzu, M., Kantarcioglu, M.: Access pattern disclosure on searchable encryption: ramification, attack and mitigation. In: NDSS. vol. 20, p. 12. Citeseer (2012)
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Porter, M.F.: Snowball: A language for stemming algorithms (2001)
Song, D.X., Wagner, D., Perrig, A.: Practical techniques for searches on encrypted data. In: Proceeding 2000 IEEE Symposium on Security and Privacy. S&P 2000, pp. 44–55. IEEE (2000)
Zhang, Y., Katz, J., Papamanthou, C.: All your queries are belong to us: the power of file-injection attacks on searchable encryption. In: 25th USENIX Security Symposium (USENIX Security 16), pp. 707–720 (2016)
Zipf, G.K.: Selected studies of the principle of relative frequency in language (1932)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendices
1.1 A Co-occurrence Matrix Correlation (Partial Background Knowledge)
Both Islam et al. [12] and Cash et al. [6] both briefly elaborate on the recovery rate of the IKK attack in the case where the adversary only has partial background knowledge. Islam et al. add various degrees of Gaussian noise to individual cells in the co-occurrence matrix representing perfect background knowledge to simulate this setting, whereas Cash et al. simulate non-perfect background knowledge co-occurrence matrix by taking a fraction of all documents in the dataset. Both papers show that the accuracy of the attack is greatly dependent on the level of background knowledge the adversary has. We therefore argue that it is important to get a better understanding of the correlation between the level of background knowledge the adversary has and the recovery rate of the IKK attack. We also argue that the level of background knowledge can be expressed as a similarity between the query and background knowledge co-occurrence matrices, i.e. the co-occurrence matrix similarity.
In order to express co-occurrence matrix similarity we propose a metric that returns a similarity score between 0 (no similarity) and 1 (equivalent matrices) between two co-occurrence matrices of the same dimensions. For two matrices \(M_1\) and \(M_2\) and arbitrary words a, b (corresponding to a row and column) the following formulas are used:
Equations 5 and 6 are used to calculate the total squared Euclidean distance of cells that occur both in \(M_1\) and \(M_2\).
Equations 7 and 8 are used to calculate the number of cells that occur in both \(M_1\) and \(M_2\).
In Eq. 9 we calculate the average squared Euclidean distance of cells that occur in both matrices and multiply this by the ratio of row identifiers that occur in both matrices (\(K_\text {overlap}\)) to the total number of rows in both matrices (\(K_\text {total})\).
In Table 6 matrices \( M_1 \) and \(M_2\) are exactly the same. \(\varDelta _ total ^2\) is 0 as the squared Euclidean distance between each of the cells is \((1-1)^2=0\). The average is therefore also 0. As all keywords in both matrices also occur in the other matrix, \( \frac{K_\text {overlap}}{K_\text {total}} =3/3=1\). The similarity between the matrices is calculated as \( co-oc sim. =(1-0)*1=1\) meaning that the matrices are exactly the same. The calculation for the similarities between matrices \(M_1\) and \(M_3\), and \(M_1\) and \(M_4\) gives the values 0 and 2/3 respectively. In our simulations we calculate the co-occurrence matrix similarity using the perfect background knowledge co-occurrence matrix \(M_F\) (which corresponds with an unquerified query co-occurrence matrix) and a non-perfect background knowledge co-occurrence matrix \(M_P\). Both matrices have the same dimensions. In order to simulate partial background co-occurrence matrix \(M_P\) we use the following methods:
Gaussian noise addition - We use the method by Islam et al. to add Gaussian noise in various degrees to the cells in \(M_F\) to obtain \(M_P\).
Document percentage - In this setting we use 10% to 100% of the user folders in a dataset to generate \(M_P\). This method differs a bit from the method by Cash et al. as we argue that the adversary is more likely to obtain a percentage of the mail boxes of users (and all documents that are in these folders) than a percentage of all documents, selected uniformly at random, in a dataset. We believe that this choice might influence the results as different users are likely to use specific language in (all of) their emails.
Keyword percentage - In this setting we, uniformly at random, select 10% to 100% of the keywords in \(M_F\) to obtain \(M_P\). To keep the dimensions of \(M_P\) consistent throughout all our simulations we supplement the selected keywords with words with a lower occurrence count in the dataset used, i.e. that were not in the keyword set.
Different input folder - In this setting we use a different, but similar dataset to generate \(M_P\). In our simulations we use the inbox folder (containing 44859 emails) of the ENRON dataset.
The results of the different methods are shown in Figs. 8, 9 and 10. In these figures we group the values into certain buckets to group similarity scores. If a co-occurrence similarity score is between 0 and 0.1 it is put in the 10% similarity bucket, a value between 0.1 and 0.2 is put in the 20% bucket and so on.
Figure 8 shows the correlation between the co-occurrence similarity and the recovery rate in the setting where we use a certain percentage of the keywords in the keyword set to simulate partial background knowledge. Each bucket represents 20 simulations. Due to the nature of our similarity score using 90% of keywords from the keyword set will result in a similarity score of exactly 90%. The figure shows a clear linear correlation between the co-occurrence similarity score and the recovery rate. This makes sense as in this setting entire rows (and thus also columns) that are present in \(M_F\) are changed while simulating \(M_P\). The highest possible percentage of recoverable queries therefore is linearly dependent on the percentage of keywords regarded. As the individual cell values are not changed while simulating \(M_P\) (as opposed to the other methods) the algorithm is likely to recover (most of the) queries of which the corresponding keywords were selected in \(M_P\) as these have the optimal Euclidean distance of 0.
Figure 9 shows the correlation between the co-occurrence matrix similarity and the recovery rate in simulations where non-perfect background knowledge is simulated by taking a percentage of user folders in a dataset to generate \(M_P\). We ran 20 simulations for each percentage ranging from 10%, 20% to 100% each. The first thing that we notice is that the buckets do not contain the same number of simulations per bucket, which is shown in the figure as the number between brackets. This is due to the fact that we uniformly at random select a percentage of all user folders in a dataset and these user folders do not contain the same number of documents. Different writing styles can also be of influence to the overall co-occurrence similarity. The results in Fig. 9 show a different correlation than the results in Fig. 8. This makes sense as in these simulations both row/column identifiers as well as individual cell values are changed. The algorithm is less likely to correctly map queries to keywords that do occur in \(M_P\) as the changed individual cell values, in Fig. 9, make it less likely to find the optimal mapping. We note that the results in the 40%-60% bucket do not give much information, as each bucket consists of a single simulation. We conclude from the rest of the results that the co-occurrence matrix similarity and the recovery rate show an exponential correlation in Fig. 9.
The results in Fig. 10 show the correlation between the co-occurrence matrix similarity and the recovery rate of simulations where \(M_P\) was generated using a similar, but different dataset (ENRON/inbox) and when we add Gaussian noise to various degrees. First of all, if we generate \(M_P\) using the ENRON/inbox data folder we obtain a similarity score of approximately 70% to the ENRON/_sent_mail dataset. The recovery rate of almost 0 is consistent with our results in Fig. 9.
With the addition of various degrees of Gaussian noise (with C values 0.0, 0.2, ..., 1.0) the similarity of the co-occurrences matrices is always between 0.999 and 1.0. This can be explained as this method does not change the row/column identifiers, but only changes the individual cell values (co-occurrence counts). As only a little noise is added most of these cell values stay relatively the same. The recovery rate distribution among 120 simulations is relatively high as opposed to other methods to generate non-perfect background knowledge.
We conclude that it is not possible to use our matrix similarity metric to find a single correlation between the similarity of co-occurrence matrices and the recovery rate. The different methods change non-perfect background knowledge \(M_P\) in different manners and this influences the results of the IKK attack a lot. The IKK attack correctly recovers queries if the co-occurrence counts in \(M_P\) exactly match (or are close to) those in \(M_F\), which is shown in Figs. 8 and 10 (Gaussian noise addition). If the co-occurrence counts in \(M_P\) are further away from those in \(M_F\), which is the case in Figs. 9 and 10 (Inbox folder) the accuracy of the IKK attack decreases drastically.
We argue that the scenario where the background knowledge, represented as \(M_P\), is generated by taking a percentage of the user folders in a dataset is the most realistic one in a real-world scenario. It is not unlikely that an adversary, somehow, gets access to a certain set of the plaintext contents of the email boxes of specific users. The IKK attack proves to be a powerful attack which can break the privacy of queries as well as data confidentially of documents stored encrypted of the server, yet it is only exploitable by a powerful adversary, which has access to a dataset which results in background knowledge that is at least 90% similar the actual dataset encrypted on the server, as can be seen in Fig. 9.
1.2 B IKK Algorithms
In this section, we cite (part of) the Simulated Annealing (SA) algorithms as proposed by Islam et al.[12] as well as formalize our proposed algorithms for the Deterministic version of the IKK attacks. In short:
-
Algorithm Optimizer (Algorithms 1 and 2) is used to select the initial state of the ANNEAL algorithm.
-
Algorithm ANNEAL (Algorithms 3 and 4) is the heart of the IKK attack and is the actual algorithm that maps queries to keywords (apart from setting the initial state). The algorithms displayed are a simplified version of the ANNEAL algorithm as presented by Islam et al. and are mainly included to illustrate changes we made to more deterministically select the nextState.
-
Algorithm findNextStateDet (Algorithm 5) is our proposed sub algorithm of the ANNEAL algorithm which selects a new state of the algorithm more deterministically.
In order to easily annotate differences between the Original IKK attack (as proposed by Islam et al.) and our proposed Deterministic IKK attack we use the following colors:
-
annotates lines that are present in both the Original and Deterministic IKK attack.
-
annotates lines that are present in the Original IKK attack, but not in the Deterministic IKK attack. Red lines are replaced by blue lines.
-
annotates lines that are not present in the Original IKK attack, but are in the Deterministic IKK attack. Blue lines replace red lines.
The Original IKK attack and the Deterministic IKK attack are elaborated upon in Sects. 2.6 and 6 respectively.
We note that the pseudo code in the algorithms as shown below does not fully match with our implementation of the Original/Deterministic IKK attack [3]. First of all, we used Python3 specific methods to easily implement both attacks and, in order to reduce the number of lines we re-used as much of the code of the Original IKK attack as possible to implement our Deterministic IKK attack.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Groot Roessink, R., Peter, A., Hahn, F. (2021). Experimental Review of the IKK Query Recovery Attack: Assumptions, Recovery Rate and Improvements. In: Sako, K., Tippenhauer, N.O. (eds) Applied Cryptography and Network Security. ACNS 2021. Lecture Notes in Computer Science(), vol 12727. Springer, Cham. https://doi.org/10.1007/978-3-030-78375-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-78375-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78374-7
Online ISBN: 978-3-030-78375-4
eBook Packages: Computer ScienceComputer Science (R0)