Abstract
Audio content recognition is an emerging technology that forms the basis for mobile services, such as automatic song recognition, second-screen synchronization, and broadcast monitoring. The technology utilizes audio fingerprints, short patterns that are extracted from audio recordings of a smartphone and enable the identification of specific content. These fingerprints are generally considered privacy-friendly, as they contain minimal information of the original signal. As a result, mobile applications have emerged in the past few years that silently monitor user habits by collecting such audio fingerprints in the background. In this paper, we systematically examine whether audio fingerprints leak sensitive information from the recording environment and potentially violate the privacy of smartphone users. To this end, we analyze three popular audio recognition solutions and develop attacks to infer sensitive information from their fingerprints. To the best of our knowledge, we are the first to show that the identification of speakers and words in the fingerprints is possible. Based on our analysis, we conclude that current audio fingerprints do not sufficiently protect privacy and should be used with great caution.
M. Pfister and R. Michael—Authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
sha1: 3c0770204a5d769c1a22a4acb7f9d6a4dd12e55c.
- 3.
sha1: 5de8eb4098d2e35a2c3951a169bf9e19a680e2d4.
- 4.
References
ACRCloud: ACRCloud: automatic content recognition services for doers (2022). https://www.acrcloud.com/. Accessed 22 Apr 2024
Arp, D., Quiring, E., Wressnegger, C., Rieck, K.: Privacy threats through ultrasonic side channels on mobile devices. In: Proceedings of IEEE European Symposium on Security and Privacy (EuroS &P) (2017)
Arp, D., et al.: Dos and don’ts of machine learning in computer security. In: Proceedings of USENIX Security Symposium (2022)
Brookman, J., Rouge, P., Alva, A., Yeung, C.: Cross-device tracking: measurement and disclosures. Proc. Priv. Enhancing Technol. (PETS) 2017(2) (2017)
Celosia, G., Cunche, M.: Discontinued privacy: personal data leaks in apple Bluetooth-low-energy continuity protocols. Proc. Priv. Enhancing Technol. (PETS) 2020(1) (2020)
Chatterjee, R., et al.: The spyware used in intimate partner violence. In: Proceedings of IEEE Symposium on Security and Privacy (S &P) (2018)
Chen, H., Laine, K., Rindal, P.: Fast private set intersection from homomorphic encryption. In: Proceedings of ACM Conference on Computer and Communications Security (CCS) (2017)
Deezer: Deezer \(|\) listen to music \(|\) online music streaming platform (2022). https://www.deezer.com. Accessed 22 Apr 2024
Deezer: Third party data breach – deezer support (2022). https://support.deezer.com/hc/en-gb/articles/7726141292317-Third-Party-Data-Breach. Accessed 22 Apr 2024
Defferrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: FMA: a dataset for music analysis. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2019)
Dong, C., Chen, L., Wen, Z.: When private set intersection meets big data: an efficient and scalable protocol. In: Proceedings of ACM Conference on Computer and Communications Security (CCS) (2013)
Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: Proceedings of International Conference on Learning Representations (ICLR) (2019)
Faragher, R., Harle, R.: Location fingerprinting with Bluetooth low energy beacons. IEEE J. Sel. Areas Commun. 33(11), 2418–2428 (2015)
Haitsma, J., Kalker, T.: A highly robust audio fingerprinting system. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2002)
Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. CoRR abs/2104.05704 (2021)
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Jawurek, M., Johns, M., Rieck, K.: Smart metering de-pseudonymization. In: Proceedings of Annual Computer Security Applications Conference (ACSAC) (2011)
Kennedy, S., Li, H., Wang, C., Liu, H., Wang, B., Sun, W.: I can hear your Alexa: voice command fingerprinting on smart home speakers. In: Proceedings of IEEE Conference on Communications and Network Security (CNS) (2019)
Kim, H.G., Cho, H.S., Kim, J.Y.: Robust audio fingerprinting using peak-pair-based hash of non-repeating foreground audio in a real environment. Cluster Comput. 19(1) (2016)
Knospe, H.: Privacy-enhanced perceptual hashing of audio data. In: International Conference on Security and Cryptography (SECRYPT) (2013)
Konjeti, S., Potty, H., Kashyap, D.: Zapr audio fingerprinting (2017). https://www.music-ir.org/mirex/abstracts/2017/KP1.pdf
Korolova, A., Sharma, V.: Cross-app tracking via nearby Bluetooth low energy devices. In: Proceedings of ACM Conference on Data and Applications Security and Privacy (CODASPY) (2018)
LG Ads Solutions: Alphonso ACR technology and consumer choice - lg ads\(^{3}\) (2018). https://tinyurl.com/yp9t2dmz. Accessed 22 Apr 2024
LG Ads Solutions: Automatic content recognition (2021). https://alphonso.tv/. Accessed 22 Apr 2024
Liberatore, M., Levine, B.N.: Inferring the source of encrypted http connections. In: Proceedings of ACM Conference on Computer and Communications Security (CCS) (2006)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: Proceedings of International Conference on Learning Representations (ICLR) (2017)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proceedings of International Conference on Learning Representations (ICLR) (2019)
Mavroudis, V., Hao, S., Fratantonio, Y., Maggi, F., Kruegel, C., Vigna, G.: On the privacy and security of the ultrasound ecosystem. Proc. Priv. Enhancing Technol. (PETS) 2017(2) (2017)
Media, B.: Hotstar, newsdog and other Indian apps are spying on your phone’s mic (2018). https://beebom.com/hotstar-newsdog-apps-spying-phone-mic/. Accessed 22 Apr 2024
Musixmatch S.p.A.: Musixmatch website (2022). https://www.musixmatch.com. Accessed 22 Apr 2024
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015)
Park, M., Kim, H.R., Yang, S.H.: Frequency-temporal filtering for a robust audio fingerprinting scheme in real-noise environments. ETRI J. 28(4) (2006)
Ravnås, O.A.V.: Frida \(\bullet \) a world-class dynamic instrumentation framework (2022). https://frida.re/. Accessed 22 Apr 2024
Reardon, J., Feal, Á., Wijesekera, P., On, A.E.B., Vallina-Rodriguez, N., Egelman, S.: 50 ways to leak your data: an exploration of apps’ circumvention of the Android permissions system. In: Proceedings of USENIX Security Symposium (2019)
Rimmer, V., Preuveneers, D., Juárez, M., van Goethem, T., Joosen, W.: Automated website fingerprinting through deep learning. In: Proceedings of Network and Distributed System Security Symposium (NDSS) (2018)
Saadatpanah, P., Shafahi, A., Goldstein, T.: Adversarial attacks on copyright detection systems. In: Proceedings of International Conference on Machine Learning (ICML) (2020)
Schlegel, R., Zhang, K., Zhou, X., Intwala, M., Kapadia, A., Wang, X.: Soundcomber: a stealthy and context-aware sound trojan for smartphones. In: Proceedings of Network and Distributed System Security Symposium (NDSS) (2011)
Son, W., Cho, H.T., Yoon, K.: Sub-fingerprint masking for a robust audio fingerprinting system in a real-noise environment for portable consumer devices. In: Digest of Technical Papers International Conference on Consumer Electronics (ICCE) (2010)
Sonnleitner, R., Widmer, G.: Robust quad-based audio fingerprinting. IEEE ACM Trans. Audio Speech Lang. Process. 24(3) (2016)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. (JMLR) 15(1), 1929–1958 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
The New York Times Company: All 3 billion yahoo accounts were affected by 2013 attack (2017). https://www.nytimes.com/2017/10/03/technology/yahoo-hack-3-billion-users.html. Accessed 22 Apr 2024
The New York Times Company: That game on your phone may be tracking what you’re watching on TV (2017). https://www.nytimes.com/2017/12/28/business/media/alphonso-app-tracking.html. Accessed 22 Apr 2024
Thiemert, S., Nürnberger, S., Steinebach, M., Zmudzinski, S.: Security of robust audio hashes. In: IEEE International Workshop on Information Forensics and Security (WIFS) (2009)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS) (2017)
Wang, A.: An industrial strength audio search algorithm. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2003)
Wang, A.: The shazam music recognition service. Commun. ACM 49(8) (2006)
Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. CoRR abs/1804.03209 (2018). http://arxiv.org/abs/1804.03209
White, A.M., Matthews, A.R., Snow, K.Z., Monrose, F.: Phonotactic reconstruction of encrypted VoIP conversations: Hookt on Fon-iks. In: Proceedings of IEEE Symposium on Security and Privacy (S &P) (2011)
Xu, Y., Frahm, J., Monrose, F.: Watching the watchers: automatically inferring TV content from outdoor light effusions. In: Proceedings of ACM Conference on Computer and Communications Security (CCS) (2014)
Zapr Media Labs: Privacy \(|\) Zapr Media Labs (Zapr discontinued its service in mid 2022. Thus, we can only provide a link to the snapshot of the website) (2022). https://tinyurl.com/rneknwyb. Accessed 22 Apr 2024
Zapr Media Labs: Zapr \(|\) TV analytics, integrated advertising, real-time surveys\(^{3}\) (2022). https://tinyurl.com/2vhr6vmu. Accessed 22 Apr 2024
Zimmeck, S., Li, J.S., Kim, H., Bellovin, S.M., Jebara, T.: A privacy analysis of cross-device tracking. In: Proceedings of USENIX Security Symposium (2017)
Acknowledgements
This work was funded by the German Federal Ministry of Education and Research (BMBF) under the grants BIFOLD24B and 16KIS1142K.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
A Details on ACR Solutions
A Details on ACR Solutions
In this section, we provide information that we obtained through reverse engineering of the two commercial ACR solutions Zapr and ACRCloud.
Analysis Setup
We start by describing our experimental setup to reverse engineer the apps and discuss our findings of both solutions afterward.
Mobile Apps. For ACRCloud, we base our analysis on the Deezer app (version 6.1.14.99.) and verify that our insights also remain valid for more recent versions of the SDK (version 6.2.13.151). Similarly, we use the Android smartphone application ABP Live TV News (version 9.9.7) for Zapr.
Dynamic Analysis. Both solutions encapsulate the implementations of the ACR algorithms in a shared library, which is provided as a native binary object and accessed by the Android apps through the Android apps through the Java Native Interface (JNI). We treat the fingerprinting algorithms inside the shared object as a black box and observe its return values. To this end, we use the dynamic instrumentation toolkit Frida, which allows us to run the fingerprinting algorithms on controlled input signals, and extract the resulting audio fingerprints. A static analysis shows that all algorithms expect the input signal to be sampled at a frequency of 8,000 Hz with an audio bit depth of 16 bit. Providing the ACR implementations with properly preprocessed audio samples yields the required audio fingerprints, which can then be utilized for further analysis.
To learn more about the underlying structure of the fingerprints, we perform controlled experiments using specifically crafted audio signals from which we derive audio fingerprints. For instance, we use audio signals that contain only one particular frequency or even pure silence.
Fingerprint Structures
We find that the fingerprint structures do not only widely differ between ACRCloud and Zapr, but even between the two Zapr algorithms we selected for our analysis. In the following, we provide more details on our findings.
ACRCloud. For ACRCloud, we find that the generated audio fingerprints vary in length, although all audio snippets are three seconds long. In particular, the length of the generated fingerprints for our dataset varies between 344 and 752 bytes, with a median at 544 bytes. Each fingerprint consists of multiple subfingerprints (see Sect. 3.2) with a length of 8 bytes. The first two bytes of each subfingerprint encode the frequency of an identified peak. Here, ACRCloud seemingly segments the frequency band, which has a maximum frequency of 4,000 Hz, into 1024 distinct bins of equal size, leading to a frequency resolution of \(f_{res} = \frac{4000~\textrm{Hz}}{1024} \approx 3.906~\textrm{Hz}\). The third and fourth byte of the subfingerprints encode the time offset \(\varDelta t\) with a granularity of roughly 20 ms. For the last four bytes, we are unable to derive clear explanations. But we notice that the information stored in these bytes depend on the frequency bytes but not on the time offset.
Zapr. For Zapr Alg1, none of the observed fingerprints exceeds 340 bytes in length, which suggests a maximum length for the fingerprints. Additionally, each fingerprint’s length is divisible by 4 bytes, indicating that they are composed of multiple subfingerprints, each 4 bytes long. The only exception we find is for silent signals, for which the algorithm does not output any fingerprints. The first two bytes of a fingerprint encode the time offset with a precision of 2 s. The last byte encodes frequency information, systematically partitioning the 4 kHz-band into 256 distinct frequency bins. The purpose of the third byte remains unclear. Unfortunately, for Zapr Alg2, we have not been able to derive information about its structure.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pfister, M., Michael, R., Boll, M., Körfer, C., Rieck, K., Arp, D. (2024). Listening Between the Bits: Privacy Leaks in Audio Fingerprints. In: Maggi, F., Egele, M., Payer, M., Carminati, M. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2024. Lecture Notes in Computer Science, vol 14828. Springer, Cham. https://doi.org/10.1007/978-3-031-64171-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-64171-8_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64170-1
Online ISBN: 978-3-031-64171-8
eBook Packages: Computer ScienceComputer Science (R0)