research-article

Are Your Requests Your True Needs? Checking Excessive Data Collection in VPA App

Authors:

Mark Huasong Meng,

Guangdong BaiAuthors Info & Claims

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Article No.: 205, Pages 1 - 12

https://doi.org/10.1145/3597503.3639107

Published: 12 April 2024 Publication History

Abstract

Virtual personal assistants (VPA) services encompass a large number of third-party applications (or apps) to enrich their functionalities. These apps have been well examined to scrutinize their data collection behaviors against their declared privacy policies. Nonetheless, it is often overlooked that most users tend to ignore privacy policies at the installation time. Dishonest developers thus can exploit this situation by embedding excessive declarations to cover their data collection behaviors during compliance auditing.

In this work, we present Pico, a privacy inconsistency detector, which checks the VPA app's privacy compliance by analyzing (in)consistency between data requested and data essential for its functionality. Pico understands the app's functionality topics from its publicly available textual data, and leverages advanced GPT-based language models to address domain-specific challenges. Based on the counterparts with similar functionality, suspicious data collection can be detected through the lens of anomaly detection. We apply Pico to understand the status quo of data-functionality compliance among all 65,195 skills in the Alexa app store. Our study reveals that 21.7% of the analyzed skills exhibit suspicious data collection, including Top 10 popular Alexa skills that pose threats to 54,116 users. These findings should raise an alert to both developers and users, in the compliance with the purpose limitation principle in data regulations.

References

[1]

2023. OpenAI/Models. Retrieved April 7, 2023 from https://platform.openai.com/docs/models/gpt-3-5

[2]

2024. PICO. Retrieved Janurary 9, 2024 from https://github.com/UQ-Trust-Lab/PICO

[3]

Amazon Alexa. 2023. Set Up News and Flash Briefings for Alexa. https://www.amazon.com/gp/help/customer/display.html?nodeId=GXMFWZJ8FKRGLFFU

[4]

Amazon. 2023. Alexa skills. Retrieved January 7, 2023 from https://www.amazon.com.au/Alexa-Skills/b?node=4931595051

[5]

Amazon Developer Documentation. 2022. Configure Permissions for Customer Information in Your Skill. https://developer.amazon.com/en-US/docs/alexa/custom-skills/configure-permissions-for-customer-information-in-your-skill.html

[6]

Benjamin Andow, Samin Yaseer Mahmud, Wenyu Wang, Justin Whitaker, William Enck, Bradley Reaves, Kapil Singh, and Tao Xie. 2019. PolicyLint: investigating internal privacy policy contradictions on google play. In 28th USENIX security symposium (USENIX security). 585--602.

[7]

Benjamin Andow, Samin Yaseer Mahmud, Justin Whitaker, William Enck, Bradley Reaves, Kapil Singh, and Serge Egelman. 2020. Actions Speak Louder than Words: Entity-Sensitive Privacy Policy and Data Flow Analysis with POLICHECK. In 29th USENIX Security Symposium (USENIX Security).

[8]

Duc Bui, Yuan Yao, Kang G. Shin, Jong-Min Choi, and Junbum Shin. 2021. Consistency Analysis of Data-Usage Purposes in Mobile Apps. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (Virtual Event, Republic of Korea) (CCS). Association for Computing Machinery, New York, NY, USA, 2824--2843.

Digital Library

[9]

Yunang Chen, Mohannad Alhanahnah, Andrei Sabelfeld, Rahul Chatterjee, and Earlence Fernandes. 2022. Practical Data Access Minimization in Trigger-Action Platforms. In 31st USENIX Security Symposium (USENIX Security). Boston, MA, 2929--2945.

[10]

Ioana Ciuca and Yuan-Sen Ting. 2023. Galactic ChitChat: Using Large Language Models to Converse with Astronomy Literature. Research Notes of the AAS 7 (09 2023), 193.

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics.

[12]

Jan Digutsch and Michal Kosinski. 2023. Overlap in meaning is a stronger predictor of semantic activation in GPT-3 than in humans. Scientific Reports 13, 1 (2023), 5035.

[13]

Jide Edu, Xavi Ferrer Aran, Jose Such, and Guillermo Suarez-Tangil. 2021. SkillVet: Automated Traceability Analysis of Amazon Alexa Skills. IEEE Transactions on Dependable and Secure Computing (2021).

[14]

Jide Edu, Xavier Ferrer-Aran, Jose Such, and Guillermo Suarez-Tangil. 2022. Measuring Alexa skill privacy practices across three years. In Proceedings of the ACM Web Conference 2022. 670--680.

Digital Library

[15]

European Parliament. 2020. General Data Protection Regulation (GDPR). Retrieved July 26, 2023 from https://gdpr-info.eu/

[16]

Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. 2014. Checking app behavior against app descriptions. In Proceedings of the 36th international conference on software engineering (ICSE). 1025--1035.

Digital Library

[17]

Maarten Grootendorst. 2020. KeyBERT: Minimal keyword extraction with BERT.

[18]

Maarten R. Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. ArXiv abs/2203.05794 (2022).

[19]

Edward Guo, Mehul Gupta, Sarthak Sinha, Karl Rössler, Marcos Tatagiba, Ryojo Akagami, Ossama Al-Mefty, Taku Sugiyama, Phillip E Stieg, Gwynedd E Pickett, et al. 2023. neuroGPT-X: Towards an Accountable Expert Opinion Tool for Vestibular Schwannoma. medRxiv (2023), 2023--02.

[20]

Zhixiu Guo, Zijin Lin, Pan Li, and Kai Chen. 2020. Skillexplorer: Understanding the behavior of skills in large scale. In 29th USENIX Security Symposium (USENIX Security). 2649--2666.

[21]

Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear 7, 1 (2017), 411--420.

[22]

Tu Le, Danny Yuxing Huang, Noah Apthorpe, and Yuan Tian. 2022. SkillBot: Identifying Risky Content for Children in Alexa Skills. ACM Trans. Internet Technol. 22, 3, Article 79 (July 2022), 31 pages.

Digital Library

[23]

Suwan Li, Lei Bu, Guangdong Bai, Zhixiu Guo, Kai Chen, and Hanlin Wei. 2022. VITAS: Guided Model-based VUI Testing of VPA Apps. In 37th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1--12.

[24]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 8th IEEE international conference on data mining. IEEE, 413--422.

Digital Library

[25]

Aleecia M McDonald and Lorrie Faith Cranor. 2008. The cost of reading privacy policies. I/S: A Journal of Law and Policy for the Information Society (ISJLP) 4 (2008), 543.

[26]

Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205.

[27]

Anca Micheti, Jacquelyn Burkell, and Valerie Steeves. 2010. Fixing broken doors: Strategies for drafting privacy policies young people can understand. Bulletin of Science, Technology & Society 30, 2 (2010), 130--143.

[28]

Jonathan A Obar and Anne Oeldorf-Hirsch. 2020. The biggest lie on the internet: Ignoring the privacy policies and terms of service policies of social networking services. Information, Communication & Society 23, 1 (2020), 128--147.

[29]

OpenAI. 2021. GPT-3 powers the next generation of apps. https://openai.com/blog/gpt-3-apps

[30]

Lianglu Pan, Shaanan Cohney, Toby Murray, and Van-Thuan Pham. 2023. EDE-Fuzz: A Web API Fuzzer for Excessive Data Exposures. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 519--530.

[31]

Sai Teja Peddinti, Igor Bilogrevic, Nina Taft, Martin Pelikan, Úlfar Erlingsson, Pauline Anthonysamy, and Giles Hogben. 2019. Reducing permission requests in mobile apps. In Proceedings of the internet measurement conference (IMC). 259--266.

Digital Library

[32]

Mayk Caldas Ramos, Shane S Michtavy, Marc D Porosoff, and Andrew D White. 2023. Bayesian Optimization of Catalysts With In-context Learning. (2023).

[33]

scikit-learn. 2023. sklearn.ensemble.IsolationForest. Retrieved July 26, 2023 from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

[34]

Rocky Slavin, Xiaoyin Wang, Mitra Bokaei Hosseini, James Hester, Ram Krishnan, Jaspreet Bhatia, Travis D. Breaux, and Jianwei Niu. 2016. Toward a Framework for Detecting Privacy Policy Violations in Android Application Code. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). 25--36.

Digital Library

[35]

Bergur Thormundsson. 2023. Virtual Assistant Technology - statistics & facts. Retrieved July 27, 2023 from https://www.statista.com/topics/5572/virtual-assistants

[36]

Xiaoyin Wang, Xue Qin, Mitra Bokaei Hosseini, Rocky Slavin, Travis D Breaux, and Jianwei Niu. 2018. Guileak: Tracing privacy policy claims on user input data for android applications. In Proceedings of the 40th International Conference on Software Engineering (ICSE). 37--47.

Digital Library

[37]

Fuman Xie, Yanjun Zhang, Hanlin Wei, and Guangdong Bai. 2022. UQ-AAS21: a comprehensive dataset of Amazon Alexa skills. In International Conference on Advanced Data Mining and Applications (ADMA). Springer, 159--173.

Digital Library

[38]

Fuman Xie, Yanjun Zhang, Chuan Yan, Suwan Li, Lei Bu, Kai Chen, Zi Huang, and Guangdong Bai. 2022. Scrutinizing Privacy Policy Compliance of Virtual Personal Assistant Apps. In 37th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[39]

Chuan Yan, Fuman Xie, Mark Huasong Meng, Yanjun Zhang, and Guangdong Bai. 2024. On the Quality of Privacy Policy Documents of Virtual Personal Assistant Applications. In 24th Privacy Enhancing Technologies Symposium (PETS).

[40]

Jeffrey Young, Song Liao, Long Cheng, Hongxin Hu, and Huixing Deng. 2022. SkillDetective: Automated Policy-Violation Detection of Voice Assistant Applications in the Wild. In 31st USENIX Security Symposium (USENIX Security).

[41]

Chengpeng Zhang, Haoyu Wang, Ran Wang, Yao Guo, and Guoai Xu. 2018. Re-checking App Behavior against App Description in the Context of Third-party Libraries. In International Conference on Software Engineering and Knowledge Engineering.

[42]

Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang, Yuan Tian, and Feng Qian. 2019. Dangerous skills: Understanding and mitigating security risks of voice-controlled third-party functions on virtual personal assistant systems. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 1381--1396.

[43]

Lu Zhou, Chengyongxiao Wei, Tong Zhu, Guoxing Chen, Xiaokuan Zhang, Suguo Du, Hui Cao, and Haojin Zhu. 2023. POLICYCOMP: Counterpart Comparison of Privacy Policies Uncovers Overbroad Personal Data Collection Practices. In 32nd USENIX Security Symposium (USENIX Security 23). 1073--1090.

Index Terms

Are Your Requests Your True Needs? Checking Excessive Data Collection in VPA App
1. Security and privacy
  1. Software and application security
    1. Web application security

Recommendations

Scrutinizing Privacy Policy Compliance of Virtual Personal Assistant Apps
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

A large number of functionality-rich and easily accessible applications have become popular among various virtual personal assistant (VPA) services such as Amazon Alexa. VPA applications (or VPA apps for short) are accompanied by a privacy policy ...
App Store: IPod Touch, Apple Inc., ITunes Store, IPhone OS, Itunes, Piper Jaffray, List of digital distribution platforms for mobile devices, I Am Rich
An Explorative Study of the Mobile App Ecosystem from App Developers' Perspective
WWW '17: Proceedings of the 26th International Conference on World Wide Web

With the prevalence of smartphones, app markets such as Apple App Store and Google Play has become the center stage in the mobile app ecosystem, with millions of apps developed by tens of thousands of app developers in each major market. This paper ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

May 2024

2942 pages

ISBN:9798400702174

DOI:10.1145/3597503

Co-chairs:
Ana Paiva,
Rui Abreu,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 April 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '24

Sponsor:

SIGSOFT

ICSE '24: IEEE/ACM 46th International Conference on Software Engineering

April 14 - 20, 2024

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
86
Total Downloads

Downloads (Last 12 months)86
Downloads (Last 6 weeks)12

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents