CN118332300A - Compliance detection method and system for privacy policy labels of mobile terminal application programs - Google Patents
Compliance detection method and system for privacy policy labels of mobile terminal application programs Download PDFInfo
- Publication number
- CN118332300A CN118332300A CN202410463213.1A CN202410463213A CN118332300A CN 118332300 A CN118332300 A CN 118332300A CN 202410463213 A CN202410463213 A CN 202410463213A CN 118332300 A CN118332300 A CN 118332300A
- Authority
- CN
- China
- Prior art keywords
- privacy policy
- privacy
- label
- labels
- compliance detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 claims abstract description 19
- 238000012216 screening Methods 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000004590 computer program Methods 0.000 claims description 13
- 238000007781 pre-processing Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000003058 natural language processing Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 description 9
- 239000000284 extract Substances 0.000 description 9
- 238000012360 testing method Methods 0.000 description 6
- 238000013136 deep learning model Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000009434 installation Methods 0.000 description 3
- 238000013515 script Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Bioethics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域Technical Field
本发明属于标签的合规性检测技术领域,尤其涉及一种移动端应用程序隐私政策标签的合规性检测方法及系统。The present invention belongs to the technical field of compliance detection of labels, and in particular, relates to a compliance detection method and system for a privacy policy label of a mobile application.
背景技术Background technique
针对移动端应用程序隐私政策标签正确性的分析,具有重要的研究意义;移动端应用程序隐私政策的分析方法,大多数局限于对已知文本信息的分割。The analysis of the correctness of mobile application privacy policy labels is of great research significance; most of the analysis methods of mobile application privacy policies are limited to the segmentation of known text information.
发明人发现,针对在应用程序信息有限的情况下,自动化获取并分析隐私政策与其标签的方法还存在较大的空白;目前,绝大多数针对应用软件隐私政策的分析方法中,主要集中于对文本进行切分或标签化;而未对应用提供商中的隐私标签,以及应用在真实场景下生成的隐私标签进行综合对比,导致隐私政策与应用程序在现实世界的使用中所采集的信息不一致,从而带来安全问题。The inventors found that there is still a large gap in the method of automatically obtaining and analyzing privacy policies and their labels when application information is limited; currently, most analysis methods for application privacy policies mainly focus on segmenting or labeling text; without making a comprehensive comparison between the privacy labels of application providers and the privacy labels generated by applications in real scenarios, resulting in inconsistencies between privacy policies and information collected by applications in real-world use, which brings security issues.
发明内容Summary of the invention
本发明为了解决上述问题,提出了一种移动端应用程序隐私政策标签的合规性检测方法及系统,本发明在实现标签自动化准确预测基础上,通过预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断后,得到检测结果,解决了隐私政策与应用程序在现实世界的使用中所采集的信息不一致问题。In order to solve the above problems, the present invention proposes a compliance detection method and system for privacy policy labels of mobile applications. Based on the realization of automatic and accurate prediction of labels, the present invention obtains detection results after consistency judgment is made among the predicted labels, the privacy labels predefined by the service provider, and the privacy labels during actual runtime, thereby solving the problem of inconsistency between the privacy policy and the information collected by the application in the real world.
为了实现上述目的,本发明是通过如下的技术方案来实现:In order to achieve the above object, the present invention is implemented through the following technical solutions:
第一方面,本发明提供了一种移动端应用程序隐私政策标签的合规性检测方法,包括:In a first aspect, the present invention provides a compliance detection method for a privacy policy label of a mobile application, comprising:
获取隐私政策链接;Get a link to the Privacy Policy;
对获取的隐私政策链接进行筛选,排除无效链接;以及对排除无效连接后的隐私政策链接进行树遍历,删除无效字符;Screening the obtained privacy policy links to exclude invalid links; and traversing the privacy policy links after excluding invalid links to delete invalid characters;
提取筛选处理后隐私政策链接文本中的名词,建立数据库;Extract nouns from the privacy policy link text after screening and establish a database;
根据数据库,以及预先训练好的大语言模型,进行标签预测;Perform label prediction based on the database and pre-trained large language model;
将预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断,得到检测结果。The predicted labels, the privacy labels predefined by the service provider, and the privacy labels during actual runtime are judged for consistency to obtain the detection results.
进一步的,将隐私政策链接中页面的隐私政策文件保留为文本文件;按照层级结构读取文本并逐层划分标签;提取每层标签中的文本信息,并删除文本信息中的无效字符。Furthermore, the privacy policy file of the page in the privacy policy link is retained as a text file; the text is read according to the hierarchical structure and the labels are divided layer by layer; the text information in each layer of labels is extracted, and invalid characters in the text information are deleted.
进一步的,无效字符包括回车、换行、缩进符和非常规编码字符。Furthermore, invalid characters include carriage returns, line feeds, indents, and non-conventional encoded characters.
进一步的,使用自然语言处理工具包进行文本预处理操作,验证提取的名词是否属于有效的名词。Furthermore, a natural language processing toolkit is used to perform text preprocessing operations to verify whether the extracted nouns are valid nouns.
进一步的,对隐私政策链接文本进行清洗,去除HTML标签和特殊字符、分词和停用词。Furthermore, the privacy policy link text was cleaned to remove HTML tags and special characters, word segmentation, and stop words.
进一步的,所述大语言模型包括多头自注意力机制、前馈神经网络、层归一化和残差连接;所述多头自注意力机制在处理输入序列时关注不同位置的信息,所述前馈神经网络通过激活函数提高非线性建模能力;所述大语言模型的每个子层后都引入层归一化和残差连接。Furthermore, the large language model includes a multi-head self-attention mechanism, a feedforward neural network, layer normalization and residual connection; the multi-head self-attention mechanism focuses on information at different positions when processing an input sequence, and the feedforward neural network improves nonlinear modeling capabilities through an activation function; layer normalization and residual connection are introduced after each sublayer of the large language model.
第二方面,本发明还提供了一种移动端应用程序隐私政策标签的合规性检测系统,包括:In a second aspect, the present invention further provides a compliance detection system for a privacy policy label of a mobile application, comprising:
数据采集模块,别配置为:获取隐私政策链接;The data collection module is configured to: obtain the privacy policy link;
预处理模块,别配置为:对获取的隐私政策链接进行筛选,排除无效链接;以及对排除无效连接后的隐私政策链接进行树遍历,删除无效字符;The preprocessing module is configured to: filter the acquired privacy policy links and exclude invalid links; and perform tree traversal on the privacy policy links after excluding invalid links and delete invalid characters;
数据库建立模块,别配置为:提取筛选处理后隐私政策链接文本中的名词,建立数据库;The database establishment module is configured to: extract nouns from the privacy policy link text after screening and establish a database;
标签预测模块,别配置为:根据数据库,以及预先训练好的大语言模型,进行标签预测;The tag prediction module is configured to: perform tag prediction based on the database and the pre-trained large language model;
检测模块,别配置为:将预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断,得到检测结果。The detection module is specifically configured to perform consistency judgment on the predicted labels, the privacy labels predefined by the service provider, and the privacy labels during actual runtime to obtain the detection results.
第三方面,本发明还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现了第一方面所述的移动端应用程序隐私政策标签的合规性检测方法的步骤。In a third aspect, the present invention further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the compliance detection method for the privacy policy label of a mobile application described in the first aspect.
第四方面,本发明还提供了一种电子设备,包括存储器、处理器及存储在存储器上并能够在处理器上运行的计算机程序,所述处理器执行所述程序时实现了第一方面所述的移动端应用程序隐私政策标签的合规性检测方法的步骤。In a fourth aspect, the present invention further provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein when the processor executes the program, the steps of the compliance detection method for the privacy policy label of the mobile application described in the first aspect are implemented.
第五方面,本发明还提供了一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序被处理器执行时,实现了第一方面所述的移动端应用程序隐私政策标签的合规性检测方法的步骤。In a fifth aspect, the present invention further provides a computer program product, comprising a computer program, which, when executed by a processor, implements the steps of the compliance detection method for the privacy policy label of a mobile application described in the first aspect.
与现有技术相比,本发明的有益效果为:Compared with the prior art, the present invention has the following beneficial effects:
本发明首先对获取的隐私政策链接进行筛选,排除无效链接;以及对排除无效连接后的隐私政策链接进行树遍历,删除无效字符;然后,提取筛选处理后隐私政策链接文本中的名词,建立数据库;根据数据库,以及预先训练好的大语言模型,进行标签预测;最后,将预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断,得到检测结果。在实现标签自动化准确预测基础上,考虑了应用提供商中的隐私标签,以及真实场景下生成的隐私标签,通过预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断后,得到检测结果,解决了隐私政策与应用程序在现实世界的使用中所采集的信息不一致问题。The present invention first screens the acquired privacy policy links to exclude invalid links; and performs tree traversal on the privacy policy links after excluding invalid links to delete invalid characters; then, extracts nouns in the privacy policy link text after screening and establishes a database; performs label prediction based on the database and a pre-trained large language model; finally, performs consistency judgment on the predicted labels, the privacy labels predefined in the service provider, and the privacy labels during actual operation to obtain the test results. On the basis of achieving accurate automatic label prediction, the privacy labels in the application provider and the privacy labels generated in the real scenario are considered. After consistency judgment is performed on the predicted labels, the privacy labels predefined in the service provider, and the privacy labels during actual operation, the test results are obtained, which solves the problem of inconsistency between the privacy policy and the information collected by the application in the real world.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
构成本实施例的一部分的说明书附图用来提供对本实施例的进一步理解,本实施例的示意性实施例及其说明用于解释本实施例,并不构成对本实施例的不当限定。The drawings in the specification that constitute a part of this embodiment are used to provide a further understanding of this embodiment. The schematic embodiments of this embodiment and their descriptions are used to explain this embodiment and do not constitute improper limitations on this embodiment.
图1为本发明实施例1的方法流程图;FIG1 is a flow chart of a method according to Embodiment 1 of the present invention;
图2为本发明实施例1的系统流程示意图。FIG. 2 is a schematic diagram of a system flow chart of Embodiment 1 of the present invention.
具体实施方式Detailed ways
下面结合附图与实施例对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.
应该指出,以下详细说明都是示例性的,旨在对本申请提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are exemplary and are intended to provide further explanation of the present application. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present application belongs.
实施例1:Embodiment 1:
目前针对移动端应用软件隐私政策的分析方法,大多数局限于对已知文本信息的分割,而针对在应用程序信息有限的情况下自动化获取并分析隐私政策与其标签的分析方法还存在较大的空白。Currently, most of the analysis methods for mobile application privacy policies are limited to the segmentation of known text information, while there is still a large gap in the analysis methods for automatically obtaining and analyzing privacy policies and their labels when application information is limited.
针对应用程序隐私政策标签正确性的分析具有重要的研究意义。现在主流的应用程序提供商已经提供了预生成方案。形如Google Play、App Store等都能够允许开发者自行贴上隐私政策标签。通过这些标签能够清晰的告诉用户该应用需要收集用户的哪些信息,但近年已有部分研究表明,存在隐私政策与应用程序在现实世界的使用中所采集的信息不一致的问题,存在着诸多安全问题。The analysis of the correctness of application privacy policy labels is of great research significance. Mainstream application providers now provide pre-generated solutions. For example, Google Play and App Store allow developers to attach privacy policy labels themselves. These labels can clearly tell users what information the application needs to collect from users, but some studies in recent years have shown that there is a problem of inconsistency between privacy policies and the information collected by applications in real-world use, and there are many security issues.
针对上述问题,本实施例提供了一种移动端应用程序隐私政策标签的合规性检测方法,包括:In view of the above problems, this embodiment provides a compliance detection method for a privacy policy label of a mobile application, including:
获取隐私政策链接;Get a link to the Privacy Policy;
对获取的隐私政策链接进行筛选,排除无效链接;以及对排除无效连接后的隐私政策链接进行树遍历,删除无效字符;Screening the obtained privacy policy links to exclude invalid links; and traversing the privacy policy links after excluding invalid links to delete invalid characters;
提取筛选处理后隐私政策链接文本中的名词,建立数据库;Extract nouns from the privacy policy link text after screening and establish a database;
根据数据库,以及预先训练好的大语言模型,进行标签预测;Perform label prediction based on the database and pre-trained large language model;
将预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断,得到检测结果。The predicted labels, the privacy labels predefined by the service provider, and the privacy labels during actual runtime are judged for consistency to obtain the detection results.
具体的,对获取的隐私政策链接进行筛选、树遍历,借助大语言模型实现标签自动化准确预测基础上,考虑了应用提供商中的隐私标签,以及真实场景下生成的隐私标签,通过预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断后,得到检测结果,解决了隐私政策与应用程序在现实世界的使用中所采集的信息不一致问题。Specifically, the obtained privacy policy links are screened and the tree is traversed. Based on the automatic and accurate prediction of labels using a large language model, the privacy labels in the application provider and the privacy labels generated in real scenarios are considered. The predicted labels, the predefined privacy labels in the service provider, and the privacy labels at actual runtime are judged for consistency to obtain the test results, which solves the problem of inconsistency between the privacy policy and the information collected by the application in the real world.
本实施例中,针对获取到的隐私政策链接,考虑到页面结构的多样性,直接对页面进行访问提取文字,会将多个不必要的分隔符字段以及JavaScript脚本进行同步采集,JavaScript是一种具有函数优先的轻量级,解释型或即时编译型的编程语言。因此,本实施例在处理隐私政策链接时,首先进行链接筛选和预处理,排除无效链接,确保仅处理与隐私政策相关的内容;然后,进行HTML解析和DOM树遍历,以理解页面结构,利用NLP技术提取并清理文本数据,包括去除HTML标签和多余空白字符;最后,经过验证的结果被存储在数据库中,以供进一步分析。In this embodiment, for the obtained privacy policy link, considering the diversity of page structure, directly accessing the page to extract text will synchronously collect multiple unnecessary separator fields and JavaScript scripts. JavaScript is a lightweight, interpreted or just-in-time compiled programming language with function priority. Therefore, when processing the privacy policy link, this embodiment first performs link screening and preprocessing to exclude invalid links and ensure that only content related to the privacy policy is processed; then, HTML parsing and DOM tree traversal are performed to understand the page structure, and NLP technology is used to extract and clean up text data, including removing HTML tags and redundant blank characters; finally, the verified results are stored in the database for further analysis.
本实施例中,针对DOM树遍历,首先,将页面的隐私政策文件保留为文本文件。然后,读取对本地的隐私政策文本;按照层级结构进行读取,如页面的html/div/ul/li/p标签逐层划分;提取每层标签中的文本信息,并删除文本信息中的无效字符,例如回车、换行、缩进符和非常规编码字符;可选的,使用python中的Replace函数对文本的无效字符去除,Replace函数是Python字符串对象的内置方法之一,用于将字符串中的指定子串替换为新的子串,能将文本中的无效字符替换为空,进而达到去除无效字符的效果。In this embodiment, for DOM tree traversal, first, the privacy policy file of the page is retained as a text file. Then, the local privacy policy text is read; read according to the hierarchical structure, such as the html/div/ul/li/p tags of the page are divided layer by layer; extract the text information in each layer of tags, and delete invalid characters in the text information, such as carriage returns, line feeds, indents and unconventional encoding characters; optionally, use the Replace function in Python to remove invalid characters in the text. The Replace function is one of the built-in methods of the Python string object, which is used to replace a specified substring in a string with a new substring. It can replace invalid characters in the text with empty, thereby achieving the effect of removing invalid characters.
本实施例中,然后利用自然语言处理方法进行数据清洗。可选的,本实施例中使用自然语言处理工具包(Natural Language Toolkit,NLTK)进行文本预处理操作。具体的,首先,是载入NLTK包及其WordNet模块,WordNet模块用于验证提取的名词是否属于有效的名词。WordNet模块是一个英语词汇数据库,它组织了英语单词之间的关系,包括它们的同义词、反义词、名词以及上下位关系等,通过使用其中的is_noun函数能够提取出文本中包含的名词;然后,将这些名词保存到数据库中以供进一步分析。In this embodiment, the data is then cleaned using a natural language processing method. Optionally, in this embodiment, a natural language processing toolkit (Natural Language Toolkit, NLTK) is used to perform text preprocessing operations. Specifically, first, the NLTK package and its WordNet module are loaded. The WordNet module is used to verify whether the extracted nouns are valid nouns. The WordNet module is an English vocabulary database that organizes the relationships between English words, including their synonyms, antonyms, nouns, and hyponyms, etc. The nouns contained in the text can be extracted by using the is_noun function therein; then, these nouns are saved in the database for further analysis.
本实施例中,标签预测时,对于存储在数据库中每条隐私政策中包含的名词,将其输入到大语言模型中进行标签预测。其中,大语言模型包括但不限于类似于ChatGPT、文心一言以及通义千问等等。本实施例中,针对这些通用的大语言模型设计了固定的Prompt工程进行提问;例如ChatGPT,ChatGPT的全称是Chat Generative Pre-TrainedTransformer,采用Transformer结构,该结构包括多头自注意力机制、前馈神经网络、层归一化和残差连接等关键组件,多头自注意力机制允许模型在处理输入序列时关注不同位置的信息,而前馈神经网络通过激活函数提高非线性建模能力;为了维护训练的稳定性和梯度流动,每个子层后都引入了层归一化和残差连接,位置编码用于保存输入序列的顺序信息。在大规模的预训练中,通过自监督学习,使得模型能够从广泛的语料库中学到了语言的统计结构和模式,其优势在于具备数十亿的参数,通过大规模预训练学习了广泛的语料库,能够生成自然流畅的文本并在多种自然语言处理任务中表现出色。在隐私标签预测上,大语言模型能够对大量文本数据深入理解,使其能够识别和理解与隐私相关的语义和上下文,从而有效预测文本中的隐私内容。而针对应用场景,本实施例中将提问语句设定为:Asa privacy policy analyst,here is what you know about the labels(<DefinedLabel A>,<Defined Label B>,...)。Next,I will give you a list of privacy policywords(<DB Label A>,<DB Label B>,…).Can you please tell me which tags theselists involve?Reply format[<label>,<label>,...]。其中,<Defined Label>对应预先从Google Play以及App Store上收集的所有隐私标签信息,<DB Label>对应着数据库中存储的从隐私政策提取出的名词信息。大语言模型将会以一个列表的格式返回每条隐私政策名词中包含的隐私标签。In this embodiment, when predicting labels, for each noun contained in the privacy policy stored in the database, it is input into the large language model for label prediction. Among them, the large language model includes but is not limited to ChatGPT, Wenxin Yiyan, Tongyi Qianwen, etc. In this embodiment, a fixed Prompt project is designed for these general large language models to ask questions; for example, ChatGPT, the full name of ChatGPT is Chat Generative Pre-TrainedTransformer, which adopts the Transformer structure, which includes key components such as multi-head self-attention mechanism, feedforward neural network, layer normalization and residual connection. The multi-head self-attention mechanism allows the model to pay attention to information at different positions when processing the input sequence, and the feedforward neural network improves the nonlinear modeling ability through the activation function; in order to maintain the stability of training and gradient flow, layer normalization and residual connection are introduced after each sublayer, and position encoding is used to save the order information of the input sequence. In large-scale pre-training, through self-supervised learning, the model can learn the statistical structure and pattern of language from a wide range of corpora. Its advantage is that it has billions of parameters. Through large-scale pre-training, it has learned a wide range of corpora, can generate natural and fluent text, and performs well in a variety of natural language processing tasks. In privacy label prediction, the large language model can deeply understand a large amount of text data, enabling it to identify and understand privacy-related semantics and context, thereby effectively predicting the privacy content in the text. For the application scenario, the question statement in this embodiment is set as: As a privacy policy analyst, here is what you know about the labels (<DefinedLabel A>, <Defined Label B>, ...). Next, I will give you a list of privacy policy words (<DB Label A>, <DB Label B>, ...). Can you please tell me which tags these lists involve? Reply format [<label>, <label>, ...]. Among them, <Defined Label> corresponds to all privacy label information collected in advance from Google Play and App Store, and <DB Label> corresponds to the noun information extracted from the privacy policy stored in the database. The large language model will return the privacy labels contained in each privacy policy noun in the format of a list.
本实施例中的方法,首先,实现了自动化流程,通过预定义脚本和自动配置,减少了用户手动操作,提高了效率。其次,广泛采集各大应用市场中的APP数据详细信息,包括隐私政策链接、标签和应用程序安装包,确保数据全面性;同时,借助深度学习模型进行标签预测,提高了准确性;此外,通过自动化测试和流量捕获,在模拟器中安装应用程序并捕获数据流量,全面分析应用程序行为,有助于识别潜在隐私问题;通过规则匹配和一致性检查,确保了隐私政策的一致性,提高了隐私标签的准确性和安全性。最后,用户友好的界面和报告输出使用户能够轻松使用系统,清晰了解隐私政策合规性和潜在风险,为数据隐私保护提供了全面解决方案。本实施例的具体步骤为:The method in this embodiment, first of all, realizes an automated process, reduces user manual operations and improves efficiency through predefined scripts and automatic configuration. Secondly, it widely collects detailed information of APP data in major application markets, including privacy policy links, labels and application installation packages, to ensure the comprehensiveness of data; at the same time, it uses deep learning models to predict labels to improve accuracy; in addition, through automated testing and traffic capture, the application is installed in the simulator and data traffic is captured, and the application behavior is comprehensively analyzed to help identify potential privacy issues; through rule matching and consistency checks, the consistency of the privacy policy is ensured, and the accuracy and security of privacy labels are improved. Finally, the user-friendly interface and report output enable users to easily use the system, clearly understand the privacy policy compliance and potential risks, and provide a comprehensive solution for data privacy protection. The specific steps of this embodiment are:
S1、运行预定义脚本文件,自动化配置好所需要的依赖环境。S1. Run the predefined script file to automatically configure the required dependency environment.
S1.1、可选的,用户运行PreChecker.exe文件,该文件可用于检测用户所使用的操作系统信息、硬件配置并给出符合运行条件的虚拟机安装地址。S1.1. Optionally, the user runs the PreChecker.exe file, which can be used to detect the operating system information and hardware configuration used by the user and provide a virtual machine installation address that meets the operating conditions.
S1.2、可选的,用户自行安装完安卓虚拟机。检测虚拟机配置是否正确,自动化配置Python环境并安装好所需要的依赖代码库,然后下载隐私政策标签分析系统所需要的深度学习模型文件,配置完成后根据提示关闭该应用程序,重启机器。S1.2. Optionally, the user installs the Android virtual machine by himself. Check whether the virtual machine configuration is correct, automatically configure the Python environment and install the required dependent code libraries, then download the deep learning model files required by the privacy policy label analysis system. After the configuration is completed, close the application according to the prompts and restart the machine.
S2、读取用户输入的APK(Android application package)信息,从各大应用市场中采集APP数据的详细数据信息。S2. Read the APK (Android application package) information input by the user and collect detailed data information of APP data from major application markets.
可选的,启动PolicyLabel.exe应用程序,用户输入所需分析的应用包名,然后根据提示选择应用程序服务提供商入口,随后开始应用程序信息采集流程。Optionally, the PolicyLabel.exe application is launched, the user enters the name of the application package to be analyzed, and then selects the application service provider entrance according to the prompts, and then starts the application information collection process.
S3、在所采集的数据信息中提取应用服务商中预定义的隐私政策标签、隐私政策链接、应用程序安装包。S3. Extract the privacy policy label, privacy policy link, and application installation package predefined in the application service provider from the collected data information.
可选的,通过步骤S2,能够得到隐私政策链接、标签以及应用程序的APK,然后将隐私标签存入数据库中用于作最终的对比分析。Optionally, through step S2, the privacy policy link, label and APK of the application can be obtained, and then the privacy label is stored in the database for final comparative analysis.
S4、对隐私政策链接进行访问,提取其隐私政策文本信息,预处理文本内容,输入至系统预训练好的深度学习模型中进行标签预测,生成标签的预测结果;深度学习模型可选的为大语言模型。S4. Access the privacy policy link, extract the privacy policy text information, pre-process the text content, input it into the system's pre-trained deep learning model for label prediction, and generate label prediction results; the deep learning model can optionally be a large language model.
S4.1、针对隐私政策链接,考虑到每一个APK都具有不同的页面结构,本实施例预先定义了爬取规则,例如,针对纯文本类页面,则将该页面所有<p>标签内文本全部采集,通过该规则,能够对当下绝大多数隐私政策链接中的隐私政策文本信息进行采集。S4.1. With respect to the privacy policy link, considering that each APK has a different page structure, this embodiment predefines crawling rules. For example, with respect to a plain text page, all text within the <p> tags of the page is collected. Through this rule, the privacy policy text information in the vast majority of current privacy policy links can be collected.
S4.2、将采集到的隐私政策文本进行预处理。隐私政策文本的预处理流程包括清洗文本、去除HTML标签和特殊字符、分词、去除停用词、词干提取或词形还原等,建立词汇表,文本编码为数字表示,填充或截断文本以确保相同维度,为后续的深度学习模型隐私标签预测提供合法输入。S4.2. Preprocess the collected privacy policy text. The preprocessing process of the privacy policy text includes cleaning the text, removing HTML tags and special characters, word segmentation, removing stop words, stemming or word form restoration, etc., building a vocabulary, encoding the text into digital representation, padding or truncating the text to ensure the same dimension, and providing legal input for the subsequent deep learning model privacy label prediction.
S4.3、将预处理后的文本输入至系统内置的大语言模型中进行标签预测。S4.3. Input the preprocessed text into the system’s built-in large language model for label prediction.
S5、在模拟器中安装该应用程序,对应用程序进行自动化测试,捕获其中产生的数据流量,对流量的明文数据进行分割,生成隐私标签。S5. Install the application in the simulator, perform automated testing on the application, capture the data traffic generated therein, segment the plaintext data of the traffic, and generate a privacy label.
S5.1、针对APK文件,通过调用adb install path/to/your-app.apk命令将其安装至虚拟机中,然后调用adb forward tcp:8080tcp:8080将Android设备上的mitmproxy的8080端口与计算机的8080端口相连,其目的是将预定义好的mitmproxy端口8080与设备进行关联。最后调用adb shell monkey-pyour.package.name-v{seconds}对应用程序进行测试,捕获其中产生的网络流量。S5.1. Install the APK file into the virtual machine by calling the command adb install path/to/your-app.apk, and then call adb forward tcp:8080tcp:8080 to connect the mitmproxy port 8080 on the Android device to the computer's port 8080. The purpose is to associate the predefined mitmproxy port 8080 with the device. Finally, call adb shell monkey-pyour.package.name-v{seconds} to test the application and capture the network traffic generated.
S5.2、针对由mitmproxy导出的pcap文件,本实施例中对其进行数据提取,然后将数据进行分割,输出每个数据字段对应的隐私标签。S5.2. For the pcap file exported by mitmproxy, this embodiment extracts data, then splits the data, and outputs the privacy label corresponding to each data field.
S6、将步骤S3中的预定义隐私标签与步骤S4的预测结果,以及步骤S5中运行时隐私标签进行规则匹配,判断其隐私政策的一致性,生成隐私标签安全报告。S6. Perform rule matching on the predefined privacy label in step S3, the prediction result in step S4, and the privacy label at runtime in step S5 to determine the consistency of their privacy policies and generate a privacy label security report.
具体的,将步骤S3中入库的隐私标签、步骤S4.3中的预测标签以及步骤S5.2中的提取标签进行比较,把比较结果写入预先定义好的模板中,得到隐私标签安全报告。Specifically, the privacy label stored in step S3, the predicted label in step S4.3, and the extracted label in step S5.2 are compared, and the comparison result is written into a predefined template to obtain a privacy label security report.
可选的,步骤S3中是预定义好的标签,例如[联系人、日历、照片];步骤S4.3中预测出的标签也是类似的,比如[日历、照片电话号码、联系人、语音]。将预定义好的标签和预测标签取一个补集得到区别;例如,步骤S3中的标签为[联系人、日历、照片],步骤S4.3中的标签为[日历、照片、电话号码、联系人、语音],交集为[联系人、日历、照片],补集为[电话号码、语音]。Optionally, the labels in step S3 are predefined, such as [contacts, calendars, photos]; the labels predicted in step S4.3 are similar, such as [calendar, photos, phone numbers, contacts, voice]. The predefined labels and the predicted labels are distinguished by taking a complement; for example, the labels in step S3 are [contacts, calendars, photos], and the labels in step S4.3 are [calendar, photos, phone numbers, contacts, voice], the intersection is [contacts, calendars, photos], and the complement is [phone numbers, voice].
通过本实施例提供的方法,用户只需要提供应用程序的包名即可实现对该应用隐私政策及隐私标签的提取、隐私标签是否正确标识等分析;有助于移动安全审计人员对隐私政策合规性做出判断,同时有助于应用程序的客户能够更加清晰地了解到该应用所需要收集的信息。Through the method provided in this embodiment, the user only needs to provide the package name of the application to extract the privacy policy and privacy label of the application, and analyze whether the privacy label is correctly marked. This helps mobile security auditors to make judgments on the compliance of the privacy policy, and also helps customers of the application to more clearly understand the information that the application needs to collect.
实施例2:Embodiment 2:
本实施例提供了一种移动端应用程序隐私政策标签的合规性检测系统,包括:This embodiment provides a compliance detection system for a privacy policy label of a mobile application, including:
数据采集模块,别配置为:获取隐私政策链接;The data collection module is configured to: obtain the privacy policy link;
预处理模块,别配置为:对获取的隐私政策链接进行筛选,排除无效链接;以及对排除无效连接后的隐私政策链接进行树遍历,删除无效字符;The preprocessing module is configured to: filter the acquired privacy policy links and exclude invalid links; and perform tree traversal on the privacy policy links after excluding invalid links and delete invalid characters;
数据库建立模块,别配置为:提取筛选处理后隐私政策链接文本中的名词,建立数据库;The database establishment module is configured to: extract nouns from the privacy policy link text after screening and establish a database;
标签预测模块,别配置为:根据数据库,以及预先训练好的大语言模型,进行标签预测;The tag prediction module is configured to: perform tag prediction based on the database and the pre-trained large language model;
检测模块,别配置为:将预测得到的标签,服务商中预定义的隐私标签,以及实际运行时的隐私标签进行一致性判断,得到检测结果。The detection module is specifically configured to perform consistency judgment on the predicted labels, the privacy labels predefined by the service provider, and the privacy labels during actual runtime to obtain the detection results.
所述系统的工作方法与实施例1的移动端应用程序隐私政策标签的合规性检测方法相同,这里不再赘述。The working method of the system is the same as the compliance detection method of the privacy policy label of the mobile application in Example 1, and will not be repeated here.
实施例3:Embodiment 3:
本实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现了实施例1所述的移动端应用程序隐私政策标签的合规性检测方法的步骤。This embodiment provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the steps of the compliance detection method for the privacy policy label of the mobile application described in Example 1 are implemented.
实施例4:Embodiment 4:
本实施例提供了一种电子设备,包括存储器、处理器及存储在存储器上并能够在处理器上运行的计算机程序,所述处理器执行所述程序时实现了实施例1所述的移动端应用程序隐私政策标签的合规性检测方法的步骤。This embodiment provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the processor executes the program, the steps of the compliance detection method for the privacy policy label of the mobile application described in Example 1 are implemented.
实施例5:Embodiment 5:
本实施例提供了一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序被处理器执行时,实现了实施例1所述的移动端应用程序隐私政策标签的合规性检测方法的步骤。This embodiment provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the steps of the compliance detection method for the privacy policy label of the mobile application described in Example 1 are implemented.
以上所述仅为本实施例的优选实施例而已,并不用于限制本实施例,对于本领域的技术人员来说,本实施例可以有各种更改和变化。凡在本实施例的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本实施例的保护范围之内。The above description is only a preferred embodiment of the present embodiment and is not intended to limit the present embodiment. For those skilled in the art, the present embodiment may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present embodiment shall be included in the protection scope of the present embodiment.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410463213.1A CN118332300A (en) | 2024-04-17 | 2024-04-17 | Compliance detection method and system for privacy policy labels of mobile terminal application programs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410463213.1A CN118332300A (en) | 2024-04-17 | 2024-04-17 | Compliance detection method and system for privacy policy labels of mobile terminal application programs |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118332300A true CN118332300A (en) | 2024-07-12 |
Family
ID=91773938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410463213.1A Pending CN118332300A (en) | 2024-04-17 | 2024-04-17 | Compliance detection method and system for privacy policy labels of mobile terminal application programs |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118332300A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118606937A (en) * | 2024-08-08 | 2024-09-06 | 天津商业大学 | APP sensitive feature detection method and system based on large-scale language model |
-
2024
- 2024-04-17 CN CN202410463213.1A patent/CN118332300A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118606937A (en) * | 2024-08-08 | 2024-09-06 | 天津商业大学 | APP sensitive feature detection method and system based on large-scale language model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Niu et al. | A deep learning based static taint analysis approach for IoT software vulnerability location | |
CN113141360B (en) | Method and device for detecting network malicious attack | |
CN114138244A (en) | Method and device for automatically generating model files, storage medium and electronic equipment | |
CN118332300A (en) | Compliance detection method and system for privacy policy labels of mobile terminal application programs | |
WO2024131496A1 (en) | Vulnerability data analysis method and apparatus, electronic device and storage medium | |
CN113704420A (en) | Method and device for identifying role in text, electronic equipment and storage medium | |
CN114528457A (en) | Web fingerprint detection method and related equipment | |
CN112464237A (en) | Static code safety diagnosis method and device | |
CN114386048A (en) | Ranking-based method for locating open source software security vulnerability patches | |
CN117056966A (en) | System for analyzing consistency of applet privacy policy and authority call | |
CN118606937A (en) | APP sensitive feature detection method and system based on large-scale language model | |
CN118170685B (en) | An automated testing platform and method for an adaptive operating system environment | |
CN118013963B (en) | Method and device for identifying and replacing sensitive words | |
CN117874760A (en) | Android malicious software detection method and system based on interpretable graph learning | |
CN111859862A (en) | Text data labeling method and device, storage medium and electronic device | |
CN117728995A (en) | XSS attack detection method and device, computer equipment and storage medium | |
CN116881971A (en) | Sensitive information leakage detection method, device and storage medium | |
CN113434404B (en) | Automatic service verification method and device for verifying reliability of disaster recovery system | |
CN114780403A (en) | Software defect prediction method and prediction device based on enhanced code attribute map | |
CN111581533B (en) | Method and device for identifying state of target object, electronic equipment and storage medium | |
CN115292571A (en) | App data acquisition method and system | |
CN114528218A (en) | Test program generation method, test program generation device, storage medium, and electronic device | |
US11907110B2 (en) | Methods and systems for automated software testing | |
CN112364649A (en) | Named entity identification method and device, computer equipment and storage medium | |
CN114595454B (en) | Malicious JS script detection method based on hybrid analysis and feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |