Nothing Special   »   [go: up one dir, main page]

CN113919300A - Data labeling method and device and readable storage medium - Google Patents

Data labeling method and device and readable storage medium Download PDF

Info

Publication number
CN113919300A
CN113919300A CN202111028299.8A CN202111028299A CN113919300A CN 113919300 A CN113919300 A CN 113919300A CN 202111028299 A CN202111028299 A CN 202111028299A CN 113919300 A CN113919300 A CN 113919300A
Authority
CN
China
Prior art keywords
data
file
information
labeling
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111028299.8A
Other languages
Chinese (zh)
Inventor
裴芝林
邱墨桐
何鑫
金基勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yonyou Network Technology Co Ltd
Original Assignee
Yonyou Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yonyou Network Technology Co Ltd filed Critical Yonyou Network Technology Co Ltd
Priority to CN202111028299.8A priority Critical patent/CN113919300A/en
Publication of CN113919300A publication Critical patent/CN113919300A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a data labeling method, a data labeling device and a readable storage medium. The data annotation method is used for a client side and comprises the following steps: sending a request for creating a data set to a data service; uploading the file to an object storage service; and sending a data verification request to the data service after uploading the file and the label information to the object storage service. In the technical scheme of the invention, the uploading process of the file does not involve codes, a graphical and guide form can be adopted, the operation process is simplified, other knowledge does not need to be mastered by a user, the low threshold is realized, guide type interaction based on a fixed flow is realized, and the method is more concise and friendly for users in the traditional industry with late starting and poor foundation, and better improves the user experience.

Description

Data labeling method and device and readable storage medium
Technical Field
The invention relates to the technical field of data annotation, in particular to a data annotation method, a data annotation device and a readable storage medium.
Background
Data tagging is the process of processing raw primary data, including voice, pictures, text, video, etc., and converting the processed primary data into machine-recognizable information. The original data is generally acquired through data acquisition, and the subsequent data labeling is equivalent to processing the data and then conveying the processed data to an artificial intelligence algorithm and a model to complete calling.
The current artificial intelligence data set labeling has the following problems:
(1) the labeling formats are not uniform, format conversion is needed during use, certain code amount and manual workload are needed, and the multiplexing process of the labeling strategy framework and the label template is complicated.
(2) The data marking, training and model verification processes are separated, and automation cannot be achieved.
(3) Under the large data volume data set scene, the performance of data labeling and model training is not good.
(4) How unstructured data is stored, and performance issues.
Disclosure of Invention
The present invention is directed to solving or improving at least one of the above technical problems.
Therefore, a first object of the present invention is to provide a method for annotating data.
The second purpose of the invention is to provide a data annotation method.
The third purpose of the invention is to provide a data annotation method.
The fourth purpose of the invention is to provide a data annotation method.
A fifth object of the present invention is to provide a data annotation device.
A sixth object of the present invention is to provide a readable storage medium.
In order to achieve the first object of the present invention, the technical solution of the present invention provides a data annotation method, used on a client side, including: sending a request for creating a data set to a data service; uploading the file to an object storage service; and sending a data verification request to the data service after uploading the file and the label information to the object storage service.
In the embodiment, the uploading process of the file does not involve codes, a graphical and guide form can be adopted, the operation process is simplified, a user does not need to master other knowledge, the low threshold is realized, guide type interaction based on a fixed flow is realized, and the method is more concise and friendly for users in the traditional industry with late starting and poor foundation, and better improves user experience.
In addition, the technical scheme provided by the invention can also have the following additional technical characteristics:
in the above technical solution, uploading a file to an object storage service specifically includes: and selecting an uploading folder, and uploading the file to the object storage service after requesting the data service batch for uploading the pre-signed link.
In the implementation, the uploading folder is selected firstly, then the pre-signature link is requested to be uploaded to the data service in batch, then the file is uploaded to the object storage service, and through the pre-signature link, the identity can be better verified, the data safety is ensured, and the requirement of the data safety under the multi-tenant scene is met.
In any of the above technical solutions, the data labeling method further includes: requesting data set marking information and file downloading pre-signature links to a data service in batches; receiving a download link returned by the data service; according to the download link, downloading files in parallel; and training the model in batches based on the files and the labeling information.
In the embodiment, the model can be directly trained in batches after the files and the labeling information are obtained, and the data labeling and the model training are seamlessly integrated, so that the integrated data labeling, training and model verification are realized.
In order to achieve the second object of the present invention, a technical solution of the present invention provides a data annotation method, used for a data service side, including: receiving a request for creating a data set, and storing metadata information to a database according to the name of the data set; receiving a data verification request sent by a client, and performing data set verification on files stored in an object storage service; receiving an unmarked file list request sent by the marking platform, searching the unmarked file according to the metadata information to obtain an unmarked collection file, and returning the unmarked collection file to the marking platform.
In this embodiment, the data service implements data set verification of the file stored in the object storage service, and can convert the third-party format label information into self-format label information, which is simple to operate and implements automation operation to a certain extent.
In addition, the technical scheme provided by the invention can also have the following additional technical characteristics:
in the above technical solution, the data set verification of the file stored in the object storage service specifically includes: verifying whether the file exists; checking whether the file includes exchangeable image file format information based on the existence of the file and the annotation information; deleting exchangeable image file format information of the file based on the file including the exchangeable image file format information; and establishing a mapping relation between the third-party labeling information and the labeled file, and converting the third-party labeling information into self-format labeling information based on the mapping relation.
The embodiment automatically performs conversion and labeling by establishing the mapping relation, and is more friendly to users in the traditional industry with late starting and poor foundation.
In any of the above technical solutions, converting the third-party annotation information into self-formatted annotation information specifically includes: and starting background timing task frame operation, and converting the third-party format marking information into self-format marking information.
In the embodiment, the timed task framework adopts Quartz for operation, and the task can be effectively managed by adopting the timed task framework, wherein the timed task framework comprises the starting time of the task, the uniqueness of the task, the strategy after the task fails and the like. And moreover, the task splitting and parallel verification can be performed on the large data set by adopting the timing task framework.
In any of the above technical solutions, the data labeling method further includes: and receiving the labeling information of the data set and a file downloading pre-signature link request, obtaining a file and a labeling file list under a path corresponding to the data set according to the metadata information, generating a downloading link, and returning the downloading link to the client.
According to the embodiment, the identity can be better verified in a pre-signature link mode, the data safety is guaranteed, and the requirement of the data safety under a multi-tenant scene is met.
In order to achieve the third object of the present invention, a technical solution of the present invention provides a data labeling method, used on an object storage service side, including: receiving and storing metadata information; receiving and storing a file; and receiving and storing the marking information.
In this embodiment, the object storage service may adopt an open source framework MinIO (open source distributed file storage system) which is currently mainstream.
In addition, the technical scheme provided by the invention can also have the following additional technical characteristics:
in the above technical solution, receiving and storing the label information specifically includes: and verifying the file corresponding to the labeling information through batch query of the database, and then storing the labeling information.
In the embodiment, the file corresponding to the label information is verified, and then the label information is stored after the verification is passed.
In order to achieve the fourth object of the present invention, a technical solution of the present invention provides a data annotation method, which is used for annotating a platform side, and includes: requesting an unlabeled file list from a data service; receiving and verifying a label set file returned by the data service, analyzing the label set file, adapting to a label format, and labeling the label set file; and sending the labeling information to an object storage service.
In this embodiment, the annotation platform has a feature of high compatibility, and supports the current mainstream annotation format.
In order to achieve the fifth object of the present invention, the technical solution of the present invention provides a data labeling apparatus, including: the device comprises a memory and a processor, wherein the memory stores programs or instructions, and the processor executes the programs or instructions; wherein, the processor implements the steps of the data labeling method according to any technical scheme of the invention when executing the program or the instructions.
The data labeling apparatus provided in the present technical solution implements the steps of the data labeling method according to any one of the technical solutions of the present invention, and thus has all the beneficial effects of the data labeling method according to any one of the technical solutions of the present invention, and is not described herein again.
In order to achieve the sixth object of the present invention, the technical solution of the present invention provides a readable storage medium, where a program or an instruction is stored, and when the program or the instruction is executed, the steps of the data labeling method of any one of the technical solutions are implemented.
The readable storage medium provided in this technical solution implements the steps of the data labeling method according to any one of the technical solutions of the present invention, so that it has all the beneficial effects of the data labeling method according to any one of the technical solutions of the present invention, and details thereof are not repeated herein.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method for annotating data according to an embodiment of the present invention;
FIG. 2 is a second flowchart of a data labeling method according to an embodiment of the present invention;
FIG. 3 is a third flowchart of a data annotation method according to an embodiment of the invention;
FIG. 4 is a flow chart of a method for annotating data according to an embodiment of the present invention;
FIG. 5 is a flow chart of a method for annotating data according to an embodiment of the present invention;
FIG. 6 is a flow chart of a method for annotating data according to an embodiment of the present invention;
FIG. 7 is a seventh flowchart of a method for annotating data according to an embodiment of the present invention;
FIG. 8 is an eighth flowchart of a method for annotating data according to an embodiment of the present invention;
FIG. 9 is a flow chart of a method for annotating data according to an embodiment of the present invention;
FIG. 10 is a flow chart of a data annotation method according to an embodiment of the present invention;
FIG. 11 is a schematic block diagram of an apparatus for annotating data in accordance with an embodiment of the present invention;
FIG. 12 is one of the labeled diagrams of data according to one embodiment of the invention;
FIG. 13 is a second illustration of a data label according to an embodiment of the invention.
Wherein, the correspondence between the reference numbers and the part names in fig. 11 to 13 is:
100: browser workshops, 102: data service, 104: object storage service, 106: MinIO, 108: OSS, 110: MySQL, 112: annotation platform, 114: python SDK, 200: data annotation device, 210: memory, 220: a processor.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
A method, an apparatus, and a readable storage medium for annotating data according to some embodiments of the present invention are described below with reference to fig. 1 to 13.
Example 1:
as shown in fig. 1, the present embodiment provides a data annotation method, which is used on a client side, and includes the following steps:
step S102, sending a data set creating request to a data service;
step S104, uploading the file to an object storage service;
and step S106, after uploading the file and the label information to the object storage service, sending a data verification request to the data service.
In this embodiment, a client sends a request for creating a data set to a data service, the data service receives the request for creating the data set, stores metadata information to an object storage service according to a name of the data set, uploads a file to the object storage service, sends a data verification request to the data service after uploading to the object storage service based on the file and annotation information, and the data service receives the data verification request sent by the client and verifies the data set of the file stored in the object storage service.
In this embodiment, the file supports all unstructured data, including but not limited to: pictures, video, audio, text, etc., can store and manage massive unstructured data.
In the embodiment, the uploading process of the file does not involve codes, a graphical and guide form can be adopted, the operation process is simplified, a user does not need to master other knowledge, the low threshold is realized, guide type interaction based on a fixed flow is realized, and the method is more concise and friendly for users in the traditional industry with late starting and poor foundation, and better improves user experience.
In this embodiment, the processing of the files can be performed in parallel, and more efficient data annotation is achieved.
Example 2:
as shown in fig. 2, the present embodiment provides a data annotation method, and in addition to the technical features of the foregoing embodiment, the present embodiment further includes the following technical features:
uploading a file to an object storage service, specifically comprising the steps of:
step S202, selecting an uploading folder, requesting to upload the pre-signed link to the data service in batch, and uploading the file to the object storage service.
In the implementation, the uploading folder is selected firstly, then the pre-signature link is requested to be uploaded to the data service in batch, then the file is uploaded to the object storage service, and through the pre-signature link, the identity can be better verified, the data safety is ensured, and the requirement of the data safety under the multi-tenant scene is met.
Example 3:
as shown in fig. 3, the present embodiment provides a data annotation method, and in addition to the technical features of the above embodiment, the present embodiment further includes the following technical features:
the data annotation method further comprises the following steps:
step S302, batch request data set marking information and file downloading pre-signature link to data service;
step S304, receiving a download link returned by the data service;
step S306, downloading files in parallel according to the downloading link;
and step S308, training the model in batches based on the files and the labeling information.
In this embodiment, after the data is labeled, batch request of data set labeling information and file download pre-signature links may be performed on the data service, and after receiving a download link returned by the data service, the file is downloaded, and after downloading the file, the model is trained in batches. In the embodiment, through the pre-signature link mode, the identity can be better verified, the data security is ensured, and the requirement of the data security under the multi-tenant scene is met. In the embodiment, when the file and the labeling information are obtained, the model can be directly trained in batches, the data labeling and the model training are seamlessly integrated, the integrated data labeling, training and model verification are realized, and the automation is realized. And in the large data volume data set scene, the data labeling and model training meet the performance requirements of the mass data in the labeling and training scene.
The embodiment can be applied to model training in a short time to form a real AI commercial value.
Example 4:
as shown in fig. 4, the present embodiment provides a data annotation method, which is used for a data service side, and includes the following steps:
step S402, receiving a request for creating a data set, and storing metadata information to a database according to the name of the data set;
step S404, receiving a data verification request sent by a client, and performing data set verification on files stored in the object storage service;
step S406, receiving a list request of the unmarked files sent by the marking platform, searching the unmarked files according to the metadata information to obtain the unmarked collection files, and returning the unmarked collection files to the marking platform.
In this embodiment, a client sends a request for creating a data set to a data service, the data service receives the request for creating the data set, stores metadata information to a database of an object storage service according to a name of the data set, the client sends a request for data verification to the data service, the data service receives the request for data verification sent by the client, performs data set verification on files stored in the database of the object storage service, a tagging platform requests a list of unmarked files from the data service, the data service receives a request for the list of unmarked files sent by the tagging platform, searches for the unmarked files according to the metadata information, obtains the unmarked set files, and returns the unmarked set files to the tagging platform.
In this embodiment, the data service implements data set verification of a file stored in the database in the object storage service, and may convert the third-party format label information into self-format label information, which is simple to operate and implements automation operation to a certain extent.
Example 5:
as shown in fig. 5, the present embodiment provides a data annotation method, and in addition to the technical features of the above embodiment, the present embodiment further includes the following technical features:
the method for verifying the data set of the file stored in the object storage service specifically comprises the following steps:
step S502, verifying whether the file exists;
step S504, based on the existence of the file and the labeling information, checking whether the file comprises exchangeable image file format information;
step S506, based on the file including the exchangeable image file format information, deleting the exchangeable image file format information of the file;
step S508, establishing a mapping relation between the third-party labeling information and the labeled file, and converting the third-party labeling information into self-format labeling information based on the mapping relation;
step S510, counting the number of tags in the data set;
and S512, adding thumbnails of the pictures or the videos based on the files.
In the embodiment, data set verification is performed on files stored in a database in object storage service, including verifying whether the files exist or not, checking whether the files include exchangeable image file format information or not, if the files have label information, the label information at the moment is third-party label information which possibly does not meet the requirements of a self-format, establishing a mapping relation between the third-party label information and the labeled files, and then converting the third-party label information into the self-format label information based on the mapping relation, wherein part of data can have automatic labeling capacity, and when the method is oriented to enterprise operation scenes such as marketing, purchasing, working, HR management, supply chain, finance and manufacturing, conversion and labeling are automatically performed through establishing the mapping relation, and the method is more friendly for traditional industry users who start late and have poor foundation.
In this embodiment, after the mapping relationship between the third-party annotation information and the annotated file is established, and the third-party annotation information is converted into the self-format annotation information based on the mapping relationship, the number of the tags in the data set can be counted, and whether the number of each tag in the data set is balanced or not is judged by counting the number of the tags in the data set, so that the training data is prevented from being unbalanced.
In the embodiment, thumbnails of generated pictures and videos and the like can be added, and the thumbnails are used for displaying the file details of the data set detail page.
Example 6:
as shown in fig. 6, the present embodiment provides a data annotation method, and in addition to the technical features of the above embodiment, the present embodiment further includes the following technical features:
converting the third-party labeling information into self-format labeling information, and specifically comprising the following steps of:
step S602, starting the background timing task framework operation, and converting the third-party format marking information into self-format marking information.
In the embodiment, the timed task framework adopts Quartz for operation, and the task can be effectively managed by adopting the timed task framework, wherein the timed task framework comprises the starting time of the task, the uniqueness of the task, the strategy after the task fails and the like. And moreover, the task splitting and parallel verification can be performed on the large data set by adopting the timing task framework.
Example 7:
as shown in fig. 7, the present embodiment provides a data annotation method, and in addition to the technical features of the above embodiment, the present embodiment further includes the following technical features:
the data annotation method further comprises the following steps:
step S702, receiving the labeling information of the data set and the file downloading pre-signature link request, obtaining the file and the labeling file list under the path corresponding to the data set according to the metadata information, generating a downloading link, and returning the downloading link to the client.
In this embodiment, after receiving the tagging information of the data set and the request for downloading the pre-signed link of the file, the data service obtains the list of the file and the tagged file under the path corresponding to the data set according to the metadata information, generates a download link, and returns the download link to the client, thereby facilitating the downloading by the client.
Example 8:
as shown in fig. 8, this embodiment provides a data annotation method, which is used on an object storage service side, and includes the following steps:
step S802, receiving and storing metadata information;
step S804, receiving and storing the file;
step S806, receiving and storing the annotation information.
In this embodiment, the Object Storage Service may adopt an open source framework MinIO (open source distributed file Storage system) that is currently mainstream, and the framework supports an S3(S3 Simple Storage Service) annotation protocol, and may be seamlessly adapted to an S3/OSS (Object Storage Service, a cloud Storage Service that provides massive, secure, low-cost, and high-durability). MinIO is a server of the object storage service with the fastest speed in the world, is very suitable for a large-scale private cloud environment with strict safety requirements, and can ensure high availability under each workload, wherein MinIO is open-source free service, OSS is charging service, and the MinIO can effectively reduce the cost.
In this embodiment, the object storage service may further include MySQL (relational database management system) for data storage.
Example 9:
as shown in fig. 9, the present embodiment provides a data annotation method, and in addition to the technical features of the above embodiment, the present embodiment further includes the following technical features:
receiving and storing the labeling information, specifically comprising the following steps:
and step S902, verifying the file corresponding to the labeling information through batch query of the database, and storing the labeling information.
In the embodiment, the file corresponding to the label information is verified, and then the label information is stored after the verification is passed.
Example 10:
as shown in fig. 10, the present embodiment provides a data annotation method, which is used for annotating a platform side, and includes the following steps:
step S1002, requesting an unmarked file list from a data service;
step S1004, receiving and verifying the unlabeled set file returned by the data service, analyzing the unlabeled set file, adapting to the labeling format, and labeling the unlabeled set file;
step S1006, the annotation information is sent to the object storage service.
In this embodiment, the annotation platform has a feature of high compatibility, and supports the current mainstream annotation format, including but not limited to: voc (visual Object classes), coco (Common Objects in COntext, a dataset available for image recognition from Microsoft corporation), labelme (label me), vott (visual Object Tagging tool), and so on.
According to the embodiment, parallel annotation and task management of the data set can be realized, the performance requirements of mass data in an annotation scene are met, and more efficient data annotation is realized.
Example 11:
as shown in fig. 11, the present embodiment provides a data annotation device 200, which includes: a memory 210 and a processor 220, the memory 210 storing programs or instructions, the processor 220 executing the programs or instructions; wherein the processor 220, when executing the program or instructions, implements the steps of the data annotation method according to any of the embodiments of the present invention.
Example 12:
the present embodiment provides a readable storage medium, which stores a program or instructions, and when the program or instructions are executed by a processor, the steps of the data annotation method of any one of the above embodiments are implemented.
The specific embodiment is as follows:
the implementation provides a data labeling method, parallel labeling and task management of large-scale data are realized through low-threshold, visual and guide means, and the method can be applied to model training in a short time to form a real AI commercial value.
The present embodiment is intended to solve at least one of the following key problems:
(1) how to store and manage large amounts of unstructured data, including but not limited to pictures, video, voice, text;
(2) how to label and manage tasks for the data set in parallel;
(3) how to meet the performance requirements of mass data in the marking and training scenes;
(4) how to satisfy the data security in a multi-tenant scenario.
The data annotation method of the embodiment is realized through an annotation platform, wherein the annotation platform is a low-threshold unstructured storage data annotation platform, comprises a client, a data service and an object storage service, and can realize a complete realization process of data set import and data annotation.
The data labeling method specifically comprises the following steps:
(1) importing a data set: as shown in fig. 12, a create dataset request is issued from the browser studio 100 (or client), and the data service 102 saves the metadata information according to the dataset name (to MySQL110 (relational database management system)). Browser studio 100 selects an upload folder, bulk requests Url (pre-signed Url) uploads to data service 102, and then uploads the file directly to object storage service 104.
The object Storage Service selects the currently mainstream open source framework MinIO106 (open source distributed file Storage system), and the framework supports the S3(S3 Simple Storage Service) annotation protocol and can be seamlessly adapted to S3/OSS 108. MinIO is a server of the object storage service with the highest speed in the world, is very suitable for a large-scale private cloud environment with strict safety requirements, and can ensure high availability under each workload.
(2) Data set verification: as shown in fig. 12, after all files are uploaded, the user clicks the validate dataset button on the client. After receiving the request, the data service 102 starts a background Quartz (open source project in the field of Job scheduling, by OpenSymphony open source organization) operation, and starts to convert the third-party format labeling information into self-format labeling information.
The data set validation content is as follows:
1) and verifying whether the file exists.
2) The picture EXIF information is deleted and the signature of the file is checked to determine if it is a picture.
3) And converting the third party label into a self-format label, and establishing a mapping relation between the third party label and the labeled file.
4) And counting the number of tags in the data set.
5) And adding thumbnails of the pictures or videos for the pictures or videos based on the files.
(3) And (3) labeling process: as shown in FIG. 13, the annotation platform 112 (annotation service) requests a list of unannotated files from the data service 102. The data service 102 searches for the unlabeled file according to the metadata information of the data set, and returns the unlabeled file to the labeling platform 112. And externally, verifying and receiving parameters such as the label set file, the file set name and the like. And internally analyzing the label set file and adapting to the label format. And verifying whether the file corresponding to the marking information exists or not through batch query of the database, and storing the marking information.
(4) Model training process: as shown in fig. 13, a Python SDK114 (software development kit) batch requests data set annotation information and a file download url (pre-signed link) from the data service 102. The data service 102 lists the files and the labeled file list under the path corresponding to the data set according to the metadata information of the data set, generates a download link, and returns the download link to the Python SDK 114. The Python SDK114 can scatter the returned results, download files in parallel, and perform model training in batches.
In this embodiment, no code is involved, the processes of data set import, data set verification and data annotation are all graphical and wizard, and the embodiment adopts zero code, graphical and wizard data set import and annotation processes. Compared with the free layout of canvas type interaction, the wizard type interaction based on the fixed flow is more concise and friendly for users in the traditional industry with late starting and poor foundation.
The annotation platform in this embodiment has a feature of high compatibility, and supports the current mainstream annotation format, including but not limited to: voc, coco, labelme, vott, etc.
The partial data in the embodiment is labeled with automation capacity, and is oriented to enterprise operation scenes such as marketing, purchasing, working, HR management, supply chain, finance and manufacturing.
The present embodiment supports all unstructured data, including but not limited to: pictures, video, audio, text, etc.
In this embodiment, data annotation and model training are seamlessly integrated.
In conclusion, the beneficial effects of the embodiment are as follows:
1. in this embodiment, the file supports all unstructured data, including but not limited to: pictures, video, audio, text, etc., can store and manage massive unstructured data.
2. In the embodiment, the uploading process of the file does not involve codes, a graphical and guide form can be adopted, the operation process is simplified, a user does not need to master other knowledge, the low threshold is realized, guide type interaction based on a fixed flow is realized, and the method is more concise and friendly for users in the traditional industry with late starting and poor foundation, and better improves user experience.
3. In this embodiment, the labeling process may be performed in parallel, thereby achieving more efficient data labeling.
In the present invention, the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance; the term "plurality" means two or more unless expressly limited otherwise. The terms "mounted," "connected," "fixed," and the like are to be construed broadly, and for example, "connected" may be a fixed connection, a removable connection, or an integral connection; "coupled" may be direct or indirect through an intermediary. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the description of the present invention, it is to be understood that the terms "upper", "lower", "left", "right", "front", "rear", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the referred device or unit must have a specific direction, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for annotating data, which is used on a client side, is characterized by comprising the following steps:
sending a request for creating a data set to a data service;
uploading the file to an object storage service;
and sending a data verification request to the data service after the file is uploaded to the object storage service based on the file.
2. The data annotation method of claim 1, wherein the uploading the file to an object storage service specifically comprises:
and selecting an uploading folder, and uploading the file to the object storage service after requesting the data service batch for uploading the pre-signed link.
3. The method for annotating data according to claim 1, further comprising:
requesting data set annotation information and file download pre-signature links to the data service in batches;
receiving a download link returned by the data service;
according to the downloading link, downloading the files in parallel;
based on the files, the model is trained in batches.
4. A data annotation method is used for a data service side, and is characterized by comprising the following steps:
receiving a request for creating a data set, and storing metadata information to a database according to the name of the data set;
receiving a data verification request sent by a client, and performing data set verification on files stored in an object storage service;
receiving an unmarked file list request sent by a marking platform, searching the unmarked file according to the metadata information to obtain an unmarked collection file, and returning the unmarked collection file to the marking platform.
5. The data annotation method according to claim 4, wherein the verifying the data set of the file stored in the object storage service specifically includes:
verifying whether the file exists;
checking whether the file includes exchangeable image file format information based on the file presence;
deleting the exchangeable image file format information of the file based on the file including the exchangeable image file format information;
establishing a mapping relation between third-party labeling information and a labeled file, and converting the third-party labeling information into self-format labeling information based on the mapping relation;
counting the number of the data set tags;
and adding a thumbnail of the picture or the video based on the fact that the file is the picture or the video.
6. The method for annotating data according to claim 5, wherein said converting said third party annotation information into self-formatted annotation information specifically comprises:
and starting background timing task frame operation, and converting the third-party format marking information into self-format marking information.
7. The method for annotating data according to claim 4, further comprising:
receiving data set marking information and a file downloading pre-signature link request, obtaining a file and marking file list under a path corresponding to the data set according to the metadata information, generating a downloading link, and returning the downloading link to the client.
8. A data labeling method is used for an object storage service side, and is characterized by comprising the following steps:
receiving and storing metadata information;
receiving and storing a file;
and receiving and storing the marking information.
9. The data annotation method of claim 8, wherein the receiving and storing annotation information specifically comprises:
and verifying the file corresponding to the labeling information through batch database query, and then storing the labeling information.
10. A method for labeling data is used for labeling a platform side, and is characterized by comprising the following steps:
requesting an unlabeled file list from a data service;
receiving and verifying an unlabeled set file returned by the data service, analyzing the unlabeled set file, adapting to a labeling format, and labeling the unlabeled set file;
and sending the labeling information to an object storage service.
11. An apparatus (200) for annotating data, comprising:
a memory (210) storing programs or instructions;
a processor (220) that executes the program or instructions;
wherein the processor (220), when executing the program or instructions, carries out the steps of the method of annotating data according to any one of claims 1 to 3, or the steps of the method of annotating data according to any one of claims 4 to 7, or the steps of the method of annotating data according to claim 8 or 9, or the steps of the method of annotating data according to claim 10.
12. A readable storage medium, characterized in that it has stored thereon a program or instructions which, when executed by a processor, carry out the steps of a method of annotating data according to any one of claims 1 to 3, or of a method of annotating data according to any one of claims 4 to 7, or of a method of annotating data according to claim 8 or 9, or of a method of annotating data according to claim 10.
CN202111028299.8A 2021-09-02 2021-09-02 Data labeling method and device and readable storage medium Pending CN113919300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111028299.8A CN113919300A (en) 2021-09-02 2021-09-02 Data labeling method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111028299.8A CN113919300A (en) 2021-09-02 2021-09-02 Data labeling method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN113919300A true CN113919300A (en) 2022-01-11

Family

ID=79233898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111028299.8A Pending CN113919300A (en) 2021-09-02 2021-09-02 Data labeling method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113919300A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492997A (en) * 2018-10-31 2019-03-19 四川长虹电器股份有限公司 A kind of image labeling plateform system based on SpringBoot
CN111666936A (en) * 2019-03-08 2020-09-15 北京市商汤科技开发有限公司 Labeling method, labeling device, labeling system, electronic equipment and storage medium
CN112381114A (en) * 2020-10-20 2021-02-19 广东电网有限责任公司中山供电局 Deep learning image annotation system and method
US11048979B1 (en) * 2018-11-23 2021-06-29 Amazon Technologies, Inc. Active learning loop-based data labeling service
WO2021164161A1 (en) * 2020-02-17 2021-08-26 平安国际智慧城市科技股份有限公司 Image data labeling method and apparatus, and computer device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492997A (en) * 2018-10-31 2019-03-19 四川长虹电器股份有限公司 A kind of image labeling plateform system based on SpringBoot
US11048979B1 (en) * 2018-11-23 2021-06-29 Amazon Technologies, Inc. Active learning loop-based data labeling service
CN111666936A (en) * 2019-03-08 2020-09-15 北京市商汤科技开发有限公司 Labeling method, labeling device, labeling system, electronic equipment and storage medium
WO2021164161A1 (en) * 2020-02-17 2021-08-26 平安国际智慧城市科技股份有限公司 Image data labeling method and apparatus, and computer device and storage medium
CN112381114A (en) * 2020-10-20 2021-02-19 广东电网有限责任公司中山供电局 Deep learning image annotation system and method

Similar Documents

Publication Publication Date Title
US11899681B2 (en) Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
US20100274714A1 (en) Sharing of presets for visual effects or other computer-implemented effects
TWI354475B (en) Dispatching client requests to appropriate server-
CN110263001B (en) File management method, device, system, equipment and computer readable storage medium
CN107665237B (en) Data structure classification device, and unstructured data publishing and subscribing system and method
WO2017114190A1 (en) File uploading processing method and device
WO2021093673A1 (en) E-mail sending method, apparatus and device, and computer-readable storage medium
WO2019127894A1 (en) Method for converting pdf file into picture, electronic device, and computer-readable storage medium
CN110838969B (en) Picture transmission method, device, equipment and medium
CN107147706A (en) Data export method and device
JP7509886B2 (en) Method and apparatus for pushing subscription data in the internet of things, and devices and storage media thereof
US20230409312A1 (en) Game data updating method and system, server, electronic device, and storage medium
CN113919300A (en) Data labeling method and device and readable storage medium
CN106209936B (en) Third party system data capture method and device
CN103942239A (en) Information processing apparatus and information processing method
CN110163564A (en) Method, system and the storage medium of item service are generated based on item model
JP7336161B2 (en) Media processing method
CN112084245B (en) Data management method, device, equipment and storage medium based on micro-service architecture
CN111478951B (en) File issuing method and device
CN111932691B (en) Webpage-based virtual reality courseware making method, device and equipment
CN115426216A (en) Internet of things node interconnection method based on complex virtual instrument software
CN111222065B (en) Information display method and device, electronic equipment and medium
CN113836421A (en) Work recommendation method and related device
CN111966605A (en) Automatic resource retrieval method, system and storage medium for Redfish
CN106604263B (en) Downloading method and system based on mobile equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination