CN103942122B - A kind of identification AVI types block method - Google Patents
A kind of identification AVI types block method Download PDFInfo
- Publication number
- CN103942122B CN103942122B CN201410164339.5A CN201410164339A CN103942122B CN 103942122 B CN103942122 B CN 103942122B CN 201410164339 A CN201410164339 A CN 201410164339A CN 103942122 B CN103942122 B CN 103942122B
- Authority
- CN
- China
- Prior art keywords
- block
- avi
- types
- byte
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000003860 storage Methods 0.000 claims abstract description 34
- 238000003066 decision tree Methods 0.000 claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000012217 deletion Methods 0.000 claims abstract description 4
- 230000037430 deletion Effects 0.000 claims abstract description 4
- 238000009826 distribution Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 238000012512 characterization method Methods 0.000 claims 1
- 239000004744 fabric Substances 0.000 claims 1
- 238000011084 recovery Methods 0.000 abstract description 9
- 238000002203 pretreatment Methods 0.000 abstract description 2
- 239000012634 fragment Substances 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 238000006062 fragmentation reaction Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of identification AVI types block method, this method is byte identification code based on audio video interleaved and the method for C4.5 decision trees, it is the recognition methods for avi file type block in the storage mediums such as disk, USB flash disk, this method design carries out engraving independent of file system metadata for the storage mediums such as disk, USB flash disk deletion data and provides pre-treatment step, and general file engraving needs undergo classification and recover two steps;The present invention step be:The block with particular identification code is identified by byte identification code first, then for the block not yet identified, after decision tree is obtained by simulating the training set similar to disk storage environment, then is recognized.The program adapts to complexity, multifile, the storage environment of Large Copacity.In addition, the present invention has good recognition accuracy for originally belonging to AVI block, there is higher application value for application fields such as judicial evidence collection, data recoveries.
Description
Technical field
The present invention relates to computer data digging technology field, more particularly to a kind of identification AVI types block method.
Context analyzer
With Information Technology Development, data recovery is more and more important as the effect of last one of barrier of information security,
Application demand in judicial evidence collection, military and civilian field is strong all the more.Traditional data reconstruction method is directed to the number of fragmentation
Even if can not recover according to the metadata using remaining.Therefore, data may it is damaged and in the case of lacking metadata such as
What recovers data this problem urgent need to resolve.Damaged data are often worth very greatly, sometimes include the crucial letter of case
Breath.And in civil area, video recovery also has a wide range of applications occasion, such as:Wedding celebration company needs to give imprudence deletion for change
Client's wedding dinner DV.Video recovery has great economic value for specific enterprise.The development of information technology is created for people
Make surprising data simultaneously, also propose data recovery this problem to researcher.
The metadata that the data recovery of early stage excessively dependent file system is provided, progressively occurred extensive independent of metadata later
The file engraving process of complex data.File engraving basis goes out data to file internals and content recovery.The text occurred earliest
The method value that part engraving process is read according to the flag sequence of file end to end is adapted to the situation that document order is stored.Research shows,
The file that files more than several M (million) there are about 15%~20% can produce fragment, that is to say, that there are a large amount of fragmentations on disk
File.For the file of fragmentation, it will be malfunctioned using the engraving process continuously read.Therefore, it is necessary to which studying to fit
Engraving process for fragment file.
At present, for the engraving of fragment file, corresponding framework proposes.It is main to include identification block, recovery two
Individual part.But, in the recognition methods for AVI (Audio Video Interleaved form), generally there is the problem of discrimination is not high.
The present invention will propose that a kind of new method is used for AVI types block classification.
The content of the invention
Present invention aims at propose a kind of method for recognizing AVI types block in the storage mediums such as disk, this method
Tentatively recognized by the intrinsic byte identification code of AVI format, then for remaining block applications C4.5 traditional decision-trees,
Go out the AVI type block of no byte identification code using byte value frequency distribution BFD as feature recognition, pass through the knowledge of priority two-wheeled
The identification to AVI types block is not realized,
The technical scheme adopted by the invention to solve the technical problem is that:The present invention is a kind of in analysis AVI types block
On the basis of feature, the byte condition code and byte value frequency distribution information that may contain in block are excavated, Jin Eryi
Matched according to byte identification code and using C4.5 Decision-Tree Methods identification target block method, this method mainly includes mirror
As backup, extract the steps such as block, the matching of byte flag code, the identification of C4.5 decision trees.
Method flow:
Step 1:Mirror back-up.
Mainly the content in storage medium is backuped in other storage mediums completely by special backup tool, it is to avoid
Data source is damaged in data recovery procedure.The scope of backup is from first sector until last sector.Backup
Data include meta-data section and real data part.
Step 2:Extract block.
By scanning storage medium, according to file table, unwritten piece of file table is marked.These unwritten piece
The data block lost or damaged comprising the block not stored and metadata.To not have markd piece to backup in other storage mediums
Remove the object as identification target block.
Step 3:Byte flag code matching.
AVI types block it is exclusive byte-identifier code have List, avi, hdrl, avih, strl, strf, strd, JUNK,
Odml, movi, ##wb, ##dc, ##db (## represents numbering 01,02,03 etc.), rec, idx1 etc..Each blcok is retrieved successively
Byte identification code, when occurring in that the byte identification code in the byte identification code set being mentioned above in block, it is determined that
AVI fragments.
Step 4:C4.5 decision trees recognize.
It is determined that after the file type that mirror image is included, setting up the training set being made up of these types block.In various texts
In the case that how much unknown part number of types is, each type of block equivalent is chosen, and ensure that block number is enough.So
Each block byte frequency distribution (Byte Frequency Distribution, BFD) is extracted afterwards.As feature, pin
Decision tree is set up according to C4.5 algorithms to training set.The block in each test set is identified using decision tree.
C4.5 algorithms set up classification tree by following steps:(1) entropy of classification stochastic variable is calculated.(2) in turn with wherein
Then one attribute calculates entropy production as root.(3) that maximum attribute of selection entropy production is root.
Beneficial effect:
1st, the present invention can identify the block of AVI types with higher discrimination.
2nd, the present invention can adapt to the storage environment of complexity, include the polytype form such as picture, video, document
Target block is identified in the environment of block.
Brief description of the drawings:
Fig. 1 is flow chart of the method for the present invention.
Fig. 2 is the flow chart of C4.5 algorithms.
Embodiment
The invention is described in further detail below in conjunction with Figure of description.
As depicted in figs. 1 and 2, the present invention proposes a kind of identification AVI types block method, and this method includes as follows
Step:
Step 1:Mirror back-up
The object of backup includes the storage mediums such as disk, USB flash disk, CD.Ghost is the instrument for hard disc cloning.For U
Disk backup has the softwares such as UBackUp, USB flash disk backup tool.Optical disc backup can just be realized by imprinting software.Here backup
It is complete backup, the deletion data being stored on backup object and does not delete data and be all copied and be stored on another medium.
1) another storage medium is selected.
2) it is different according to backup object, different backup tools are selected, all data progress to backup object is completely standby
Part.
3) backup is completed, and preserves former storage medium.The data backed up on another storage medium will be used for AVI types
Block identification.
Step 1 of the present invention is to ensure that according to storage media types, selects suitable backup software, and back up completion
Afterwards, former storage medium is preserved.The scope of backup is from first sector until last sector.Backup Data includes metadata
Part and real data part.
Step 2:Extract block
1) mirror image data is scanned, analysis of metadata determines allocated block and unappropriated block in mirror image.
2) allocated block data need not be recovered.Allocated block is made marks.Then, it is successively read out not
The block of distribution, and stored with certain document form (being set as txt here).What each was stored with txt forms
Block is the object of identification.
Step 2 of the present invention is, according to metadata information, to mark allocated block, namely need not recover
block.For unallocated block, it is preserved one by one using txt file type, for subsequently recognizing.
Step 3:Byte flag code matching
The file that avi file type belongs to RIFF encapsulated types is a kind of.RIFF file types spend differentiation number comprising various
According to the byte identification code of type.By the file analysis to RIFF file types, in addition to RIFF this byte identification code, these
The file of type is without other identical identification codes.That is, block can be determined by the byte identification code in addition to RIFF
Type.
1) the distinctive byte flag code of AVI type files is determined.By analyzing file format, following byte flag is obtained
Code is exclusive for AVI type files:List、avi、hdrl、avih、strl、strf、strd、JUNK、odml、movi、##wb、##
Dc, ##db (## represents numbering 01,02,03 etc.), rec, idx1.
2) byte identification code matching is carried out to each block stored with txt forms by KMP methods.As long as the txt is literary
Byte identification code containing a matching in part, just stops matching operation, and think that the block is exactly AVI types
block。
3) block identified constitutes a set.Eliminated from original txt file set identified
Block out.Remaining txt file is used as the second wheel C4.5 traditional decision-tree identifications.
Step 3 of the present invention includes, to AVI format file distinctive byte-identifier code, there is as follows:List、avi、
Hdrl, avih, strl, strf, strd, JUNK, odml, movi, ##wb, ##dc, ##db (## represents numbering 01,02,03 etc.),
rec、idx1.These identification codes be used to carry out bytes match to the block that each needs are recognized.Using KMP methods, to each
Byte identification code matching is carried out with the block that txt forms are stored.As long as the byte-identifier containing a matching in the txt file
Code, just stops matching operation, and think that the block is exactly the block of AVI types.
Step 4:C4.5 decision trees recognize.
After the data type tentatively understanding of storage medium, an instruction suitable with storage media types storage environment is set up
Practice collection.The data acquisition system contains the block of All Files type in storage medium, and the block numbers of every kind of file type
Amount is enough and identical.Then following pre-treatment step is carried out to these block:
1) application Matlab extracts the block of input BFD features, and the BFD features of All Files constitute block numbers *
256 matrix, and save as csv file.Block BFD features are represented per a line, each row are exactly one and are used as feature
Byte value.
2) file type according to belonging to each block, determines the property value of the row.If the BFD of the row is AVI fragments
Feature, is denoted as Yes.Conversely, being denoted as No.
The csv file obtained for pretreatment, decision tree is set up by C4.5 traditional decision-trees.Each node of decision tree
All it is the byte value (byte value) as feature.Remaining block is calculated according to C4.5 successively after byte identification code is matched
Method is recognized.Comprise the following steps that:
1) block for needing to recognize is read, its BFD feature is extracted.
2) according to the C4.5 decision trees having built up, after some block for needing to recognize BFD is obtained, according to each
The threshold value of node selects branch one by one, when determining leafy node, and identification is completed.
3) other all block identification is completed according to step 1,2.
Step 4 of the present invention is remaining bis- identifications of block after being matched using C4.5 algorithms to byte identification code, with
Ensure that the block of really AVI types is identified without signature identification code for itself.In order that decision tree more conforms to reality
The storage environment of storage medium, before training set is prepared, initial analysis is done to the file type mainly included in storage medium.So
Afterwards so that the block types (i.e. the affiliated file types of block) that training set is included are consistent with file type in storage medium, and
And every kind of block numbers are identical, quantity is enough.Obtain after training set, its BFD feature is extracted by Matlab, according to each
File type belonging to block, determines the property value of the row.Finally, a csv file for representing training set is formed.Pass through
Processing of the C4.5 traditional decision-trees to training set, constructs the decision tree for belonging to the training set.To each block for needing to recognize
After its BFD is extracted, according to the threshold value trade-off decision tree branch one by one of each node of decision tree, when determining leafy node,
Identification is completed.
Claims (8)
1. a kind of identification AVI types block method, it is characterised in that methods described comprises the following steps:
Step 1:Mirror back-up;
The backup for backup completely, the deletion data that are stored on backup object and do not delete data be all copied be stored in it is another
On one medium, including:
1) another storage medium is selected;
2) it is different according to backup object, different backup tools are selected, all data to backup object are backed up completely;
3) backup is completed, and preserves former storage medium;The data on another storage medium are backed up by for AVI types block's
Identification;
Step 2:Extract block;
1) mirror image data is scanned, analysis of metadata determines allocated block and unappropriated block in mirror image;
2) allocated block data need not be recovered;Allocated block is made marks;Then, it is successively read out unallocated
Block, and stored in txt file form;Each is the object recognized by the block stored with txt forms;
Step 3:Byte flag code matching;
The file that avi file type belongs to RIFF encapsulated types is a kind of;RIFF file types are used to distinguish data class comprising various
The byte identification code of type;By the file analysis to RIFF file types, in addition to RIFF this byte identification code, these types
File without other identical identification codes;Block type is determined by the byte identification code in addition to RIFF;
Step 4:C4.5 decision trees recognize;
A training set suitable with storage media types storage environment is set up, the training set contains all texts in storage medium
The block of part type, and the block quantity of every kind of file type is enough and identical, and then these block are located in advance
Reason, including:
1) application Matlab extracts the block of input byte frequency distribution feature, the byte frequency distribution feature of All Files
Block numbers * 256 matrix is constituted, and saves as csv file;The byte frequency distribution that a block is represented per a line is special
Levy, each row are exactly a byte value for being used as feature;
2) file type according to belonging to each block, it is determined that the property value per a line;If the byte frequency distribution of the row is
AVI shred characterizations, are denoted as Yes, conversely, being denoted as No;
The csv file obtained for pretreatment, sets up decision tree, each node of decision tree is by C4.5 traditional decision-trees
As the byte value of feature, remaining block is recognized according to C4.5 algorithms successively after byte identification code is matched, including:
Step 4-2-1:The block for needing to recognize is read, its byte frequency distribution feature is extracted;
Step 4-2-2:According to the C4.5 decision trees having built up, some block for needing to recognize byte frequency point is being obtained
After cloth, branch is selected one by one according to the threshold value of each node, and when determining leafy node, identification is completed;
Step 4-2-3:Other all block identification is completed according to step 4-2-1, step 4-2-2.
2. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described
1 includes:Ensure according to storage media types, select after the completion of suitable backup software, and backup, preserve former storage and be situated between
Matter;The scope of backup is from first sector until last sector;Backup Data includes meta-data section and real data
Part.
3. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described
2 include:According to metadata information, allocated block, namely the block that need not recover are marked;For unallocated
Block, is preserved it using txt file type one by one, for subsequently recognizing.
4. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described
3 include:Byte-identifier code distinctive to AVI format file, there is as follows:List、avi、hdrl、avih、strl、strf、
strd、JUNK、odml、movi、##wb、##dc、##db、rec、idx1;The identification code enters to each block for needing to recognize
Row bytes match, ## represents numbering 01,02,03 ....
5. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described
In 3, using KMP methods, byte identification code matching is carried out to each block stored with txt forms;If containing in txt file
The byte-identifier code of one matching, just stops matching operation, and think that the block is exactly the block of AVI types;Recognize
Block out constitutes a set, and the block identified is eliminated from original txt file set, remaining
Txt file is used as the second wheel C4.5 traditional decision-tree identifications.
6. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described
In 4, remaining bis- identifications of block after being matched using C4.5 algorithms to byte identification code, to ensure itself without signature identification code
And the block of really AVI types is identified;Before training set is prepared, to the files classes mainly included in storage medium
Type does initial analysis, then so that the block types that training set is included are consistent with file type in storage medium and every kind of
Block numbers are identical.
7. a kind of identification AVI types block according to claim 1 method, it is characterised in that the step of methods described
In 4, obtain after training set, its byte frequency distribution feature, the files classes according to belonging to each block is extracted by Matlab
Type, determines the property value of the row, finally, forms a csv file for representing training set;By C4.5 traditional decision-trees to training
The processing of collection, constructs the decision tree for belonging to the training set, and its byte frequency distribution is being extracted to each block for needing to recognize
Afterwards, according to the threshold value trade-off decision tree branch one by one of each node of decision tree, when determining leafy node, identification is completed.
8. a kind of identification AVI types block according to claim 1 method, it is characterised in that:Methods described is to be based on
Feature recognition code and C4.5 traditional decision-trees, the method for recognizing AVI types block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410164339.5A CN103942122B (en) | 2014-04-22 | 2014-04-22 | A kind of identification AVI types block method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410164339.5A CN103942122B (en) | 2014-04-22 | 2014-04-22 | A kind of identification AVI types block method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103942122A CN103942122A (en) | 2014-07-23 |
CN103942122B true CN103942122B (en) | 2017-09-29 |
Family
ID=51189795
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410164339.5A Expired - Fee Related CN103942122B (en) | 2014-04-22 | 2014-04-22 | A kind of identification AVI types block method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103942122B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6511893B2 (en) * | 2015-03-23 | 2019-05-15 | 日本電気株式会社 | Image processing apparatus, image processing method, and program |
DE102016209032B3 (en) * | 2016-05-24 | 2017-09-14 | Siemens Healthcare Gmbh | Image-providing method for carrying out a medical examination together with the associated imaging system and associated computer program product |
CN109947760A (en) * | 2017-07-26 | 2019-06-28 | 华为技术有限公司 | It is a kind of excavate KPI root because method and device |
CN113032179B (en) * | 2021-02-25 | 2024-03-26 | 北京工业大学 | Third party data recovery software clearing effect evaluation and selection method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101064158A (en) * | 2006-04-30 | 2007-10-31 | 凌阳科技股份有限公司 | Optical storage media recorded with audio-video staggered formation files and recording method |
US8374573B1 (en) * | 2009-03-30 | 2013-02-12 | Reno A & E | AVI system with improved receiver signal processing |
CN103165157A (en) * | 2011-12-16 | 2013-06-19 | 深圳市快播科技有限公司 | Method and device for locating playing position of no-indexing audio video interleaved (AVI) file and player |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011253589A (en) * | 2010-06-02 | 2011-12-15 | Funai Electric Co Ltd | Image/voice reproducing device |
-
2014
- 2014-04-22 CN CN201410164339.5A patent/CN103942122B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101064158A (en) * | 2006-04-30 | 2007-10-31 | 凌阳科技股份有限公司 | Optical storage media recorded with audio-video staggered formation files and recording method |
US8374573B1 (en) * | 2009-03-30 | 2013-02-12 | Reno A & E | AVI system with improved receiver signal processing |
CN103165157A (en) * | 2011-12-16 | 2013-06-19 | 深圳市快播科技有限公司 | Method and device for locating playing position of no-indexing audio video interleaved (AVI) file and player |
Also Published As
Publication number | Publication date |
---|---|
CN103942122A (en) | 2014-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pal et al. | The evolution of file carving | |
CN103942122B (en) | A kind of identification AVI types block method | |
CN104462433B (en) | A kind of method of recovery FAT32 partition datas | |
CN109522290A (en) | A kind of HBase data block restores and data record extraction method | |
KR101593184B1 (en) | Method and apparatus for recovering partition based on file system metadata | |
CN108319518B (en) | File fragment classification method and device based on recurrent neural network | |
US20150278023A1 (en) | Apparatus and method for recovering data in oracle database | |
CN102999433A (en) | Redundant data deletion method and system of virtual disks | |
CN101763394A (en) | Method for searching secret-related files in computer system | |
Sari et al. | A review of graph theoretic and weightage techniques in file carving | |
CN116132499A (en) | Compression method and device for call chain, computer equipment and storage medium | |
KR101938730B1 (en) | METHOD, APPARATUS AND COMPUTER PROGRAM FOR RECOVERING THE DELETED RECORD IN ABNORMAL PAGE AND JOURNAL FILE OF SQLite | |
Pahade et al. | A survey on multimedia file carving | |
CN109359090A (en) | File fragmentation classification method and system based on convolutional neural networks | |
CN103870364B (en) | A kind of final version restoration methods of YAFFS2 files based on timestamp | |
JP2011065268A (en) | Method and device for determining start-end offset of variable-length data fragment | |
CN105653567A (en) | Method for quickly looking for feature character strings in text sequential data | |
CN104615948A (en) | Method for automatically recognizing file completeness and restoring | |
Azeem | The Data Carving-The Art of Retrieving Deleted Data as Evidence | |
CN105701500A (en) | Single-sided English paper scrap splicing identification method | |
Lee et al. | Block based smart carving system for forgery analysis and fragmented file identification | |
CN102902814B (en) | A kind of IM deletes the restoration methods of information | |
CN110705462B (en) | Hadoop-based distributed video key frame extraction method | |
Chen et al. | A novel data recovery algorithm for fat32 file system | |
CN102662981A (en) | Windows recycle bin delete record forensics method based on feature scan |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170929 |
|
CF01 | Termination of patent right due to non-payment of annual fee |