Nothing Special   »   [go: up one dir, main page]

Varol et al., 2015 - Google Patents

Detecting near-duplicate text documents with a hybrid approach

Varol et al., 2015

View PDF
Document ID
11416167402255108457
Author
Varol C
Hari S
Publication year
Publication venue
Journal of Information Science

External Links

Snippet

Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text …
Continue reading at www.researchgate.net (PDF) (other versions)

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30705Clustering or classification
    • G06F17/3071Clustering or classification including class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30705Clustering or classification
    • G06F17/30707Clustering or classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30634Querying
    • G06F17/30657Query processing
    • G06F17/30675Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F17/30613Indexing
    • G06F17/30619Indexing indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor; File system structures therefor
    • G06F17/30286Information retrieval; Database structures therefor; File system structures therefor in structured data stores
    • G06F17/30289Database design, administration or maintenance
    • G06F17/30303Improving data quality; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/21Text processing
    • G06F17/22Manipulating or registering by use of codes, e.g. in sequence of text characters
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6267Classification techniques
    • G06K9/6279Classification techniques relating to the number of classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/36Image preprocessing, i.e. processing the image information without deciding about the identity of the image
    • G06K9/46Extraction of features or characteristics of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/00852Recognising whole cursive words
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled

Similar Documents

Publication Publication Date Title
Watanabe et al. Theory-driven analysis of large corpora: Semisupervised topic classification of the UN speeches
US8073877B2 (en) Scalable semi-structured named entity detection
US20060206306A1 (en) Text mining apparatus and associated methods
US20100125447A1 (en) Language identification for documents containing multiple languages
Osorio et al. Supervised event coding from text written in spanish: Introducing eventus id
Mustafa et al. Kurdish stemmer pre-processing steps for improving information retrieval
Hasibi et al. On the reproducibility of the TAGME entity linking system
Goebel et al. Summary of the competition on legal information, extraction/entailment (COLIEE) 2023
Kılınç An accurate toponym-matching measure based on approximate string matching
Qian et al. Detecting new Chinese words from massive domain texts with word embedding
Sadeghi et al. Automatic identification of light stop words for Persian information retrieval systems
Yaseen et al. Extracting the roots of Arabic words without removing affixes
Han et al. Towards effective extraction and linking of software mentions from user-generated support tickets
Varol et al. Detecting near-duplicate text documents with a hybrid approach
Ehsan et al. Cross-lingual text alignment for fine-grained plagiarism detection
Figueroa et al. Contextual language models for ranking answers to natural language definition questions
Salama et al. A novel ensemble approach for heterogeneous data with active learning
Fragkou Applying named entity recognition and co-reference resolution for segmenting english texts
Yeniterzi et al. Turkish named-entity recognition
Zhang et al. Chinese-English mixed text normalization
Khan et al. Metadata for Efficient Management of Digital News Articles in Multilingual News Archives
Wei et al. Recognizing software names in biomedical literature using machine learning
Hlayel et al. An algorithm to improve the performance of string matching
Kaur et al. Assessing lexical similarity between short sentences of source code based on granularity
Elloumi et al. General learning approach for event extraction: Case of management change event