Varol et al., 2015 - Google Patents
Detecting near-duplicate text documents with a hybrid approachVarol et al., 2015
View PDF- Document ID
- 11416167402255108457
- Author
- Varol C
- Hari S
- Publication year
- Publication venue
- Journal of Information Science
External Links
Snippet
Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text …
- 230000002708 enhancing 0 abstract description 2
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30705—Clustering or classification
- G06F17/3071—Clustering or classification including class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30705—Clustering or classification
- G06F17/30707—Clustering or classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30634—Querying
- G06F17/30657—Query processing
- G06F17/30675—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F17/30613—Indexing
- G06F17/30619—Indexing indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30289—Database design, administration or maintenance
- G06F17/30303—Improving data quality; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/21—Text processing
- G06F17/22—Manipulating or registering by use of codes, e.g. in sequence of text characters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6267—Classification techniques
- G06K9/6279—Classification techniques relating to the number of classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/36—Image preprocessing, i.e. processing the image information without deciding about the identity of the image
- G06K9/46—Extraction of features or characteristics of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00852—Recognising whole cursive words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Watanabe et al. | Theory-driven analysis of large corpora: Semisupervised topic classification of the UN speeches | |
US8073877B2 (en) | Scalable semi-structured named entity detection | |
US20060206306A1 (en) | Text mining apparatus and associated methods | |
US20100125447A1 (en) | Language identification for documents containing multiple languages | |
Osorio et al. | Supervised event coding from text written in spanish: Introducing eventus id | |
Mustafa et al. | Kurdish stemmer pre-processing steps for improving information retrieval | |
Hasibi et al. | On the reproducibility of the TAGME entity linking system | |
Goebel et al. | Summary of the competition on legal information, extraction/entailment (COLIEE) 2023 | |
Kılınç | An accurate toponym-matching measure based on approximate string matching | |
Qian et al. | Detecting new Chinese words from massive domain texts with word embedding | |
Sadeghi et al. | Automatic identification of light stop words for Persian information retrieval systems | |
Yaseen et al. | Extracting the roots of Arabic words without removing affixes | |
Han et al. | Towards effective extraction and linking of software mentions from user-generated support tickets | |
Varol et al. | Detecting near-duplicate text documents with a hybrid approach | |
Ehsan et al. | Cross-lingual text alignment for fine-grained plagiarism detection | |
Figueroa et al. | Contextual language models for ranking answers to natural language definition questions | |
Salama et al. | A novel ensemble approach for heterogeneous data with active learning | |
Fragkou | Applying named entity recognition and co-reference resolution for segmenting english texts | |
Yeniterzi et al. | Turkish named-entity recognition | |
Zhang et al. | Chinese-English mixed text normalization | |
Khan et al. | Metadata for Efficient Management of Digital News Articles in Multilingual News Archives | |
Wei et al. | Recognizing software names in biomedical literature using machine learning | |
Hlayel et al. | An algorithm to improve the performance of string matching | |
Kaur et al. | Assessing lexical similarity between short sentences of source code based on granularity | |
Elloumi et al. | General learning approach for event extraction: Case of management change event |