KR20160149050A

KR20160149050A - Apparatus and method for selecting a pure play company by using text mining

Info

Publication number: KR20160149050A
Application number: KR1020150086039A
Authority: KR
Inventors: 유선희; 원동규
Original assignee: 한국과학기술정보연구원
Priority date: 2015-06-17
Filing date: 2015-06-17
Publication date: 2016-12-27

Abstract

The present invention relates to an apparatus and a method for selecting a representative company by utilizing text mining, which comprises: a document collection module for searching a text-based document; a preprocess module for converting text contents in the searched document into a format possible to input; a conversion/extraction module for selecting and storing meaning information from the document preprocessed by the preprocess module; an analysis module for reproducing by collecting or classifying a document by using a similarity coefficient based on the meaning information, and interconnecting a business, a representative product, and a product producing company in document contents; and a selection module for selecting a representative company for each business by using document contents reproduced by the analysis module.

Description

TECHNICAL FIELD The present invention relates to an apparatus and method for selecting a pure company using text mining,

A preprocessing module for converting text contents in a retrieved document into an inputtable format, a conversion module for selectively storing semantic information among preprocessed documents of the preprocessing module, / Extraction module, an analysis module for grouping documents using similarity coefficients, classifying and reproducing documents based on the semantic information, and linking industry types, representative products, and product producers among contents in the document, The present invention relates to an apparatus and a method for selecting a pure company using text mining including a selection module for selecting a representative pure company by business type using a document content.

In R & D decision support and technology evaluation, industry standard financial information is used to estimate profit. Conventional standardized financial information is a financial statement that is a suitable criterion for estimating profit or cash flow of technology valuation. The average value and percentile composition ratio of financial data of sample companies by year can be presented. For example, companies are selected based on the criteria such as 3-year OPM of 0 or more, and the number of firms with information that is statistically significant (10 ~ 20 or more) in the relevant industry. And can be used as basic information.

However, since the standard financial information mainly uses the average value of the industry, there is a disadvantage that it is limited in informing the profit structure according to the commercialization of the target technology, and is limited in the technology evaluation and R & D decision support application practice .

Recently, the development of computers and media has increased the scale of data exponentially. Text mining technology has been studied as a method to select meaningful information from meaningless information due to storage device giganticization and media development have. Text mining is a type of data mining that deals with discovering new knowledge from large unstructured test sets and can be used for data collection / refinement / transformation, information extraction, summarization / retrieval, etc. .

Through such a text mining process, it is possible to find the meaning of a context level rather than a keyword level in information of a user in a large amount of data. In other words, there is a need for a method that can process many parts automatically with explosive increase of information, and a text mining method for discovering hidden patterns in a large amount of data and retrieving data associated with a specific topic is being studied.

As the standard financial data available at present, the data such as 'Business Management Analysis' of the Bank of Korea is divided into large corporations, small and medium enterprises, and whole corporations. Most of the data exist only for the whole corporations and large corporations. There is no financial data for SMEs, and industrial classification is limited to use as a major category.

In addition, companies that are promoted as companies and listed as representative firms in one industry are already diversifying their businesses, and if they use the financial information of similar companies on an average basis, they will reflect the operating profit But also because it includes operating profits that reflect other businesses or products. For example, if Samsung Electronics is selected as a similar business type of technology to be evaluated and financial information is used, there is a possibility that the operating profit estimate may be distorted.

Therefore, in countries around the world, companies that focus on manufacturing or producing only one industry or product are defined as pure play companies through the technology evaluation matrix, and databases that match thousands of major companies in major industries are constructed and operated And estimates the cost of capital for the project or production line. This can serve as a basis for a pure company to explain the crisis and opportunity of the business.

The present invention has been invented based on such a technical background and has been invented to provide additional technical elements which can not easily be invented by a person having ordinary skill in the art, as well as satisfying the technical requirements of the present invention.

The present invention has been made to solve the problems of the existing methods described above. The present invention establishes a model for selecting a representative pure company for each business type by utilizing a text mining technique, The challenge is to implement a pure company selection system using mining.

In addition, the present invention establishes a model study of the technology-product-industry linkage and establishes the basis of the evidence by using the text mining technique in the information analysis technique, establishes the representative company selection model by industry and the information business profit estimation information base by the linkage model This paper proposes a system for selecting a pure company using text mining.

The technical problem to be solved by the present invention is not limited to the above-mentioned technical problems, and various technical problems can be included within the scope of what is well known to a person skilled in the art from the following description.

According to an aspect of the present invention, there is provided an apparatus for selecting a pure company using text mining, the apparatus comprising: a document collection module for searching for a text-based document; A converting / extracting module for selecting and storing semantic information among documents preprocessed by the preprocessing module, and classifying or classifying documents using similarity coefficients based on the semantic information, A selection module for selecting a representative pure company according to a business type by using an analysis module for connecting industry types, representative products, product production companies among contents in the document, and document contents reproduced by the analysis module.

Also, the apparatus for selecting a pure company using text mining according to an embodiment of the present invention may include at least one of an external storage device connected to the pure company selection device, an Internet search portal, and an external cloud device Based document in the document storage device, and the document collection module updates and saves the text-based document when the document is retrieved and modified every predetermined period.

In addition, the apparatus for selecting a pure company using text mining according to an embodiment of the present invention may be configured such that the preprocessing module analyzes a morpheme including at least one of words, phrases, and phrases represented in the retrieved document, And converting the unstructured text content into an inputtable format.

In this case, when the preprocessing module analyzes the morpheme in the retrieved document, the apparatus for selecting a pure company using text mining according to an embodiment of the present invention analyzes the morpheme in the document in the smallest unit Wherein the preprocessing module classifies the morpheme according to whether it is self-supporting or meaningful, and the preprocessing module expresses a language expression in which a plurality of deformation patterns exist through predetermined normalization, (FSM). &Lt; / RTI >

In addition, the apparatus for selecting a pure company using text mining according to an embodiment of the present invention may be configured to perform a morphological language analysis of the preprocessing module, or to select semantic information of the conversion / And a dictionary database module for storing and using the dictionary database module.

In addition, an apparatus for selecting a pure company using text mining according to an embodiment of the present invention includes a clustering unit in which the analysis module clusters documents using similarity coefficients based on the semantic information, And a text classifying unit for determining whether to assign to the category set according to the magnitude of the fitness value and ranking the categories set according to the degree of the fitness value.

In this case, the apparatus for selecting a pure company using text mining according to an embodiment of the present invention is characterized in that the clustering unit classifies a plurality of clusters based on the semantic information, Wherein the clustering unit uses at least one hierarchical clustering among a single connection, a full connection, a group average connection, and a ward technique, And clustering the document using non-hierarchical clustering which determines the stopping point of the relocation process.

The apparatus for selecting a pure company using text mining according to an embodiment of the present invention may be configured such that the text classification unit inputs a classification object document to a classifier and measures a classification accuracy of the document using a preliminary verification or cross validation method Wherein the classifier comprises at least one of probabilistic seismic analysis (PSA), bayesian networks (LDA), kNN classifier, linear classifier, locho classifier, SVM classifier, neural network classifier, And a perceptron classifier.

Further, an apparatus for selecting a pure company using text mining according to an embodiment of the present invention may include an information output module for visualizing and reproducing the reproduced data, a text mining process of the pure company selecting apparatus, And an evaluation module that corrects or compensates for the portion where the error has occurred.

In this case, in the pure company selection apparatus using text mining according to an embodiment of the present invention, the evaluation module may include a cluster for determining whether the data structure generated as a result of clustering of the analysis module provides statistical evidence for pure enterprise selection Wherein the evaluation module performs a cluster performance evaluation based on the feasibility evaluation, the external quality measure, and the internal quality measure, and the evaluation module includes a clustering tendency including at least one of redundancy check, nearest neighbor check, And evaluating whether a meaningful cluster is generated using the inspection module. The evaluation module evaluates the classification performance based on the classification recall rate and the classification accuracy of the document classified by the analysis module.

In addition, the pure company selection apparatus using text mining according to an embodiment of the present invention may include at least one of a single industry type, a single product production status, a research execution status, financial information, and a financial performance result And selecting the representative pure company by industry according to the selection criteria.

Meanwhile, a pure company selection method using text mining according to an embodiment of the present invention includes a step of searching a text-based document, a step of converting a text content in the searched document into an inputtable format, Selecting and storing semantic information among preprocessed documents, grouping documents using the similarity coefficient, classifying and reproducing the documents based on the semantic information, producing business types, representative products, and products in the document Connecting the companies to each other, and selecting the representative pure business according to the business type using the document contents reproduced by the analysis module.

Through the present invention, clustering and matching of Korean terms using information analysis technique (text mining technique) can be used for technology / product linkage, and objectivity can be enhanced and automation base can be provided. In addition, it is possible to implement a pure company selection system that utilizes text mining, which provides a practical application model using the term extraction, processing, conversion, and clustering technology advancement through morphological analysis.

In addition, there are various approaches to estimate the ability and effect of creating added value through commercialization of ideas and technologies, but the effort, time, and cost are high, whereas the pure company selection apparatus utilizing text mining of the present invention It can be used simply and effectively in connection with representative companies to compare risk and estimate profitability.

In addition, the pure company selection apparatus using the text mining of the present invention can be utilized as a core information base for measuring the economical efficiency of the technology in planning support of R & D technology, commercialization of profitability analysis of technology, and technology evaluation, It can be used as a main criterion for the cost structure in technology implementation, which is the main criterion for decision making.

In addition, the pure company selection device using text mining of the present invention is based on the key ground information for connecting the technology and the market, and it is expected that future application studies such as technology-market dynamics model and ecosystem research based on big data analysis and practical May be applicable.

In addition, it is possible to link the basic information of the established technology, the market, and the information related to the profit, so that the future commercialization of technology in the market and the industry and the linkage of the benchmark target in the future technology analysis, commercialization feasibility analysis, Can be improved.

FIG. 1 is a block diagram illustrating a pure company selection apparatus using text mining according to an embodiment of the present invention. Referring to FIG.
FIG. 2 is a block diagram showing a configuration in which a preprocessing module of a pure company selection apparatus using text mining according to an embodiment of the present invention is preprocessed.
FIG. 3 is a diagram illustrating a method for selecting a pure company by a pure company selection apparatus using text mining according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating an example of a pure company selection criterion of a pure company selection apparatus using text mining according to an embodiment of the present invention.
5 is a flowchart illustrating a pure company selection method using text mining according to an embodiment of the present invention.

Hereinafter, an apparatus and method for selecting a pure company using text mining according to the present invention will be described in detail with reference to the accompanying drawings. The present invention is not limited to the above-described embodiments, and various changes and modifications may be made without departing from the scope of the present invention. In addition, the matters described in the attached drawings may be different from those actually implemented by the schematic drawings to easily describe the embodiments of the present invention.

In the meantime, each constituent unit described below is only an example for implementing the present invention. Thus, in other implementations of the present invention, other components may be used without departing from the spirit and scope of the present invention. In addition, each component may be implemented solely by hardware or software configuration, but may be implemented by a combination of various hardware and software configurations performing the same function. Also, two or more components may be implemented together by one hardware or software.

Also, the expression " comprising " is intended to merely denote that such elements are present as an expression of " open ", and should not be understood to exclude additional elements.

Also, the expressions such as 'first, second', etc. are used only to distinguish between plural configurations, and do not limit the order or other features among the configurations.

Text mining refers to the process of discovering new knowledge in unstructured text and corresponds to the concept of data mining, which is the discovery of knowledge in structured data. Only about 20% of the information that is created / stored / reused in the enterprise is composed of highly usable form data and the remaining 80% is composed of compound documents such as word processor, e-mail, presentation, spreadsheet, Pages, and the like. Therefore, since search engines retrieve a large amount of information, there is a need to search for useful information between undesired information and search for useful information from unstructured data.

Most of digital information is unstructured data. Text mining aims to extract and process useful information by applying natural language processing (NLP) and document processing techniques to such unstructured / semi-structured data. At this time, document summarization, feature extraction, and clustering are core research areas of text mining.

FIG. 1 is a block diagram illustrating a pure company selection apparatus using text mining according to an embodiment of the present invention. Referring to FIG.

Referring to FIG. 1, a pure company selection apparatus 100 using text mining according to the present invention includes a document collection module 110 for searching a text-based document, A conversion / extraction module 130 for selecting and storing semantic information among preprocessed documents of the preprocessing module, and a conversion / extraction module 130 for collecting and analyzing documents based on the semantic information, A selection module 150 for selecting a representative pure company according to a business type using the content of the reproduced document by the analysis module 140, . &Lt; / RTI >

The document collection module 110 searches for text-based documents. At this time, the document collection module can search all electronic documents and text-based documents in a document storage device including at least one of an external storage device connected to the pure company selection device, an Internet search portal, and an external cloud device .

For example, the document collection module can be connected to a pure enterprise selection device directly connected to a USB drive, an external hard drive, an Internet search portal (eg, Naver, Next, Google ...), an external cloud device (Wi- Portal cloud) to search all electronic and text-based documents. Documents that are subject to text mining are documents with all the electronic formats such as text (txt), hwp, word doc, power point (ppt), PDF, html, Industry manuals, and web documents.

The preprocessing module 120 converts the text content in the searched document into an inputtable format. At this time, the preprocessing module analyzes the morpheme including at least one of the words, phrases, and clauses expressed in the retrieved document, converts the unstructured text content into a form that can be input, In the case of language analysis, the morpheme is extracted as the smallest unit having meaning in a language, and the morpheme is classified according to whether it is independent or meaningful.

Text preprocess refers to a technique for finding useful information in a document, rather than searching the document itself. It detects a portion of a given text that contains a hint word in a subset, It means filling. In other words, it means converting unstructured text into a form that can be entered in a database table.

In this case, the preprocessing module analyzes the morpheme in the document. The morpheme is the smallest unit containing meaning in a language, and is a unit of language which is lost when it is analyzed further. As with phonemes, morphemes are abstract entities and can be realized in various forms in speech. The morpheme can be divided into free morpheme and bound morpheme depending on whether it is independent or not, and full morpheme and empty morpheme depending on the meaning of the meaning.

Independent morpheme is a morpheme that can be used alone without other morphemes and usually refers to nouns (ex. Wiki / encyclopedia / information). A dependent morpheme is a form that is used in conjunction with another morpheme when it is spoken, and it can include verbs of the verb that belong to it as well as the verb and the mother.

In addition, the morpheme is a lexical morphee, which is a substantive morpheme that expresses a lexical meaning such as a specific object, an action, and a state, and a grammatical morpheme that displays a formal relation between a real morpheme and a horse, &Lt; / RTI > A vocabulary morpheme is a morpheme with lexical meaning, a morpheme that indicates a certain object or state / action, and a grammatical morpheme is a morpheme with a grammatical meaning and functions to represent the relationship between vocabulary morphemes.

The preprocessing module of the present invention analyzes and translates the morpheme in the document. For example, referring to FIG. 2, it can be seen that the preprocessing module of the pure company selection apparatus using text mining preprocesses. In a text - based document that is searched through web search, the paragraph of "There is a lot of good information in KISTI" can be divided into vocabulary and grammatical morpheme through morphological language analysis. In this case, vocabulary morpheme is "KISTI", " , "Information" and "many -", and the grammatical morpheme may correspond to "-", "-", "-", "-", "-".

In addition, the dictionary database module 160 may further include a dictionary database module 160 for storing information on the conventional terminology when the morphological language analysis of the preprocessing module or the semantic information of the conversion / extraction module is selected. . For example, when distinguishing morphemes, it is possible to retrieve the pre-stored values through the dictionary database module and check whether the morpheme is properly applied to the paragraphs, in which morpheme is generally used as the morpheme method. In addition, in the process of converting / extracting semantic information of a conversion / extraction module, which will be described later, it is also possible to check through the dictionary database module when a task of collecting or classifying meaningful words is performed.

Meanwhile, the preprocessing module can express a language expression in which a plurality of deformation patterns exist through predetermined normalization, and confirm the normalized language expression through a finite state machine (FSM).

Regular expression is a method of normalizing and displaying language expressions with various variation patterns in the information extraction technique, and corresponds to the information extraction process using Regex in general. (ex: Tokenizer -> POS Tagger -> Regex Matcher -> Template Filler -> Template Merger)

A finite-state machine (FSM) is an algorithm that consists of state nodes and arrows. It is easy to display the number of cases in various stages, and FASTUS (finite state automation text understanding system Hidden markov models (HMMs), which give the most appropriate sequence of states when the order of the observations is given based on the probability that the corresponding state is given, defined in a matching dictionary using a machine learning / probability model, Based taggers that perform part-of-speech tagging based on rules such as a word in a tag, a specific usage rule, whether it starts with an uppercase letter, whether there is a known suffix, or the like.

The conversion / extraction module 130 selects and stores the semantic information among the preprocessed documents of the preprocessing module. Semantic information refers to relevant information that a user desires among the floods of a large number of information. For this purpose, a feature generation method and a feature selection method can be used. In this case, the semantic information conversion selects and saves meaningful information among the preprocessed data, and performs processing such as abstraction processing, case processing, and stemming processing. In addition, semantic information extraction simplifies the representation of complex semantic information and stores information suitable for the domain as semantic data (feature information) of the document.

The analysis module 140 groups documents using the similarity coefficient based on semantic information, classifies and reproduces the documents, and connects the business type, the representative product, and the product producing enterprise among the contents in the document. At this time, the analysis module determines whether the clustering unit 141 clusters the document using the similarity coefficient based on the semantic information and assigns the clustering unit 141 to the category set according to the size of the conformance value based on the semantic information, And a text classification unit 142 for ranking the categories that are set according to the category.

At this time, similarity coefficient can be used when grouping documents or classifying documents, and distance coefficient and similar coefficient can be used to measure similarity measure between documents. The distance coefficient may include the Euclidean distance, the Chi square, the binary Euclidean distance, and the binary size difference. The similarity coefficient may include Pearson correlation coefficient, cosine coefficient, binary Jacquard coefficient, binary Dice coefficient, binary Ochi coefficient, weighted dice coefficient, Coefficient, and inner product coefficient.

The clustering unit 141 clusters documents using similarity coefficients based on feature information, and the clustering unit classifies a plurality of clusters based on the semantic information, It is possible to clusters documents using non-hierarchical clustering and non-hierarchical clustering which determines the initial cluster selection method and the stopping point of the relocation process.

Hierarchical clustering is based on i) a single connection based on the closeness of closest neighbors belonging to two clusters, ii) a full connection based on the similarity between the farthest neighbors belonging to two clusters, iii) A cluster mean linking the average of the similarities between pairs and the average within the group which is the average of the similarities between all pairs of members belonging to two clusters to be integrated; iv) And ward techniques for finding and integrating clusters.

Nonclustered clustering determines how to select the initial clusters and when to stop the relocation process. I) Include single inbound documents in an already formed cluster by comparing similarities with existing cluster centroids, or select one as the central document of a new cluster Path, ii) it starts from arbitrarily formed K initial cluster sets and repeats the relocation procedure based on the similarity between object and centroid. If the cluster is stable or the value of error squares no longer decreases because there are few documents to be relocated, K -Means, and iii) Expectation Maximization (EM) techniques, which is probability-based clustering.

The text classification unit 142 may determine, based on semantic information, whether to allocate to a category set according to the size of conformity, and rank the categories set according to the degree of conformance value. The binary classifier of the text classification unit determines whether or not to allocate to this category according to the magnitude of the fitness value for the specific category, and the ranking providing classifier ranks the categories according to the degree of the fitness value, can do. In this case, a classification technique may be a knowledge-based approach using classification rules and an expert system, and a machine learning approach using an inductive learning classifier.

In addition, the text classification unit classifies the experimental document group into a learning set and an experiment set, applies a learning algorithm to the learning set to generate a classification model, and inputs the classified input document into the classifier, and then measures the classification accuracy of the document . In this case, validation methods such as holdout validation, cross validation, k-fold cross-validation, and leave-one-out cross-validation can be used.

In addition, the classifier of the text classification unit includes: i) a Probabilistic Sementi Analysis (PSA) for calculating the probability that a word represents a specific category using a learning document and then predicting the words appearing in the input document to be classified as a clue, (Ii) the similarity coefficients and k values to be selected in advance, (ii) k is chosen as the closer the categories in the feature space are to each other, and the neighboring documents are found in the learning document set A kNN classifier that selects one or more categories to classify input documents based on the assigned categories, iii) an inner product between a category vector denoting a classifier for a specific category and a document vector to be classified, Linear classifiers, and iv) category vectors, which are classifiers for a particular category C, are divided into positive and negative examples through learning during construction, A SVM classifier that finds a decision plane that separates the set of positive and negative examples with the maximum margin, and vi) the neural network that determines the weight of the link between nodes. Classifier, vii) a multi-layer back propagation classifier that inputs the term weight of the learning document to the input node and minimizes the error through back propagation when classification error occurs; viii) And a perceptron classifier for comparing the result value with an output reference value.

The selection module 150 can select a representative pure company according to the business type by using the document contents reproduced by the analysis module. In this case, the selection module can select a representative pure company according to the selection criteria including at least one of whether there is a single type of business, whether a single product is produced, whether to conduct research, financial information, and financial result.

Technology-product linkage A pure play company is a company that researches / manufactures / sells only those products that can be implemented with the technology in industries and industries in which the technology to be evaluated can be commercialized, This is a company that focuses on its products. It refers to a firm that has obtained financial information reviewed by external experts and has a stable financial base.

Referring to FIG. 3, a method of selecting a pure company by a pure company selecting apparatus using text mining according to an embodiment of the present invention can be confirmed. For example, Coca-Cola (a company) can only be seen as a pure company because it manufactures / sells only cola (a representative product), but Pepsi (company) is not only a drink (industry) Frito-Lay (representative product), which is a business type, is sold. Therefore, it is not a pure company.

In the case of selecting such a pure company by sector, it can be selected as a pure company when all the predetermined number among the various criteria such as the pure company selection criteria shown in FIG. 4 are satisfied. For example, a pure company selection criteria can be selected by i) a single industry type as much as possible, ii) a single product can be produced, and a pure company can be selected according to various selection criteria. In this case, it can be selected as a pure company.

The information output module 170 can visualize and reproduce the reproduced data. At this time, newly generated information can be effectively displayed to a user (End-User) by using a visualization tool.

The evaluation module 180 evaluates the text mining process of the pure company selection device and can correct or supplement the error occurrence part according to the predetermined criteria. At this time, the evaluation module performs a cluster feasibility evaluation for determining whether the data structure generated as a result of the clustering of the analysis module provides statistical evidence for a pure enterprise selection, and a cluster performance evaluation for determining through an external quality measure and an internal quality measure can do.

The cluster validity test evaluates whether the data structure generated by the clustering provides statistical evidence for a particular phenomenon being studied. Whether the data set has a tendency to clustering randomly / whether the hierarchical structure resulting from the clustering is similar How much the relationship is reflected / whether the document clustering result is effective in improving the search performance, and so on.

At this time, the cluster feasibility evaluation can perform a clustering tendency test to check whether or not meaningful clusters can be generated as a result of clustering. Clustering tendency checking is based on i) the degree of similarity between all relevant documents against the query and the degree of similarity between the relevant documents and the non-conforming documents, ii) the actual relevance among the n nearest neighbors of each document (Iii) a term density test to divide the total number of postings into (number of documents X index words) and to check whether a dense document group is suitable for clustering.

The cluster performance evaluation can be evaluated by considering the external quality scale and the internal quality scale. The external quality measure is a comparison of document classes generated as a result of manual classification and document clusters generated as a result of clustering, and the internal quality measure corresponds to a task of evaluating the similarity of different clustering results.

In addition, the evaluation module can evaluate the classification performance through classification recall rate and classification accuracy of documents classified by the analysis module. The classification recall ratio can be calculated by dividing the number of applied conformity categories by the total number of conformance categories, and the classification accuracy can be calculated by dividing the number of conformity categories given by the total number of categories given.

5 is a flowchart illustrating a pure company selection method using text mining according to an embodiment of the present invention.

Referring to FIG. 5, a pure company selection method using text mining according to the present invention includes searching for a text-based document (S510), converting the text content in the retrieved document into an inputtable format (S520) A step S530 of selecting and storing semantic information among documents preprocessed by the preprocessing module, a step S540 of collecting or classifying documents using similarity coefficients based on the semantic information, A step S560 of connecting the business type, the representative product, and the product producing company among the contents in the document (S550), and selecting the representative pure business according to the business type using the document content reproduced by the analysis module (S560).

In the meantime, the 'pure company selection method using text mining' according to the embodiment of the present invention described above is different from the 'pure company selection apparatus using text mining' according to an embodiment of the present invention, And may include substantially the same technical features.

Therefore, although not described in detail in order to prevent redundant description, the above-described characteristics in connection with the 'pure company selection apparatus using text mining', the present invention can be applied to 'pure company selection using text mining' Method 'of course can be applied analogy. Conversely, the above-described features related to 'method of selecting a pure company using text mining' can be analogously applied to 'pure company selecting apparatus using text mining'.

The embodiments of the present invention described above are disclosed for the purpose of illustration, and the present invention is not limited thereto. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention.

100: Pure company selection device
110: Document collection module
120: preprocessing module
130: conversion / extraction module
140: Analysis module
141: clustering unit
142:
150: Selection module
160: Dictionary database module
170: Information output module
180: Evaluation module

Claims

A document collection module for searching a text based document;
A preprocessing module for converting the contents of text in the searched document into an inputtable format;
A conversion / extraction module for selecting and storing semantic information among documents preprocessed by the preprocessing module;
An analysis module for grouping documents using similarity coefficient based on the semantic information, classifying and reproducing the documents, and linking industry types, representative products, and product producers among contents in the document;
A selection module for selecting a representative pure company according to a business type using the contents of the document reproduced by the analysis module;
A pure enterprise selection system using text mining including.

The method according to claim 1,
Wherein the document collection module comprises:
Based document in a document storage device including at least one of an external storage device connected to the pure company selection device, an Internet search portal, and an external cloud device.

The method according to claim 1,
Wherein the document collection module comprises:
Wherein the text-based document is updated and searched and stored every predetermined period or when the document already searched is modified and uploaded.

The method according to claim 1,
The pre-
Wherein a morpheme including at least one of words, phrases, and phrases expressed in the retrieved document is subjected to linguistic analysis to convert unstructured text content into an inputable format, .

5. The method of claim 4,
The pre-
Wherein the morpheme is extracted as a smallest unit having a meaning in a language when the morpheme in the searched document is analyzed, and the morpheme is classified according to whether the morpheme is self-supporting or meaningful. A pure company selection device.

5. The method of claim 4,
A dictionary database module for storing information on a conventional term when the morphological language analysis of the preprocessing module or the semantic information of the conversion / extraction module is selected;
, Which is a pure company selection system using text mining.

The method according to claim 1,
The pre-
Wherein a language expression in which a plurality of deformation patterns exist is expressed through predetermined normalization, and the normalized language expression is confirmed through a finite state machine (FSM).

The method according to claim 1,
Wherein the analysis module comprises:
A clustering unit for clustering documents using the similarity coefficient based on the semantic information;
A text classifier for determining whether to assign to categories set according to the size of the conformance value based on the semantic information and ranking the categories set according to the degree of the conformance value;
Wherein the computer-readable medium is a computer-readable medium having computer-executable instructions for performing the method.

9. The method of claim 8,
The clustering unit,
Wherein the plurality of clusters are classified based on the semantic information, and the documents are clustered using hierarchical clustering according to the relationship between the plurality of clusters.

10. The method of claim 9,
The clustering unit,
Wherein at least one of hierarchical clustering of a single connection, a full connection, a group average connection, and a ward technique is used.

9. The method of claim 8,
The clustering unit,
The clustering of documents is performed using non - hierarchical clustering that determines the initial cluster selection method and the stopping point of the relocation process.

9. The method of claim 8,
Wherein the text classification unit comprises:
Wherein the classification target document is input to the classifier and the classification accuracy of the document is measured using a preliminary verification or cross validation method.

13. The method of claim 12,
Wherein the classifier comprises:
The method comprising at least one of probabilistic seismic analysis (PSA), naive bayes networks (LDA), kNN classifiers, linear classifiers, locho classifiers, SVM classifiers, neural network classifiers, A pure enterprise selection system using text mining.

The method according to claim 1,
An information output module for visualizing and reproducing the reproduced data;
, Which is a pure company selection system using text mining.

The method according to claim 1,
An evaluation module for evaluating a text mining process of the pure company selection device and correcting or supplementing a portion where an error occurs according to a predetermined criterion;
, Which is a pure company selection system using text mining.

16. The method of claim 15,
Wherein the evaluation module comprises:
A cluster feasibility evaluation for determining whether the data structure generated as a result of clustering of the analysis module provides statistical evidence for a pure enterprise selection and a cluster performance evaluation for determining through an external quality measure and an internal quality measure are performed Pure company selection system using mining.

16. The method of claim 15,
Wherein the evaluation module comprises:
And evaluating whether meaningful clusters are generated by using a clustering tendency check including at least one of a redundancy check, a nearest neighbor check, and a term density check.

16. The method of claim 15,
Wherein the evaluation module comprises:
And the classification performance is evaluated through the classification recall rate and the classification accuracy of the document classified by the analysis module.

The method according to claim 1,
The selection module comprises:
A pure company selection method using text mining, which is characterized by selecting a representative pure company according to a business type according to a selection criterion including at least one of a single industry type, a single product production, a research result, financial information, .

Searching a text based document;
Converting the text content in the searched document into an inputtable format;
Selecting and storing semantic information among documents preprocessed by the preprocessing module;
Clustering or classifying and reproducing the document using the similarity coefficient based on the semantic information;
Linking industry types, representative products, and product producers in the document;
Selecting a representative pure company according to a business type using document contents reproduced by the analysis module;
A method of selecting a pure company using text mining including.