M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining

BEST PRACTICES IN DATA MINING
What is Data Mining?

Data mining is a computational process for discovering patterns, correlations, and anomalies within
large datasets. It applies various statistical analysis and machine learning (ML) techniques to extract
meaningful information and insights from data. Industrial Organisations can use these insights to
make informed decisions, predict trends, and improve safety strategies.
Example for Data Mining:
Data mining can be helpful to human resources (HR) departments in identifying the characteristics of
their most successful employees. Information obtained – such as universities attended by highly
successful employees – can help HR focus recruiting efforts accordingly. Additionally, Strategic
Enterprise Management applications help a company translate corporate-level goals, such as profit
and margin share targets, into operational decisions, such as production plans and workforce levels.
This process is essential in transforming large volumes of raw data — structured, unstructured, or
semi-structured — into valuable, actionable knowledge. The steps 1 to 4 come under the data
preprocessing stage. Here, data mining is represented as a single step but it refers to the entire
knowledge discovery process. While mining database, we can search for Trends and Data Patterns.
For example, Maximum Marks scored by students, Minimum Marks scored by students, Analysis of
Sales Data, etc. could be obtained as readily as possible.
What kind of Data can be mined?

The most basic forms of data for mining are database data, data warehouse data, and transactional
data. The data mining techniques can also be applied to other forms like data streams, sequenced
data, text data, and spatial data.
#1) Database Data: The database management system is a set of interrelated data and a set of
software programs to manage and access the data. The relational database system is a collection of
tables and each table consists of a set of attributes and tuples.
Mining of relational databases search the trends and data patterns E.g. credit risk of customers
based on age, income, and previous credit risk. Also, mining can find out deviations from the
expected E.g. a significant increase in the price of an item.
#2) Data Warehouse Data: A data warehouse is a collection of information collected from multiple
data sources, stored under a unified schema at a single sit. A DW is modelled as a multidimensional
data structure called data cube having cells and dimensions providing precomputation and faster
access to data.
#3) Transactional Data: Transactional Data captures a transaction. It has a transaction id and a list of
items used in transaction.
#4) Other kinds of Data: Other data can include: time-related data, spatial data, hypertext data, and
multimedia data.
What Techniques are used in Data Mining?
Data Mining is a highly application-driven domain. Many techniques such as statistics, machine
learning, pattern recognition, information retrieval, visualization, etc., influence the development of
data analysis methods.
 1. Statistics: The study of collection, analysis, interpretation, and presentation of data

can be done using Statistical Models. For example, statistics can be used to model noise
and missing data, and then this model can be used in large data set to identify the noise
and missing values in data.
 2. Machine Learning: ML is used to improve performance based on data. The main

research area is for computer programs to automatically learn to recognize complex
patterns and make intelligent decisions based on the data. Machine Learning focuses on
accuracy and data mining focuses on the efficiency and scalability of mining methods on
the large data set, complex data, etc.
 3. Information Retrieval: It is the science of searching for documents or information

in documents.
It uses two principles:
 Data that is to be searched is unstructured.
 The queries are formed mainly by keywords.
 By using data analysis and IR, we can find major topics in the collection of
documents and also the major topics involved in each document.
BEST PRACTICES IN DATA MINING:
Data collection and preprocessing are crucial to data mining. They involve cleaning and organizing
data to prepare it for evaluation so your data mining tools can understand it. These steps help
ensure your e efforts produce results that match your objectives.
Gather relevant data from various sources, such as your databases, spreadsheets, logs, etc. Then,
preprocess your raw data:
 Clean: Fix any errors, including typos, duplicate entries, and inconsistencies.

 Normalize and standardize: Make the variables in your data comparable, which is important
for any data mining technique. Normalization scales data to a range between 0 and 1,
and standardization transforms data to have a mean of 0 and a standard deviation of 1.

 Transform: Adjust data to meet specific project needs. This could involve combining data,
creating new variables, or encoding (e.g., turning words or categories into numbers so a
computer can understand them better).
Use a tool like Excel or Google Sheets for basic, manual cleaning to make it easier and quicker for
your team. Or you can use a more advanced platform like Trifacta for complex, automated data
preprocessing.

M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining

Uploaded by

Copyright:

Available Formats

M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining

Uploaded by

Copyright:

Available Formats

BEST PRACTICES IN DATA MINING

What is Data Mining?

Example for Data Mining:

What kind of Data can be mined?

What Techniques are used in Data Mining?

 1. Statistics: The study of collection, analysis, interpretation, and presentation of data

 2. Machine Learning: ML is used to improve performance based on data. The main

 3. Information Retrieval: It is the science of searching for documents or information

It uses two principles:

 Data that is to be searched is unstructured.

 The queries are formed mainly by keywords.

You might also like