Unit- 3
Unit- 3
Unit- 3
Feature extraction is the process of identifying and selecting the most important
information or characteristics from a data set. It’s like distilling the essential
elements, helping to simplify and highlight the key aspects while filtering out
less significant details. It’s a way of focusing on what truly matters in the data.
In real-world machine learning problems, there are often too many factors
(features) on the basis of which the final prediction is done. The higher the number
of features, the harder it gets to visualize the training set and then work on it.
Sometimes, many of these features are correlated or redundant. This is where
dimensionality reduction algorithms come into play.
Selecting a subset of
Transforming the original features
relevant features from the
into a new set of features
Definition original set
Subset of selected
New set of transformed features
Output features
Efficiency: Pipelines automate many routine tasks, such as data preprocessing, feature
engineering and model evaluation. This efficiency can save a significant amount of time
and reduce the risk of errors.
Scalability: Pipelines can be easily scaled to handle large datasets or complex workflows.
As data and model complexity grow, you can adjust the pipeline without having to
reconfigure everything from scratch, which can be time-consuming.
Version control and documentation: You can use version control systems to track
changes in your pipeline's code and configuration, ensuring that you can roll back to
previous versions if needed. A well-structured pipeline encourages better documentation of
each step.
The stages:
Data collection: In this initial stage, new data is collected from various data sources, such
as databases, APIs or files. This data ingestion often involves raw data which may require
preprocessing to be useful.
Data preprocessing: This stage involves cleaning, transforming and preparing input data for
modeling. Common preprocessing steps include handling missing values, encoding
categorical variables, scaling numerical features and splitting the data into training and
testing sets.
Model selection: In this stage, you choose the appropriate machine learning algorithm(s)
based on the problem type (e.g., classification, regression), data characteristics, and
performance requirements. You may also consider hyperparameter tuning.
Model training: The selected model(s) are trained on the training dataset using the chosen
algorithm(s). This involves learning the underlying patterns and relationships within the
training data. Pre-trained models can also be used, rather than training a new model.
Model evaluation: After training, the model's performance is assessed using a separate
testing dataset or through cross-validation. Common evaluation metrics depend on the
specific problem but may include accuracy, precision, recall, F1-score, mean squared error
or others.
Monitoring and maintenance: After deployment, it's important to continuously monitor the
model's performance and retrain it as needed to adapt to changing data patterns. This step
ensures that the model remains accurate and reliable in a real-world setting