Rocío Alaiz-rodríguez

Followers

Following

Co-authors

Public Views

Interests

Uploads

Papers by Rocío Alaiz-rodríguez

Tool insert wear classification using statistical descriptors and neuronal networks

Download

Use of contour signatures and classification methods to optimize the tool life in metal machining | Kulumisjälje kontuuride kasutamise ja klassifitseerimise meetodid metallide lõiketöötlemisel püsivusaja optimeerimiseks

Download

Phishing websites detection using a novel multipurpose dataset and web technologies features

by Rocío Alaiz-rodríguez and Eduardo Fidalgo

Expert Systems with Applications, 2022

Phishing attacks are one of the most challenging social engineering cyberattacks due to the large... more Phishing attacks are one of the most challenging social engineering cyberattacks due to the large amount of entities involved in online transactions and services. In these attacks, criminals deceive users to hijack their credentials or sensitive data through a login form which replicates the original website and submits the data to a malicious server. Many anti-phishing techniques have been developed in recent years, using different resource such as the URL and HTML code from legitimate index websites and phishing ones. These techniques have some limitations when predicting legitimate login websites, since, usually, no login forms are present in the legitimate class used for training the proposed model. Hence, in this work we present a methodology for phishing website detection in real scenarios, which uses URL, HTML, and web technology features. Since there is not any updated and multipurpose dataset for this task, we crafted the Phishing Index Login Websites Dataset (PILWD), an offline phishing dataset composed of 134, 000 verified samples, that offers to researchers a wide variety of data to test and compare their approaches. Since approximately three-quarters of collected phishing samples request the introduction of credentials, we decided to crawl legitimate login websites to match the phishing standpoint. The developed approach is independent of third party services and the method relies on a new set of features used for the very first time in this problem, some of them extracted from the web technologies used by the on each specific website. Experimental results show that phishing websites can be detected with 97.95% accuracy using a LightGBM classifier and the complete set of the 54 features selected, when it was evaluated on PILWD dataset.

Download

Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

by Rocío Alaiz-rodríguez and Francisco Janez-Martino

Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, p... more Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient antispam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques-Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT-and four classifiers: Support Vector Machine, Näive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2ms and 2.2ms on average, respectively.

Download

A Comparative Study on Feature Selection for a Risk Prediction Model for Colorectal Cancer

Download

Tool wear monitoring using an online, automatic and low cost system based on local texture

Mechanical systems and signal processing, 2018

In this work we propose a new online, low cost and fast approach based on computer vision and mac... more In this work we propose a new online, low cost and fast approach based on computer vision and machine learning to determine whether cutting tools used in edge profile milling processes are serviceable or disposable based on their wear level. We created a new dataset of 254 images of edge profile cutting heads which is, to the best of our knowledge, the first publicly available dataset with enough quality for this purpose. All the inserts were segmented and their cutting edges were cropped, obtaining 577 images of cutting edges: 301 functional and 276 disposable. The proposed method is based on (1) dividing the cutting edge image in different regions, called Wear Patches (WP), (2) characterising each one as worn or serviceable using texture descriptors based on different variants of Local Binary Patterns (LBP) and (3) determine, based on the state of these WP, if the cutting edge (and, therefore, the tool) is serviceable or disposable. We proposed and assessed five different patch division configurations. The individual WP were classified by a Support Vector Machine (SVM) with an intersection kernel. The best patch division configuration and texture descriptor for the WP achieves an accuracy of 90.26% in the detection of the disposable cutting edges. These results show a very promising opportunity for automatic wear monitoring in edge profile milling processes.

Download

RankSum-An Unsupervised Extractive Text Summarization based on Rank Fusion

Expert Systems with Applications, 2022

In this paper, we propose Ranksum, an approach for extractive text summarization of single docume... more

Download

An Information Theoretic Approach to Quantify the Stability of Feature Selection and Ranking Algorithms

Knowledge-Based Systems, 2020

Feature selection is a key step when dealing with high-dimensional data. In particular, these tec... more Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data in fields like biomedicine, bioinformatics, genetics or chemometrics by selecting the most relevant features out of the noisy, redundant and irrelevant features. A problem that arises in many of these applications is that

Download

Visualizing Classifier Performance on Different Domains

Download

Assessing the Impact of Changing Environments on Classifier Performance

Lecture Notes in Computer Science, 2008

Download

A Visualization-Based Exploratory Technique for Classifier Comparison with Respect to Multiple Metrics and Multiple Domains

Lecture Notes in Computer Science, Aug 13, 2008

Download

Visualizing High Dimensional Classifier Performance Data

Springer eBooks, 2009

... y de Sistemas y Automatica, Universidad de León, Campus de Vegazana, 24071 León, Spain e-mail... more

Minimax strategies for training classifiers under unknown priors

Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing

Abstract Most supervised learning algorithms are based on the assumption that the training data s... more Abstract Most supervised learning algorithms are based on the assumption that the training data set reflects the underlying statistical model of the real data. However, this stationarity assumption is not always satisfied in practice: quite frequently, class prior probabilities are not in accordance with the class proportions in the training data set. The minimax approach is based on selecting the classifier that minimize the error probability under the worst case conditions. We propose a two-step learning algorithm to train a neural network in order to ...

A Visualization-Based Exploratory Technique for Classifier Comparison with Respect to Multiple Metrics and Multiple Domains

Lecture Notes in Computer Science

Download

Assessing the Impact of Changing Environments on Classifier Performance

Lecture Notes in Computer Science

Download

Visualizing Classifier Performance on Different Domains

2008 20th IEEE International Conference on Tools with Artificial Intelligence, 2008

Download

Visualizing High Dimensional Classifier Performance Data

Advances in Data Management, 2009

... y de Sistemas y Automatica, Universidad de León, Campus de Vegazana, 24071 León, Spain e-mail... more

Improving Classification under Changes in Class and Within-Class Distributions

Lecture Notes in Computer Science, 2009

Abstract. The fundamental assumption that training and operational data come from the same probab... more Abstract. The fundamental assumption that training and operational data come from the same probability distribution, which is the basis of most learning algorithms, is often not satisfied in practice. Several algorithms have been proposed to cope with classification problems where the class priors may change after training, but they can show a poor performance when the class conditional data densities also change. In this paper, we propose a re-estimation algorithm that makes use of unlabeled operational data to adapt ...

Neural Minimax Classifiers

Lecture Notes in Computer Science, 2002

Many supervised learning algorithms are based on the assumption that the training data set reflec... more Many supervised learning algorithms are based on the assumption that the training data set reflects the underlying statistical model of the real data. However, this stationarity assumption may be partially violated in practice: for instance, if the cost of collecting data is class dependent, the class priors of the training data set may be different from that of the test set. A robust solution to this problem is selecting the classifier that minimize the error probability under the worst case conditions. This is known as the minimax strategy. In this ...