Papers by Rocío Alaiz-rodríguez
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Expert Systems with Applications, 2022
Phishing attacks are one of the most challenging social engineering cyberattacks due to the large... more Phishing attacks are one of the most challenging social engineering cyberattacks due to the large amount of entities involved in online transactions and services. In these attacks, criminals deceive users to hijack their credentials or sensitive data through a login form which replicates the original website and submits the data to a malicious server. Many anti-phishing techniques have been developed in recent years, using different resource such as the URL and HTML code from legitimate index websites and phishing ones. These techniques have some limitations when predicting legitimate login websites, since, usually, no login forms are present in the legitimate class used for training the proposed model. Hence, in this work we present a methodology for phishing website detection in real scenarios, which uses URL, HTML, and web technology features. Since there is not any updated and multipurpose dataset for this task, we crafted the Phishing Index Login Websites Dataset (PILWD), an offline phishing dataset composed of 134, 000 verified samples, that offers to researchers a wide variety of data to test and compare their approaches. Since approximately three-quarters of collected phishing samples request the introduction of credentials, we decided to crawl legitimate login websites to match the phishing standpoint. The developed approach is independent of third party services and the method relies on a new set of features used for the very first time in this problem, some of them extracted from the web technologies used by the on each specific website. Experimental results show that phishing websites can be detected with 97.95% accuracy using a LightGBM classifier and the complete set of the 54 features selected, when it was evaluated on PILWD dataset.
Bookmarks Related papers MentionsView impact
Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, p... more Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient antispam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques-Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT-and four classifiers: Support Vector Machine, Näive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2ms and 2.2ms on average, respectively.
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Mechanical systems and signal processing, 2018
In this work we propose a new online, low cost and fast approach based on computer vision and mac... more In this work we propose a new online, low cost and fast approach based on computer vision and machine learning to determine whether cutting tools used in edge profile milling processes are serviceable or disposable based on their wear level. We created a new dataset of 254 images of edge profile cutting heads which is, to the best of our knowledge, the first publicly available dataset with enough quality for this purpose. All the inserts were segmented and their cutting edges were cropped, obtaining 577 images of cutting edges: 301 functional and 276 disposable. The proposed method is based on (1) dividing the cutting edge image in different regions, called Wear Patches (WP), (2) characterising each one as worn or serviceable using texture descriptors based on different variants of Local Binary Patterns (LBP) and (3) determine, based on the state of these WP, if the cutting edge (and, therefore, the tool) is serviceable or disposable. We proposed and assessed five different patch division configurations. The individual WP were classified by a Support Vector Machine (SVM) with an intersection kernel. The best patch division configuration and texture descriptor for the WP achieves an accuracy of 90.26% in the detection of the disposable cutting edges. These results show a very promising opportunity for automatic wear monitoring in edge profile milling processes.
Bookmarks Related papers MentionsView impact
Expert Systems with Applications, 2022
In this paper, we propose Ranksum, an approach for extractive text summarization of single docume... more In this paper, we propose Ranksum, an approach for extractive text summarization of single documents based on the rank fusion of four multidimensional sentence features extracted for each sentence: topic information, semantic content, significant keywords, and position. The Ranksum obtains
Bookmarks Related papers MentionsView impact
Knowledge-Based Systems, 2020
Feature selection is a key step when dealing with high-dimensional data. In particular, these tec... more Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data in fields like biomedicine, bioinformatics, genetics or chemometrics by selecting the most relevant features out of the noisy, redundant and irrelevant features. A problem that arises in many of these applications is that
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science, 2008
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science, Aug 13, 2008
Bookmarks Related papers MentionsView impact
Springer eBooks, 2009
... y de Sistemas y Automatica, Universidad de León, Campus de Vegazana, 24071 León, Spain e-mail... more ... y de Sistemas y Automatica, Universidad de León, Campus de Vegazana, 24071 León, Spain e-mail: rocio.alaiz@unileon.es Nathalie Japkowicz School ... the goal of the analysis [3]: (a) complex icons (glyphs) with features that depend on the data val-ues (eg, Chernoff faces or ...
Bookmarks Related papers MentionsView impact
Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing
Abstract Most supervised learning algorithms are based on the assumption that the training data s... more Abstract Most supervised learning algorithms are based on the assumption that the training data set reflects the underlying statistical model of the real data. However, this stationarity assumption is not always satisfied in practice: quite frequently, class prior probabilities are not in accordance with the class proportions in the training data set. The minimax approach is based on selecting the classifier that minimize the error probability under the worst case conditions. We propose a two-step learning algorithm to train a neural network in order to ...
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science
Bookmarks Related papers MentionsView impact
2008 20th IEEE International Conference on Tools with Artificial Intelligence, 2008
Bookmarks Related papers MentionsView impact
Advances in Data Management, 2009
... y de Sistemas y Automatica, Universidad de León, Campus de Vegazana, 24071 León, Spain e-mail... more ... y de Sistemas y Automatica, Universidad de León, Campus de Vegazana, 24071 León, Spain e-mail: rocio.alaiz@unileon.es Nathalie Japkowicz School ... the goal of the analysis [3]: (a) complex icons (glyphs) with features that depend on the data val-ues (eg, Chernoff faces or ...
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science, 2009
Abstract. The fundamental assumption that training and operational data come from the same probab... more Abstract. The fundamental assumption that training and operational data come from the same probability distribution, which is the basis of most learning algorithms, is often not satisfied in practice. Several algorithms have been proposed to cope with classification problems where the class priors may change after training, but they can show a poor performance when the class conditional data densities also change. In this paper, we propose a re-estimation algorithm that makes use of unlabeled operational data to adapt ...
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science, 2002
Many supervised learning algorithms are based on the assumption that the training data set reflec... more Many supervised learning algorithms are based on the assumption that the training data set reflects the underlying statistical model of the real data. However, this stationarity assumption may be partially violated in practice: for instance, if the cost of collecting data is class dependent, the class priors of the training data set may be different from that of the test set. A robust solution to this problem is selecting the classifier that minimize the error probability under the worst case conditions. This is known as the minimax strategy. In this ...
Bookmarks Related papers MentionsView impact
2007 3rd International Conference on Intelligent Sensors, Sensor Networks and Information, 2007
Bookmarks Related papers MentionsView impact
Uploads
Papers by Rocío Alaiz-rodríguez