Papers by Ioannis Tsamardinos
F1000Research
Feature (or variable) selection is the process of identifying the minimal set of features with th... more Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R as a package. The R package MXM is such an example, which not only offers a variety of feature selection algorithms, but has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models to plug into the feature selection algorithms; c) it includes an algorithm for detecting multiple solutions (many sets of equivalent features); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R. In this paper we qualitatively compare MXM with other relevant packages and discuss its advantages and disadvantages. We also provide a demonstration of its algorithms using real high-dimensional data from various applications.
Bookmarks Related papers MentionsView impact
BMC Bioinformatics, 2018
Background: Feature selection is commonly employed for identifying collectively-predictive biomar... more Background: Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work we extend established constrained-based, feature-selection methods to high-dimensional " omics " temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. Results: The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. Conclusions: The use of this algorithm is suggested for variable selection with high-dimensional temporal data.
Bookmarks Related papers MentionsView impact
Causal discovery algorithms can induce some of the causal relations from the data, commonly in th... more Causal discovery algorithms can induce some of the causal relations from the data, commonly in the form of a causal network such as a causal Bayesian network. Arguably however , all such algorithms lack far behind what is necessary for a true business application. We develop an initial version of a new, general causal discovery algorithm called ETIO with many features suitable for business applications. These include (a) ability to accept prior causal knowledge (e.g., taking senior driving courses improves driving skills), (b) admitting the presence of latent confounding factors, (c) admitting the possibility of (a certain type of) selection bias in the data (e.g., clients sampled mostly from a given region), (d) ability to analyze data with missing-by-design (i.e., not planned to measure) values (e.g., if two companies merge and their databases measure different attributes), and (e) ability to analyze data from different interventions (e.g., prior and posterior to an advertisement campaign). ETIO is an instance of the logical approach to integrative causal discovery that has been relatively recently introduced and enables the solution of complex reverse-engineering problems in causal discovery. ETIO is compared against the state-of-the-art and is shown to be more effective in terms of speed, with only a slight degradation in terms of learning accuracy, while incorporating all the features above.The code is available on the mensxmachina.org website.
Bookmarks Related papers MentionsView impact
The statistically equivalent signature (SES) algorithm is a method for feature selection inspired... more The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constrained-based learning of Bayesian Networks. Most of the currently available feature-selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. Under that respect SES subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. SES is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data-analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of pre-dictive accuracy and that multiple, equally predictive signatures are actually present in real world data.
Bookmarks Related papers MentionsView impact
2008 Computers in Cardiology, 2008
Bookmarks Related papers MentionsView impact
Cancer informatics
Bookmarks Related papers MentionsView impact
Journal of Machine Learning Research
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Feature selection is used extensively in biomedical research for biomarker identification and pat... more Feature selection is used extensively in biomedical research for biomarker identification and patient classification, both of which are essential steps in developing personalized medicine strategies. However, the structured nature of the biological datasets and high correlation of variables frequently yield multiple equally optimal signatures, thus making traditional feature selection methods unstable. Features selected based on one cohort of patients, may not work as well in another cohort. In addition, biologically important features may be missed due to selection of other co-clustered features We propose a new method, Tree-guided Recursive Cluster Selection (T-ReCS), for efficient selection of grouped features. T-ReCS significantly improves predictive stability while maintains the same level of accuracy. T-ReCS does not require an a priori knowledge of the clusters like group-lasso and also can handle "orphan" features (not belonging to a cluster). T-ReCS can be used wi...
Bookmarks Related papers MentionsView impact
Journal of the American Medical Informatics Association
OBJECTIVE Finding the best scientific evidence that applies to a patient problem is becoming exce... more OBJECTIVE Finding the best scientific evidence that applies to a patient problem is becoming exceedingly difficult due to the exponential growth of medical publications. The objective of this study was to apply machine learning techniques to automatically identify high-quality, content-specific articles for one time period in internal medicine and compare their performance with previous Boolean-based PubMed clinical query filters of Haynes et al. DESIGN The selection criteria of the ACP Journal Club for articles in internal medicine were the basis for identifying high-quality articles in the areas of etiology, prognosis, diagnosis, and treatment. Naive Bayes, a specialized AdaBoost algorithm, and linear and polynomial support vector machines were applied to identify these articles. MEASUREMENTS The machine learning models were compared in each category with each other and with the clinical query filters using area under the receiver operating characteristic curves, 11-point average ...
Bookmarks Related papers MentionsView impact
Temporal uncertainty is a feature of many real-world planning problems. One of the most successfu... more Temporal uncertainty is a feature of many real-world planning problems. One of the most successful formalisms for dealing with temporal uncertainty is the Simple Temporal Problem with uncertainty (STP-u). A very attractive feature of STP-u's is that one can determine in polynomial time whether a given STP-u is dynamically controllable, i.e., whether there is a guaranteed means of execution such that all the constraints are respected, regardless of the exact timing of the uncertain events. Unfortunately, if the STP-u is not dynamically controllable, limitations of the formalism prevent further reasoning about the probability of legal execution. In this paper, we present an alternative formalism, called Probabilistic Simple Temporal Problems (PSTPs), which generalizes STP-u to allow for such reasoning. We show that while it is difficult to compute the exact probability of legal execution, there are methods for bounding the probability both from above and below, and we sketch alter...
Bookmarks Related papers MentionsView impact
This work presents a sound probabilistic method for enforcing adherence of the marginal probabili... more This work presents a sound probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailement: pairs of labels, where the presence of one implies the presence of the other in all instances of a dataset. The second concerns exclusion: sets of labels that do not coexist in the same instances of the dataset. These relationships are represented with a Bayesian network. Marginal probabilities are entered as soft evidence in the network and adjusted through probabilistic inference. Our approach offers robust improvements in mean average precision compared to the standard binary relavance approach across all 12 datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself.
Bookmarks Related papers MentionsView impact
The Markov blanket of a target variable is the minimum conditioning set of variables that makes t... more The Markov blanket of a target variable is the minimum conditioning set of variables that makes the target independent of all other variables. Markov blankets inform feature selection, aid in causal discovery and serve as a basis for scalable methods of constructing Bayesian networks. We apply decision tree induction to the task of Markov blanket identification. Notably, we compare (a) C5.0, a widely used algorithm for decision rule induction, (b) C5C, which post-processes C5.0 's rule set to retain the most frequently referenced variables and (c) PC, a standard method for Bayesian network induction. C5C performs as well as or better than C5.0 and PC across a number of data sets. Our modest variation of an inexpensive, accurate, off-the-shelf induction engine mitigates the need for specialized procedures, and establishes baseline performance against which specialized algorithms can be compared.
Bookmarks Related papers MentionsView impact
Communications in Computer and Information Science, 2013
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science, 2002
In Temporal Planning a typical assumption is that the agent controls the execu- tion time of all ... more In Temporal Planning a typical assumption is that the agent controls the execu- tion time of all events such as starting and ending actions. In real domains how- ever, this assumption is commonly violated and certain events are beyond the di- rect control of the plan's executive. Previous work on reasoning with uncontrol- lable events (Simple Temporal Problem with Uncertainty) assumes that we can bound the occurrence of each uncontrollable within a time interval. In principle however, there is no such bounding interval since there is always a non-zero probability the event will occur outside the bounds. Here we develop a new more general formalism called the Probabilistic Simple Temporal Problem (PSTP) fol- lowing a probabilistic approach. We present a method for scheduling a PSTP maximizing the probability of correct execution. Subsequently, we use this method to solve the problem of finding an optimal execution strategy, i.e. a dy- namic schedule where scheduling decisions can be made on-line.
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Lecture Notes in Computer Science, 2002
Bookmarks Related papers MentionsView impact
PLOS ONE, 2015
We address the problem of predicting the position of a miRNA duplex on a microRNA hairpin via the... more We address the problem of predicting the position of a miRNA duplex on a microRNA hairpin via the development and application of a novel SVM-based methodology. Our method combines a unique problem representation and an unbiased optimization protocol to learn from mirBase19.0 an accurate predictive model, termed MiRduplexSVM. This is the first model that provides precise information about all four ends of the miRNA duplex. We show that (a) our method outperforms four state-of-the-art tools, namely MaturePred, MiRPara, MatureBayes, MiRdup as well as a Simple Geometric Locator when applied on the same training datasets employed for each tool and evaluated on a common blind test set. (b) In all comparisons, MiRduplexSVM shows superior performance, achieving up to a 60% increase in prediction accuracy for mammalian hairpins and can generalize very well on plant hairpins, without any special optimization. (c) The tool has a number of important applications such as the ability to accurately predict the miRNA or the miRNA*, given the opposite strand of a duplex. Its performance on this task is superior to the 2nts overhang rule commonly used in computational studies and similar to that of a comparative genomic approach, without the need for prior knowledge or the complexity of performing multiple alignments. Finally, it is able to evaluate novel, potential miRNAs found either computationally or experimentally. In relation with recent confidence evaluation methods used in miRBase, MiRduplexSVM was successful in identifying high confidence potential miRNAs.
Bookmarks Related papers MentionsView impact
Uploads
Papers by Ioannis Tsamardinos