A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia
In this paper, we propose a survival modeling approach to promoting ranking diversity for biomedical information retrieval. The proposed approach concerns with finding relevant documents that can deliver more different aspects of a query. First, two ...
Centroid-Based Actionable 3D Subspace Clustering
Actionable 3D subspace clustering from real-world continuous-valued 3D (i.e., object-attribute-context) data promises tangible benefits such as discovery of biologically significant protein residues and profitable stocks, but existing algorithms are ...
Constrained Text Coclustering with Supervised and Unsupervised Constraints
In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine information-theoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised ...
Crowdsourced Trace Similarity with Smartphones
- Demetrios Zeinalipour-Yazti,
- Christos Laoudias,
- Costantinos Costa,
- Michalis Vlachos,
- Maria I. Andreou,
- Dimitrios Gunopulos
Smartphones are nowadays equipped with a number of sensors, such as WiFi, GPS, accelerometers, etc. This capability allows smartphone users to easily engage in crowdsourced computing services, which contribute to the solution of complex problems in a ...
Customized Policies for Handling Partial Information in Relational Databases
Most real-world databases have at least some missing data. Today, users of such databases are “on their own” in terms of how they manage this incompleteness. In this paper, we propose the general concept of partial information policy (PIP) operator to ...
Decision Trees for Mining Data Streams Based on the McDiarmid's Bound
In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffding's bound was ...
Discovering Characterizations of the Behavior of Anomalous Subpopulations
We consider the problem of discovering attributes, or properties, accounting for the a priori stated abnormality of a group of anomalous individuals (the outliers) with respect to an overall given population (the inliers). To this aim, we introduce the ...
FoCUS: Learning to Crawl Web Forums
In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the ...
Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy
Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, $({\rm PMI}_{max})$, that augments ...
Incentive Compatible Privacy-Preserving Data Analysis
In many cases, competing parties who have private data may collaboratively conduct privacy-preserving distributed data analysis (PPDA) tasks to learn beneficial data models or analysis results. Most often, the competing parties have different ...
Nonnegative Matrix Factorization: A Comprehensive Review
Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as ...
On Identifying Critical Nuggets of Information during Classification Tasks
In large databases, there may exist critical nuggets—small collections of records or instances that contain domain-specific important information. This information can be used for future decision making such as labeling of critical, unlabeled data ...
Radio Database Compression for Accurate Energy-Efficient Localization in Fingerprinting Systems
Location fingerprinting is a positioning method that exploits the already existing infrastructures such as cellular networks or WLANs. Regarding the recent demand for energy efficient networks and the emergence of issues like green networking, we ...
Semi-Supervised Nonlinear Hashing Using Bootstrap Sequential Projection Learning
In this paper, we study the effective semi-supervised hashing method under the framework of regularized learning-based hashing. A nonlinear hash function is introduced to capture the underlying relationship among data points. Thus, the dimensionality of ...
Spatial Approximate String Search
This work deals with the approximate string search in large spatial databases. Specifically, we investigate range queries augmented with a string similarity search predicate in both euclidean space and road networks. We dub this query the spatial ...
SVStream: A Support Vector-Based Algorithm for Clustering Data Streams
In this paper, we propose a novel data stream clustering algorithm, termed SVStream, which is based on support vector domain description and support vector clustering. In the proposed algorithm, the data elements of a stream are mapped into a kernel ...
The Move-Split-Merge Metric for Time Series
A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time ...
A User-Friendly Patent Search Paradigm
As an important operation for finding existing relevant patents and validating a new patent application, patent search has attracted considerable attention recently. However, many users have limited knowledge about the underlying patents, and they have ...