Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3159450.3162345acmconferencesArticle/Chapter ViewAbstractPublication PagessigcseConference Proceedingsconference-collections
abstract

Summit Selection: Designing a Feature Selection Technique to Support Mixed Data Analysis (Abstract Only)

Published: 21 February 2018 Publication History

Abstract

Since data size is continuously increasing, analyzing large-scale data is considered as one of the major research challenges in computational data analysis. Although researchers have proposed numerous approaches, most of them still suffer from analyzing the data efficiently. To overcome the limitation, identifying the optimal number of features is critical for analyzing the data. In this paper, we introduce a newly designed feature selection technique, called Summit Selection, which boosts model performances by determining optimal features in noisy mixed data. First, testing all features is conducted to determine an initial base feature that satisfies a pre-defined criterion for maintaining the highest performance score. Then, a continuous evaluation is managed to build a model by successively adding or removing features based solely on the performance score tested with chosen computational models. To show the effectiveness of our proposed technique, a performance evaluation study was conducted to determine fraudulent activities in the UCSD Data Mining Contest 2009 Dataset. We compared our proposed technique with different feature extraction techniques such as PCA, ANOVA test, and Mutual Information (MI). Specifically, multiple machine learning techniques such as Decision Tree, Random Forest, and k-Nearest Neighbor (KNN) are tested with the feature extraction techniques to determine performance differences. As results, we found that our proposed technique showed about 8.78% performance improvement in detecting fraudulent activities. Since our technique can be extended to a cloud computing environment, we also performed a scalability testing with a known distributed cloud computing model (i.e., Apache Spark).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGCSE '18: Proceedings of the 49th ACM Technical Symposium on Computer Science Education
February 2018
1174 pages
ISBN:9781450351034
DOI:10.1145/3159450
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2018

Check for updates

Author Tags

  1. cloud computing
  2. data analysis
  3. feature selection

Qualifiers

  • Abstract

Conference

SIGCSE '18
Sponsor:

Acceptance Rates

SIGCSE '18 Paper Acceptance Rate 161 of 459 submissions, 35%;
Overall Acceptance Rate 1,595 of 4,542 submissions, 35%

Upcoming Conference

SIGCSE Virtual 2024
1st ACM Virtual Global Computing Education Conference
December 5 - 8, 2024
Virtual Event , NC , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media