article

Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond

Authors:

Kun Zhang,

Wei FanAuthors Info & Claims

Knowledge and Information Systems, Volume 14, Issue 3

Pages 299 - 326

https://doi.org/10.1007/s10115-007-0095-1

Published: 18 March 2008 Publication History

Abstract

Much work on skewed, stochastic, high dimensional, and biased datasets usually implicitly solve each problem separately. Recently, we have been approached by Texas Commission on Environmental Quality (TCEQ) to help them build highly accurate ozone level alarm forecasting models for the Houston area, where these technical difficulties come together in one single problem. Key characteristics of this problem that is challenging and interesting include: (1) the dataset is sparse (72 features, and 2 or 5% positives depending on the criteria of “ozone days”), (2) evolving over time from year to year, (3) limited in collected data size (7 years or around 2,500 data entries), (4) contains a large number of irrelevant features, (5) is biased in terms of “sample selection bias”, and (6) the true model is stochastic as a function of measurable factors. Besides solving a difficult application problem, this dataset offers a unique opportunity to explore new and existing data mining techniques, and to provide experience, guidance and solution for similar problems. Our main technical focus addresses on how to estimate reliable probability given both sample selection bias and a large number of irrelevant features, and how to choose the most reliable decision threshold to predict the unknown future with different distribution. On the application side, the prediction accuracy of our chosen approach (bagging probabilistic decision trees and random decision trees) is 20% higher in recall (correctly detects 1–3 more ozone days, depending on the year) and 10% higher in precision (15–30 fewer false alarm days per year) than state-of-the-art methods used by air quality control scientists, and these results are significant for TCEQ. On the technical side of data mining, extensive empirical results demonstrate that, at least for this problem, and probably other problems with similar characteristics, these two straight-forward non-parametric methods can provide significantly more accurate and reliable solutions than a number of sophisticated and well-known algorithms, such as SVM and AdaBoost among many others.

Cited By

View all

Meiseles ARokach L(2024)Iterative Feature eXclusion (IFX)Knowledge-Based Systems10.1016/j.knosys.2024.111546289:COnline publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111546
Chiu CMinku L(2024)Smoclust: synthetic minority oversampling based on stream clustering for evolving data streamsMachine Language10.1007/s10994-023-06420-y113:7(4671-4721)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10994-023-06420-y
Uçar THajiramezanali EEdwards LRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)SubTabProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541702(18853-18865)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3541702
Show More Cited By

Index Terms

Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond

Recommendations

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions
ICDM '06: Proceedings of the Sixth International Conference on Data Mining

Much work on skewed, stochastic, high dimensional, and biased datasets usually implicitly solve each problem separately. Recently, we have been approached by Texas Commission on Environmental Quality (TCEQ) to help them build highly accurate ozone level ...
Dynamic Mining of Quantitative and Categorical Attributes with Skewed Support Distribution
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006

Though many algorithms have focused on mining quantitative attributes with a uniform support distribution, very little work is done in mining quantitative attributes with skewed support distribution. Still the binning methods in these algorithms are not ...
Volatility Forecasting Using APARCH with Skewed Conditional Distributions
ICEE '10: Proceedings of the 2010 International Conference on E-Business and E-Government

It is well known that distributions of financial return are fat-tailed and many models have been developed to capture fat-tail. It is not so well known that distributions are skewed. We construct skewed distributions based on symmetric distributions ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Knowledge and Information Systems Volume 14, Issue 3

March 2008

143 pages

ISSN:0219-1377

Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 18 March 2008

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Meiseles ARokach L(2024)Iterative Feature eXclusion (IFX)Knowledge-Based Systems10.1016/j.knosys.2024.111546289:COnline publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111546
Chiu CMinku L(2024)Smoclust: synthetic minority oversampling based on stream clustering for evolving data streamsMachine Language10.1007/s10994-023-06420-y113:7(4671-4721)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10994-023-06420-y
Uçar THajiramezanali EEdwards LRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)SubTabProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541702(18853-18865)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3541702
Ferreira LGuimarães FSilva R(2020)Applying Genetic Programming to Improve Interpretability in Machine Learning Models2020 IEEE Congress on Evolutionary Computation (CEC)10.1109/CEC48606.2020.9185620(1-8)Online publication date: 19-Jul-2020
https://dl.acm.org/doi/10.1109/CEC48606.2020.9185620
Faganeli Pucer JPirš G?Trumbelj E(2018)A Bayesian approach to forecasting daily air-pollutant levelsKnowledge and Information Systems10.1007/s10115-018-1177-y57:3(635-654)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10115-018-1177-y
Gong BOrdieres-Mer J(2017)Reconfiguring existing pollutant monitoring stations by increasing the value of the gathered informationEnvironmental Modelling & Software10.1016/j.envsoft.2017.06.03496:C(106-122)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1016/j.envsoft.2017.06.034
Kevric JJukic SSubasi A(2017)An effective combining classifier approach using tree algorithms for network intrusion detectionNeural Computing and Applications10.1007/s00521-016-2418-128:1(1051-1058)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1007/s00521-016-2418-1
Gong BOrdieres-Meré J(2016)Prediction of daily maximum ozone threshold exceedances by preprocessing and ensemble artificial intelligence techniquesEnvironmental Modelling & Software10.5555/3006045.300607684:C(290-303)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.5555/3006045.3006076
Kang YZadorozhny V(2016)Process monitoring using maximum sequence divergenceKnowledge and Information Systems10.1007/s10115-015-0858-z48:1(81-109)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.1007/s10115-015-0858-z
Zhang XFan WDu N(2015)Random decision hashing for massive data learningProceedings of the 4th International Conference on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications - Volume 4110.5555/2999990.2999999(65-80)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.5555/2999990.2999999
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Index Terms

Recommendations

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions

Dynamic Mining of Quantitative and Categorical Attributes with Skewed Support Distribution

Volatility Forecasting Using APARCH with Skewed Conditional Distributions

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations