Enhance classify.py script to build predictive models for classifying measurements #227

MBARIMike · 2015-12-23T22:57:49Z

This issue is a follow-on to Issue #49 and gets to the heart of performing Machine Learning on data stored in STOQS. There are comments in the createClassifier() method of https://github.com/stoqs/stoqs/blob/master/stoqs/contrib/analysis/classify.py that give pointers on what work needs to be done. Watch the video at https://www.youtube.com/watch?v=4ONBVNm3isI to learn the techniques that can be followed for creating classifiers and doing cross validation.

The key contribution of this issue is to implement general purpose predictive classification capability for any measurements in STOQS. The classify.py script provides a starting place for the implementation and Jupyter Notebooks should be used to demonstrate its use.

MBARIMike · 2015-12-23T23:18:37Z

Dorado data from the stoqs_september2013 campaign have chlorophyll/backscatter data that are easily labeled as shown in this figure:

For more background read Using STOQS (The spatial temporal oceanographic query system) to manage, visualize, and understand AUV, glider, and mooring data, especially section V, which is excerpted here:

V. UNDERSTANDING DATA
Today’s oceanographic campaigns produce tens of millions
of diverse measurements; this volume of data is too great for
individual users to understand, even with the effective user
interface that STOQS provides. Though it sounds fanciful, we
expect to soon ”teach” STOQS software to ”understand” the
data for us. By this we mean that algorithms can be developed
to recognize patterns, classify data, inform us of features, and
make predictions. Doing this sort of work is called machine
learning. To enable machine learning we recently modified the
STOQS schema to support labeling (or tagging) of data. This
is accomplished by inserting records in the MeasuredParameterResource
and SampledParameterResource tables as shown
in Fig. 14.
Storing labeled data in STOQS allows us to use all of its visualization
capabilities to explore the results of the algorithms.
For example, data similar to that shown in Fig. 13 have been
labeled with names of diatom, dino1, dino2, and sediment.
These names are exposed through the UI as selectable items
that may be applied as a filter on the data selection, allowing
for easy spatial-temporal exploration of labeled data within
the UI. Development of machine learning algorithms and data
exploration go hand-in-hand – STOQS has the capabilities we
need to accomplish this task.
The STOQS platform is under continued development. Machine
learning approaches to assess relational patterns within
and among multi-platform physical, chemical, and biological
data already in hand is our primary focus. For example,
implementation of additional data labeling within STOQS will
empower machine learning methods to identify and associate
specific combinations of optical (e.g., backscatter, transmissometry)
or other measurements with biological signals from
specific groups of organisms detected in representative water
samples (e.g., phytoplankton or zooplankton taxonomic
groups). With further development, it may be possible to identify
groups of organisms based solely on their specific physical
and/or chemical signatures that are more easily measured by
in situ electronic sensors.

MBARIMike · 2016-01-19T17:43:35Z

Here is another resource to explore for performing classification in Python. With this you can actually execute the cells on the web without configuring your own STOQS development environment. You will need a Kaggle account, but that is easy to set up.

MBARIMike · 2016-02-03T02:38:08Z

Once a model is developed and cross-validated form the Dorado data shown above it can be applied to other chlorophyll/backscatter data from the other vehicles that surveyed Monterey Bay during this same campaign. Here is an animated GIF of those data:

http://odss.mbari.org/data/canon/2013_Sep/Products/AUV_Gliders/stoqs_september2013_Fl_vs._bb__red_.gif

With these data classified we can then construct a picture of the spatial and temporal distribution of various kinds of plankton.

MBARIMike · 2016-02-07T04:52:07Z

Welcome @devonrusconi and @vitoupen to the STOQS project! Here is another resource for learning about classification using scikit-learn:

https://github.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%20Machine%20Learning%20Notebook.ipynb

MBARIMike · 2016-04-14T16:08:15Z

Comparing classifiers plot added to Jupyter Notebook at https://github.com/MBARIMike/stoqs/blob/capstone2016/stoqs/contrib/notebooks/classify_data.ipynb

MBARIMike · 2016-05-31T22:43:05Z

The Capstone 2016 contribution PR is a step toward the implementation of a general purpose predictive classification capability. This issue will remain open awaiting contributions toward this goal.

MBARIMike · 2018-12-26T22:26:00Z

For background on using measurement data as proxies for identifying plankton please see this paper:

https://www.sciencedirect.com/science/article/pii/S0079661118300478?via%3Dihub

stoqs added help wanted Bioinformatics capstone Intern Data Science and removed Bioinformatics labels Dec 23, 2015

MBARIMike mentioned this issue May 31, 2016

CSUMB Capstone 2016 contribution #293

Merged

MBARIMike mentioned this issue Aug 4, 2017

Add unsupervised learning capability to STOQS #624

Open

MBARIMike mentioned this issue Mar 10, 2021

Notebook for example use of Parquet data #1139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance classify.py script to build predictive models for classifying measurements #227

Enhance classify.py script to build predictive models for classifying measurements #227

Enhance classify.py script to build predictive models for classifying measurements #227

Enhance classify.py script to build predictive models for classifying measurements #227

Comments