Contents
Proceedings of SciPy 2024
The 23rd annual SciPy conference will be held in Tacoma, WA at the Tacoma Convention Center, July 8-14, 2024.
SciPy brings together attendees from industry, academia and government to showcase their latest projects, learn from skilled users and developers, and collaborate on code development.
Full proceedings, posters and slides, and organizing committee can be found at https://
Making Research Data Flow With Python
Making Research Data Flow With Python
The increasing volume of research data in fields such as astronomy, biology, and engineering necessitates efficient distributed data management. This paper presents the Librarian, a custom framework designed for data transfer in large academic collaborations, designed for the Simons Observatory.
Josh Borrow, Paul La Plante, James Aguirre, +1
https://doi.org/10.25080/HWGA5253
Echostack: A flexible and scalable open-source software suite for echosounder data processing
Echostack: A flexible and scalable open-source software suite for echosounder data processing
Water column sonar data collected by echosounders are essential for fisheries and marine ecosystem research, enabling the detection, classification, and quantification of fish and zooplankton from many different ocean observing platforms. We introduce Echostack, a suite of open-source Python software packages that leverage existing distributed computing and cloud-interfacing libraries to support intuitive and scalable data access, processing, and interpretation.
Wu-Jung Lee, Valentina Staneva, Landung “Don” Setiawan, +5
https://doi.org/10.25080/WXRH8633
Python-Based GeoImagery Dataset Development for Deep Learning-Driven Forest Wildfire Detection
Python-Based GeoImagery Dataset Development for Deep Learning-Driven Forest Wildfire Detection
In recent years, leveraging satellite imagery with deep learning architectures has become an effective approach for environmental monitoring tasks, including forest wildfire detection. This paper presents a Python-based methodology for gathering and using a labeled high-resolution satellite imagery dataset for forest wildfire detection.
Valeria Martin, Derek Morgan, K. Brent Venable
https://doi.org/10.25080/YADT7194
Echodataflow: Recipe-based Fisheries Acoustics Workflow Orchestration
Echodataflow: Recipe-based Fisheries Acoustics Workflow Orchestration
With the influx of large data from multiple instruments and experiments, scientists are wrangling complex data pipelines that are context-dependent and non-reproducible. Echodataflow provides transparent reproducible pipelines that can be edited with text "recipes", scaled and monitored.
Valentina Staneva, Soham Butala, Landung (Don) Setiawan, +1
https://doi.org/10.25080/JXDK4427
THEIA: An Offline Tool for Tradespace Visualization
THEIA: An Offline Tool for Tradespace Visualization
Tradespace datasets are the result of large parameter sweeps run over numerous design options and can consist of thousands or even millions of design configurations and the corresponding performance metrics. THEIA has been developed for visualizing this complex tradespace data related to the acquisitions process.
Samuel Williams, Scott Christensen, Marvin Brown
https://doi.org/10.25080/RVRR7774
Mamba Models a possible replacement for Transformers?
Mamba Models a possible replacement for Transformers?
The quest for more efficient and faster deep learning models has led to the development of various alternatives to Transformers, one of which is the Mamba model. This paper provides a comprehensive comparison between Mamba models and Transformers, focusing on their architectural differences, performance metrics, and underlying mechanisms.
Suvrakamal Das, Rounak Sen, Saikrishna Devendiran
https://doi.org/10.25080/XHDR4700
RoughPy
RoughPy
Rough path theory is a branch of mathematics arising out of stochastic analysis. One of the main tools of rough path analysis is the signature, which captures the evolution of an unparametrised path including the order in which events occur. RoughPy is our new Python package that aims change the way we think about sequential streamed data.
Sam Morley, Terry Lyons
https://doi.org/10.25080/DXWY3560
Orchestrating Bioinformatics Workflows Across a Heterogeneous Toolset with Flyte
Orchestrating Bioinformatics Workflows Across a Heterogeneous Toolset with Flyte
While Python excels at prototyping and iterating quickly, it’s not always performant enough for whole-genome scale data processing. Flyte, an open-source Python-based workflow orchestrator, presents an excellent way to tie together the myriad tools required to run bioinformatics workflows.
Pryce Turner
https://doi.org/10.25080/DDJJ4932
Supporting Greater Interactivity in the IPython Visualization Ecosystem
Supporting Greater Interactivity in the IPython Visualization Ecosystem
Interactive visualizations are invaluable tools for building intuition and supporting rapid exploration of datasets and models. This paper explains the benefits of IPyVuetify with the ability to arbitrarily overlay widgets and plots on top of others to support more flexible details-on-demand techniques.
Nathan Martindale, Jacob Smith, Lisa Linville
https://doi.org/10.25080/GVHT1072
How the Scientific Python ecosystem helps answer fundamental questions of the Universe
How the Scientific Python ecosystem helps answer fundamental questions of the Universe
The ATLAS experiment at CERN explores vast amounts of physics data to answer the most fundamental questions of the Universe. This paper will describe to a broad audience how a large scientific collaboration leverages the power of the Scientific Python ecosystem to tackle domain-specific challenges and advance our understanding of the Cosmos.
Matthew Feickert, Nikolai Hartmann, Lukas Heinrich, +6
https://doi.org/10.25080/KMXN4784
ITK-Wasm
ITK-Wasm
In recent years, WebAssembly has emerged as a widely-supported technology that offers high performance, compact binary size, support for multiple languages, hardware independence, security, and universal platform support. ITK-Wasm brings WebAssembly’s capabilities to scientific computing by combining the Insight Toolkit (ITK) and WebAssembly to enable high-performance spatial analysis across programming languages and hardware architectures.
Matthew McCormick, Paul Elliott
https://doi.org/10.25080/TCFJ5130
Any notebook served: authoring and sharing reusable interactive widgets
Any notebook served: authoring and sharing reusable interactive widgets
Jupyter Widgets enable interactive code and data visualization in notebooks, but creating and distributing widgets across the Jupyter ecosystem is challenging. The anywidget project introduces a standard and toolset for portable, web-based widgets in various computing environments, simplifying development and extending compatibility beyond Jupyter. Its approach has fostered a rich widget ecosystem, driving the creation of new widgets and adoption of the standard by multiple platforms.
Trevor Manz, Nils Gehlenborg, Nezar Abdennur
https://doi.org/10.25080/NRPV2311
Scikit-build-core
Scikit-build-core
Discover how scikit-build-core revolutionizes Python extension building with its seamless integration of CMake and Python packaging standards. Learn about its enhanced features for cross-compilation, multi-platform support, and simplified configuration, which enable writing binary extensions with pybind11, Nanobind, Fortran, Cython, C++, and more.
Henry Schreiner, Jean-Christophe Fillion-Robin, Matt McCormick
https://doi.org/10.25080/FMKR8387
Model Share AI
Model Share AI
Machine learning is revolutionizing a wide range of research areas and industries, but many ML projects never progress past the proof-of-concept stage. To address this problem, we introduce Model Share AI, a platform designed to streamline collaborative model development, model provenance tracking, and model deployment.
Heinrich Peters, Michael Parrott
https://doi.org/10.25080/MDCE8355
Ecological and Spatial Influences on the Genetics of Cumacea (Crustacea: Peracarida) in the Northern North Atlantic
Ecological and Spatial Influences on the Genetics of Cumacea (Crustacea: Peracarida) in the Northern North Atlantic
The peracarid taxon Cumacea is an essential indicator of benthic quality in marine ecosystems. This study investigated the influence of environmental (i.e., biological or ecosystemic), climatic (i.e., meteorological or atmospheric), and spatial (i.e., geographic or regional) variables on their genetic variability and adaptability in the Northern North Atlantic, focusing on Icelandic waters.
Justin Gagnon, Nadia Tahiri
https://doi.org/10.25080/NVYF1037
Funix - The laziest way to build GUI apps in Python
Funix - The laziest way to build GUI apps in Python
Presenting a model or algorithm as a GUI application is a common need in the scientific and engineering community. Funix was created to automatically launch apps from existing Python functions, automatically selecting widgets based on the types of the arguments and returning functions according to the type-to-widget mapping defined in a theme.
Forrest Sheng Bao, Mike Qi, Ruixuan Tu, +1
https://doi.org/10.25080/JFYN3740
Cyanobacteria detection in small, inland water bodies with CyFi
Cyanobacteria detection in small, inland water bodies with CyFi
Harmful algal blooms pose major health risks to human and aquatic life. CyFi is an open-source Python package that enables detection of cyanobacteria in inland water bodies using 10-30m Sentinel-2 imagery and a computationally efficient tree-based machine learning model.
Emily Dorne, Katie Wetstone, Trista Brophy Cerquera, +1
https://doi.org/10.25080/PDHK7238
geosnap: The Geospatial Neighborhood Analysis Package
geosnap: The Geospatial Neighborhood Analysis Package
Understanding neighborhood context is critical for social science research, public policy analysis, and urban planning. We introduce geosnap, the Geospatial Neighborhood Analysis Package, a suite of tools for exploring, modeling, and visualizing the social context and spatial extent of neighborhoods and regions over time.
Elijah Knaap, Sergio Rey
https://doi.org/10.25080/FVWM4182
Continuous Tools for Scientific Publishing
Continuous Tools for Scientific Publishing
Science requires new mediums to compose ideas and ways to share research findings iteratively, as early as possible and connected directly to software and data. In this paper we discuss two tools for scientific authoring and publishing, MyST Markdown and Curvenote, and illustrate examples of improving metadata, reimagining the reading experience, including computational content, and transforming publishing practices for individuals and societies through automation and continuous practices.
Rowan Cockett, Steve Purves, Franklin Koch, +1
https://doi.org/10.25080/NKVC9349
Improving Code Quality with Array and DataFrame Type Hints
Improving Code Quality with Array and DataFrame Type Hints
This article demonstrates practical approaches to fully type-hinting generic NumPy arrays and StaticFrame DataFrames, and shows how the same annotations can improve code quality with both static analysis and runtime validation.
Christopher Ariza
https://doi.org/10.25080/WPXM6451
Predx-Tools
Predx-Tools
Histopathological images, which are digitized images of human or animal tissue, contain insights into disease state. We present PredX-Tools, a suite of simple and easy to use python GUI applications which facilitate analysis of histopathological images and provide a no-code platform for data scientists and researchers to perform analysis on raw and transformed data.
Brian Falkenstein, Shannon Quinn, Chakra Chennubhotla, +2
https://doi.org/10.25080/YCFW5807
Voice Computing with Python in Jupyter Notebooks
Voice Computing with Python in Jupyter Notebooks
Jupyter is a popular platform for writing interactive computational narratives that contain computer code and its output interleaved with prose that describes the code and the output. It is possible to use one’s voice to interact with Jupyter notebooks.
Blaine H. M. Mooers
https://doi.org/10.25080/MCYV2126
Evaluating Probabilistic Forecasters with sktime and tsbootstrap — Easy-to-Use, Configurable Frameworks for Reproducible Science
Evaluating Probabilistic Forecasters with sktime and tsbootstrap — Easy-to-Use, Configurable Frameworks for Reproducible Science
Evaluating probabilistic forecasts is complex and essential across various domains, yet no comprehensive software framework exists to simplify this task. Despite extensive literature on evaluation methodologies, current practices are fragmented and often lack reproducibility. To address this gap, we introduce a reproducible experimental workflow for evaluating probabilistic forecasting algorithms using the sktime package.
Benedikt Heidrich, Sankalp Gilda, Franz Kiraly
https://doi.org/10.25080/VPNX1595
AI-Driven Watermarking Technique for Safeguarding Text Integrity in the Digital Age
AI-Driven Watermarking Technique for Safeguarding Text Integrity in the Digital Age
Identifying the sources is vital for generative AI models, like ChatGPT and Bard, due to concerns about copyright infringement and plagiarism. In this paper, we explore text watermarking as a potential solution. We investigate techniques including physical watermarking and logical watermarking.
Atharva Rasane
https://doi.org/10.25080/DHKD1726
Algorithms to Determine Asteroid’s Physical Properties using Sparse and Dense Photometry, Robotic Telescopes and Open Data
Algorithms to Determine Asteroid’s Physical Properties using Sparse and Dense Photometry, Robotic Telescopes and Open Data
Arushi Nath
https://doi.org/10.25080/TWCF2755
Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions
Computational Resource Optimisation in Feature Selection under Class Imbalance Conditions
Feature selection is crucial for reducing data dimensionality as well as enhancing model interpretability and performance in machine learning tasks. This study explores the possibility of performing feature selection on a subset of data to reduce the computational burden.
Amadi Gabriel Udu, Andrea Lecchini-Visintini, Steve R. Gunn, +3
https://doi.org/10.25080/TPGN6857
Training a Supervised Cilia Segmentation Model from Self-Supervision
Training a Supervised Cilia Segmentation Model from Self-Supervision
Understanding cilia behavior is essential in diagnosing and treating such diseases, but, the tasks of automatically analyzing cilia are often a labor and time-intensive. In this work we overcome this bottleneck by developing a robust, self-supervised framework exploiting the visual similarity of normal and dysfunctional cilia.
Seyed Alireza Vaezi, Shannon Quinn
https://doi.org/10.25080/HXCJ6205
Mandala: Compositional Memoization for Simple & Powerful Scientific Data Management
Mandala: Compositional Memoization for Simple & Powerful Scientific Data Management
We present mandala, a Python library that largely eliminates the accidental complexity of scientific data management and incremental computing. While most traditional and/or popular data management solutions are based on logging, mandala takes a fundamentally different approach, using memoization of function calls as the fundamental unit of saving, loading, querying and deleting computational artifacts.
Aleksandar Makelov
https://doi.org/10.25080/JHPV7385
multinterp
multinterp
Multivariate interpolation is a fundamental tool in scientific computing used to approximate the values of a function between known data points in multiple dimensions. Despite its importance, the Python ecosystem offers a fragmented landscape of specialized tools for this task; the multinterp package was developed to address this challenge.
Alan Lujan
https://doi.org/10.25080/FGCJ9164