To celebrate the International Love Data Week (Feb.10–14, 2025), Marina Zhang, engineering and information librarian at the University of Iowa Lichtenberger Engineering Library, spoke to Andres Martinez about his experience in data sharing and reuse; what suggestions he has for those who may own, share and publish data; and what makes data sharing so important.
Martinez is an associate research engineer at IIHR–Hydroscience and Engineering and co-investigator of the National Institute of Environmental Health Sciences (NIEHS)-funded Iowa Superfund Research Program. He has extensive experience in data collection, reuse and analysis.
Like other researchers in the Iowa Superfund Research Program, Martinez is proactive about data sharing and reuse. For instance, he recently compiled and published a data collection with the Pangaea Data Repository (DOI: 10.1594/PANGAEA.9727050), used R to analyze the data (code published at DOI: 10.5281/zenodo.13887687), and published the results in ACS ES&T Water: “Spatial and Temporal Analysis, and Machine Learning-Based Prediction of PCB Water Concentrations in U.S. Natural Water Systems” (DOI: 10.1021/acsestwater.4c00542).
Q: Tell us about your research area.
A: My research focuses on investigating the behavior of persistent organic pollutants in the environment. To achieve this, I perform environmental sampling and develop new sampling methods to improve spatial and temporal resolution. Analytical methods are also a key aspect of my work. Using the data collected, I analyze the occurrence, distribution, and trends of these chemicals. Finally, I develop mathematical models to predict their behavior in various environments.
Q: It seems like the data underlying your paper is obtained from various external sources. Where did the data come from?
A: Yes, that is correct. Most of the data were obtained from U.S. Environmental Protection Agency reports, a few environmental government agency websites like State Departmental of Environmental Quality, and, to a lesser extent, from scientific papers.
Q: Have you encountered any difficulties with data acquisition? If so, what are the difficulties?
A: To obtain most of the U.S. Environmental Protection Agency reports, I had to submit Freedom of Information Act requests and wait for a while, which was time-consuming. However, I think the most challenging part was finding the government agency websites and navigating them to locate the data. That part truly took a lot of time.
Q: Have you encountered any challenges when reusing the data you obtained? If so, what are the challenges?
A: I believe there were two main challenges. First, some of the reports were old PDF files that were not OCRed, so I had to manually extract the data. Second, the tables in these reports and on the websites were not suitable for my analysis. As a result, I had to spend additional time reformatting the data.
Q: What recommendations do you have for those who may own, share, and publish data so that others can have a smooth experience in data acquisition and reuse? What specific tools or platforms that facilitate smooth data sharing and reuse do you suggest?
A: This is a great question. I think the best advice is to talk about your project with someone knowledgeable in data management before you start collecting or working with the data. It can save a lot of time in the long run. R and Phyton are great tools for wrangling data, for example, changing the format of the data.
Q: How have the Libraries assisted you with data management and sharing?
A: It was very helpful. As a member of a research center, the Iowa Superfund Research Program, we have weekly meetings where different projects and researchers present their work. In a few of those meetings, Zhang and Brian Westra from Research Data Services at the UI Libraries gave presentations about data management. From those sessions and from one-to-one conversations, I realized that doing things correctly from the beginning, even if it took more time, would ultimately pay off. For example, creating a unique name ID for each sample, which included basic information such as location and date, proved to be very useful. Further, recording the source of the data as a reference was also very helpful. This made it easy to go back and verify the data when needed. They also guided me on how and where I can share my data and scripts, ensuring the correct references are created.
Q: It seems like you already published the complied data underlying your paper through making it available in a data repository. Is it required or encouraged by the journal publisher or your funder like NIEHS? If not, what motivates you to do that? What are the benefits you perceive from sharing the data?
A: This depends on the journal, but in this case, it is encouraged by the journal since the journal’s author guidelines clearly state their Research Data Policy. NIEHS requirements also have led to significant changes in data management and sharing. I think it is important to publish data alongside the paper. It helps organize your data, and others can easily find it if they want to use it. It’s also useful for me to know where the final version of the data is located. Furthermore, you gain two references when you publish both the paper and the data, and if other researchers are only interested in the data, they can cite it separately from the paper.
Research Data Services provides support to researchers across the data lifecycle, from data management and sharing plans to managing data during research to sharing and preserving data and code. Learn how you can receive assistance on its website.