Introduction

Particulate matter (PM) is a major component of air pollution that impacts health and global climate. The fine fraction (<2.5 μm, PM2.5) is responsible for millions of premature deaths annually1,2 and is a risk factor for chronic illness and cancer3,4. Ambient levels in low- and middle-income countries of Asia have been declared public health emergencies1,5,6. PM2.5 has intercontinental spatial impacts and transports persistent organic contaminants from populated regions to remote global regions7,8,9. PM furthermore affects aerosol-sunlight and aerosol-cloud interactions10 and is generally believed to contribute to climate cooling through light-scattering and cloud condensation, although black carbon and brown carbon (BrC; light-absorbing organic matter) components may also lead to warming through absorption of solar radiation11,12. These opposing factors remain poorly constrained in climate models11,12,13. Organic molecules can be a major mass fraction of total PM14, thus a comprehensive molecular characterization of PM2.5 could contribute to improved understanding of global air pollution sources, climate impacts, and health effects.

High-resolution mass spectrometry (HRMS) is an established instrumental technique that can reveal the molecular complexity of PM2.5 organic compounds, however, most substances remain uncharacterized beyond assignment of molecular formula or the presence of certain functional groups15,16,17,18,19,20,21,22. Characterization of PM2.5 samples by HRMS also creates high demands on data processing which has limited previous detailed studies to only a few atmospheric samples. In metabolomics and proteomics, high-throughput workflows for batch-processing of full-scan HRMS chromatographic data (i.e. MS1) and the associated fragmentation spectra (MS2) are now applied to explore the chemical structures of ‘molecular dark matter’ in biological systems23,24,25,26. Such methods have yet to be applied in atmospheric research, but could open new molecular-level windows for studies of air pollution. We hypothesized that new insights into the molecular composition and effects of PM2.5 organic compounds could be achieved by combining comprehensive nontarget HRMS analysis with computational workflows23,24,25,26.

Here, PM2.5 was continuously collected throughout January-April by high-volume sampling (n = 40 samples) at the Maldives Climate Observatory at Hanimaadhoo (MCOH) as part of the South Asian Pollution Experiment, 2018 (SAPOEX-18)11. During these months, the MCOH enables sampling of highly polluted plumes originating from the Indian subcontinent1,11, and occasional pristine air from the Southern Indian Ocean27,28. Polluted air in this geographical hotspot leads to millions of premature deaths1,2,4 and the associated ‘Atmospheric Brown Cloud’ extends south of the equator, influencing atmospheric energy balances over a vast region29. Previous studies of PM in this region have highlighted the major fraction of sunlight-absorbing BrC30.

To achieve a broad molecular characterization of organic compounds, each PM2.5 sample was extracted with a range of solvents and analyzed by gas-chromatography (GC)-HRMS electron ionization (EI) and negative chemical ionization (NCI), or by high-performance liquid chromatography (LC)-HRMS electrospray ionization (ESI) in positive and negative mode. This approach resulted in six unique molecular profiles per sample (Fig. 1a), and revealed high molecular complexity (Fig. 1b–d). Known anthropogenic contaminants were confirmed (Level 1), including legacy persistent organic pollutants, polycyclic aromatic hydrocarbons (PAHs), plasticizers, pesticides, and associated transformation intermediates. However, these identifications represented only a minor portion of all detected molecules. Hence, by integration of open-source cheminformatics and computational workflows, including molecular networking25, MS2-guided in silico structural predictions31, and physicochemical property estimation of optical and toxicological relevance32,33,34, we proceeded to characterize thousands of structures among the remaining unknown molecules. The molecular properties of these structures were evaluated with consideration of potential health and climate impacts.

Fig. 1: Sampling and comprehensive nontarget HRMS analysis of PM2.5.
figure 1

a High-volume sampling (500 L min-1) of PM2.5 at MCOH onto quartz-fiber filters that were sectioned and extracted by three distinct protocols. b The nonpolar organic compound (NPOC) extract was analyzed with GC-HRMS by electron ionization (EI) and negative chemical ionization (NCI), shown here plotted by base peak m/z and Kovats retention index (RI). c Water-soluble organic compound (WSOC) extract and d polar organic compound (POC) extract were analyzed with LC-HRMS in data-independent MS2 acquisition (DIA) with electrospray positive (ESI+) and negative (ESI−) mode, shown here as Kendrick plots with color shading by retention time. e, f Venn diagrams showing the percent of molecular features that were unique to each extract and mode of analysis, demonstrating that a battery of approaches was important to achieve comprehensive molecular analysis of PM2.5.

Results and discussion

Comprehensive nontarget analysis of PM2.5

After quality control and field-blank correction, the combined analyses of 40 PM2.5 samples (120 extracts) revealed 60,030 molecular features (Supplementary Data 2—Dataset). Each feature is defined by a retention time (Rt) in the chromatographic dimension for GC and LC, and for GC analyses a mass spectrum dimension corresponding to mass-to-charge ratio (m/z) with a base-peak ion and deconvoluted MS1 spectrum (EI and NCI), and for LC analyses by a precursor MS1 (full-scan) and corresponding deconvoluted data-independent (DIA) MS2 spectrum.

The number of features detected in water-soluble, polar, and nonpolar organic compound extracts (WSOC, POC, and NPOC, respectively) across the four modes of instrumental analysis (Fig. 1b, c, d GC-EI, GC-NCI, LC-ESI + , LC-ESI-) indicated that a great proportion of molecules were unique to each dataset (Fig. 1e, f), and thus that a battery of approaches was important for achieving comprehensive analysis of PM2.5 (see criteria in Supplementary methods—Estimation of unique features).

The greatest molecular complexity was found in the WSOC and POC extracts (98% of all features). This was not unexpected because hydrophilic compounds represent the largest fraction (up to 50–100%) of aerosol organic compounds11. Moreover, the remote MCOH sampling location allows substantial atmospheric oxidation of transported pollution to occur prior to collection11. Altogether, the large number of samples, and the battery of extracts and analytical modes employed here resulted in a greater view of molecular complexity than reported in previous analyses of atmospheric PM18,22,35.

Back-trajectories and geographical sources of air pollution

Throughout the campaign, back-trajectories showed that sampled air originated from four geographical regions (Fig. 2a, b, Fig. S8 and Supplementary Data 3—Back-trajectories), including three regions to the north that cover reaches of the Indian subcontinent (i.e. Arabian Sea, Indo-Gangetic Plain, and Peninsular India), and a fourth region originating in the Southern Indian Ocean (Fig. 2a, b). The frequency contributions of air from these four back-trajectories (Fig. 2a) were used to model the chemical variation observed in each 48 hr sample; considering all combined features from all fractions (NPOC, POC, WSOC). The resulting multivariate model (Fig. 2c) explained variation among molecular profiles by geographical source, in particular, the first latent variable significantly separated the Southern Indian Ocean cluster from the three subcontinental clusters. Based on satellite data, the back-trajectories of air coming from the three subcontinental regions coincided with much higher tropospheric nitrogen dioxide (NO2) concentrations (Fig. 2d), a generic indicator of air pollution1,36. Consistent with this, samples dominated by air from any of the three subcontinental regions had significantly higher levels of combustion-derived polycyclic aromatic compounds (PACs) (Fig. 2e and Supplementary Data 4—Identifications GC—NPOC), in particular for air originating in the Indo-Gangetic Plain, a global hotspot for air pollution during the dry winter monsoon4,13.

Fig. 2: Geographical sources of PM2.5 and modelling of molecular profiles.
figure 2

a Relative contribution of air masses for each 48 hr sample (n = 40 time-points) originating from the Arabian Sea, Peninsular India, Indo-Gangetic Plain or Indian Ocean, and b the mean path of these four 10-day back-trajectory clusters around the receptor site MCOH (6.80°N, 73.20°E). The satellite image corresponds to the monitoring campaign day 2nd of February, 2018 (NASA Worldview; https://worldview.earthdata.nasa.gov/). c Scores of the orthogonal partial least square (OPLS; 2 + 1 + 0) multivariate model correlating the back-trajectories to the nontarget (LC+GC) HRMS chemical profiles of PM2.5 samples, with Indian Ocean back-trajectory cluster cross-validated [CV]-ANOVA p value = 0.0013. See also principal component analysis (PCA) in Fig. S10, and details in Supplementary data 3—Models and statistics OPLS 2 + 1). Each sample is colored according to the most dominant back-trajectory within the 48 h window. d Tropospheric nitrogen dioxide (NO2) levels measured along the back-trajectories by satellite remote sensing (OMI/Aura; NASA). e Polycyclic aromatic compound abundance (±SE) in GC-HRMS for analytes identified at Level 1 (dibenzo[a,i]pyrene, 4H-cyclopenta[def]phenanthren-4-one, benzo[b]naphto[1,2-d]thiophene). Error bars indicate S.E. Significant differences relative of levels in Arabian Sea (n = 14), Peninsular India (n = 13), and Indo-Gangetic Plain (n = 7) to the Indian Ocean (n = 6) are shown by the Welch t-test (two-sided): *p ≤ 0.05; **p ≤ 0.01; ***p ≤ 0.001; and ****p ≤ 0.0001 (d.f. = 18, 17, 11).

Molecular annotation and identification

Based on spectral library searches, a total of 318 features (across both GC- and LC-HRMS analyses) were highlighted as putative anthropogenic or biogenic compounds, up to Level 2a identification37 (see criteria in Methods). Annotations of putative anthropogenic substances were selected for confirmation by comparing orthogonal evidence under identical analytical conditions (Rt, MS1 and MS2) to reference standards (Supplementary methods). Across GC and LC datasets, 89 compounds were ultimately confirmed with highest confidence (Level 1; 53 compounds with Rt shift < 0.2 min or RI < 30) or as closely related isomers (36 compounds with Rt shift < 0.4 min or RI < 250) (Supplementary data 4—‘Identifications’, and Supplementary data 6—‘Spectral matches’). For the NPOC extracts, these consisted mostly of n-alkanes and PACs detected by GC-EI-HRMS (Fig. 3), including four oxy-PACs (e.g. 4H-cyclopenta[def]phenanthren-4-one (C15H8O), 9-anthracenecarboxaldehyde (C15H10O), and 7H-benz[de]anthracene-7-one (C17H10O)), six sulfur-containing PACs (i.e. C16H10S, C18H12S, and C20H12S isomers), and a benzocarbazole isomer (C16H11N). Several persistent organic pollutants were detected in the same extracts using GC-NCI-HRMS, including 11 chlorinated compounds e.g., polychlorinated dioxins, and polychlorinated biphenyls (PCBs), and seven brominated flame retardants (BDEs) (Fig. 3). Compounds confirmed by LC-HRMS encompassed a wider variety of chemical classes, including parent commercial substances such as tris-2-butoxyethyl-phosphate (C18H39O7P) (Fig. 3), degradation products of commercial substances, or products of combustion and/or atmospheric oxidation, such as monoethyl phthalate (C10H10O4), phthalic acid (C8H6O4), 4-nitrophenol (C6H5NO3), 2,4-dinitrophenol (C6H4N2O5), benzimidazole (C7H6N2), and 2-hydroxybenzimidazole (C7H6N2O) (Fig. 3). Various herbicides and insecticides were also confirmed (Fig. 3), e.g., DEET (C12H17NO), prometon (C10H19N5O), malaoxon (C10H19O7PS), methamidophos (C2H8NO2PS), as well as their environmental transformation products, e.g. simazine-2-hydroxy (C7H13N5O) and atrazine-2-hydroxy (C8H15N5O)37.

Fig. 3: Anthropogenic substances confirmed in PM2.5.
figure 3

Examples of environmental contaminants detected across different extracts and modes of analyses, all of which were confirmed by comparison with authentic standards (i.e., identification Level-1). Dashed-line intersections indicate example analytes detected in multiple extracts or modes of analysis. Nonpolar hydrocarbons, including PAHs and oxy-PAHs, were detected in the NPOC extracts by GC-HRMS with electron ionization (EI) and/or by negative chemical ionization (NCI). Sulfur-containing PACs were detected and confirmed by GC-EI, while other persistent pollutants (e.g., PCBs, BDEs) by GC-NCI. The WSOC and POC fractions were analyzed by LC-HRMS with each extract injected twice for separate acquisitions in electrospray positive (ESI+) and negative (ESI−) modes. ESI− is optimal for weak organic acids, while ESI+ reveals nitrogenous bases and a range of polar neutral molecules such as alcohols, aldehydes, and ketones. These analyses combined allowed the detection of anthropogenic pollutants representing a wide variety of chemical classes, some potentially originating from atmospheric oxidation of incomplete combustion sources, others synthesized for commercial or industrial use, e.g., plasticizers, generic biocides, insecticides, herbicides, and their transformation products.

Many of the legacy persistent organic pollutants are semi-volatile and partition to the gas-phase, particularly at high ambient temperatures as recorded in the current campaign (mean = 32.9 °C), but several PCBs and BDEs were nevertheless detected sporadically from polluted back-trajectories, suggesting continued emissions in South Asia, with highest detection frequency for 2,2′,4,4′-tetrabromodiphenyl ether (BDE-47; C12H6Br4O) (Fig. 3 and Fig. S11). Samples with subcontinental back-trajectories, particularly those associated with the Arabian Sea and Indo-Gangetic Plain, consistently had higher levels of PACs, plasticizers, biocides, and herbicides; at least 2- to 10-fold higher than in air masses from the Southern Indian Ocean (Fig. 2e, and Fig. S11). Simazine-2-hydroxy (C7H13N5O) (Fig. 3) was detected at highest levels in two samples from the Indian Ocean, suggesting local use of simazine in the Maldives. Conversely, atrazine-2-hydroxy (C8H15N5O) was detected at higher levels (up to 10-fold) in samples from subcontinental regions (Fig. 3, Fig. S11 and S16S17), and we are not aware of any previous reports of this substance in ambient air.

Characterization of PM2.5 molecular dark matter

Only a minor proportion (0.5%) of all molecular features in polluted PM2.5 were identified or putatively annotated by MS2 spectral database matching (see criteria in Methods). Higher but still low annotation rates (up to 1–2%) have been reported for environmental water analysis38,39 and metabolomics24 despite larger and specialized databases. The vast majority of molecules in polluted air could not be matched to known compounds, not only because spectral databases mostly cover biogenic compounds (e.g., anthropogenic substances account for approximately 15% of records in both NIST20 and MassBankEU), but also because the major sources of air pollution include the incomplete combustion of complex fuels and heterogeneous biomass that are prone to rapid transformations in the atmosphere (oxidation, photolysis, hydrolysis) to yield molecular byproducts and secondary organic aerosols40. Hence, we combined cheminformatics and computational strategies that leverage data-rich MS1 and MS2 information of unknown features, and structural information of annotated molecules, to characterize the remaining unknown features in PM2.5 through in silico predictions of structure and physicochemical properties38,39. For all LC-HRMS features, we first calculated molecular formulae by combining two independent approaches, MFAssignR41 and SIRIUS42. Consensus between the two methods resulted in >19,000 features being assigned a molecular formula (i.e., 33% of the LC-HRMS datasets, Level 4 identification)37 (Fig. 4a). Next, features were clustered by molecular networks constructed in GNPS (Global Natural Products Social Molecular Networking)25,26, whereby neighboring molecular features (nodes) are linked by pairwise MS2 spectral similarity (edges) representing an inferred structural analogy (Fig. 4b, c). We then performed a high-throughput structural elucidation of all molecules in the networks with the MS2-guided in silico Network Annotation Propagation (NAP) GNPS workflow31. This workflow leverages the molecular network topology to re-rank the in silico predicted candidates based on joint similarity within a molecular family cluster (e.g., ten first-neighbors), and by combining in silico predictions via the core algorithm MetFrag43 to structural information from MS2 spectral library matches. The structures of 30,389 molecules were predicted in this way to achieve Level 3 identification37 (Supplementary data 4—Identifications NAP). While the NAP workflow already increases the reliability of the in silico first-candidates31, we further proceeded to consider only those predicted structures for which a matching molecular formula had been consistently assigned by all three computational steps (i.e., MFAssignR, SIRIUS, and NAP). By this conservative approach, 10,256 structures (out of 30,389 initial predictions; 34%) were carried forward as a relatively reliable in silico portfolio of small molecules in PM2.5 (Fig. 4b, c and Supplementary data 4—NAP + formula_consensus).

Fig. 4: Molecular-level characterization of complex PM2.5 extracts.
figure 4

High-throughput characterization of PM2.5 molecules was performed by combining information from MS1 and MS2 HRMS data. The integration of multiple cheminformatics approaches is illustrated here for molecular features detected in the WSOC extracts by LC-HRMS with ESI+ (see also Fig. S6). a Molecular formulae were assigned to a proportion of features (45% in this example, see Fig. S5) allowing color-coded visualization in van Krevelen diagrams, and in b molecular networks (GNPS). In the network, each feature is shown as a node linked to other nodes by edges indicating the degree of similarity among deconvoluted MS2 spectra (minimum cosine score = 0.65). In this example, 10,051 molecular features are clustered into 1064 molecular families having inferred structural analogy. c Zoom-in on a molecular family cluster of 15 nitrogen-containing benzenoids. The first-candidate structures from re-ranked in silico predictions (NAP/MetFrag) are shown in blue. The highlighted node (red outline) shows two putative annotations for the same molecular formula (C9H10N2), i.e., 5,6-dimethylbenzimidazole (red structure) from the GNPS library match (Fig. S18), and 1-indanonehydrazone (blue structure) as the top in silico first-candidate ranked by the network consensus (Fig. S20). The structure of 5,6-dimethylbenzimidazole was also predicted in silico, but ranked as the sixth candidate.

Molecular hallmarks of polluted and clean air

After formula assignment and thousands of structural predictions, the contrasting molecular profiles across back-trajectory regions (Fig. 2) presented an opportunity to investigate what types of organic molecules are most characteristic of clean and polluted air. Thus, in a second supervised multivariate model, we collapsed the four back-trajectory matrices into one vector expressing each sample’s polluted fraction (i.e., air originating from any of the three polluted subcontinental trajectories) versus clean air of the Southern Indian Ocean (Fig. 5a). The fraction of molecules that most significantly correlated with polluted or clean air back-trajectories were selected for further investigation (Fig. 5b). These top-VIP (variable importance for the projection) features accounted for 10% of the dataset (6088), 35% (2156) of which had been successfully assigned a molecular formula, and 17% (1049) of which had been assigned a structure consistent with the formula (Supplementary data 4—Identifications).

Fig. 5: Molecular markers of polluted and clean PM2.5.
figure 5

a Multivariate OPLS (1 + 1 + 0) model explaining chemical profiles of polluted (subcontinental) and clean air (Indian Ocean). Samples are clustered based on similarity of molecular feature profiles, and colored according to the relative contribution of polluted (brown) or clean (cyan) air back-trajectory frequencies in each sample (CV-ANOVA p value = 9e–06; see details in Supplementary data 3 – Models and statistics OPLS 1 + 1). b From the OPLS model, the contribution of each molecular feature (n = 60,030) to either polluted or clean air profiles is highlighted in the volcano plot, showing molecules with the highest model-correlation coefficients (|p[corr]| > 0.50) and variable importance for the projection scores (VIP > 1.0). Features are colored by their m/z values. The total number of top VIP features (i.e., VIPs > 1.00, |p[corr]| > 0.50) as significant markers of polluted and clean air are shown (red font) with the proportion of unambiguous molecular formulae assigned (in brackets). c Distribution of these top -VIP features in van Krevelen space, colored by their model-correlation with either polluted (brown) or clean (cyan) air. d Total counts of top-VIP features by heteroatomic formula classes (minimum of five features per class) ordered by difference between polluted and clean air. See also Fig. S12 for the relative contribution of each extract (NPOC, POC, WSOC) and data sources in Supplementary data 3 – Models and statistics, Top VIP x corr).

Overall, molecules in PM2.5 that correlated with polluted back-trajectories (4660 features; 1652 formulae; 775 structures) were three times more numerous than those correlating with clean air (1428 features; 504 formulae; 274 structures) (Fig. 5b), and occupied a broader and more oxidized chemical space (median; polluted = O/C 0.29 ± 0.28 SD, H/C 1.43 ± 0.41 S.D.; clean = O/C 0.20 ± 0.18 S.D., H/C 1.55 ± 0.30 S.D.) (Fig. 5c). The most numerous heteroatomic formula classes in polluted air corresponded to molecules containing one or two nitrogen atoms (N2, N2Ox classes, Fig. 5d), and were represented by in silico predicted structures of e.g., nitrophenols, N-heterocycles, imidazoles, quinazolines, and diazine derivatives (Fig. 6a). These are compound classes previously highlighted as contributors to the light-absorption of BrC in atmospheric aerosols15. Polluted air also included relatively more molecules containing sulfur (S and OxS classes) or mixed sulfur and nitrogen (NOxS), e.g., the predicted structure of the herbicide bentazone (C10H12N2O3S) (Fig. 6a), and highly oxygenated compounds (O4-O8 classes) (Fig. 5d) such as quinolacetic acid (C8H8O4) and dihydroxyterephthtalic acid (C8H6O6) (Fig. 6). The OxS class in polluted samples also included organosulfates (Fig. 6a) which are implicated in cloud condensation processes44.

Fig. 6: Representative molecular structures among the major heteroatomic formula classes in polluted and clean air.
figure 6

Molecular markers in polluted and clean air (OPLS; Fig. 5) were structurally characterized by MS2 spectral library matching (red structures; Level 2) or by MS2 in silico prediction (blue structures; Level 3). Representative example structures are shown to illustrate the results (see also Supplementary data 4 – Identifications). a Polluted air was distinguished by a mixture of nitrogen- and sulfur-containing compounds, and highly oxygenated organic molecules. Several of these structures corresponded to aromatic N-heterocycles (e.g. nitrophenols, imidazoles, and diazine derivatives), which are structures that have been previously linked to light-absorbing BrC. The three structures in red from polluted air (N2, N2O, N2O5) were confirmed Level 1 by authentic reference standards. b Molecular hallmarks of clean air were mono-, di-, and tri-oxygenated organic molecules (O-O3 class), many of which were predicted in silico (Level 3) or annotated on the GNPS spectral library (Level 2) as biogenic compounds. These included terpenes, such as derivatives of alpha-pinene (CH) and other plant volatiles. An interesting exception of a compound in clean air with higher oxygen content (O6). was confirmed as acetyl portentol (Level 1; Fig. S21), a polyketide produced by the marine lichens of Roccella sp. native to Indian coastal habitats. Another biogenic compound was predicted in silico (Level 3) as viscosumic acid, a sesquiterpene produced by the annual herbs of Polygonum sp. native to Nepal, Bangladesh, and north-eastern India.

In contrast, clean air PM2.5 was distinguished by mono-, di-, and tri-oxygenated molecules (O-O3 classes), of which many were annotated through the GNPS spectral library (Level 2a) or predicted in silico (Level 3) as derivatives of alpha-pinene (C10H16), such as dihydroactinidiolide (C11H16O2) and loliolide (C11H16O3), or other biogenic volatiles, e.g., mandelic acid-methyl ester (C9H10O3), and viscosumic acid (C15H20O3), a sesquiterpene produced by Polygonum sp45, native to South East Asia (Fig. 6b and Fig. S11). An interesting natural product with higher oxygen content was identified (Level 1) as acetyl portentol (C19H28O6; Fig. 6b and Fig. S21), a polyketide produced by marine lichens of Roccella sp. native to Indian coastal habitats46,47. Biogenic volatiles, such as (mono)terpenes, that are photochemically oxidized in the atmosphere (including to O-O3 class substances) can contribute to particle nucleation in the absence of pollution48, and are chromophoric components of secondary organic aerosols, for example, oxidized indole derivatives49.

Physicochemical properties of molecules in clean and polluted air

To gain molecular insights to the impacts of PM2.5 on human health and climate, we next employed the structures of top-VIP molecular markers to estimate physiochemical descriptors of toxicological and environmental relevance, e.g., lipophilicity (logP), topological polar surface area (TPSA), water solubility (logS), and molar refractivity (MR). A concordance between physicochemical properties and clusters of molecular families in the molecular networks was observed, whereby closely related structures showed similar values of e.g., logP (Fig. 7a, b). Moreover, for features detected by LC-HRMS (WSOC, POC), the computed logP of predicted molecular structures were strongly correlated with empirical measurements of their hydrophobicity: the reversed-phase HPLC retention times in our analyses (Pearson’s correlation, ρ = 0.64–0.75, Fig. 7c). These important results demonstrate the reliability of the molecular networking approach and of the in silico predicted structures (Supplementary data 4 – Identifications, Descriptors).

Fig. 7: Analysis of physicochemical properties of clean and polluted air molecules.
figure 7

Structures of molecules in clean and polluted air (i.e. VIPs > 1.00, p[corr] > ±0.50; Fig. 5) were used to compute physicochemical descriptors of toxicological and environmental relevance. a, b A subset of the WSOC + molecular network (see Fig. 4) is shown to highlight the tight relationship between structural analogy and physicochemical properties, with features color-coded according to (a) chemical group, and (b) predicted lipophilicity (logP). In b, example structures from the same network are reported with corresponding logP values. c Positive Pearson’s correlation between the predicted logP and the liquid chromatography retention time (Rt) on a C18 stationary phase for POC and WSOC extracts (ESI+ and ESI− data combined). d Two-dimensional density plots visualizing the distribution of logP and topological polar surface area (TPSA, Ų) in clean and polluted air. e Density distribution and median values of logP in clean and polluted air molecules. f, g Scatterplots with molecular features color-coded according to detection in the analyzed extracts, showing a negative Pearson’s correlation between the (f) LogP and O/C, and (g) between MR and solubility (logS). h Density distribution and median values of the index of refraction (nD) in clean and polluted air molecules. All Pearson’s correlation coefficients (ρ) are shown with p values < 2.2e–16. Significant differences between clean and polluted air profile distributions are reported for unpaired Student’s t-test p-values (two-sided).

Compared to clean air, molecules in polluted air occupied a broader physicochemical space, including for logP and TPSA (Fig. 7d) which influence bioavailability and tendency to cross biological membranes. Molecules in polluted air had lower median logP (polluted = 0.85 logP; clean = 2.02 logP; p value = 6.6e–09) (Fig. 7d), but the distribution was bimodal and polluted air also had a higher frequency of structures with extreme lipophilic values (> 5 logP, polluted 10.4%; clean = 5.8%) (Fig. 7e). The most highly lipophilic substances included n-alkanes and PACs detected by GC-HRMS in NPOC extracts (confirmed, Level 1), and related alcohols, aldehydes, fatty acids, and amides predicted in silico (Level 3) by LC-HRMS. The latter substances were often detected by LC-ESI+ in POC extracts, e.g. 8-dotriacontenoic acid (C32H62O2), 22-oxononacosanoic acid (C29H56O3), and docosanamide (C22H45NO), but also in other fractions and ionization modes, e.g. tricosanoylglycine (C25H49NO3) and dimethyl octadecanedioate (C20H38O4) detected by LC-ESI- in POC and WSOC, respectively.

The overall trend of lower logP (i.e. increased polarity) among molecules from polluted regions of the Indian subcontinent (Fig. 7d, e) is consistent with photochemical oxidation of water-soluble BrC during transport from the Indo-Gangetic Plain11. We reported a higher O/C among molecules in polluted air (Fig. 5c, d), and here we further observed a strong inverse correlation between logP and molecular content of oxygen (ρ = −0.62; Fig. 7f). As an illustrative example of oxidation, fluoranthene (C16H10; logP = 4.53) was among the PAHs detected by GC-EI-HRMS (Level 1), while by LC-HRMS (ESI-, WSOC) our in silico workflow predicted the hydroxy-PAH 9H-fluoren-9-ol (C13H10O; logP = 2.52, Level 3). The bulk of relatively polar substances revealed by LC-HRMS at this receptor site may constitute secondary organic aerosols derived from atmospheric processing and photooxidation of anthropogenic and biogenic precursors (Fig. 6). With the current approach, these distributions could be examined along spatial transects from source to receptor regions in future.

Significance to human health and global climate

In silico workflows may be useful for future research into the health impacts of PM2.5 exposure. While PAHs are lipophilic carcinogenic molecules50, semi-polar byproducts of their atmospheric processing span a wider range of physicochemical properties and can be more acutely toxic. Here, a quantitative structure-activity relationship analysis based on descriptors used to predict human absorption, distribution, metabolism, and excretion (e.g. MW, logP, logS, TPSA; Fig. 7c, d)33 revealed that more than a third of the top-VIP molecular markers in polluted air (35%, 272/774 structures) had high predisposition for gastro-intestinal absorption and permeation through the human blood-brain barrier, thus representing potential gut inflammatory51 and neurotoxic52 components of PM2.5. Of these, the majority (147 molecules) were also predicted to be inhibitors of cytochrome P450 enzymes, with representative structures highlighted in Fig. 8. Several of these were confirmed (Level 1) as sulfur-containing PACs (e.g. benzo[b]naphtho[1,2-d]thiophene), oxy-PAHs (e.g. benzanthrone and 4H-cyclopenta[def]phenanthren-4-one) and imidazoles (e.g. benzimidazole and 2-hydroxybenzimidazole), while other in silico predicted N-heterocycles included azoles and azaarenes, such as 11H-Indeno[1,2-b]quinoline (C16H11N), nonylpyrazole (C12H22N2), and 4-azapyrene (C15H9N) (Fig. 8) (Supplementary data 4 – Identifications, Descriptors).

Fig. 8: Molecular hallmarks of polluted air with climate and human health relevance.
figure 8

Example of structures and estimated toxicological and optical properties of organic compounds detected by GC- and LC-HRMS in PM2.5, as molecular markers of polluted air. Structures were confirmed with authentic standards (in red; Level 1 or Level 2 as close isomers) or predicted in silico (in blue; Level 3). Molecules are ordered in rows by decreasing light-scattering capacity expressed as the real component of the complex refractive index (nD). Some of the resulting molecular byproducts include sulfur-, oxygen- and nitrogen-containing molecules (highlighted in yellow) with predicted toxicity to multiple endpoints, i.e. simultaneously CYP450 inhibitors and with high predisposition to human gastro-intestinal absorption and blood-brain barrier permeability.

From a climate perspective, many top-VIP molecular markers of polluted air confirmed here (Level 1), such as PACs (detected by GC-HRMS), and imidazoles and nitrophenols (detected by LC-HRMS) are known light-absorbing chromophores in BrC aerosols53 (Fig. 8). Given that atmospheric aging rapidly alters the molecular components of such organic aerosols, leading to high uncertainty of their optical properties54, high-throughput workflows and in silico prediction of molecular structures could be exploited in future studies to gain further insight to the climate impact for complex mixtures of organic substances in PM2.5. For instance, molecules in polar and water-soluble (WSOC, POC) extracts of PM2.5 at this receptor site had predicted water solubilities (logS) that negatively correlated with molar refractivity (MR, ρ = −0.75; Fig. 7g); a measure of polarizability and tendency for molecules to interact with light (e.g. driving Rayleigh-scattering)55. A significant difference was evident in the optical properties of molecular hallmarks in clean and polluted air (p value = 1.3e–21) (Fig. 7h), highlighted by the predicted light-scattering capacity (i.e. expressed by the real part of the complex refractive index (nD)34,54). Molecules with high-scattering capacity (>1.55 nD) were nearly three times more abundant in polluted air (polluted = 48.1%; clean = 16.9%). Lower-scattering molecules (1.44–1.55 nD) were abundant in clean air profiles, but some were associated with polluted air (e.g. oxy-PAHs, indoles, and derivatives of nitrobenzene, benzimidazole, and quinoline), whereby the lowest nD values were for small-molecules and byproducts with linear aliphatic structures (Fig. 8). This latter trend may indicate atmospheric processing and photooxidation in secondary organic aerosols49, as suggested by previous laboratory experiments54.

Conclusions

Through a battery of comprehensive extractions and complementary LC- and GC- nontarget analyses of PM2.5 from South Asia (720 analyses of 120 extracts from 40 air samples), we resolved and characterized greater molecular complexity of atmospheric aerosols than previously reported. In the environmental context, the intensive analytical workflow facilitated molecular discoveries and high-throughput characterization of thousands of unidentifiable substances across wide spatial scales of air mass origins throughout the continuous 3-month SAPOEX-18 campaign. The molecular complexity and relative profile of 60,030 molecular features varied by source region and pollution levels, and chemical class hallmarks of polluted and clean air were revealed at this receptor site in the Indian Ocean. The dominant nitrogenous organic molecules in polluted air are likely of relevance to health and climate but must be confirmed through further studies using approaches such as those described here.

As anticipated at this remote receptor site, only a small fraction of molecular structures could be confidently annotated by matching to spectral databases (Level 2) or identified with authentic standards (Level 1). We demonstrated that in silico predictions based on the underlying information-rich MS2 spectra can be exploited for the high-throughput structural characterization of thousands of substances in atmospheric samples. High-throughput molecular structure prediction remains an imperfect tool, however, predictions herein were aided by the molecular network topology and validated by strong and statistically significant correlations between the structures’ predicted physicochemical properties (i.e. logP) and our LC retention times.

Overall, the ranges of molecular formula classes, predicted structures, and properties (physicochemical and toxicological) were wider in samples from polluted regions than in pristine air from the Indian Ocean. The molecular profile of organic chemicals in PM2.5 originating from polluted regions were more oxidized (higher O/C and lower H/C), reflecting atmospheric processing and secondary organic aerosol formation56,57. Persistent anthropogenic compounds were also confirmed, and their ubiquitous observation at a relatively remote site 2500 km from the outflow of the Indo-Gangetic Plain is evidence of their long-range atmospheric transport potential58. Higher levels of many compounds, including various PACs (oxy-, nitro-, and sulfur-containing) in polluted air from the Indian subcontinent likely come from massive emissions of incomplete combustion processes, characteristic of the South Asian Atmospheric Brown Cloud, such as from biomass burning (household biofuel and burning of agricultural crop residue) and small-scale fossil fuel combustion (e.g., traffic, kerosene lamps, and diesel generators)59,60,61. Sources of other anthropogenic contaminants may include fugitive releases from industry or urban areas, such as plasticizers, resins, textile dyes, and flame retardants. Biocides, herbicides, and their metabolites (e.g. atrazine-2-hydroxy) may derive directly from agricultural spraying and soil erosion, or burning of post-harvest biomass residues in agricultural regions across the Indian subcontinent62. Atrazine-2-hydroxy is a major metabolite of atrazine in soils63, but has to our knowledge never been reported in air. Previous work at MCOH and in South Asia demonstrate that marine sources of organic carbon are minor contributors to PM and its water-soluble organics11,64. It is plausible that marine sources could be of greater relative importance for molecular signals detected here in PM2.5 from air masses coming from southern Indian Ocean. The sources and global relevance of contaminants emitted from these high-emission regions of South Asia deserves much more attention.

In addition to sources, the present approach opens up multiple avenues for deeper understanding of the atmospheric chemistry of aerosols at the molecular level, of central relevance to health and climate. Simultaneous identification or characterization of known and unknown emissions and transformation products could allow the following of coupled reaction pathways, which may also be linked to mesoscale chemical information, e.g. from aerosol mass spectrometry14. PM contributes to some of the largest uncertainties in our current understanding of the climate, and thus molecular markers of photochemical aging or secondary formation (e.g., carboxylic acids or dicarbonyls) may be comprehensively tracked to better resolve complex photochemistry which is shown to attenuate light-absorption of climate warming BrC in the South Asian outflow11. Here, we observed several hallmark chromophores (e.g., nitro-phenols and PAHs), opening up possibilities for the broad-scale understanding of the molecular origins of light absorption beyond targeted analysis. Furthermore, the present work suggests links between the molecular composition and the real refractive index, a key to computing the scattering properties and overall climate cooling effects of organic aerosols. We also observed and semiquantified molecules with known strong impacts on cloud condensation, e.g. organosulfates, thus there is great potential to explore molecular-level connections to climate with present approaches.

From a health perspective, the wide variety of molecules discovered or described here for polluted South Asian air may contribute to mortality and physiological stress and disease65,66, including adverse birth outcomes67, asthma68, or even increased susceptibility to respiratory infections such as COVID-1969. The most toxic and bioavailable substances in polluted air have yet to be identified in toxicology or health studies, but may be confounded by toxicological interactions between primary emissions (e.g. lipophilic biocides and drug-like molecules) and secondary organics with diverse potentials to activate adverse outcome pathways. Comprehensive, detailed and high-throughput molecular analyses will be necessary to uncover these relationships. Altogether, these results highlight how nontarget analyses and in silico structure predictions can be implemented as advanced tools to explore deeper molecular-level insights and hypotheses on the health and climate impacts of complex organic compound mixtures in airborne PM.

Methods

High-volume PM2.5 sampling

PM2.5 was sampled continuously in 48 h intervals between January 11th and April 4th, 2018, at MCOH (Hanimaadhoo, Haa Dhalu atoll, Maldives, 6.77 °N, 73.18 °E) onto pre-cleaned quartz fiber filters (150 mm Ø) using a high-volume sampler equipped with a PM2.5 selective inlet (DH-77, Digitel Elektronik AG, Volketswil, Switzerland) operating at 500 L min−1. To minimize sampling of local air, a wind-censored system interrupted the sampling when the wind was below 1.2 m s−1 or coming from the southwest (180–270°)11. Field-blanks (n = 4) consisted of PM2.5 filters placed in the air samplers with the pump turned off. Samples and field-blanks were stored frozen in pre-cleaned aluminum envelopes inside sealed bags, and shipped to Stockholm University for analysis.

Sample preparation

PM2.5 samples were cut and extracted by three different protocols (Fig. 1a and Supplementary methods). An accelerated solvent extraction (ASE-350, Thermo Scientific Dionex ASE) was used with hexanes and toluene for nonpolar organic compounds (NPOCs), and with methanol and toluene for polar organic compounds (POCs), and extracts concentrated under nitrogen gas. Water-soluble organic compounds (WSOCs) were extracted by sonication in 40 mL HPLC grade water, followed by centrifugation (see Supplementary methods)11. Multiple isotope-labeled internal standards (Supplementary data 1 – Standards) were spiked to all PM2.5 samples, field blanks, and urban dust reference samples (NIST SRM 1649b) prior to extraction. Sample preparation was performed in a positive pressure clean laboratory.

GC- and LC-HRMS

After silica cleanup, the NPOC extracts (2 µL injection) were analyzed with gas-chromatography (DB5 column) and HRMS (Q Exactive GC Orbitrap, Thermo Scientific) using electron ionization (EI) or negative chemical ionization (NCI) with full scan (44–700 m/z) and 60,000 resolution full-width half-maxima (FWHM) at 200 m/z. For POC and WSOC, extracts were filtered (0.45 and 0.2 µm, respectively) and analyzed with ultra-high-pressure liquid chromatography (UHPLC, Ultimate 3000) and HRMS (Q Exactive Orbitrap HF-X, Thermo Fisher Scientific) using electrospray ionization (ESI) in positive and negative mode. POC extracts (10 µL) were injected directly to the column (Waters Acquity UPLC BEH C18), while WSOC extracts (1000 µL) were injected to online solid-phase extraction prior to analytical separation. The mobile phases were 10 mM ammonium acetate in water (A) and methanol (B) and flow rate 0.4 mL/min (Supplementary methods). LC-HRMS was operated with alternating full scan (90–1000 m/z, 120,000 resolution FWHM at 200 m/z) and four MS2 data-independent analysis (DIA) scans (30,000 FWHM) with variable m/z precursor windows.

Data pre-processing

GC- and LC-HRMS raw data were pre-processed using MS-DIAL (v4.24)70, allowing chromatographic alignment across all samples, basic data reduction (e.g. grouping of C13 isotopes), spectral deconvolution, peak integration, and field-blank filtering (Supplementary data 1 MS-DIAL parameters). All features were blank filtered in MS-DIAL based on a fivefold difference between sample maximum and the average in field blanks (n = 4). For semi-quantitative analysis, integrated peak areas from MS-DIAL were normalized using the areas of different isotope-labeled internal standards (Supplementary data 1 Standards). All normalized feature areas were blank-subtracted by the average area of the corresponding feature detected in the field blanks (negative values were set to zero). Finally, the feature areas were normalized to the air volume accounted by the portion of PM2.5-filter extracted each sample.

Molecular formula assignments

The R Package ‘MFAssignR’41 and the software SIRIUS (v.4.5)42 were used for molecular formula assignment (mass accuracy < 5 ppm). MFAssignR applies element heuristics on the MS1 -level, then subtracts non-oxygen heteroatoms to solve for low-mass moieties (CHO), and finally assigns formula extensions via nested loops of homologous series41. SIRIUS similarly generates molecular formulae for the MS1 and then leverages MS2 fragmentation decision trees (i.e. shared neutral losses) to rank the candidates42. Consensus results were retained, corresponding to every unambiguous formula assigned by MFAssignR (Fig. S5) that matched the first-candidate assigned by SIRIUS, and later in the workflow by NAP (Supplementary data 4 – Identifications, NAP + formula_consensus).

Spectral library annotations

For LC-HRMS (WSOC/POC; ESI + /ESI− modes), spectral library search was performed on the open-access platform GNPS (http://gnps.ucsd.edu) and third-party libraries (including MoNA, https://mona.fiehnlab.ucdavis.edu/; and MassBankEU; https://massbank.eu/) using a minimum of two shared MS2 fragments (cosine ≥ 0.60) and later filtered for an MS1 threshold of 5 ppm (See Fig. S15). For GC-HRMS (NPOC; EI/NCI modes), annotations were performed using a combination of high-throughput spectral library search (MSPepSearch; https://chemdata.nist.gov/) on the NIST20 and our in-house Orbitrap-HRMS library of environmental contaminants, and candidates were considered only for spectral match factors ≥ 700.

Molecular networks and in silico structural elucidation

Feature-based molecular networks25,26 were built in GNPS (ver. 28.2) and visualized using Cytoscape v.3.8.2. For LC-HRMS (WSOC and POC) datasets, parameters were: MS1 and MS2 tolerances of 0.02 Da, minimum spectra similarity cosines of ≥ 0.65, and a minimum of four shared spectral peaks. For in silico structural prediction with the GNPS/NAP workflow31, the following parameters were used: 10 first-neighbors, 5 ppm accuracy, cosine score ≥ 0.65, 10 maximum candidates from structural databases (GNPS, HMDB, SUPNAT, CHEBI), and Consensus + Fusion ranking algorithm. The above workflow was only partly applicable to GC-HRMS data. GC-EI molecular networks were built in GNPS (ver. 30)71 using an ion tolerance of 0.4 Da, spectra similarity cosines ≥ 0.50, and a minimum of five shared spectral peaks, and were used to assist identifications (Fig. S7), together with formula assignments (MFAssignR), Kovats RI, Lee index, and GC-NCI data (Supplementary data 4 - Identifications ‘GC-NPOC’).

Physicochemical properties

Molecular formulae from the in silico predicted structures were translated using the open-source cheminformatics API OpenBabel32 (http://openbabel.org). OpenBabel was also used to compute physicochemical descriptors and derive toxicological endpoints within the pharmacokinetics platform SwissADME33 (http://www.swissadme.ch/). The real (i.e. light-scattering) component of the complex refractive index was computed in Python using the model developed by Bouteloup & Mathieu34.

Back-trajectories and satellite measurements

Ten-day back-trajectories were calculated every six hr using the HYSPLIT model (version 4) of the National Oceanic and Atmospheric Administration (NOAA), at 0:00, 06:00, 12:00, and 18:00 h GMT for 10 d into the past and 100 m height at MCOH (6.80°N, 73.20°E) (Fig. S8 and Supplementary data 3- Back-trajectories). A model was selected with four mean back-trajectories using the clustering algorithm in HYSPLIT. Tropospheric NO2 concentrations were averaged over the period of the campaign and prior 10 days of the first back-trajectory in 0.25° resolution using the Giovanni web application (https://giovanni.gsfc.nasa.gov/giovanni/) to access the National Aeronautics and Space Administration (NASA) OMI/Aura NO2 Cloud-Screened Total and Tropospheric Column dataset72, for a region including all back-trajectories (40°E to 108°E, 40°N to 10°S). Aerosol optical density satellite measurements for the same period are reported in Fig. S9.

Statistics

In total, 41 PM2.5 samples were initially collected during the campaign, but one filter was excluded due to technical problems with the pump at the time of collection. For the 40 samples included in the analysis, quality assessment of sample variation by PCA showed no outliers (Fig. S10). Multivariate analyses (i.e. PCA and OPLS models) were performed in SIMCA v.16 (Umetrics/Satorius); See Supplementary Information, Chemometrics, and Supplementary data 3 – Models and statistics. The R Packages “ggpubr” and “ggplot2” were used for other statistics and data visualization.