Nothing Special   »   [go: up one dir, main page]

Skip to main content

The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching

An Erratum to this article was published on 20 September 2017

This article has been updated

Abstract

Background

The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms.

Results

We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism.

Conclusions

This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software.

CDK 2.0 provides new features and improved performance

Background

The open source cheminformatics community has made significant steps forward recently [1] as evidenced by the growing number of tools and underlying toolkits, along with the usage of these software components in a variety of applications. The Chemistry Development Kit (CDK) is one of the tools developed under the aegis of the Blue Obelisk, a movement promoting Open Data, Open Source, and Open Standards in chemistry [1, 2]. The CDK providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. Previously documented CDK versions have been widely adopted [3, 4]. Use of the CDK ranges from inclusion of CDK functionality in wrapper platforms such as Cinfony [5], incorporation within the R environment (rcdk [6]), and as plugins for Taverna [7], KNIME [8], Cytoscape (ChemViz2 [9]), and for Microsoft Excel (LICSS [10]). In contrast to scenarios that have made CDK functionality available in larger systems, a number of projects have employed the CDK as a general cheminformatics toolkit. Examples include jCompoundMapper [11], ScaffoldHunter [12, 13], OMG [14], PaDEL [15], ChemDes [16], ReactPRED [17], SMSD [18,19,20], WhichCyp [21], MetaPrint2D [22], MetFrag [23], and the IUPHAR/BPS Guide to Pharmacology [24], BRENDA [25] and QSAR DataBank [26] databases. A number of such tools were initially developed using older versions of the CDK and are updated to new releases as they are made available. Examples include Bioclipse [27, 28] and AMBIT [29,30,31]. The CDK has also played a role in a number of chemical studies, such as finding the maximally bridging rings in chemical structures [32], prediction of organic reactions [33], and bioactivities of compounds [34].

While the CDK has purported to be a general purpose cheminformatics toolkit, older versions were designed by a community with specific applications in mind, primary among them being structure elucidation. In addition, an implicit goal of previous versions was to have the CDK serve as an educational resource to enable students of cheminformatics to understand the underlying algorithms. This resulted in certain functionalities, such as molecular fingerprinting [35, 36], receiving more attention than others, such as stereochemistry. The outcome was significant variance in performance and features throughout the toolkit.

The growth of open source software over the last 10 years is evidence of the ability of communities of developers to develop systems and processes that lead to high quality software systems for long term use. The CDK is no different. The adoption of automatic build systems and quality control methodologies such as unit testing, automated source code validation, and peer review by fellow developers have greatly improved the stability of the library. While it has slowed development somewhat, it has allowed for cleaning up interdependencies between modules of functionality, and importantly, has improved the scalability of the development model. This has resulted in significant new functionality in core application programming interfaces (APIs) while maintaining the quality of code depending on those core APIs.

Examples of new features supported by the improved development model include InChI functionality [37], greatly improved ring detection algorithms [38], improvements to the core atom type perception module that now covers a much more comprehensive set of elements, charge states and radical species than previous versions, a more comprehensive fingerprinting API, new depiction functionality, and many speed and stability improvements.

Implementation and results

This section describes the specifics of new APIs and improvements to pre-existing methods that are available in the latest CDK. We then discuss how we have improved and formalized the development model for the project using unit testing, code review and guidelines for handling version control. Finally we report on the availability of binary distributions of the library, allowing users to include specific modules (and their dependencies) of the CDK in their own projects (as opposed to developers who work on the CDK library itself).

New APIs and improved implementations

We here outline various new and improved APIs in the CDK library since the two previous publications in 2003 and 2006 [3, 4].

Atom typing

Atom type perception is core cheminformatics functionality: the atom types describe chemical features of atoms, such as the number of neighbors, possible formal charges, (approximate) hybridization, electron distribution over orbitals and so on. However, previous versions of the CDK implemented atom type perception as part of different algorithms, resulting in duplicated and sometimes divergent typing schemes. As a result it was cumbersome to add new atom types and implement support for new charged and radical species in a consistent manner.

This CDK version has a new, centralized atom typing framework, removing the perception of atom types from various algorithms. This allows for a consistent and extensive typing scheme, that can be also be tested independently of other code. The new code defines the atom types using a list that specifies for each type the element symbol, hybridization, formal charge, number of lone pairs, and an enumeration of the bond orders (see Fig. 1). This list of properties captures the information needed for the various algorithms in the CDK. For example, hybridization information can be used in certain aromaticity models (see later), and the lone pair information is needed for resonance structure calculation needed, for example, for Gasteiger \(\pi\)-charges.

Fig. 1
figure 1

Atom type information specified for a \(sp^3\)-hybridized carbon

A reference implementation, CDKAtomTypeMatcher, has been written in such a way that perceives these atom types, and validates the perception automatically against the properties defined by the ontology. This class handles a variety of types of missing information, as commonly resulting from various (file) formats; for example, it can handle undefined hydrogen counts and undefined double bond positions if hybridization information is provided instead. That makes the perception code flexible but also more complex. Alternative algorithms for atom typing have not been explored. This reference implementation can be used on a single atom:

figure b

And on a full molecule, in which case the list of types is ordered in the same order as the atoms in the molecule object:

figure c

Stereochemistry

Previous versions of the API represented stereochemistry in different ways. This hindered interconversion between and within file formats. CDK v2.0 standardizes upon a new core representation and procedures have been updated or added to enable duplicate checking, pattern matching, and interconversion.

The preferred representation of stereochemistry is now for it to be stored at the molecule level as a StereoElement. In abstract terms a stereo element describes local geometry using a type, focus, carriers, and configuration (Fig. 2). Currently the most common types of stereochemistry are supported: Tetrahedral, Cis-trans isomerism around a double bond, and Extended Tetrahedral. Rarer types of stereochemistry, such as: Square Planar, Trigonal Bipyramidal, Octahedral, could easily be incorporated into the chosen description given sufficient demand from the community.

Fig. 2
figure 2

Relative storage of stereochemistry, the type and focus of stereochemistry are fixed for a given stereocenter description but the carriers and configuration are relative. The multiple rows for each stereochemistry type are different internal representation that would be considered equivalent. In the tetrahedral types, hydrogens may be suppressed in a molecular graph so the focus is reused in the carriers list as a placeholder

Along with the new stereochemistry representation, algorithms were required in several areas. Generally, a user does not need to invoke these procedures explicitly as they are called as needed within existing APIs:

  • perception from 2D coordinates,

  • perception from 3D coordinates,

  • wedge assignment,

  • graph (sub)isomorphism matching,

  • SMARTS matching, and

  • canonicalization.

The perception from coordinates and wedge assignment algorithms are fundamental for conversion between formats that store stereochemistry implicitly based on coordinates (e.g. molfile,Footnote 1 CML) and explicitly (e.g. SMILES, CML, InChI). Perception from 2D coordinates can optionally identify perspective projections, specifically: Fischer, Haworth, and Chair projections. With the perception of perspective projections enabled, database entries currently considered distinct can be merged (Fig. 3).

Fig. 3
figure 3

The raw input files of CHEMBL23970 and CHEMBL444314 are displayed (ChEMBL 21). Without perceiving the stereochemistry indicated by Haworth projection in CHEMBL23970, the database entries are incorrectly considered distinct. Down stream aggregation databases mirror this separation (PubChem CID 5280, CID 65119)

Pattern matching of stereochemistry with the described representation is straight forward. Given the atom–atom mapping from a query structure to a target molecule, the focus and carriers of the query stereochemistry are mapped to the target. Using the permutation parity of this mapping the configurations were compared. SMARTS matching requires some special handling for complex cases [39]. For canonicalization, a partial canonical ordering is used to assign an absolute label which can then be integrated into the ordering. The algorithms used for stereochemistry are thoroughly detailed in Chapter 6 of [40]. The perception from projections is based on an algorithm briefly described by [41].

Atomic and molecular signatures

An implementation has been provided of the Signature structure descriptor for molecules [42]. These act as a linear notation—like the SMILES format—for the whole molecule as well as for connected substructures rooted at a single atom. The descriptor can also be canonicalized to provide isomorphism-independent representations [43]. Signatures of depth two can be calculated for atoms with:

figure d

But they can also can be calculated for full molecules:

figure e

Finally, a signature fingerprint can be calculated for molecules, to allow similarity calculations. This can then be used in QSAR modeling [34, 44,45,46,47,48,49].

Rendering API

A new rendering API has been introduced to make the rendering code independent from Java widget toolkits. The previous code was tightly linked to the Swing toolkit, but other tools use different widget toolkits. For example, Bioclipse is based on Eclipse which uses the Standard Widget Toolkit (SWT) [27].

A second new design goal was introduced to balance between size restrictions of some use cases, such as Java applets, and the rendering functionality. In particular, some functionality, even after modularization, needed considerable parts of the CDK library, making creation of a small-sized applet unfeasible. Therefore, the rendering API was modularized to allow splitting up rendering functionality into modules, with varying CDK dependencies.

Rendering is split up into several generation steps: previous versions split up bond from atom rendering. Heteroatom symbols were simply drawn over lines representing bonds using a white rectangle to mask. A new StandardGenerator has been introduced that does bond and atom rendering at the same time. It incorporates many ideas described by Alex Clark [50, 51]. The depictions generated are of much higher quality and suitable for publication.

Moreover, a simplified high-level API has been introduced that addresses most of the common rendering needs, with the DepictionGenerator class. To depict a molecule loaded into a variable ‘benzene’ the following code can be used:

figure f

Many of the rendering options are available as parameters in the core API and as methods on the DepictionGenerator class. This includes substructure coloring, exemplified with an example reaction shown in Fig. 4. When missing, 2D coordinates are generated on the fly with the new structure diagram layout functionality.

Fig. 4
figure 4

Integrated example showing the rendering and SMILES parsing functionality. Example from U.S. Patent US 2014 231770 A1 para 287

Structure diagram layout

The structure diagram layout has been improved and the new code solves a number of long standing issues. In particular, collision avoidance has been greatly improved. Figure 5 shows a difference in output between the old code base, with and without overlap resolving, and with the new refinement based implementation [52]. Generation of 2D coordinates is done as shown below:

Fig. 5
figure 5

The improved structure diagram generation has improved code to solve overlap. The original SDG code used general heuristics (left) and the OverlapResolver would fine tune the layout to ensure atoms would not be placed at the same location (middle). The new SDG algorithm is able to make more rigorous changes, making the final output must more pleasing (right)

figure g

While the API itself has not been significantly changed, the internals have been revamped. In addition to improved overlap resolution noted above, the engine appropriately handles large ring systems, maintains input stereochemistry, and makes use of a large template library. Templates are useful for laying out substructure. While previous CDK versions partially supported double bond stereochemistry the new engine is more efficient in using this information when generating 2D layouts. Furthermore, the engine assigns wedge bond information based on tetrahedral stereochemistry. These features are exemplified by the following code and the resulting layout depiction in Fig. 6:

Fig. 6
figure 6

Structure diagram generation for structures with double bond and tetrahedral stereochemistry

figure h

Molecular formula

A chemical formula is the simplest chemical representation of a compound. It defines the number of isotopes or elements that compose a compound without describing how atoms are bonded. With the rise of metabolomics it has become increasingly relevant to have full support for these in cheminformatics libraries [23, 53,54,55,56].

The CDK interfaces can handle several concepts related to chemical formulas: the formula itself, sets of formulas, chemical formula ranges, adducts, isotope containers and patterns, and rules to filter formula sets. These new tools can be used for a number of tasks, including calculating the isotopic pattern from a given chemical formula, determining the possible elemental compositions for a given mass (mass decomposition), and calculating the exact mass from a given chemical formula.

The CDK contains two algorithms for the decomposition of mass ranges into possible elemental formulas. For most inputs, a Round Robin algorithm, originally developed for the SIRIUS metabolite identification tool [57], is used. The algorithm discretizes the real-value mass decomposition problem into an integer-value knapsack problem [58]. It first computes a dynamic programming table and then backtracks within it to generate matching formulas [59, 60]. Data for the Round Robin algorithm is stored in an extended residue table [61], resulting in a low memory footprint of several kilobytes. For certain problem instances, such as very large mass values (above 400,000 Da) or mass range span larger than 1 Da, the Round Robin algorithm is not suitable and CDK falls back to an optimized full enumeration search method, originally developed as part of the MZmine 2 framework for mass spectrometry data processing [54, 55].

The following code calculates all possible chemical formulas for a given accurate mass, within allowed counts for each element:

figure i

This gives the following output:

figure j

To evaluate the performance of the CDK molecular formula generator, we compared its runtimes to those of the classic, full enumeration-based HR2 formula generator [62] and those of a recently developed Parallel Formula Generator (PFG) [63] (Table 1). As inputs, we used two sets of 10,000 small (<500 Da) and 20 large (>1500 and <3500 Da) molecular mass values downloaded from the Global Natural Products Social Molecular Networking database [64]. The mass tolerance was set to 0.001 or 0.01 Da. The CDK v2.0’s Round-Robin formula generator outperformed the other methods in all cases, despite running in a single thread (PFG utilizes multiple threads). The performance gain of the Round Robin algorithm was particularly apparent when narrow mass ranges were queried (e.g. ±0.001 Da), thus showing its suitability for applications in high-resolution mass spectrometry.

Table 1 Evaluation of molecular formula generators

SMILES parser and generator

The SMILES [65] parsing has been replaced by code from the external Beam project [66]. This BSD-licensed SMILES parser is a complete implementation of the SMILES and OpenSMILES (http://opensmiles.org/) specifications by one of the authors (including stereochemistry), and is independent of the CDK library. The SmilesParser API uses this library underneath, and the Beam API is hidden by this class. Basic usage is as follows:

figure k

The most significant functional change here is that the SMILES parser automatically locates the positions of double bonds in de-localised aromatic systems (Kekulisation). If this invariant cannot be met the SMILES is rejected as invalid. It is possible to override this check but this is strongly discouraged as rejected molecules do not have a fixed formula or tautomer [40].

The SMILES generation API has also been simplified and made more flexible able to produce several different flavours. The SmiFlavor flags are used to control the type of SMILES generated. Historically the terms: generic, isomeric, unique, absolute have been used in other toolkits and are also supported.

figure l

Support for ChemAxon Extended SMILES (CXSMILES) [67] layers has been added to CDK v2.0. CXSMILES provides a powerful means of including auxiliary information in a SMILES string such as 2D/3D coordinates, atom values, generic labels, repeat units, and positional variation. CXSMILES is achieving by placing additional information between pipe characters (‘|’) in the SMILES title field. Information is annotated based on the order of the atoms in the SMILES string. An example CXSMILES for a generic structure is shown below.

figure m

Substructure and SMARTS matching

Substructure matching is fundamental cheminformatics operation and plays a key role in many other functions such as fingerprint and descriptor generation, and atom typing. Since CDK v1.2, functionality has been added to handle the SMARTS query language. The SMARTS language is supported well including features such as stereochemistry, component grouping, and atom maps (to match reaction transformations). A new Pattern API has been added to CDK v2.0, which simplifies finding, filtering, and transforming search results. The API is immutable allowing a pattern to be initialized once and then matched against several molecules or reactions across multiple threads. During initialization the pattern is inspected so as to determine what invariants will be needed (e.g. ring size) and only required invariants are calculated. The internal matching algorithms provide a lazy iterator, such that the next match is only computed when it is needed. The API handles reactions in addition to molecules, and both can be specified as either queries or targets.

figure n

CDK v2.0 includes large improvements to algorithm efficiency. This is emphasised in the systematic benchmark of MACCS-like 166 key generation (Table 5). The efficiency improvements are a combination of optimising data structures and key molecule processing algorithms (e.g. kekulisation and aromaticity) needed before a SMARTS match can be run [40, 68, 69].

Ring finding

Ring finding is another key functionality in a cheminformatics library, and the CDK knows a long history of ring finding [38, 70]. Specifically, non-redundant ring sets have seen particular interest, such as the smallest set of smallest rings, for which the CDK implements two classical algorithms [70, 71]. Recent work has implemented a new, faster algorithm, allowing searching for various types of (non-redundant) ring sets [38]. These are available via the new Cycles API:

figure o

Aromaticity

Aromaticity has seen many definitions in the past and for cheminformatics it frequently is algorithmically defined. The outcome of an aromaticity calculation depends on a number of atom type features and heuristics, which are often ambiguously defined in the published literature. Based on the information used, several different algorithmic definitions of aromaticity can be defined. Older CDK versions had various aromaticity models implemented but the code was scattered throughout the library, resulting in an inconsistent API to compute aromaticity and a significant maintenance burden. The API was unified in the current version, resulting in three models, of which two are based on the CDK atom typer. The difference between these two models is how contributions from exocyclic double bonds are handled.

The current CDK version further generalizes the idea that aromaticity is a model, and provides an API that allows the user to select one of several aromaticity models, leading to greater interoperability with other toolkits. The new Aromaticity class allows to build a custom model by selecting and combining options. For example, to reproduce the functionality of the previous CDKAromaticity class:

figure p

Here, the CDK model for counting donated electrons is used, along with the rings systems that were identified by the older algorithm in previous versions that was limited in the number of fused rings systems that were considered. However, an alternative aromaticity calculator that considers all possible ring systems can now be easily created with:

figure q

For SMARTS matching and SMILES generation a model based on Daylight [72] can be used and offers significant speed improvements to the one based on CDK Atom Types. This model has recently been documented as part of the OpenSMILES specification (http://opensmiles.org/):

figure r

The aromaticity algorithm is straight forward, the potential electron donation is calculated for each atom as \(-1\) (not aromatic), 0, 1, 2. The set of cycles provided in the constructor is then generated and each is checked for Hückel’s rule (\(4n+2\)).

CTfile format improvements

The molfile format is still very popular and despite it being a proprietary format, it has become a de facto standard. The format forms the core of the larger CTfile family which was originally developed by MDL Information Systems [73]. The current format specification is published by BIOVIA and available on request [74].

The CTAB block (connection table) of a molfile comes in two versions, V2000 and V3000. The V3000 provides several enhancements including but not limited to: removing atom and bond count limits, enhanced stereochemistry, and link nodes. For backwards compatibility V2000 is often preferred resulting in limited usage of V3000.

CDK v2.0 adds support for V3000 and has optimized and extended support for V2000. Currently these are considered separate formats requiring a user to know what version is being read beforehand. Future APIs will aim to simplify this and provide a unified reader. An overview of currently supported CTfile formats is given in Table 2.

Table 2 CTfile format support

CTfile Sgroups capture and organise high level information about sets of atoms and bonds [75]. There are four types of Sgroup: Display Short-cuts, Polymers, Mixtures, and Data. The most familiar Sgroups from an end user perspective are structure repeat units (e.g. bracketing) and abbreviations (Fig. 8). CDK  v2.0 adds supports for representation, reading, writing, and depiction of Sgroups.

New object builders

Originally, the CDK was developed as a shared library between JChemPaint [76] and Jmol [77, 78]. JChemPaint used a MVC approach with an event-passing mechanism to update the view when the model was changed. This can cause a cascade of change events being passed around. This was not always a desirable feature, especially for non-UI code. To address this, interfaces were introduced allowing multiple implementations of the core interfaces. With much code of the CDK library no longer based on the original data model, a builder is needed to create objects of that data model, such as an implementation of the IAtom. The new IChemObjectBuilders allow implementations to be created, allowing implementations of the interfaces to be instantiated without the need of explicitly referencing those implementations. This way, any algorithm implementation in the CDK can use any of the data model interface implementations.

The CDK v1.0 and v1.2 implementations of the IChemObjectBuilder had, however, one method for each data object constructor, resulting in a very large interface. Moreover, this interface API had to be updated each time a new class was introduced, and when existing methods changed and constructors were updated. To simplify the API, the new IChemObjectBuilder collapses all methods into a single method, which takes as a first parameter the class of the interface that is to be constructed. All further parameters are passed as parameters to the class constructor.

For example, to construct a new atom from its element symbol, one would write previously:

figure s

With the new builder, the code looks like:

figure t

The CDK library is now mostly refactored and no longer depends on a specific implementation of the IChemObjectBuilder, allowing the user of the CDK to select a builder suitable to their software. Therefore, if software depends on event passing, then the DefaultChemObjectBuilder can be used, in most cases this isn’t needed and the SilentChemObjectBuilder is preferred resulting in a typical speed up of 10–20%:

figure u

The third builder is the DataDebugChemObjectBuilder which generates debug information for all changes to the content of the data classes. This can be useful for debugging and other forms of code inspection.

Molecular fingerprints

Molecular fingerprints have also seen significant development in this CDK version. Previously, fingerprints were represented using the BitSet class from the Java library. While using this class allowed the use of pre-existing methods to manipulate bit strings, it keeps a vector of bits in memory. The solution was excellent for hashed, relatively small fingerprints, e.g., 1024 bits, i.e. with a \(2^{10}\) indexing space (128 B). However, implementing a fingerprint designed to avoid collisions with a \(2^{32}\) bit indexing space using this approach would be memory-inefficient (512 MiB). To allow for multiple fingerprint representations, a bit fingerprint interface was introduced: IBitFingerprint.

figure v

Also, although fingerprints traditionally are bit vectors a count fingerprint was also introduced making fingerprints based on integer vectors supported in CDK as well. The counts in the fingerprint then represent how often this substructure is found in the molecule it represents.

figure w

The fingerprints currently provided by the CDK are listed in Table 3.

Table 3 The molecular fingerprints in CDK

Improved coding standards

As the CDK library grew over the years, so did the complexity of the maintenance. The main branch frequently failed to compile and bug fixes became more onerous due to unexpected side effects. Often fixing a bug in one part of the code, broke some other code which made the incorrect assumptions about the fixed code. With the increased size of the CDK developer community, such issues were inevitable in the absence of any formal coding and testing standards.

To address these issues, we have adopted a number of coding standards. While not a comprehensive implementation of software engineering best practices, they attempt to find a balance between increasing code maintainability and being flexible enough to allow efficient code development. We appreciate the subjective nature of this statement, and some adopted guidelines have been heavily discussed and debated in the CDK community.

Arguably, perhaps the biggest factor in improved code quality is a peer review process where any functionality changing patch is required to be reviewed by one independent, senior CDK developer for the development branch, and by two reviewers for stable branches. This patch development system is supported by a number of automated validations steps as outlined below. The next sections describe some approaches the project have adopted that allows us to maintain the CDK library as it is today.

Stability and version identifier

Prior to CDK v2.0, the parity of the version identifier’s second digit indicated stability. Even numbers (v1.2.x, v1.4.x) indicating API stability and odd numbers (v1.3.x, v1.5.x) indicating potential API instability. Versions v1.4.x and v1.5.x were developed in parallel, where possible patches were applied to both. As the APIs diverged the amount of effort to port patches from the development but more robust v1.5.x to v1.4.x became unmanageable for the core development team. This even-odd version scheme was adopted from old Linux kernel versioning that was subsequently abandoned in 2004 for time-based releases [79].

At the time of writing the development branch is more than 3000 commits ahead of v1.4.x. As the the v1.5.x API has become stable it became time to release v1.6.x. Due to significant API changes in 2011Footnote 2 it was felt a larger digit increment was needed. This provided the opportunity to change to a more manageable and intuitive version identifier.

From CDK v2.0 a new sequence based version scheme will be used. The version identifier indicates change significance as follows:

figure x

Due to limited developer resources we envision that releases will primarily increment the minor version with the occasional patch release. As per Maven convention, development versions are suffixed with -SNAPSHOT. There are no API changes from v1.5.x and v2.0.

Modularization

One of the central approaches we have adopted, is to make the CDK more modular. The CDK assigns every class to a module, and defines dependencies between modules. For example, core modules are not allowed to depend on modules with data classes implementing the CDK interfaces; instead, they may only depend on the interfaces themselves. This ensures that dependencies are minimized. Furthermore, it also allows cherrypicking CDK functionality, reducing the number of third-party library dependencies that are needed. An overview of key modules with description, important changes, and dependencies on third-party libraries is given in Table 4 and the dependencies between the CDK modules are depicted in Fig. 7.

Table 4 A selection of key CDK modules with major changes
Fig. 7
figure 7

Dependencies between CDK modules. Visualization of the dependencies between CDK modules. For example, the cdk-core depends on the cdk-interfaces module. A few higher level modules have been left out: cdk-builder3dtools, cdk-legacy, and cdk-depict

Documentation

The quality of the JavaDocs was originally tested with DocCheck, and later replaced by a custom written tool called OpenJavaDocCheck. With the move to Maven (explained later), which does not have integration for this tool, we adopted CheckStyle (http://checkstyle.sourceforge.net/). This tool reports on missing documentation and on documentation which is not properly annotated in the Java source files. The new website lists a few resources to help starting CDK users, including a book [80] and the Chemistry Toolkit Rosetta Wiki (http://ctr.wikia.com/wiki/Chemistry_Toolkit_Rosetta_Wiki).

Testing

Years of development of the CDK library has resulted in a large suite of tests of various kinds. This include unit tests, which test core APIs, and functional testing, which test higher level functionality of the CDK. The latter include tests if algorithm implementations calculate the expected values, but also contain integrated tests, which involve more than one algorithm, such as SMILES parsing. The suite consists of more than 23 thousand tests.

Code quality

The project continues to use PMD (http://pmd.sf.net/) for code quality checking, but deviates from the default rules. For example, we are more liberal with variable name length. Moreover, a number of additional PMD tests have been developed specifically for the CDK, that, for example, test if a class uses the core interfaces instead of implementations of those interfaces. That is, that the code uses IAtom instead of Atom. However, these tests do generate a few false positives, as the tests check the class name only, and not the Java package the class is in.

Continuous integration

The CDK has had an automated build system for many years now. Originally, Nightly integrated various tools (building, testing, JavaDoc, etc) [2]. After the move to Maven, running various steps could be done with Maven, and Jenkins was used to execute the steps (one instance is still running at https://jenkins.bigcat.unimaas.nl/job/cdk/. The online Travis-CI service is used to build all branches, including pull requests, to ensure everything properly compiles: https://travis-ci.org/cdk/cdk.

Git, branching, and patches

Older versions of the CDK employed Subversion for version control. A few years back, the project switched to the Git version control system. A key advantage of this shift is the ability to have distributed repositories, easier branching and provision for patches. GitHub (https://cdk.github.io/) has replaced SourceForge as the main source code hosting service where we can use novel approaches for commenting on code (peer review), pull requests, etc. These new features simplify our code review process.

Support

Besides the aforementioned sources of documentation, the project has additional sources of support. First, the issue tracker welcomes questions and other types of support requests, available at https://github.com/cdk/cdk/issues. The mailing list is another place where support can be requested, while the archives document many past user questions. The list and archives can be accessed from https://sourceforge.net/p/cdk/mailman/cdk-user/.

Binary distributions

Maven packages

The build system has been converted from Ant to Maven. The shift was motivated by the easier dependency handling, cleaner separation of testing code from the main library and automated packaging. The move to modules necessitated splitting the original monolithic source code tree in to per-module source folders. While this makes the on-disk layout of the source code more complex, this is usually hidden by modern IDEs.

As a result for many modules, the test code is now more closely linked to the code being tested: both reside in the same folder, though we adhere to the Maven custom to have src/main/java and a src/test/java folders. For a few modules, however, this solution introduces circular dependencies, in which case a separate Maven module is created for the tests.

The Maven packages for the CDK are available from Maven Central, which makes it easy for other projects to use. The full library can be included in other software by depending on the cdk artifact (http://search.maven.org/#search|ga|1|org.openscience) but dependencies can also be defined on individual CDK modules.

OSGi bundles

OSGi bundles are available for the CDK too, which are used by e.g. Bioclipse [27, 28] and KNIME [8]. However, because CDK Java packages are occasionally split between CDK modules, the CDK currently needs to be bundled as a single OSGi jar. The bundle is available from http://pele.farmbio.uu.se/bioclipse/cdk/cdk-1.5.13/. This Java package and bundle incompatibilities are currently being explored and constitutes an area where improvements can be done on modularization.

Systematic benchmark

A systematic benchmark was performed to evaluate and quantify performance improvements from v1.4.19 to v2.0. The benchmark is divided into several cheminformatics tasks for common use cases. Each task was evaluated on input from ChEBI 149 [81] and ChEMBL 22.1 [82] as both SMILES and SDF.

The benchmark was run on Java SE 8, CentOS 7, Intel Core i7-4790 CPU @ 3.60GHz with 16 GB of RAM. The code to run the benchmark is available in Additional file 1 allowing numbers to be recorded on the reader’s system.

The results of benchmark are summarised in Tables 5 and 6. The total elapsed times are reported in Table 5, Table 6 subtracts the first tasks results (Count Heavy Atoms) to provide a comparable measure without the overhead of input read time. The throughput as molecules per minute is reported but is less accurate for very fast running tasks.

Table 5 Summary of systematic benchmark comparing v1.4.19 to v2.0
Table 6 Summary of systematic benchmark comparing v1.4.19 to v2.0 without read times

Count heavy atoms

This task highlights improvements in raw read performance. Each record is read in to a resident memory connection table and the number of heavy (non-hydrogen) atoms counted by iterating over the atoms sequentially.

The improvement on this task is most noticeable for SMILES input, previously it would take more than 8 min to read ChEMBL 22.1 but this is reduced to less than 11 s. On top of this improvement SMILES input is now validated and assigned a Kekulé structure. This identifies 9 invalid entries in ChEBI and another 9 in ChEMBL. Most of these rejected SMILES are due to the wrong encoding of Cis/Trans double bond stereochemistry at ring closures. The ChEBI 149 SMILES input has 2107 empty records that v1.4.19 skip, v2.0 simply reads these as empty molecules. Input from SDF also improved from ~3 to ~1 min for ChEMBL. The SDF input in v2.0 now includes perception of stereochemistry and reading CTfile Sgroups (Fig. 8). There are 9 entries from ChEBI’s SDF that are rejected because they contain CTfile query features (e.g. any bond order).

Fig. 8
figure 8

Examples of Sgroups now captured by the CDK and encoded in molfiles and CXSMILES. a Ethyl esterification fully expanded reaction. b Using Sgroup abbreviations allows display short cuts and more compact depiction. c An example of a structure repeat unit in DNA 5′-phosphate (CHEBI:4294)

Rings

Ring perception is a fundamental step in many other algorithms. The rings task is divided as three subtasks: mark, sssr, and all.

-mark The first subtask measures the performance in marking ring membership and reporting the number of ring bonds in each record. This requires a linear algorithm based on a depth first search. The original code used a weighted spanning tree to compute the membership in linearithmic time. The run times are similar for these datasets (Table 6), larger differences are only seen for more complex cage molecules such fullerenes [38].

-sssr The second subtask computes the Smallest Set of Smallest Rings (SSSR) and reports the size of the SSSR (circuit rank) for each record. Although circuit rank can be computed more efficiently with a linear traversal (counting DFS back-edges) or with Euler’s polyhedron formula we are testing the time to enumerate the SSSR set. In general SSSR is considered unfavourable due to the non-uniqueness of the set and need for Gaussian matrix elimination (cubic runtime). With some bookkeeping the time spent in the matrix elimination has been reduced [38]. For ChEMBL we see the time to generate the SSSR is now ~16 s when it previously took around ~3.5 min (Table 6).

-all The third subtask counts the number of all rings up to or equal to size 12. This includes rings that encompass other smaller rings, for example, 1H-indole has rings of size 5, 6, and 9. In general this problem is exponential and so an adjustable threshold or timeout is used to avoid problematic molecules. CDK v1.4.19 used a timeout based threshold (default 5 s) whilst v2.0 uses a counter based on properties of algorithm [38]. In v2.0 there were 15/17Footnote 3 records skipped from ChEBI that have complex cage-like ring systems (e.g. CHEBI:33611), no records in ChEMBL reached the threshold. By comparison in v1.4.19 there were 14/16 records skipped from ChEBI and 88/90 in ChEMBL due to reaching the time out.

The speed-up in v2.0 is slightly better than the SSSR task. ChEMBL previously spent 4–5 min and now takes only ~12–14 s (Table 6). In v2.0 finding all rings (\(\le\)12 bonds) runs faster than the non-unique SSSR computation.

Canonical SMILES

This task measures the generation of a Unique SMILES string. These can be used to compare dataset intersection and exact lookup. From SMILES input v2.0 the total elapsed time is ~20 times faster for both ChEBI and ChEMBL. For ChEMBL it now takes just under 41 s to read, reorder, and write the SMILES compared to more than 14 min previously.

Convert

This tasks tests the non-canonical conversion between SDF and SMILES input.

-ofmt smi SMILES is a very compact means of storing connection tables, v1.4.19 could only write canonical SMILES, v2.0 allows different SMILES flavours to be generated including a non-canonical variant. This task outputs CXSMILES that includes additional fields such as repeat groups (used by some ChEBI entries). As expected the v1.4.19 execution time is the same as for the Canonical SMILES task but v2.0 can generate the non-canonical SMILES faster taking less than 30 s for SMILES from ChEMBL.

Assigning double-bond configurations in SMILES is non-trivial and v2.0 has some safety checks, since the SMILES output is Keklué but input was aromatic, when the bond orders are assigned an extra double-bond may be accidental encoded in the SMILES output, this is sometimes acceptable but will currently report an error.

-ofmt sdf For writing SDF output there is minimal improvement from v1.14.19, when discounting improvements in read performance the SDF generation for ChEBI actually runs slightly slower than v1.4.19 (Table 6). This can be partially explained by the more comprehensive SDF generation that now writes Sgroups as well as computing values for atom parity and valence columns.

-gen2d -ofmt sdf When writing SDF the only portable way to store stereochemistry is with the inclusion of coordinates, this is specified with the -gen2d option. The overhaul in layout generation discussed early provides better layouts but also included performance tweaks, in CDK v1.4.19 generating coordinates and writing an SDF for ChEMBL took almost 3.5 h but now only takes ~18 min.

Fingerprint generation

This task tests the generation of fingerprints for similarity and substructure screening. Three different types of fingerprints were tested, a Daylight-like Hashed Path Fingerprint, MACCS-like 166 Keys, and Pipeline Pilot-like Hashed Circular Fingerprint (ECFP4). The task generates a hexadecimal FPS file that can be used with chemfp [83].

-type path Path based fingerprints encode paths of length 0–7. Path based fingerprints can be used for both substructure and similarity screening. The algorithm was tweaked for v2.0 to hash the paths without pre-computing all paths upfront and without needing to generate character strings before hashing. The time to encode ChEMBL previously took 42–47 min now only takes 6–8 min.

-type maccs The CDK MACCS fingerprint uses 166 keys to encode features of a structure and can be used for similarity searching. This encoding uses different aspects of the library including ring perception and the new aromaticity perception but the speed is primarily dependent on SMARTS matching performance. Generating the fingerprint previously took ~1 day for ChEMBL and ~1.75 h for ChEBI. This has been reduced to less than 13.5 min for ChEMBL and ~20 s for ChEBI.

-type circ Circular fingerprints can only be used for similarity and could not be generated in v1.4.19. However, the fingerprints are known to perform well for retrieval performance [84]. The times are included here to show they are faster to calculate than path or MACCS-like keys and therefore recommended. CDK includes two implementations based on signatures or extended connectivity [35].

Benchmark summary

In all tasks, the total elapsed time is better in v2.0 compared to v1.4.19. On many tasks the improvement is more than ten times faster. Not only is the execution time improved but improvements in robustness and correctness means v2.0 is often doing much more work than the equivalent procedures in v1.4.19.

Conclusions

Since the second CDK publication, in 2006, the library has been improved in many aspects including architecture, new functionality, improved code testing, management, peer review, and deployment. These changes have led a more functionally rich cheminformatics library, with significant performance improvements. Updates on the common SMILES and molfile formats and the improved structure diagram generation are very visible and benefit many of the tools using the CDK. Furthermore, the stability of the development model has significantly improved, providing greater stability of the library over time. With more than 90 contributors, a long list of tools based on the CDK, and hundreds of article citations, the CDK is alive and kicking.

Availability and requirements

  • Project Name The Chemistry Development Kit.

  • Project home page https://cdk.github.io/.

  • Operating system(s) Windows, GNU/Linux, OS/X.

  • Programming language Java.

  • Other (optional) requirements JNI-InChI, Vecmath, Beam, Guava, JGraphT, Signatures, CMLXOM, XOM, JavaCC.

  • License LGPL v2.1 or later.

  • Any restrictions to use by non-academics None additional.

Change history

  • 20 September 2017

    An erratum to this article has been published.

Notes

  1. Molfiles can also store tetrahedral stereochemistry as a parity value, this is read if no coordinates are specified. In general there is no guarantee the parity value is read and the only portable way to store stereochemistry in a molfile is with coordinates.

  2. https://github.com/cdk/cdk/commit/2fc6b61972af834c1fea7fcb64287363ecbcb188.

  3. 2 records from SDF use query bond features and are skipped when read.

References

  1. O’Boyle N, Guha R, Willighagen E, Adams S, Alvarsson J, Bradley JC et al (2011) Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. J Cheminform 3(1):37

    Article  Google Scholar 

  2. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C et al (2006) The Blue Obelisk—interoperability in chemical informatics. J Chem Inf Model 46(3):991–998

    Article  CAS  Google Scholar 

  3. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci 43(2):493–500

    Article  CAS  Google Scholar 

  4. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL (2006) Recent developments of the Chemistry Development Kit (CDK)—an open-source java library for chemo- and bioinformatics. Curr Pharm Des 12(17):2111–2120

    Article  CAS  Google Scholar 

  5. O’Boyle NM, Hutchison GR (2008) Cinfony–combining Open Source cheminformatics toolkits behind a common interface. Chem Cent J 2:24

    Article  Google Scholar 

  6. Guha R (2007) Chemical informatics functionality in R. J Stat Softw 18(5):1–16

    Article  Google Scholar 

  7. Truszkowski A, Jayaseelan KV, Neumann S, Willighagen EL, Zielesny A, Steinbeck C (2011) New developments on the cheminformatics open workflow environment CDK-Taverna. J Cheminform 3(1):1–10

    Article  Google Scholar 

  8. Beisken S, Meinl T, Wiswedel B, de Figueiredo L, Berthold M, Steinbeck C (2013) KNIME-CDK: workflow-driven cheminformatics. BMC Bioinform 14(1):257

    Article  Google Scholar 

  9. ChemViz2: Cheminformatics App for Cytoscape; 2016. http://www.rbvi.ucsf.edu/cytoscape/chemViz2/

  10. Lawson KR, Lawson J (2012) LICSS—a chemical spreadsheet in microsoft excel. J Cheminform 4(1):3

    Article  CAS  Google Scholar 

  11. Hinselmann G, Rosenbaum L, Jahn A, Fechner N, Zell A (2011) jCompoundMapper: an open source Java library and command-line tool for chemical fingerprints. J Cheminform 3(1):3

    Article  CAS  Google Scholar 

  12. Wetzel S, Klein K, Renner S, Rauh D, Oprea TI, Mutzel P et al (2009) Interactive exploration of chemical space with Scaffold Hunter. Nat Chem Biol 5(8):581–583

    Article  CAS  Google Scholar 

  13. Klein K, Koch O, Kriege N, Mutzel P, Schäfer T (2013) Visual analysis of biological activity data with Scaffold Hunter. Mol Inform 32(11–12):964–975

    Article  CAS  Google Scholar 

  14. Peironcely JE, Rojas-Chertó M, Fichera D, Reijmers T, Coulier L, Faulon JL et al (2012) OMG: open molecule generator. J Cheminform 4(1):1–13

    Article  Google Scholar 

  15. Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474

    Article  CAS  Google Scholar 

  16. Dong J, Cao DS, Miao HY, Liu S, Deng BC, Yun YH et al (2015) ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J Cheminform 7(1):60

    Article  Google Scholar 

  17. Sivakumar TV, Giri V, Park JH, Kim TY, Bhaduri A (2016) ReactPRED: a tool to predict and analyze biochemical reactions. Bioinformatics 32:3522–3524

    CAS  Google Scholar 

  18. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM (2009) Small Molecule Subgraph Detector (SMSD) toolkit. J Cheminform 1(1):12

    Article  Google Scholar 

  19. Rahman SA, Cuesta SM, Furnham N, Holliday GL, Thornton JM (2014) EC-BLAST: a tool to automatically search and compare enzyme reactions. Nat Methods 11(2):171–174

    Article  CAS  Google Scholar 

  20. Rahman SA, Torrance G, Baldacci L, Cuesta SM, Fenninger F, Gopal N et al (2016) Reaction Decoder Tool (RDT): extracting features from chemical reactions. Bioinformatics 32(13):2065–2066

    Article  Google Scholar 

  21. Rostkowski M, Spjuth O, Rydberg P (2013) WhichCyp: prediction of cytochromes P450 inhibition. Bioinformatics 29(16):2051–2052

    Article  CAS  Google Scholar 

  22. Carlsson L, Spjuth O, Adams S, Glen RC, Boyer S (2010) Use of historic metabolic biotransformation data as a means of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinform 11(1):362

    Article  Google Scholar 

  23. Wolf S, Schmidt S, Müller-Hannemann M, Neumann S (2010) In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinform 11(1):148

    Article  Google Scholar 

  24. Southan C, Sharman JL, Benson HE, Faccenda E, Pawson AJ, Alexander SPH et al (2016) The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands. Nucleic Acids Res 44(D1):D1054–D1068

    Article  CAS  Google Scholar 

  25. Placzek S, Schomburg I, Chang A, Jeske L, Ulbrich M, Tillack J et al (2017) BRENDA in 2017: new perspectives and new tools in BRENDA. Nucleic Acids Res 45(D1):D380–D388

    Article  Google Scholar 

  26. Ruusmann V, Sild S, Maran U (2015) QSAR DataBank repository: open and linked qualitative and quantitative structure activity relationship models. J Cheminform 7(1):35

    Article  Google Scholar 

  27. Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J et al (2007) Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinform 8(1):59

    Article  Google Scholar 

  28. Spjuth O, Alvarsson J, Berg A, Eklund M, Kuhn S, Mäsak C et al (2009) Bioclipse 2: a scriptable integration platform for the life sciences. BMC Bioinform 10(1):397

    Article  Google Scholar 

  29. Jeliazkova N, Jeliazkov V (2011) AMBIT RESTful web services: an implementation of the OpenTox application programming interface. J Cheminform 3(1):1–18

    Article  Google Scholar 

  30. Jeliazkova N, Kochev N (2011) AMBIT-SMARTS: efficient searching of chemical structures and fragments. Mol Inform 30(8):707–720

    CAS  Google Scholar 

  31. Kochev NT, Paskaleva VH, Jeliazkova N (2013) Ambit-Tautomer: an open source tool for tautomer generation. Mol Inform 32(5–6):481–504

    Article  CAS  Google Scholar 

  32. Marth CJ, Gallego GM, Lee JC, Lebold TP, Kulyk S, Kou KGM et al (2015) Network-analysis-guided synthesis of weisaconitine D and liljestrandinine. Nature 528(7583):493–498

    Article  CAS  Google Scholar 

  33. Segler MHS, Waller MP (2017) Modelling chemical reasoning to predict and invent reactions. Chem. Eur. J. 23:6118–6128

    Article  CAS  Google Scholar 

  34. Alvarsson J, Lampa S, Schaal W, Andersson C, Wikberg JES, Spjuth O (2016) Large-scale ligand-based predictive modelling using support vector machines. J Cheminform. 8(1):39

    Article  Google Scholar 

  35. Clark A, Sarker M, Ekins S (2014) New target prediction and visualization tools incorporating open source molecular fingerprints for TB Mobile 2.0. J Cheminform 6(1):38

    Article  Google Scholar 

  36. Cannon E, Mitchell JBO (2006) Classifying the World Anti-Doping Agency’s 2005 prohibited list using the Chemistry Development Kit fingerprint. In: Berthold MR, Glen R, Fischer I (eds) Computational life sciences II. vol. 4216 of Lecture Notes in Computer Science. Springer, Berlin, pp 173–182

    Google Scholar 

  37. Spjuth O, Berg A, Adams S, Willighagen EL (2013) Applications of the InChI in cheminformatics with the CDK and Bioclipse. J Cheminform 5(1):14

    Article  CAS  Google Scholar 

  38. May JW, Steinbeck C (2014) Efficient ring perception for the Chemistry Development Kit. J Cheminform 6(1):3

    Article  Google Scholar 

  39. May JW (2014) Mischievous SMARTS Queries. http://efficientbits.blogspot.co.uk/2014_03_01_archive.html

  40. May JW (2015) Cheminformatics for genome-scale metabolic reconstructions. University of Cambridge. https://www.repository.cam.ac.uk/handle/1810/246652

  41. Karapetyan K, Batchelor C, Sharpe D, Tkachenko V, Williams A (2015) The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets. J Stat Softw 7:30

    Google Scholar 

  42. Faulon JL, Visco J, Donald P, Pophale RS (2003) The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. J Chem Inf Comput Sci 43(3):707–720

    Article  CAS  Google Scholar 

  43. Faulon JL, Collins MJ, Carr RD (2004) The signature molecular descriptor. 4. Canonizing molecules using extended valence sequences. J Chem Inf Comput Sci 44(2):427–436

    Article  CAS  Google Scholar 

  44. Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L, Wikberg JES et al (2014) Ligand-Based target prediction with signature fingerprints. J Chem Inf Model 54(10):2647–2653

    Article  CAS  Google Scholar 

  45. Spjuth O, Eklund M, Ahlberg Helgee E, Boyer S, Carlsson L (2011) Integrated decision support for assessing chemical liabilities. J Chem Inf Model 51(8):18407

    Article  Google Scholar 

  46. Moghadam BT, Alvarsson J, Holm M, Eklund M, Carlsson L, Spjuth O (2015) Scaling predictive modeling in drug development with cloud computing. J Chem Inf Model 55(1):19–25

    Article  CAS  Google Scholar 

  47. Alvarsson J, Eklund M, Andersson C, Carlsson L, Spjuth O, Wikberg JES (2014) Benchmarking study of parameter variation when using signature fingerprints together with support vector machines. J Chem Inf Model 54(11):32117

    Article  Google Scholar 

  48. Spjuth O, Carlsson L, Alvarsson J, Georgiev V, Willighagen E, Eklund M (2012) Open source drug discovery with bioclipse. Curr Top Med Chem 12(18):1980–1986

    Article  CAS  Google Scholar 

  49. Norinder U, Ek ME (2013) QSAR investigation of NaV1.7 active compounds using the SVM/signature approach and the bioclipse modeling platform. Bioorg Med Chem Lett 23(1):261–263

    Article  CAS  Google Scholar 

  50. Clark AM (2010) Basic primitives for molecular diagram sketching. J Cheminform 2(1):8

    Article  Google Scholar 

  51. Clark AM (2013) Rendering molecular sketches for publication quality output. Mol Inform 32(3):291–301

    Article  CAS  Google Scholar 

  52. Helson HE (2007) Structure diagram generation. Wiley, Oxford

    Google Scholar 

  53. Rojas-Chertó M, Kasper PT, Willighagen EL, Vreeken RJ, Hankemeier T, Reijmers TH (2011) Elemental composition determination based on MSn. Bioinformatics 27(17):2376–2383

    Article  Google Scholar 

  54. Pluskal T, Uehara T, Yanagida M (2012) Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal Chem 84(10):4396–4403

    Article  CAS  Google Scholar 

  55. Pluskal T, Castillo S, Villar-Briones A, Orešič M (2010) MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinform 11(1):1–11

    Article  Google Scholar 

  56. Dührkop K, Shen H, Meusel M, Rousu J, Böcker S (2015) Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci 112(41):12580–12585

    Article  Google Scholar 

  57. Böcker S, Letzel MC, Lipták Z, Pervukhin A (2009) SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 25(2):218–224

    Article  Google Scholar 

  58. Martello S, Toth P (1990) Knapsack problems: algorithms and computer implementations. Wiley, New York

    Google Scholar 

  59. Dührkop K, Ludwig M, Meusel M, Böcker S (2013) Faster mass decomposition. In: Proceedings of workshop on algorithms in bioinformatics (WABI 2013). Springer, pp 45–58. http://arxiv.org/abs/1307.7805

  60. Böcker S, Lipták Z, Martin M, Pervukhin A, Sudek H (2008) DECOMP from interpreting mass spectrometry peaks to solving the money changing problem. Bioinformatics 24(4):591–593

    Article  Google Scholar 

  61. Böcker S, Lipták Z (2005) Efficient mass decomposition. In: Proceedings of the 2005 ACM symposium on applied computing. ACM, pp 151–157

  62. Kind T, Fiehn O (2007) Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinform 8(1):1–20

    Article  Google Scholar 

  63. Zhang M, Zhang Z, Chen C, Lu H, Liang Y (2016) Parallel formula generator based on branch-and-bound algorithm for elucidating high resolution mass spectra. Chemometr Intell Lab Syst 153:106–109

    Article  CAS  Google Scholar 

  64. Wang M, Carver JJ, Phelan VV, Sanchez LM, Garg N, Peng Y et al (2016) Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat Biotechnol 34(8):828–837

    Article  CAS  Google Scholar 

  65. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36

    Article  CAS  Google Scholar 

  66. May JW (2013) Beam. GitHub . https://github.com/johnmay/beam

  67. ChemAxon Extended SMILES. http://onlinelibrarystatic.wiley.com/marvin/help/formats/cxsmiles-doc.html

  68. May JW (2013) All the small things. http://efficientbits.blogspot.co.uk/2013/10/all-small-things.html

  69. May JW (2013) Improved substructure matching. http://efficientbits.blogspot.co.uk/2013/11/improved-substructure-matching.html

  70. Berger F, Flamm C, Gleiss PM, Leydold J, Stadler PF (2004) Counterexamples in chemical ring perception. J Chem Inf Comput Sci 44(2):323–331

    Article  CAS  Google Scholar 

  71. Figueras J (1996) Ring perception using breadth-first search. J Chem Inf Comput Sci 36(5):986–991

    Article  CAS  Google Scholar 

  72. Daylight Chemical Information Systems Inc. http://www.daylight.com

  73. Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA et al (1992) Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J Chem Inf Comput Sci 32(3):244–255

    Article  CAS  Google Scholar 

  74. CTfile Formats. http://accelrys.com/products/collaborative-science/biovia-draw/ctfile-no-fee.html

  75. Gushurst AJ, Nourse JG, Hounshell WD, Leland BA, Raich DG (1991) The substance module: the representation, storage, and searching of complex structures. J Chem Inf Comput Sci 31(4):447–454

    Article  CAS  Google Scholar 

  76. Krause S, Willighagen E, Steinbeck C (2000) JChemPaint—using the collaborative forces of the internet to develop a free editor for 2D chemical structures. Molecules 5(1):93–98

    Article  CAS  Google Scholar 

  77. Willighagen E, Howard M (2007) Fast and scriptable molecular graphics in web browsers without Java3D. Nature Precedings. doi:10.1038/npre.2007.50.1

  78. Hanson RM (2010) Jmol—a paradigm shift in crystallographic visualization. J Appl Crystallogr 43:1250–1260

    Article  CAS  Google Scholar 

  79. Linux kernel, Version numbering. https://en.wikipedia.org/wiki/Linux_kernel#Version_numbering

  80. Willighagen EL (2011) Groovy Cheminformatics with the Chemistry Development Kit. 1.4.1-0 ed. Figshare. https://doi.org/10.6084/m9.figshare.2057790.v1

  81. Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N et al (2013) The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res 41(D1):D456.

    Article  CAS  Google Scholar 

  82. Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M et al (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42(D1):D1083.

    Article  CAS  Google Scholar 

  83. Dalke A (2013) The FPS fingerprint format and chemfp toolkit. J Cheminform 5(1):P36.

    Article  Google Scholar 

  84. O’Boyle NM, Sayle RA (2016) Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminform 8(1):36.

    Article  Google Scholar 

  85. Authors (2015) https://github.com/cdk/cdk/blob/master/pom.xml

  86. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754

    Article  CAS  Google Scholar 

  87. Hall LH, Kier LB (1995) Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Model 35:1039–1045

    Article  CAS  Google Scholar 

  88. Klekota J, Roth FP (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24(21):251825

    Article  Google Scholar 

  89. Vidal D, Thormann M, Pons M (2005) LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J Chem Inf Model 45(2):386–393

    Article  CAS  Google Scholar 

  90. PubChem Substructure Fingerprint v1.3. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt [cited Friday 4 July 2014]

  91. Murray-Rust P, Rzepa HS (2011) CML: Evolution and design. J Cheminform 3(1):44

    Article  Google Scholar 

  92. Ihlenfeldt WD, Gasteiger J (1994) Hash codes for the identification and classification of molecular structure elements. J Comput Chem 15(8):793–813

    Article  CAS  Google Scholar 

  93. Hicklin J, Moler C, Webb P, Boisvert RF, Miller B, Pozo R et al (2012) JAMA: a Java Matrix Package. http://math.nist.gov/javanumerics/jama/

Download references

Authors' contributions

All authors wrote and contributed source code or documentation to the CDK library. Some authors have peer-reviewed source code for the library. ELW, JWM, RG, and CS are project leaders. All authors have contributed to the content of this paper. All authors read and approved the final manuscript.

Acknowledgements

The authors acknowledge the great number of people who have contributed smaller and larger contributions to the CDK library. A full list of contributors is found in the Maven parent POM [85]. OS acknowledges support from the Swedish strategic research programs eSSENCE and Swedish e-Science Research Center (SeRC). TP is a Simons Foundation Fellow of the Helen Hay Whitney Foundation. We also thank K. Dührkop for his contributions during the writing of this paper.

Competing interests

JWM and NJ work for companies that sell solutions based on the CDK. ELW sells a book describing the CDK functionality.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Egon L. Willighagen.

Additional information

The original version of this article was revised. It was noticed that the graphical abstract was not included as requested.

An erratum to this article is available at https://doi.org/10.1186/s13321-017-0231-1.

Additional file

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Willighagen, E.L., Mayfield, J.W., Alvarsson, J. et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform 9, 33 (2017). https://doi.org/10.1186/s13321-017-0220-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13321-017-0220-4

Keywords