Monday, 23 January 2017
Let's say we want to compare N methods, and we have their scores for 10 diverse datasets. Because they are diverse, we think to ourselves that it wouldn't be an accurate evaluation if we used the scores directly, e.g. taking the mean scores or some such, because some of the datasets are easy (all of the scores are high but spread out) and some are hard (all of the scores are low, but close together). As I reference in the paper, Bob Sheridan has discussed this problem, and among other solutions suggested using the mean rank.
So here's the thing. Consider two methods, A and B, where A outperforms B on 9 out of the 10 datasets, i.e. A has rank 1 nine times and rank 2 once, while vice versa for B. So the mean rank for A is 1.1 while that for B is 1.9, implying that A is better than B. So far so good.
Now let's suppose I start tweaking the parameters of B to generate 9 additional methods (B1 to B9). Unfortunately, while they have similar performance to B, they are all slightly worse on each dataset. So the rank order for 9 of the 10 datasets in now A > B > B1...B9, and that for the 10th dataset is B > B1...B9 > A. So A has rank 1 nine times (as before) and rank 11 once (giving a mean rank of 2.0), while B has the same rank as before. So now B is better than A.
Wait a second. Neither A nor B has changed, so how can our evaluation of their relative performance have changed? And therein lies the problem.
So what's the solution? Even a cat knows the answer: as I stated at the start, on 9 out of 10 datasets A outperformed B. So A is better than B, or A>B. It's as simple as that. We don't need to calculate the mean anything. Our goal is not to find a summary value for the overall performance for a method, and then to compare those. It is to evaluate relative performance, and for diverse datasets each dataset has an equal vote towards that answer. So on dataset 1 (taking a different imaginary dataset), each method dukes it out giving A>B, A>C, A>D, B>C, B>D, C=D. Then round 2, on dataset 2. Toting up the values over all datasets will yield something like A>B 10 times, while B>A 5 times, and A=B twice, so A>B overall (and similar results for other pairwise comparisons).
The nice thing about all of these pairs is that you can then plot a Hasse diagram to summarise the info. There are some niggly details like handling incomparable values (e.g. A>B, B>C, C>A) but when it's all sorted out, the resulting diagram is quite information-rich.
I didn't mention statistical significance, but the assessment of whether you can say A>B overall requires a measurement of significance. In my case, I could generate multiple datasets from the same population, and so I used a a basic T-test over the final results from each of the datasets (and corrected for multiple testing).
I've gone on and gone on a bit about what's really quite a simple idea. In fact, it's so simple that I was sure it must be how people already carry out performance evaluations across diverse datasets. However, I couldn't find any evidence of this, nor could I find any existing description of this method. Any thoughts?
Saturday, 14 January 2017
So let's fix the world.
What is written down (i.e. on the Daylight website, the OpenSMILES spec) is the following:
1. For atoms in brackets, the number of hydrogens is either listed or zero. e.g. [Na] has 0 zeros, [NaH5] has 5.
2. For atoms outside brackets (which means they must be in the so-called 'organic subset'), if they are not written lowercase (indicating aromaticity), the number of hydrogens is described by the SMILES valence model (see the OpenSMILES specification for the details), e.g. the carbons in CC have 3 implicit hydrogens.
What is not written down (apart from references to pyrrole-/pyridine-type hydrogens) is what to do about (unbracketed) aromatic atoms when reading. These should be handled as follows (as far as I can tell - see also John's investigations below):
1. For elements other than carbon, there are zero implicit hydrogens. (The corrolary is that if there were a hydrogen present, it would be bracketed.)
2. For carbon, apply the valence model of rounding up to 3 explicit neighbours. In practice, this means either 1 hydrogen present if 2 neighbours, and 0 otherwise (for 3 neighbours).
Probably because of the lack of clarity around these latter rules, not every writer or reader follows them, and Wikipedia in particular is rife with generated SMILES from
This raises the question as to what to do when presented with, for example, the SMILES c1cncc1? A correct SMILES for pyrrole would be c1c[nH]cc1. As written, the first SMILES is not kekulisable according to the Daylight aromaticity model (a neutral 'n' without a hydrogen must have a double bond, or alternatively, radicals contibute only a single electron) but one could infer that the structure intended was pyrrole. This is a slippery slope, though, once you consider aromatic rings with multiple nitrogens where it may not be possible to unambiguously assign hydrogens. Also, I would argue that it is not the job of a reader to change the structure because "it knows best".
For related reading, see John Mayfield's posts on SMILES implicit valence of aromatic atoms and New SMILES behaviour - parsing (CDK 1.5.4). Also worth noting is that at no point in identifying hydrogen locations was determination of aromatic systems or kekulisation required. Of course, if you go down the route of editing erroneous structures that may be a different story.
Image credit: Licensed CC-BY-NC by Sean Davis (image on Flickr)
Friday, 30 December 2016
As a simple example, let's take the molecule represented by the SMILES F[C@@H](Br)Cl. I want to write a SMARTS pattern that matches this, as well as all superstructures. In this context, given that the halides typically are single valent, such superstructures are replacements of the hydrogen by arbitrary R groups.
Now, you may be aware that every SMILES string is also a valid SMARTS pattern. Unfortunately, it is also true that this is rarely the SMARTS pattern that you want. In this particular case, the original SMILES string, when interpreted as a SMARTS query, requires that the C has exactly 1 hydrogen attached. In other words, it won't match any superstructures (except for the elusive 5-valent carbon).
So let's leave out the H to give F[C@@](Br)Cl. This cannot be read as SMILES (at least not without warnings) since a chiral carbon requires four neighbours, but it is a valid SMARTS pattern. The question is what does it match?
The answer is more subtle than I, at least, expected. It will only match molecules that correspond to the following pseudosmiles, F[C@@](X)(Br)Cl, where X is anything including an implicit H.
Equally relevant is what it won't match. If X is F, thereby losing the chirality, then you are out of luck, but I would consider that a perfectly reasonable superstructure. And following on from this, it also won't match any other cases where the stereo is not defined at the carbon.
So in the end, I have come to the view that the best SMARTS pattern to use is F[C@@?](Br)Cl, which also matches the case where stereo is undefined. Better to cast the net wide and if someone really doesn't want to match those cases it is easy to do a search-and-replace.
Friday, 23 September 2016
I'm pleased to announce that Open Babel 2.4.0 has finally been released.
This release represents a major update and should be a stable upgrade, strongly recommended for all users.
We intend to move to an annual major release every September, with bug fix releases as needed.
A sample of major new features:
- Integration of the confab conformer generator
- Improved partial charge models, including EEM methods, EQeq
- ECFP radial fingerprints
- Initial support for ring rotamer / conformer sampling
- Improved GAFF atom typing and parameterization
- New PHP scripting bindings
- Many new file formats, features and bug fixes
For a full list of changes and to download:
Thanks to a cast of many for this release, including:
Alexandr Fonari, Anders Steen Christensen, Andreas Kempe, arkose, Benoit Leblanc, Björn Grüning, Casper Steinmann, Chris Morley, Christoph Willing, Craig James, Dagmar Lenk, David Hall, David Koes, David Lonie, David van der Spoel, Dmitriy Fomichev, Fulvio Ciriaco, Fredrik Wallner, Geoff Hutchison, Heiko Becker, Itay Zandbank, Jean-Noel Avila, Jeff Janes, Joaquin Peralta, Joshua Swamidass, Julien Nabet, Karol Langner, Karthik Rajagopalan, Katsuhiko Nishimra, Kevin Horan, Kirill Okhotnikov, Lee-Ping, Matt Harvey, Maciej Wójcikowski, Marcus Hanwell, Mathias Laurin, Matt Swain, Mohamad Mohebifar, Mohammad Ghahremanpour, Noel O'Boyle, Patrick Avery, Patrick Fuller, Paul van Maaren, Peng Bai, Philipp Thiel, Reinis Danne, Roger Sayle, Ronald Cohen, Scott McKechnie, Stefano Forli, Steve Roughley, Steffen Moeller, Tim Vandermeersch, Tomas Racek, Tomáš Trnka, Tor Colvin, Torsten Sachse, Yi-Shu Tu, Zhixiong Zhao
On a personal note, in particular I'd like to thank Stefano, Steve and Steffen for contributing; not to mention Matt, Matt, Mathias and Maciej and David, David, David and David. Without them, ordering the authors by first name would not have so richly paid off.
As Chris Morley has taken a step back from the project, I did the Windows release for the first time. One change is that now we have a 64-bit version along with the 32-bit. I've also moved to supporting "pip install openbabel" as the primary means of installing the Python bindings - there's already been some work on this for Mac/Linux by Matt Swain and others.
If you have any comments/criticisms or need help, the best place to go is to our mailing list (firstname.lastname@example.org) or file bugs on Github (click the green "New Issue").
Sunday, 21 August 2016
You see, John had worked me over. At the start, I thought of a PDF as the bad boy of the journal publishing scene, the hamburger and not the cow. What appears at first as text arranged into sentences, is just a haphazard arrangement of glyphs which through some trickery of the eye coalesces into scientific discourse. To generate a PDF myself would be to add to this madness.
But the thing is, when you strip a PDF down to its essentials, it's a relative of a PostScript file (details omitted due to ignorance), a vector graphics format. A more popular vector graphics format is an SVG file, but this is not supported by most publishers and so I spend a lot of time calculating DPI and inches per column and then generating a PNG. But they do often support PDFs, and these can readily be generated (with a bit of care) from many different programs. And all other things being equal, the best quality images will be generated by providing a vector graphics format as the publisher can resize it without any loss of quality.
Below I provide details about how I generated the PDFs, but let's look at their handling by Journal of Cheminformatics. This journal provides three views of the paper, a HTML page, an ePUB (which I won't discuss further) and a PDF. The HTML version contains embedded PNGs, they are a little small for my taste (maybe my fault - I don't know) but they are readable. So somehow they were able to convert the PDFs to images of whatever size they wanted for the HTML page. The PDF is a bit more interesting, as the images are now included as vector graphics. That is, if you keep zooming in on an image in the PDF, the lines remain sharp (in contrast to the PNGs in the HTML version).
So, in short, there seems little downside to providing PDFs, and much to gain. I'd be interested in hearing the viewpoint of anyone involved with the publishing side of things.
1. When using matplotlib to generate graphs, just give the file a .PDF extension, e.g. plt.savefig("overallperformance_%s.pdf" % benchmark, dpi=300)
2. When using Inkscape, save as PDF.
3. The hardest part was the chemical structures. I tried a variety of recipes with two different commercial programs. In the end, although ChemDraw's SVG export had the heteroatoms all over the place, the EMF export was openable by Inkscape and then I could save as PDF. (Apparently you can go direct to PDF from ChemDraw on a Mac.)
Tuesday, 2 August 2016
- Lilly MedChem Rules - Strictly speaking, this is not obviously a toolkit, but a commandline Ruby application. However, there is a C++ cheminf toolkit sitting behind that application, which was developed by Ian Watson at Eli Lilly.
Let me know if I've missed anything. For a more comprehensive overview of Open Source Molecular Modelling see the very recent paper by Pirhadi et al, which has an associated Github repo for keeping the information up to date.
Monday, 18 July 2016
Yes, that's right, you have no clue what I'm talking about. What I'm saying is, why not bung in an ASCII depiction of the molecule in a property field? Well, apart from it being a bonkers idea that'll bloat the SD file to hitherto unimagined sizes (but think of the improved compression!), I can't think of any reason not to do this. It is my belief that this could finally unleash the untapped potential of ASCII depiction. And so I've added an option to the SD file writer in Open Babel to do exactly this.
John is fond of quoting Jurassic Park's "your scientists were so preoccupied with whether or not they could that they didn't stop to think if they should." I don't know why I just mentioned that.