Guest Commentary: Data visualization--New directions or just familiar routes?

Data visualization tools make it very easy to represent our data graphically and present it in a way that clearly communicates patterns and trends. But, there is a risk that visualizations may be used, in practice, to confirm or justify our own hypotheses and biases. Instead, can data visualizations bring to light patterns in our data, drive new hypotheses and show us things we weren’t expecting?

Given the efficiency with which we can process visual information, it is easy to explain the appeal of data visualization. At its best, a visualization can highlight patterns which numerical analyzes might otherwise miss. Anscombe’s quartet (Anscombe, 1973) is a good example of four data sets which are statistically very similar, but which when visualized show very different relationships. This sanity check can be invaluable, and yet it should be remembered that inappropriate choices of plot type, axis scales or directions and color can result in visualizations which might be uninformative or misleading at first glance. Our ability to spot visual patterns quickly can work against us when an inappropriate visualization is presented, whether or not the creator was attempting to mislead us deliberately.

If we consider a drug discovery project for which we have measured potency data, a common question to ask might be “On which compounds should I focus my attention?” We will illustrate this using an example set of 264 compounds, representing six different chemical series, for which 5-HT1A activities have been measured. The examples in Figure 1 show the importance of choosing an appropriate way to represent the range of potencies across the chemistries. The simple histogram shown in A gives a good overview but tells us nothing about how the potencies are distributed across the chemical series, until we introduce some color, as in B. This still isn’t very clear, given the different number of compounds in each chemical series, whereas using a two-dimensional histogram in C and D the height of each bar shows the average potency of a chemical series. The choice of y-axis scale also influences our view of the data, and in C the chemical series look almost identical, whereas in D there appear to be significant differences—but what is the “right” range? In E, having chosen a range based on a potency level we might consider as “inactive to very active,” we can also see the importance of adding error bars to give some indication as to the distribution of potencies. This highlights the value of representing these data using a box plot instead, as in F. Now it becomes clear that each series contains some potent compounds, but the indole-3-alkylamines certainly appear to be the most active.

Of course, potency is just one of the properties we would need to consider in order to identify a high-quality lead compound, and in our data set we also have predicted values for a number of typical absorption, distribution, metabolism and excretion and physicochemical properties. While we could create a box plot for each property of interest, this would require us to look at and, most importantly, make sense of a large number of visualizations. We could attempt to put many dimensions of data into single a scatter plot—three dimensions plotted with others represented by color, size, transparency, etc.—but unless there are some very obvious outliers, it is likely to be very hard to interpret.

It is worth considering, therefore, the kind of information with which we typically make decisions in drug discovery. All of the properties used to analyze and select compounds are derived from models of the ultimate human patient in which we are interested, whether those models are in vivo, in vitro or in silico. All measured data, however accurate, will contain some degree of uncertainty due to experimental variability, while in-silico models will contain some statistical error. As an example, a good root-mean-square error for an aqueous solubility prediction represented as logS(µM) is approximately 0.6. In practice, this means that a logS value of 1 (corresponding to 10µM) represents a fairly soluble compound but which we only know with 95-percent confidence has an actual aqueous solubility somewhere between 630mM and 0.16nM. Knowing that the real value lies towards one end of the range or the other might have a significant impact upon a decision we make about selecting this compound.

And this is just a single property. It is common to base compound selection decisions on criteria for multiple properties. At their simplest, these criteria might just be cutoffs when we believe that an acceptable compound will have a property value on the “right” side of a threshold. When we consider the uncertainty around our data points, even a value on the “right” side of the threshold might have some probability of being unacceptable. In some cases, poor property values with high uncertainties may even represent better opportunities for optimization than very accurate values which fall just short of the necessary criteria.

Adding this information about uncertainty into any data visualization will improve its representation of the true nature of the data, but this comes at the cost of interpretability. Just adding error bars to our plots isn’t likely to solve this problem.

One approach to dealing with this is to use multiparameter optimization (MPO) to generate a score that encapsulates multiple properties. There are several approaches to MPO (Segall, 2012), but by using one that explicitly considers the uncertainty, we can significantly reduce the number of dimensions we need to visualize. Applying this to the set of 5-HT1A compounds, a single visualization can now represent all of the underlying data, giving a more comprehensive picture of which chemical series have the greatest potential. In Figure 2, the compounds have been scored and plotted from left to right in order of descending score. We can see error bars which indicate when we can select between compounds with confidence, but the two highlighted series (arylpiperazines and aminotetralines) are represented by green and pink points, which dominate the left-hand side of the plot where the highest scores lie. This is despite these series not including the most potent compounds, as shown in the histograms below the representative compound displayed for each series.

The visualization of any complex data set is problematic, but even much smaller sets can present challenges. If we consider a subset of the 5-HT1A compounds, comprised of the drug Buspirone and a small number of analogs, we might hope to determine relationships between their structures and two important properties, potency and metabolic stability. The goal in this case is to identify structure-activity relationships (SAR) to design a potent, stable compound. Carrying out an R-group analysis and creating an SAR table for each property (Figure 3 A and B), we can quickly see that the combinations of R-groups which result in the highest potency do not give the best stability, and vice-versa (in both cases, the greenest circles represent the best values). On the other hand, if we create an activity neighborhood diagram, Figure 3 C, we can visualize the relationships between the compounds in a different way. Choosing a representative compound with average potency and stability values to be at the centre, the other compounds are organised in a spiral with the most structurally similar closest to the middle. The cards are colored by their metabolic stabilities, with green being the highest. The links show the difference in potency, with green showing the greatest difference and the arrow showing the direction in which the potency increases. Therefore, a green arrow pointing towards a green card, indicating a more potent and stable compound, easily stands out. Using this approach quickly highlights the original problem, that simply moving a single functional group can modify a compound from being “stable and not potent” to “unstable and potent.” The compound that stands out, however, which is both potent and stable, but not easily apparent from the SAR table, results from a combination of changes—an important point that may otherwise have been overlooked.

At each decision point, the choice of visualization can be pivotal. The way we perceive our compounds depends upon those around it: Have we explored the surrounding chemical space thoroughly enough to adequately evaluate a series? Do we have data of sufficient quality to confidently distinguish the good compounds from the rest? When it comes to making decisions about compounds, it is often the relationships between compounds that will influence the way we choose the next compound. Any visualizations which simplify or hide these relationships have the potential to bias our perception of the data.

Edmund J. Champness is chief scientific officer of Optibrium Ltd. With a background in Mathematics, he joined GlaxoWellcome in 1995 working as part of a pioneering team building predictive pharmaceutical tools. He developed the first graphical user-interfaces for working with predictive models, which were adopted globally within GlaxoWellcome. He was a core member of the team which established the U.K. operation of Camitro in 2001 and remained with that company through a series of mergers and acquisitions (ArQule, Inpharmatica and BioFocus DPI) until 2008. During this time, he designed and built the StarDrop software and, in 2009, co-founded Optibrium.

References

Anscombe, F. (1973). Graphs in Statistical Analysis. American Statistician, 27(1), 17-21.
Segall, M. (2012). Multi-Parameter Optimization: Identifying high quality compounds with a balance of properties. Current Pharmaceutical Design, 18(9), 1292-1310.http://www.optibrium.com/