Friday, 21 June 2013

Discussion meeting vs conference: in praise of slower science

Pompeii mosaic
Plato conversing with his students
As time goes by, I am increasingly unable to enjoy big conferences. I'm not sure how much it's a change in me or a change in conferences, but my attention span shrivels after the first few talks. I don't think I'm alone. Look around any conference hall and everywhere you'll see people checking their email or texting. I usually end up thinking I'd be better off staying at home and just reading stuff.

All this made me start to wonder, what is the point of conferences?  Interaction should be the key thing that a conference can deliver. I have in the past worked in small departments, grotting away on my own without a single colleague who is interested in what I'm doing. In that situation, a conference can reinvigorate your interest in the field, by providing contact with like-minded people who share your particular obsession. And for early-career academics, it can be fascinating to see the big names in action. For me, some of the most memorable and informative experiences at conferences came in the discussion period. If X suggested an alternative interpretation of Y's data, how did Y respond: with good arguments or with evasive arrogance? And how about the time that Z noted important links between the findings of X and Y that nobody had previously been aware of, and the germ of an idea for a new experiment was born?

I think my growing disaffection with conferences is partly fuelled by a decline in the amount and standard of discussion at such events. There's always a lot to squeeze in, speakers will often over-run their allocated time, and in large meetings, meaningful discussion is hampered by the acoustic limitations of large auditoriums. And there's a psychological element too: many people dislike public discussion, and are reluctant to ask questions for fear of seeming rude or self-promotional (see comments on this blogpost for examples). Important debate between those doing cutting-edge work may take place at the conference, but it's more likely to involve a small group over dinner than those in the academic sessions.

Last week, the Royal Society provided the chance for me, together with Karalyn Patterson and Kate Nation, to try a couple of different formats that aimed to restore the role of discussion in academic meetings. Our goal was to bring together researchers from two fields that were related but seldom made contact: acquired and developmental language disorders. Methods and theories in these areas have evolved quite separately, even though the phenomena they deal with overlap substantially.

The Royal Society asks for meeting proposals twice a year, and we were amazed when they not only approved our proposal, but suggested we should have both a Discussion Meeting at the Royal Society in London, and a smaller Satellite meeting at their conference centre at Chicheley Hall in the Buckinghamshire countryside.

We wanted to stimulate discussion, but were aware that if we just had a series of talks by speakers from the two areas, they would probably continue as parallel, non-overlapping streams. So we gave them explicit instructions to interact. For the Discussion meeting, we paired up speakers who worked on similar topics with adults or children, and encouraged them to share their paper with their "buddy" before the meeting. They were asked to devote the last 5-10 minutes of their talk to considering the implications of their buddy's work for their own area. We clearly invited the right people, because the speakers rose to this challenge magnificently. They also were remarkable in all keeping to their allotted 30 minutes, allowing adequate time for discussion. And the discussion really did work: people seemed genuinely fired up to talk about the implications of the work, and the links between speakers, rather than scoring points off each other.

After two days in London, a smaller group of us, feeling rather like a school party, were wafted off to Chicheley in a special Royal Society bus. Here we were going to be even more experimental in our format. We wanted to focus more on early-career scientists, and thanks to generous funding from the Experimental Psychology Society, we were able to include a group of postgrads and postdocs. The programme for the meeting was completely open-ended. Apart from a scheduled poster session, giving the younger people a chance to present their work, we planned two full days of nothing but discussion. Session 1 was the only one with a clear agenda: it was devoted to deciding what we wanted to talk about.

We were pretty nervous about this: it could have been a disaster. What if everyone ran out of things to say and got bored? What if one or two loud-mouths dominated the discussion? Or maybe most people would retire to their rooms and look at email. In fact, the feedback we've had concurs with our own impressions that it worked brilliantly. There were a few things that helped make it a success.
  • The setting, provided by the Royal Society, was perfect. Chicheley Hall is a beautiful stately home in the middle of nowhere. There were no distractions, and no chance of popping out to do a bit of shopping. The meeting spaces were far more conducive to discussion than a traditional lecture theatre.
  • The topic, looking for shared points of interest in two different research fields, encouraged a collaborative spirit, rather than competition.
  • The people were the right mix. We'd thought quite carefully about who to invite; we'd gone for senior people whose natural talkativeness was powered by enthusiasm rather than self-importance. People had complementary areas of expertise, and everyone, however senior, came away feeling they'd learned something.
  • Early-career scientists were selected from those applying, on the basis that their supervisor indicated they had the skills to participate fully in the experience. Nine of them were selected as rapporteurs, and were required to take notes in a break-out session, and then condense 90 minutes of discussion into a 15-minute summary for the whole group.  All nine were quite simply magnificent in this role, and surpassed our expectations. The idea of rapporteurs was, by the way, stimulated by experience at Dahlem conferences, which pioneered discussion-based meetings, and subsequent Str├╝ngmann forums, which continue the tradition.
  • Kate Nation noted that at the London meeting, the discussion had been lively and enjoyable, but largely excluded younger scientists. She suggested that for our discussions at Chicheley, nobody over the age of 40 should be allowed to talk for the first 10 minutes. The Nation Rule proved highly effective - occasionally broken, but greatly appreciated by several of the early career scientists, who told us that they would not have spoken out so much without this encouragement.
I was intrigued to hear from Uta Frith that there is a Slow Science movement, and I felt the whole experience fitted with their ethos: encouraging people to think about science rather than frenetically rushing on to the next thing. Commentary on this has focused mainly on the day-to-day activities of scientists and publication practices (Lutz, 2012). I haven't seen anything specifically about conferences from the Slow Science movement (and since they seem uninterested in social media, it's hard to find out much about them!), but I hope that we'll see more meetings like this, where we all have time to pause, ponder and discuss ideas.  

Reference
Lutz, J. (2012). Slow science Nature Chemistry, 4 (8), 588-589 DOI: 10.1038/nchem.1415

Monday, 17 June 2013

Research fraud: More scrutiny by administrators is not the answer

I read this piece in the Independent this morning and an icy chill gripped me. Fraudulent researchers have been damaging Britain's scientific reputation and we need to do something. But what? Sadly, it sounds like the plan is to do what is usually done when a moral panic occurs: increase the amount of regulation.

So here is my, very quick, response – I really have lots of other things I should be doing, but this seemed urgent, so apologies for typos etc.

According to the account in the Independent, Universities will not be eligible for research funding unless they sign up to a Concordat for Research Integrity which entails, among other things, that they "will have to demonstrate annually that each team member’s graphs and spreadsheets are precisely correct."

We already have massive regulation around the ethics of research on human participants that works on the assumption that nobody can be trusted, so we all have to do mountains of paperwork to prove we aren't doing anything deceptive or harmful. 

So, you will ask, am I in favour of fraud and sloppiness in research? Of course not. Indeed, I devote a fair part of my blog to criticisms of what I see as dodgy science: typically, not outright fraud, but rather over-hyped or methodologically weak work, which is, to my mind, a far greater problem. I agree we need to think about how to fix science, and that many of our current practices lead to non-replicable findings. I just don't think more scrutiny by administrators is the solution. To start scrutinising datasets is just silly: this is not where the problem lies.

So what would I do? The answers fall into three main categories: incentives, publication practices, and research methods.

Incentives is the big one. I've been arguing for years that our current reward system distorts and damages science. I won't rehearse the arguments again: you can read them here.  The current Research Excellence Framework is, to my mind, an unnecessary exercise that further incentivizes researchers against doing slow and careful work. My first recommendation is therefore that we ditch the REF and use simpler metrics to allocate research funding to University, freeing up a great deal of time and money, and improving the security of research staff. Currently, we have a situation where research stardom, assessed by REF criteria, is all-important. Instead of valuing papers in top journals, we should be valuing research replicability

Publication practices are problematic, mainly because the top journals prioritize exciting results over methodological rigour. There is therefore a strong temptation to do post hoc analyses of data until an exciting result emerges. Pre-registration of research projects has been recommended as a way of dealing with this - see this letter to the Guardian on which I am a signatory.  It might be even more effective if research funders adopted the practice of requiring researchers to specify the details of their methods and analyses in advance on a publicly-available database. And once the research was done, the publication should contain a link to a site where data are openly available for scrutiny – with appropriate safeguards about conditions for re-use.

As regards research methods, we need better training of scientists to become more aware of the limitations of the methods that they use. Too often statistical training is a dry and inaccessible discipline. All scientists should be taught how to generate random datasets: nothing is quite as good at instilling a proper understanding of p-values as seeing the apparent patterns in data that will inevitably arise if you look hard enough at some random numbers. In addition, not enough researchers receive training in best practices for ensuring quality of data entry, or in exploratory data analysis to check the numbers are coherent and meet assumptions of the analytic approach.

In my original post on expansion of regulators, I suggested that before a new regulation is introduced, there should be a cold-blooded cost-benefit analysis that considers, among other things, the cost of the regulation both in terms of the salaries of people who implement it, and the time and other costs to those affected by it. My concern is that among the 'other costs' is something rather nebulous that could easily get missed. Quite simply, doing good research takes time and mental space of the researchers. Most researchers are geeks who like nothing better than staring at data and thinking about complicated problems. If you require them to spend time satisfying bureaucratic requirements, this saps the spirit and reduces creativity.

I think we can learn much from the way ethics regulations have panned out. When a new system was first introduced in response to the Alder Hey scandal, I'm sure many thought it was a good idea. It has taken several years for the full impact to be appreciated. The problems are documented in a report by the Academy of Medical Sciences, which noted "Urgent changes are required to the regulation and governance of health research in the UK because unnecessary delays, bureaucracy and complexity are stifling medical advances, without additional benefits to patient safety"

If the account in the Independent is to be believed, then the Concordat for Research Integrity could lead to a similar outcome. I'm glad I will retire before the it is fully implemented.

Sunday, 16 June 2013

Overhyped genetic findings: the case of dyslexia

A press release by Yale University Press Office was recently recycled on the Research Blogging website*, announcing that their researchers had made a major breakthrough. Specifically they said "A new study of the genetic origins of dyslexia and other learning disabilities could allow for earlier diagnoses and more successful interventions, according to researchers at Yale School of Medicine. Many students now are not diagnosed until high school, at which point treatments are less effective." The breathless account by the Press Office is hard to square with the abstract of the paper, which makes no mention of early diagnosis or intervention, but rather focuses on characterising a putative functional risk variant in the DCDC2 gene, named READ1, and establishing its association with reading and language skills.

I've discussed why this kind of thing is problematic in a previous blogpost, but perhaps a figure will help. The point is that in a large sample you can have a statistically strong association between a condition such as dyslexia and a genetic variant, but this does not mean that you can predict who will be dyslexic from their genes.

Proportions with risk variants estimated from Scerri et al (2011)
In this example, based on one of the best-replicated associations in the literature, you can see that most people with dyslexia don't have the risk version of the gene, and most people with the risk version of the gene don't have dyslexia. The effect sizes of individual genetic variants can be very small even when the strength of genetic association is large.

So what about the results from the latest Yale press release? Do they allow for more accurate identification of dyslexia on the basis of genes? In a word, no. I was pleased to see that the authors reported the effect sizes associated with the key genetic variants, which makes it relatively easy to estimate their usefulness in screening. In addition to identifying two sequences in DCDC2 associated with risk of language or reading problems, the authors noted an interaction with a risk version of another gene, KIAA0319, such that children with risk versions in both genes were particularly likely to have problems.  The relevant figure is shown here.

Update: 30th December 2014 - The authors have published an erratum indicating that Figure 3A was wrong. The corrected and original versions are shown below and I have amended conclusions in red.
Corrected Fig 3A from Powers et al (2013)

Original Fig 3A from Powers et al (2013)



There are several points to note from this plot, bearing in mind that dyslexia or SLI would normally only be diagnosed if a child's reading or language scores were at least 1.0 SD below average.
  1. For children who have either KIAA0319 or DCDC2 risk variants, but not both, the average score on reading and language measures is at most no more than 0.1 SD below average at most.
  2. For those who have both risk factors together, some tests give scores that are from 0.2 to 0.3 SD below average, but this is only a subset of the reading/language measures. On nonword reading, often used as a diagnostic test for dyslexia, there is no evidence of any deficit in those with both risk versions of the genes. On the two language measures, the deficit hovers around 0.15 SD below the mean.
  3. The tests that show the largest deficits in those with two risk factors are measures of IQ rather than reading or language. Even here, the degree of impairment in those with two risk factors together indicates that the majority of children with this genotype would not fall in the impaired range.
  4. The number of children with the two risk factors together is very small, around 2% of the population.
In sum, I think this is an interesting paper that might help us discover more about how genetic variation works to influence cognitive development by affecting brain function. The authors present the data in a way that allows us to appraise the clinical significance of the findings quite easily. However, the results indicate that, far from indicating translational potential for diagnosis and treatment, genetic effects are subtle and unlikely to be useful for this purpose.

*It is unclear to me whether the Yale University Press Office are actively involved in gatecrashing Research Blogging, or whether this is just an independent 'blogger' who is recycling press releases as if they are blogposts.

Reference
Powers, N., Eicher, J., Butter, F., Kong, Y., Miller, L., Ring, S., Mann, M., & Gruen, J. (2013). Alleles of a Polymorphic ETV6 Binding Site in DCDC2 Confer Risk of Reading and Language Impairment The American Journal of Human Genetics DOI: 10.1016/j.ajhg.2013.05.008
Scerri, T. S., Morris, A. P., Buckingham, L. L., Newbury, D. F., Miller, L. L., Monaco, A. P., . . . Paracchini, S. (2011). DCDC2, KIAA0319 and CMIP are associated with reading-related traits. Biological Psychiatry, 70, 237-245. doi: 10.1016/j.biopsych.2011.02.005
 

Friday, 7 June 2013

Interpreting unexpected significant results

©www.cartoonstock.com
Here's s question for researchers who use analysis of variance (ANOVA). Suppose I set up a study to see if one group (e.g. men) differs from another (women) on brain response to auditory stimuli (e.g. standard sounds vs deviant sounds – a classic mismatch negativity paradigm). I measure the brain response at frontal and central electrodes located on two sides of the head. The nerds among my readers will see that I have here a four-way ANOVA, with one between-subjects factor (sex) and three within-subjects factors (stimulus, hemisphere, electrode location). My hypothesis is that women have bigger mismatch effects than men, so I predict an interaction between sex and stimulus, but the only result significant at p < .05 is a three-way interaction between sex, stimulus and electrode location. What should I do?

a) Describe this as my main effect of interest, revising my hypothesis to argue for a site-specific sex effect
b) Describe the result as an exploratory finding in need of replication
c) Ignore the result as it was not predicted and is likely to be a false positive

I'd love to do a survey to see how people respond to these choices; my guess is many would opt for a) and few would opt for c). Yet in this situation, the likelihood of the result being a false positive is very high – much higher than many people realise.   
Many people assume that if an ANOVA output is significant at the .05 level, there's only a one in twenty chance of it being a spurious chance effect. We have been taught that we do ANOVA rather than numerous t-tests because ANOVA adjusts for multiple comparisons. But this interpretation is quite wrong. ANOVA adjusts for the number of levels within a factor, so, for instance, the probability of finding a significant effect of group is the same regardless of how many groups you have. ANOVA makes no adjustment to p-values for the number of factors and interactions in your design. The more of these you have, the greater the chance of turning up a "significant" result.
So, for the example given above, the probability of finding something significant at .05, is as follows:
For the four-way ANOVA example above, we have 15 terms (four main effects, six 2-way interactions, four 3-way interactions and one 4-way interaction) and the probability of finding no significant effect is .95^15 = .46. It follows that the probability of finding something significant is .54.
And for a three-way ANOVA there are seven terms (three main effects, three 2-way interactions and one 3-way interaction), and p (something significant) = .30.
So, basically, if you do a four-way ANOVA, and you don't care what results comes out, provided something is significant, you have a slightly greater than 50% chance of being satisfied. This might seem like an implausible example: after all who uses ANOVA like this? Well, unfortunately, this example corresponds rather closely to what often happens in electrophysiological research using event-related potentials (ERPs). In this field, the interest is often in comparing a clinical and a control group, and so some results are more interesting than others: the main effect of group, and the seven interactions with group are the principal focus of attention. But hypotheses about exactly what will be found are seldom clearcut: excitement is generated by any p-value associated with a group term that falls below .05. There's a one in three chance that one of the terms involving group will have a p-value this low. This means that the potential for 'false positive psychology' in this field is enormous (Simmons et al, 2011).
A corollary of this is that researchers can modify the likelihood of finding a "significant" result by selecting one ANOVA design rather than another. Suppose I'm interested in comparing brain responses to standard and deviant sounds. One way of doing this is to compute the difference between ERPs to the two auditory stimuli and use this difference score as the dependent variable:  this reduces my ANOVA from a 4-way to a 3-way design, and gives fewer opportunities for spurious findings. So you will get a different risk of a false positive, depending on how you analyse the data.

Another feature of ERP research is that there is flexibility in how electrodes are handled in an ANOVA design: since there is symmetry in electrode placement, it is not uncommon to treat hemisphere as one factor, and electrode site as another. The alternative is just to treat electrode as a repeated measure. This is not a neutral choice: the chances of spurious findings is greater if one adopts the first approach, simply because it adds a factor to the analysis, plus all the interactions with that factor.

I stumbled across these insights into ANOVA when I was simulating data using a design adopted in a recent PLOS One paper that I'd commented on. I was initially interested in looking at the impact of adopting an unbalanced design in ANOVA: this study had a group factor with sample sizes of 20, 12 and 12. Unbalanced designs are known to be problematic for repeated measures ANOVA and I initially thought this might be the reason why simulated random numbers were giving such a lot of "significant" p-values. However, when I modified the simulation to use equal sample sizes across groups, the analysis continued to generate far more low p-values than I had anticipated, and I eventually twigged that this was because this is what you get if you use 4-way ANOVA. For any one main effect or interaction, the probability of p < .05 was one in twenty: but the probability that at least one term in the analysis would give p < .05 was closer to 50%.
The analytic approach adopted in the PLOS One paper is pretty standard in the field of ERP. Indeed, I have seen papers where 5-way or even 6-way repeated measures ANOVA is used. When you do an ANOVA and it spews out the results, it's tempting to home in on the results that achieve the magical significance level of .05 and then formulate some kind of explanation for the findings. Alas, this is an approach that has left the field swamped by spurious results.
There have been various critiques of analytic methods in ERP, but I haven't yet found any that have focussed on this point. Kilner (2013) has noted the bias that arises when electrodes or windows are selected for analysis post hoc, on the basis that they give big effects. Others have noted problems with using electrode as a repeated measure, given that ERPs at different electrodes are often highly correlated. More generally, statisticians are urging psychologists to move away from using ANOVA to adopt multi-level modelling, which makes different assumptions and can cope, for instance, with unbalanced designs. However, we're not going to fix the problem of "false positive ERP" by adopting a different form of analysis. The problem is not just with the statistics, but with the use of statistics for what are, in effect, unconstrained exploratory analyses. Researchers in this field urgently need educating in the perils of post hoc interpretation of p-values and the importance of a priori specification of predictions.
I've argued before that the best way to teach people about statistics is to get them to generate their own random data sets. In the past, this was difficult, but these days it can be achieved using free statistical software, R. There's no better way of persuading someone to be less impressed by p < .05 than to show them just how readily a random dataset can generate "significant" findings. Those who want to explore this approach may find my blog on twin analysis in R useful for getting started (you don't need to get into the twin bits!).
The field of ERP is particularly at risk of spurious findings because of the way in which ANOVA is often used, but the problem of false positives is not restricted to this area, nor indeed to psychology. The mindset of researchers needs to change radically, with a recognition that our statistical methods only allow us to distinguish signal from noise in the data if we understand the nature of chance.
Education about probability is one way forward. Another is to change how we do science to make a clear distinction between planned and exploratory analyses. This post was stimulated by a letter that appeared in the Guardian this week on which I was a signatory. The authors argued that we should encourage a system of pre-registration of research, to avoid the kind of post hoc interpretation of findings that is so widespread yet so damaging to science.

Reference

Simmons, Joseph P., Nelson, Leif D., & Simonsohn, Uri (2011). False-positive psychology Psychological Science, 1359-1366 DOI: 10.1037/e636412012-001

This article (Figshare version) can be cited as:
Bishop, Dorothy V M (2014): Interpreting unexpected significant findings. figshare.
http://dx.doi.org/10.6084/m9.figshare.1030406




PS. 2nd July 2013
There's remarkably little coverage of this issue in statistics texts, but Mark Baxter pointed me to a 1996 manual for SYSTAT that does explain it clearly. See: http://www.slideshare.net/deevybishop/multiway-anova-and-spurious-results-syt
The authors noted "Some authors devote entire chapters to fine distinctions between multiple comparison procedures and then illustrate them within a multi-factorial design not corrected for the experiment-wise error rate." 
They recommend doing a Q-Q plot to see if the distribution of p-values is different from expectation, and using Bonferroni correction to guard against type I error.

They also note that the different outputs from an ANOVA are not independent if they are based on the same mean squares denominator, a point that is discussed here:
Hurlburt, R. T., & Spiegel, D. K. (1976). Dependence of F Ratios Sharing a Common Denominator Mean Square. The American Statistician, 30(2), 74-78. doi: 10.2307/2683798
These authors conclude (p 76)
It is important to realize that the appearance of two significant F ratios sharing the same denominator should decrease one's confidence in rejecting either of the null hypotheses. Under the null hypothesis, significance can be attained either by the numerator mean square being "unusually" large, or by the denominator mean square being "unusually" small. When the denominator is small, all F ratios sharing that denominator are more likely to be significant. Thus when two F ratios with a common denominator mean square are both significant, one should realize that both significances may be the result of unusually small error mean squares. This is especially true when the numerator degrees of freedom are not small compared' to the denominator degrees of freedom.