The lack of robust reproducibility in the scientific literature is both shocking and troubling, and has been a widely covered topic over the past couple years.
One of the earliest blogs here at LifeSciVC was on the dirty secret that more than half of academic work couldn’t be replicated in an industrial setting, and how it shaped the way we view starting new companies as venture investors. It got the attention of BioCentury/SciBX as well as the Wall Street Journal.
Later in 2011, some real data was added to strengthen the case: a Bayer Healthcare team published work showing that only 25% of the academic studies they examined could be replicated (Prinz et al. Nat. Rev. Drug Discov. 10, 712, 2011). And then earlier this year, Glenn Begley (formerly Amgen) and Lee Ellis (MDACC) showed that of 53 “landmark” oncology studies from 2001-2011, each highlighting big new apparent advances in the field, only 11% (only 6!) could be robustly replicated in work done at Amgen (Begley & Ellis Nature 483, 531–533, 2012). Adding insult to injury, the number of citations for the unreproducible findings actually outpaced those with reproducible findings according to the Amgen work: averaging 248 vs 231 citations, respectively, for papers in high impact factor journals and an even more astonishing 169 vs 13 citations for papers from other journals.
These are frightening statistics for an industry predicated upon building on the prior work of others and the integrity of peer review for sorting the good from the bad.
As we think about starting new companies, initiating drug discovery campaigns, or even launching clinical studies, how do we deal with this issue?
First and foremost, we need to get better at assessing the scientific literature, and all of us involved in translational medicine need to hold ourselves to a higher standard – including investigators, their institutions, journal editors, grant-funding bodies, VCs, Pharma, etc… From an industrial perspective, we should also do better diligence – assessing science with better filters about what’s robust and what’s not.
To that end, Glenn Begley has some up with a great list of rules for what makes a robust, high quality paper with the hallmarks of reproducibilty, based on their review of scores of papers. Like Lipinski’s Rule of Five for predicting oral activity of small molecule drugs, I’d like to propose calling these Begley’s Six Rules for Reproducibility:
1) Were studies blinded?
2) Were all results shown?
3) Were experiments repeated?
4) Were positive and negative controls shown?
5) Were reagents validated?
6) Were the statistical tests appropriate?
Lets take each one of these in turn.
1) Most studies aren’t blinded with experimental and control arms. Furthermore, by my estimate, less than 20% of Methods sections even mention whether the work was blinded to prevent experimenter bias, and in most cases the blinding methodology isn’t included.
2) Results from multiple studies are rarely shown in the same paper, as its usually only the “representative” example figure (read = best single result). Outliers often disappear from figures (e.g., telltale sign are n’s differing randomly between arms). Many western and northern blots show only a computer-generated slice of the gel, without size markers. Its also often unclear if the exposures were in the linear range of the staining.
3) N-of-1 experiments are sadly fairly commonplace in the literature. Assays often don’t have replicate values included. Nor are aggregate n’s often used. Its true that some long term animal models are a chore to do multiple times, and often critical reagents are expensive – but repeating studies before publishing should be the bar.
4) The use of both positive and negative controls to benchmark an experimental system is frequently not done. In fairness, with a novel model, there might not be a positive control. But if there is, it should be included and described. Selection of the right controls is also an issue: e.g., when studying the role of a single kinase in a disease, a promiscuous dirty kinase inhibitor that happens to hit the target of interest is probably not a great control.
5) Validated reagents are essential to draw robust conclusions. Unfortunately, Begley and his colleagues found this to be frequently overlooked, especially the strength of immunohistochemistry probes and western antibodies (e.g., species cross-reactivity). Authors should highlight where the validated reagents were obtained.
6) Statistics is a big gap for most papers. Proper powering of animal studies with a pre-agreed stat plan is a rarity. Showing n’s and SEM bars in figures is important. Also, what’s the right p-value to use; for instance, p-values of 0.05 aren’t relevant to post hoc analysis hunting for signals on a chip.
These Six Rules are good guidelines for those of us in the business of finding and commercializing the next cutting edge science. Thinking about these during diligence around an investigator’s work will undoubtedly improve the outcome of academic-to-industry translational efforts. Furthermore, Tech Transfer Offices should hold these Rules up when they are working on invention disclosures and external outreach for the work. Lastly, more CROs should track the literature and propose to do reproducibility work in line with these Rules for high impact science out of top tier academic centers; my guess is many academic institutions would support those studies.
Importantly, adherence to these Rules won’t make reproducible translation work 100% of the time. Fundamentally, there are “language” differences between academic and industrial work. The often used phrase “safe and well-tolerated” in an academic animal study means the animals didn’t look sick nor did they die. But it doesn’t mean that even gross organ pathology was ruled out, much less full histopathology, chemistry and blood counts, liver enzyme levels, etc… This language difference is an important factor in translation, but is much more nuanced than Begley’s Six Rules and needs to be considered in any academic-to-industry transfer.
The entire ‘system’ of biomedical research needs to change in order to address these issues and raise the bar. Grant funding bodies need to demand it, as do journals. Investigators should want to do it and be rewarded for their work’s robustness. As one effort aimed at addressing this systematic issue, the Science Exchange’s Reproducibility Initiative launched earlier last month and I’m honored to be on their Advisory Board. The Initiative aims to provide “both a mechanism for scientists to independently replicate their findings and a reward for doing so”. It’s received lots of press in Science, Nature Biotech, Slate, BioCentury, and others. I’m hopeful that it will help create the momentum needed to address this troubling issue.
But, as a parting remark, let’s not forget that this is all about cutting-edge science. There will always be studies that can’t get repeated – that’s part of the iterative nature of the scientific method of articulating and challenging hypotheses. But as a system we can’t continue to tolerate ‘hit rates’ of reproducibility below 50% from academic scientific literature, especially from top tier journals and biomedical institutions.