Blog Archives

The origin of mistakes in research

1/31/2019

Mistakes can be costly. There are several types of mistakes that I have made in my research, some of them are not easy to identify, some of them once identified take serious amount of time to fix, some of them when fixed evolve into literal nightmares that haunt me at night, and perhaps, some of them still hide somewhere that are yet to be exposed. Here I summarize my mistakes in the past six years of research into six different types, to remind myself not to repeat similar mistakes that I have made. Hopefully, it would also provide something to people who read it.
Type A: logic flaws

In proposing hypotheses/model. This type of mistake occurs when I try to propose a mechanistic model to explain my observations/data/results or when I try to form a hypothesis for a question. It is easy to find an approximate explanation to explain a thing, most of the time. However, quite often, something could be missing. For example, if one of the steps has three possible scenarios, and only the first two are obvious and compatible with the observation, then it would be wrong to assume all possible scenarios are considered, because this kind of ignorance could lead to wrong prediction, over-stated conclusion, or worse, wrongly preferred hypothesis, or inaccurate model. Sometimes, several competing models could be tested, failing to identify the second-most probable model could create trouble, because fighting against a paper tiger is neither informative nor convincing. I am trying to keep this in mind, and keep asking the question "what else" and "how else" to avoid making this type of mistakes.
In learning and understanding. It occurs to me that sometimes I believe I have already understood something, but there is actually a key part I subconsciously fill in with my own assumptions which I have not yet fact-checked. It is always necessary to study the counter-argument thoroughly, if there is one, before choosing to believe in a different argument. I should also keep in mind, not to treat X as correct, unless I could explain why X is correct, and why everything else is wrong, in addition to why everything else claiming X is wrong is also wrong. Throughout the years, I learned that it is okay to say X works well for xxx data, but might not work when xxx, even if I provide pieces of evidence to support X; likewise, it is okay to say X does not work under xxx data but could work in other cases, even if I want to provide evidences against it; it is important to word things precisely, and to recognize the difference.

Type B: applying suboptimal/biased statistics without recognizing doing it.

Suboptimal: To be fair, this type belongs to the lack of experience rather than mistake, and a suboptimal unbiased test is still better than a biased test. However, the result of this is quite similar to making a mistake. As I have gained more research experiences, my judgment in choosing statistical tests also improves, but the change is gradual and slow. However, in the recent two years, I find it helpful to think about other people's tests from published papers, and think about how to improve them. It can be surprisingly common to find room for improvements even in published papers.
Biased: An example of using biased statistics is provided in Type E. Sometimes it could be challenging to figure out whether something is unbiased when you invent a new approach of testing things, which could be when the underlying distribution is unknown, or could be that a closed form cannot be achieved. I find it helpful to test with a simulation. There is once, a friend of mine did a test (something like first pair the data, then estimate A and B, and then A/B, then average), and I thought it should be less biased to do it in a different order. Fortunately, I overcame Type E and decided to do a simulation first, and find my presumption wrong.

Type C: ascertainment bias in coding and in observations.
Here I assume that observations are analytical results/outputs from some coded scripts. So I discuss coding and observation together. The most dangerous ascertainment bias in coding/scripting is that when the observations agree with expectations, it is actually due to a bug in the code. More often, when the observations look weird, the code would be revisited until things look highly likely. This is super dangerous. It could be more dangerous for small tasks which only need to be done once or twice than for bigger tasks which would be applied to data repeatedly. I find it helpful to do some testing even if the first time everything runs smoothly. Another thing I try to do is to avoid having any prediction and expectation, at least when I first analyze something. I also try to slow down and stay focus when I code for very simple things and only code when I desire to code, and I find it helpful.
Observations could be more general. It could be experimental results, which I have little familiarity. Ascertainment bias in observations could also be about acquiring knowledge from the literature, for example, if there is an argument in literature and a person reads more about one argument than the counter argument, the person may as well miss some valuable evidence on the other side. This could be similar to the bias in learning and understanding in Type A, but it is not due to logic but due to biased exposure/observation of the literature. This process could be entirely unintentional, or it could be subconscious. Another scenario could be similar to Type F, not observing existing literature on the topic of study could turn out to be quite a disaster.
Type D: missed important information from data.
This is right now my most painful experience in mistakes, and fortunately so far it occurred once. This is a scary mistake, not only because all the analyses need to be redone, but also because the previous observations may no longer hold, as well as all interpretations on top of them. Fortunately, in my case, the main result stayed the same. There can be at least two types of important information missed, especially when using public data.

Part of the data is missed: such as a variable which could be informative was not taken into account in the data processing, or a part of the data is presumed to be not existing in the database. Being more careful with all details about the database is thus important.
Part of the data pre-processing step is missed. This could occur when the people who published the data did some pre-processing to remove potential bias, or for some other purpose. This could be easily missed when not reading the previous publication into great details, thus learning about every single detail from the paper who published the data is key to avoid this type of mistake.

Type E: pride and prejudice.
The worst thing about pride is that it triggers me oversee my own mistakes. I struggle to eliminate the inner ego of myself, because otherwise, I could assess my questions and methods ignorantly. I know that I learnt this in the hard way. There was once, I ignored the first person's question on whether my test was unbiased, and did not try to prove it with a thorough simulation, and only found it was indeed biased when another person also instanced that the same. This experience makes me realize that I have to constantly remind myself to question my own judgement before other people question it, and I should definitely question my judgement if someone questions it. If something could be proven by a simulation or a mathematical proof, it might well worth the time.
Type F: reinventing the wheel.
Reinventing the wheel could waste a lot of time, and could also detriment the novelty of a project. It first occurred in the first project during my Ph.D., a method in detecting selection of overlapping genes. I didn't manage to find in literature that the wheel has already been invented until I was revising the first draft of the manuscript. Despite the two methods are quite different, and despite I eventually manage to publish mine as well, I have to spend a lot extras time to justify the new method, including scan for overlapping genes in the human genome, get examples, compare the speed and accuracy of the two methods, and etc. This type of mistake could be somewhat easily avoided by a more careful investigation of the literature. Nowadays, there are so many journals and so many papers, which does make it harder and harder to keep up with the literature. I adapt the following tactics to partially avoid this mistake: figure out all possible alternative terminologies and relevant concepts, go through the reference list of all key relevant papers, go through all papers that cited those papers, go through the work of relevant researchers, and if possible, ask someone who is more senior and knowledgeable to assess the topic.
Epilogue
Most people would not make as many mistakes as I do, or may not have made as many types of mistakes. Very often mistakes are inevitable, but could be fixed at an early stage to avoid future damage. Learning from the past, I found triple-checking being helpful, along with patience and keep calm when witnessing an exciting result.
Mistakes are not terrible, what is terrible is leaving a mistake uncorrected. It occurs to me that it is perhaps most important to have the mindset of knowing, that mistakes are made unintentionally; therefore, intentionally and constantly triple-check every possible step and correct them at the early stage could be more efficient than rushing for a quick result.

29 Comments

The origin of mistakes in research

April 's blog

Archives

Categories