When is it OK to Find a Bug
"How many times have you seen an email praising the heroic efforts of the developer who fixed some last-minute major issues in a huge new feature that were discovered in final user acceptance testing? This should not be seen as a heroic deed—rather it should be seen as a tragic failure. " — Real World DevOps
Let's be clear, finding issues is always best before it exists. However, we have layers of testing specifically because we want to find issues. The fact that you found the issue during testing is a good thing and a success.
Quick side note on User Acceptance Testing, at this point I believe that the tests should be focusing on, now that the system is in place rather than testing that features meet business requirements, you want to be testing that the system as a whole is meeting the value/proposition expected. The types of issues I hope to find here aren't ones where the behavior deviates from the specified requirements, but that the requirements were not satisfactory with either what was actually expected or that expectations changed due to other factors.
The two main points I want to note is
- Reproducing the issue quickly in lower environments
- Getting the fix to the environment quickly
If either of these things don't happen, then you have a failure. These eliminate the "heroic" efforts because it reduces time taken to fix a bug.
Some of the hardest issues to reproduce in lower environments are data driven. Quantity of data, types of data, information expected from the data all play a role in making reproduction difficult. Other issues tend to be environmental, how many servers are in the load balancer, live vs dark servers, any number of security sniffers or extra data transfer protocols and authentication.
I don't want to say that every issue needs to be reproducible on the dev machine, but I do want to emphasize that reducing the barriers for the dev to reproduce in their own environment will provide you with some of the fastest turn around on bug resolution.
Now that a dev has produced a fix and verified, the barrier to getting that fix to the highest environment where the bug exists needs to be as streamlined as possible. This encompasses more than just DevOps or automated verification, it is about not adding new gates and reducing the size of the change.
Say your development cycle which introduced the issue ended up not running regression tests. I believe there are valid conditions this choice would be made, one of those reasons is that they were run in a lower lane environment. Requiring that the new build take extra testing steps which weren't in-place or utilized for the original cycle is not a solution to the problem at hand. Unless the issue was severe enough that all environments rolled back to the prior version, you're probably not providing value by adding the extra gate.
There are two main concerns which would push for this desire to add in a gate. One is that the bug fix will introduce new issues and the other is that nothing was tested resulting in a cycle where all issues need to be identified and fixed high up in the release process.
The problem with the extra gate fix is that it is usually tacked on and the real evaluation on addressing the cause is missed. The gate is also usually placed on the already taxed QA team. That team can always more of your release cycle time, the work here is endless and that is on top of the endless work the dev is already doing. You're QA team is most effective when they are good at trimming down their workload by identifying what not to test. Adding a gate which says to test more ties their hands and leaves them unable to trim down their work to a point where they can focus on finding issues in the software.
Remember how I started by saying that it was a success. You placed a UAT validation in your release pipeline to identify issues, an issue was identified. UAT testing is happening like a user on a system which most resembles production and generally covers a good portion of app functionality. Why would you put a gate lower down in the pipeline which does the same thing? This is an opportunity to analyze the issue, what caused it and how best to identify and catch this and similar issues sooner. The needs of a UAT test make it harder to setup, requires more of the system to be complete, is slower even when automated and you're still going to want to executed it prior to production anyway.
Therefore, fix the issue by making the smallest change possible and reduce the number of gates originally put in place to allow the issue to be solved in all environments quickly.