In the previous learning path on a heuristic approach to debugging, you learned about the top six common sources of problems. You also learned how to use these top six sources as heuristics that help pinpoint the issue when there are too many possibilities, or when you don’t have enough information to know where to begin. One practical aspect of the heuristics approach is how it limits the scope of a problem space, but it can fail to account for other factors in a system at the same time. For example, NTP. Is there a way to avoid this? Yes, this learning path presents a concept and technique on how to do this.
In this section, you will learn:
The most important aspect to consider is, “Where do you look in a large system?”. We describe this as the context. It is possible to infer and search for all areas previously described without context. However, sustainable debugging of systems is best served with a good understanding of how a system works, followed by debugging practices to become proficient and efficient. Understanding how a system works means you know how the ’engines’ run.
For example, What are the components involved? True understanding is not simply being able to describe it in plain words, but also being able to draw it. Can you draw out the system and data flows? Yes, even if the diagram is simple, draw it because this reduces cognitive load while thinking of the issue at hand for large complex problems. Some typical diagramming tools can assist, including Miro, LucidChart and, draw.io.
One way to start is top-down. Understand the bigger picture, then dive into the individual components. On much larger systems, focus on areas of the architecture that is relevant to the scope of the problem. It is OK to not understand everything at once. More importantly, is the mindset of determination and curiosity of wanting to know it with an open mind. Once a picture of the system is available, the focus would be on identifying what parts of the system are in context to the problem. For example, components that are closely related/connected to where the issue is first seen. Once the context of the system is identified, you can start hypothesizing and prioritizing what issues and tests to perform. It is not necessary to hypothesize within the context of the system, but it does help to limit the amount of hypotheses and tests that need to be performed to areas that are likely the source of an issue.
In the following diagram, you can see that there are number of places where an issue can occur. Having a diagram with its data flows on hand allows hypotheses and prioritization of issues, and lets you perform tests at each point to see if the hypothesis holds. It is here that the sources of problems described in the previous learning path, can be assigned as a hypothesis. Having this picture, you can prioritize where to perform tests. The tests usually start with components that are closely related to where the issue first originated. Always start small, and expand the scope as needed.
The previous section looked at forming contexts of a system and hypothesizing where issues may originate. For resolving most problems, a single person or a pair of people is sufficient. However, for larger scale, complex and unknown systems, a team of people are best utilized for identifying and solving complex problems. This is because experience and different knowledge/perspectives to a problem can be gathered and used to form additional hypotheses to tests. However, how does one effectively conduct this? The key is to use brainstorming techniques, such as the ones discussed here.
This section introduces the Dump In, Sink Out (DISO) approach that was utilized and devised by S. Wittenkamp and D. Wu for the context of solving a problem in a very large complex software system with a team of developers. This idea is furthered in this article by incorporating the concept of the context system diagram.
The following are the basic requirements for all participants:
A basic knowledge of the problem. For example, all participants knows Kubernetes.
A good understanding of the system, a system expert, or an expert who can interact with internal or external components.
Access to the system, preferably in pairs.
If done in person, materials should include sticky notes, painter tape and markers. The assumption is that you are the facilitator. You may assign this to someone else to conduct.
To conduct DISO, do the following:
Hypotheses
area of where the problem might exist. Have the participants do
this alone to avoid biases, and within a limited time. For example, 10 minutes.
Encourage the participants to write down their ideas, reminding them that there
are no ‘bad ideas’, and that it’s OK if ideas are duplicated independent of other participants.Grouping and Priority
area. Group ideas that overlap
together.Tested
area. The pair must also document their testing methodology.Eventually, the test results should provide an answer as to where, or what is the cause of the problem. If the test results does not provide an answer, consider the other interacting components or systems. If there is a bug, reach out to support. For example, IaaS providers and support is required from them if they are not part of the current group of participants.