Identify, Isolate, Repair - Network Troubleshooting Tales and Tips

Brandon Hitzel
Dec 31, 2023
25 min read

Updated: Jan 4, 2024

Discussing network troubleshooting tips, stories, and insights including 2 troubleshooting scenarios

Hello, as always thanks for reading. In this post I'd like to look at network troubleshooting and some of my thoughts on it as it's a topic a lot of people generally need help with. Lurking through reddit on r/networking or sysadmin you often see posts asking for help troubleshooting. Some of these posts are very basic and some are very complex, often too much for the forums. If you are one of those people who often need help or you're someone who is in operations and is troubleshooting a lot, I'll try to include things for different situations.

crazy cable mess — welcome to the team, as your first task...

Quick note: Responding to these cases in the forums can help with your troubleshooting skills if you follow through with someone, and if you're a poster, for the love of packets please edit your post with the fix if you find it!

Knowledge

I think there is a problem in the tech industry with troubleshooting and figuring out issues related to networking. I often see where IT people skip the network basics or network engineers who are not methodical in their process. A lot of this comes down to IT training in my opinion. Looking at how I've studied and the programs I've gone through I would say the majority of content revolves around knowing what to check if X isn't working (e.g. first do this if this thing isn't working), this can be fine to a certain extent but has some faults when it comes to advanced cases.

Like with CCNP and CCIE studies there is a lot revolving around troubleshooting, however most of it is related to identifying bad configurations or a misconfiguration like something that cause a routing adjacency to not form. This is a good foundation but in real world scenarios there are times the problems are related to a behavior/characteristic that cannot be replicated in an exam.

Also there's the approaches like top down, bottom up, shoot from the hip, and divide and conquer methods, which we will NOT really touch on as those have been covered a million times. You know, you have the OSI model layers 1 -7, if you can ping the device then layers 1 and 2 are probably fine, if you can get to the web page then layers 1-7 are probably good and it could be something else back end related, yada yada, you have the solution! These are good to know for sure and work as a foundation to a troubleshooting mentality, but there are other related skills you can develop too.

Yes, as you're probably thinking, experience helps when troubleshooting without a doubt but there are many things one can do to set themselves up for success when the next incident comes around. A lot of the experience part comes with troubleshooting similar issues or the same systems, but what about if this is something you are new with or don't understand? How do you tackle that? Learning and understanding is a big part of being a good troubleshooter.

The first exposure I had in troubleshooting was in college when I was in the automotive program (yes I went to school for cars once upon a time) before I got into networking. I took an engine performance class which included a lot about diagnosing problems related to an internal combustion engine. What was good about it was that it taught about something complex and broke it down into different systems. For example the basic components of the engine are the electrical system, fuel system, and air system. Each of these systems has different components and performs different tasks but when they all work together the engine operates properly. Therefore, with knowing how the overall system operates from a high level you can better identify where a problem might be, and then knowing how the sub-systems work allows you to isolate to certain components of that system to better determine the issue.

Don't be afraid to look at alternative industries or sources to develop ideas for how you can troubleshoot issues. I feel like the fact I learned troubleshooting methods outside of my IT training actually better set me up once I started that curriculum and was confronted with tickets.

Additionally don't be afraid to research into topics more, like if you are encountering a voice related issue, study SIP and RTP a bit to get your understanding up to speed.

The Flow

This applies in networking as well, because we have different protocols and operations that operate at different portions of a network flow. We have (generalizing) the LAN side, where we have switching/MAC learning, ARP broadcasting, v6 multicast, vlans, subnet masking, IP addresses etc. The WAN side with routing decisions, routing protocols, tunnels/transport and everything that comes with it. Then there are the other technologies that apply to flows like server services, stateful inspection on firewalls, or NAT and so forth.

Most people will only know a few components of any of these categories. If you're new or someone who isn't in networking full time, then its time to hit the books, talk to colleagues, search youtube or use paid sources like INE or pluralsight and learn the basics of network communication.

diagram showing a basic network flow — Basic Network Flow Illustration

This is a basic network flow and you should know each step cold.

For Networkers: You should know what a computer does when it needs to communicate via ARP/broadcast for v4 (when local LAN vs. default gateway) and multicast/neighbor discovery and link local addressing for IPv6, to the MAC address re-write and routing decision a router makes and how it makes those decisions based on the administrative distance. Then you should know what happens when the flow hits a firewall and how that is handled. Also consider the fundamental TCP or UDP behavior. At each point in the path you should be able to identify the basic actions the device makes and what happens to the traffic.

Once you get the basics then its good to dive deeper into topics of your choice like routing protocols, TCP, wireshark, firewall rules and other network technologies. Consider something like knowing how a routing table works (not even as in depth as RIB vs. FIB) and being able to interpret a routing table show command output. You might not know each protocol in depth or the entire network but if you're looking at the first hop router you should know where the traffic is being sent so you can move to the next hop to trace the flow.

It only takes basic knowledge to accomplish this but its important in methodically tracing a network flow to troubleshoot something like packet loss or excessive latency due to routing, which are common network-related issues.

Looking at logs and outputs is an important skill to hone in some cases (as is taught) but unless you understand what you are looking at it might not mean much; therefore learning and reading is important to becoming better over time.

When explaining I like to frame the decision as a question as shown in some of the call outs in the above diagram. is 10.2.2.2 on my local subnet? The device uses it subnet mask to determine that, then sends the frame/packet with a destination MAC of the default gateway to route out to the other non-local network. Then the default gateway receives the packet and says "who do I need to send this packet to in order to reach 10.2.2.2?" and so on and so forth.

Methodologies

Another thing I learned in the auto class was about fault isolation which is important when troubleshooting. As you probably know in auto or with tech, replacing components can be costly. People often want to just replace parts when something is wrong without even diagnosing, my teacher would hold his hand up and point to his wedding ring "you're going to marry the car if you keep doing that!" because of how costly it could be to just replace things without troubleshooting.

The same thing goes for troubleshooting a network incident. Are you just going to replace a line card if someone asks to? or just reboot a $100k router because there is some packet loss reported? No, I doubt you would as a first step, but some people would suggest that.

People will just throw out things to try, "do this and let me know if it fixes it" before even asking questions aka "shooting from the hip". Doing this enough can bring about more issues as you are just throwing out changes to try and remedy the problem, or something like just rebooting before collecting local logs could leave you without key information.

identify issues with collaboration — Ask Questions

Problem statement and identification are the first steps when an issue arises, you first need to understand what is being reported and what is going on. From there I usually ask if its something new that they are trying to do or something they usually do that is not working (if its a specific application related), because based on that you can make a decision on what path you will go investigate.

I think all network engineers have gotten the ticket "the network is causing $problem or is down", a question back would be "what exactly is not working? When did it last work?" "is it one person or an entire location?", and then what are the steps they are taking to replicate the issue and so forth. Don't be afraid to jump on a call to talk with them or setup a call for a troubleshooting session with users or colleagues, especially if its a high priority case.

If this is something new "Help me understand how..." is a good start of a question as you need to understand the issue to be able to apply the knowledge of how the network components will come into play (or know what to research). You might not know what the end user does but you potentially know what the network will do based on their actions. This is why you also need to know your network and document it!

If this is something existing then sometimes you might not even take any action as you have experience or knowledge of how the network is setup to go immediately to documentation to review or straight to a tool to check (more on these later), to then form a possible diagnosis. This can save time but sometimes it can also waste time or lead you in the wrong direction.

I've seen the most senior engineers to the day one jr.'s ask this or fail to ask these types of questions, either way don't be afraid to walk through the steps to understand what is going on before you even log into any tools or devices. Sometimes you will get additional info after asking more questions or questioning different affected parties and that will completely change your thought process on how you want to start.

Once I had a Wi-Fi issue being reported from an office that had made similiar statements previously but it ended up being application related each time. On the like 4th ticket I just went straight to the app owners this time and just asked them to look into the database. However after they reported no trouble, I talked to a different individual and it was indeed a wireless network problem because they said it was isolated to a certain area.

It ended up an access point was moved by a cable tech without authorization to a new switch port and the port did not have all the vlans allowed for the SSIDs, so the SSID was broadcasting but users would go nowhere once associated. It was a network issue after all but I wasted time by not asking questions or asking different affected users questions.

Walk through each step of the network path or process of what actions are being taken so you know what information to gather and how to process the information you are getting in order to determine if the network is causing or having a problem or not. In the above case it was a certain physical area, so then its lets see what access point is in that area, is it down? Looks normal - to continue to trace to what switch/port it is connected to etc.

Sometimes issues won't be network-related as often a lot of tickets hit the network queue that shouldn't but you need to gather info first before sending off the ticket. Yes if you can ping you might not need to trace through but frequently we're talking about more in depth issues than basic connection problems.

Remember to repeat the problem statement back to the group or end user to confirm you understand. Also if there is an extended period of time while gathering information, make sure you are updating stakeholders with the progress.

"...and Knowing is half the battle." - GI Joe

Next you need to isolate the issue by either narrowing it down to a specific device or behavior/event causing an issue. You need to gather information before performing a diagnosis on anything. You gather data and then you form your hypothesis on what is going on. I like using the word deduce sometimes for this step, you get info and deduce to a rational explanation of where/what the event/cause is isolated.

Sometimes this will involve the end user replicating their actions while you capture packets for analysis, or sometimes it will be looking at charts/graphs from your tools - the diagnosis is the result of the information you gather/discover.

For example if this is something new you might reference the firewall or if there is a certificate error you might check the load balancer which brokers a network connection to look deeper. However if its something that was working before and you know there was a recent router refresh for instance or that there is a firewall in the mix, you might ask for a trace route to see which routing hop the traffic dies on so you will know where to check the routing table.

Make sure you are referencing the information you gathered and making a decision based on that, using actual evidence or best effort evidence vs. hunches or outside opinions (but seek outside perspectives if necessary).

An additional example would be if there is packet loss being reported and you see the bandwidth is maxed out on the interface in the path. First, you need to find out what is causing the bandwidth to be maxed out, like are multiple users streaming 8k netflix video? or is the netflix prefix being learned over a peering session it shouldn't be causing excessive video traffic? Or is the wrong QoS policy CIR applied?

You'd need to look at a few things before deciding which of those is the problem and therefore what you need to plan for the next step to mitigate, but if you have thousands of devices or systems then isolation is required to narrow down the scope of where you need to gather data or what data you need.

One time I had an issue where part of a large office was offline and couldn't get to the internet or outside apps. First checking the clients everything looked normal: the switch sees the MAC addresses, there is an ARP entry for the gateway on the client machine and the gateway has ARP entries for clients, but nothing was working for them.

Gathering all the usual data and walking through the standard steps yielded the necessary information but it was not noticed. The switches were rebooted, clients rebooted, everyone was throwing their hands up.

However even after that there was still no communication, turns out after looking at everything again it was pointed out that the MAC address entry for the default gateway was STATIC in the switch stack's MAC address table and bouncing between ports after checking multiple times. It ended up an end user bridged two ports and one of the ports had port-security/sticky MAC which caused it to learn the gateway MAC address on the port which was a part of a 4 switch virtual stack. So when switch 1 for instance was forwarding layer 2 traffic destined for the default gateway it was black holing on that other switch's access port because that is where the MAC address existed (shared between the stack switches directly). The issue was isolated to that switch stack first though by getting a pool of affected users and tracing where they are plugged in (because not all users were affected).

The lesson would be to make sure you are analyzing the data you are gathering before making a conclusion, here the conclusion was it must be a bug with the data plane in the switch stack, but it was actually because of the static MAC on the port that was not the uplink.

Finally, if the situation is requiring that there is a necessary change to remedy the problem, then you need to implement a repair and verify the fix. This can be something as simple as changing the gateway on a device with a static IP, opening a circuit ticket, or as complex as you needing to go through a change board to get approval on implementing a new firewall allow rule or BGP policy change.

Again many people move to this step first before even gathering information, what is important in this step is first understanding what is needed to mitigate the issue, what the change entails and what behavior will be once you make the change. What is the expected outcome?

There could be times where you don't find anything, like a performance issue where you have no packet loss and network latency is acceptable (this is often hard for people to grasp, like "we are talking a fraction of 1 second people, not 5 minutes to deliver traffic") so you kick it to another group to tshoot.

There will be tickets that are concluded without repairs; It could be said to have "cleared before isolation" or the favorite "no trouble found".

After you implement your repair you need to verify the issue is resolved. In the basic fix the end user might report that it fixed the issue and the ticket can be closed, or a valid example with the circuit repair is with you verifying the interface is up and passing traffic now where before it wasn't . As with any change observing the before and knowing the outcome after and being able to verify with applicable show commands and such is important.

Post incident be sure to appropriately document any changes in the environment and do a lessons learned if it was a major outage.

Remember I.I.R. - Identify, Isolate, and Repair

Tools

Tools are an important part of the troubleshooting process. It can be CLI show commands or networking monitoring software for example. Things like dashboards are good for the times when a ticket comes in for application slowness, can you immediately pull a dashboard and show the top congested circuits or core connections; moreover, application performance service level monitors?

As we spoke about in the first section, knowledge in my opinion will be your best tool as that will guide you on any scenario you will face, but having actual network tools is vital. Bandwidth monitoring, up/down, CPU/RAM are all essential and the standard networking monitoring and can help identify performance issues without logging into individual devices or even alert you before a ticket comes in.

For instance, having an automated tool that captures ARP, Routing, MAC tables and such every 30 minutes is a good thing as you can go and compare tables currently to previous iterations to get a point in time capture of data if necessary.

There are also automated diagnostic tools you can trigger from certain ticketing systems as well, like running show commands, generating a bandwidth report, and checking BGP neighbors. A lot of service providers utilize this type of automation, so some of the readers might already have this luxury.

These monitoring tools provide situational awareness which can help triage issues, like if you see a hub router down alarm you know the tickets coming in for application down or internet down are probably related to that which means step 1 of identification and step 2 of isolate is partially completed for ticket purposes. Then you'd need to find out why it is down or dispatch the team that can.

Alternatively, maybe you have a full stack application monitor and can see end to end performance of an application, going there when a ticket comes in for an issue related to that can give some good baseline data as you gather information to hypothesize a root cause.

In your monitor tools, setting alarm thresholds for bandwidth congestion can be good, something like >80% for 15 minutes before alerting, however in big networks this isn't as scalable, so maybe something like >90% as you have a lot of circuits or try using groups to include core/important circuits to alarm only. Same as with interface flapping or device up/down, there could be a missed poll so setting a threshold timer like 5 minutes to notify is good to avoid excessive alarms (because in 5 minutes you'd like 2-4 polling intervals or time for fast polling to kick in).

Wireshark is the standard packet capture tool as everyone knows. Some people often go straight to doing a packet capture but I recommend to first identify the issue before going straight to a pcap, but it is one of the best tools we have in our arsenal as you might have seen in my other posts like The case of the TCP Challenge ACK or The case of the Gratuitous ARP. Learn to use Wireshark and you will better learn the networking protocols.

That's enough on tools for this post, I know there is more ground here, but you get the idea I'm sure. Lets jump into the troubleshooting scenarios to piece things together.

Troubleshooting Scenario 1

Take a look at the diagram and then read the below paragraphs where we can walk through a troubleshooting scenario.

Lets say a ticket arrives for degraded application performance of a web application your company hosts for customers in two different geographies. This app is one that your other senior coworker who is out on vacation usually handles. Users are having to refresh timed out pages or the pages are taking a while to load from time to time during all phases of their process (login, pulling data, inputting data) but when they finally are able to enter or pull data, it doesn't seem like the database is having problems. DNS has been verified good by Tier 1. Its a very simple web application.

Without ever working on this app you can see from this very high level diagram how the network flow will operate. From here if you wanted more detail you might take a look at the POP or Datacenter diagrams specifically, or maybe there is a more detailed flow diagram, but this example shows why documentation is important, just looking here you can get an idea of where to check for network information to isolate the issue.

You might already know or check the diagram and see this is an Anycast application, so the next question I might ask is it all customers or just customers from EAME or Americas? Perhaps there is a routing issue where customers are routing outside of their region (that is an example of an educated guess though not a deduction).

But no, you have an internet monitoring tool that shows all BGP routes have been stable around the internet, nevertheless the ticket reports the customers are in North America and all customers in the Europe, Africa, and the Middle East regions are operating fine.

You login to your monitoring tool and see there are no down devices or bandwidth congestion alarms, but you do see excessive latency for traffic traveling though the Americas region POP2 from a probe in the USA (you have all the nice bells and whistles in this one). So you then open a dashboard for POP2 which shows all the primary interfaces and stats of the routers/switches in that POP. This is where a pre-made dashboard in your network monitoring tool for regions/areas of the network can come in handy or an automation script to check these items.

You notice the primary link to DC2 is 10Gb/s and traffic is hovering around 50% usage, you get a mobile call so you are side tracked for a few minutes, then when checking back you notice the link is still around 50% utilization just as you saw before. This seems strange as you'd expect traffic to maybe drop a little after a few minutes or rise above as flows are created and aged out over time. You refresh your dashboard and notice there are errors slowly incrementing for that interface.

Then you login and watch the traffic on the edge router which sends traffic to DC2 and it seems like the traffic is pegged around ~4.8Gb/s. You run a trace route from the internet edge router and it travels to DC2 as expected although with high latency. The utilization hasn't been high enough to trigger any alerts which you might typically get before a customer ticket. hmm.

After reviewing another diagram you bring up the GUI for the load balancer cluster in DC2 because the internet routing and POP2's health look normal. You see a lot of stale TCP connections along with warnings of excessive TCP retransmissions from the load balancer cluster. Back end services are all green.

Based on this you suspect there is a circuit issue on the primary link from POP2 to DC2. A coworker suggests to just bounce the interface to see if that fixes the issue, but you reply that you can't just bounce the interface as there is still valid production traffic traveling on the link.

Your hypothesis is there is a circuit issue on the 10Gb/s link from POP2 to DC2 but with the carrier and not your equipment as you have no other log errors or see any other interface problems.

"First we should check with the carrier of the circuit as it seems to be isolated in traffic from POP2."

You login to the control center of the carrier and run their automated tool which comes back that there is a potential problem, so you open a ticket with the carrier. It seems your hypothesis is correct and are looking for the carrier to further diagnose the issue in that path. While this is going on the issue has gotten worse with T1 reporting more tickets are coming in, it seems marketing sent out a mass e-mail which has customers flocking to this web app.

"Lets re-route traffic away from DC2 to DC1 for America's users to alleviate the bandwidth congestion on the circuit with the fault". You engage the emergency CAB and ask to adjust local preference in BGP between DC1 and POP2 which will divert traffic so you can run load tests on the circuit without production traffic being impacted. You need to prove your hypothesis that there is a congestion issue on this circuit if the carrier comes back with no trouble found.

POP2's internet edge seems to be running fine so you can still keep edge traffic entering the network in the region to avoid congestion in POP1. After running the local pref change script you see traffic traversing to DC1 from POP2 now. Next you bounce the 10Gb/s interface as your coworker is insisting and run iperf tests, but iperf only shows around ~3.5Gb/s. You notice in the app monitor that latency has decreased for North American users along with your trace route tests after your diverting change; therefore, the work-around is having a positive effect.

After about an hour a NOC tech calls from the carrier regarding your ticket, it seems the 10Gb/s circuit is on the back up protect path due to a fiber cut for the metro-E ring and that circuit is congested which means you're getting limited bandwidth and sub-optimal path traversal. They are working on repairing the fiber.

Later they report the issue is fixed, you run iperf and see the circuit is getting >9Gb/s, so you revert your change after hours and normalize the routing path from POP2 to DC2 again and verify good metrics. The next day no issues are reported, thus you document the changes and the carrier repair and set the ticket to auto close. Issue resolved, Good job.

celebrate your troubleshooting wins — Celebrate your troubleshooting wins

Looking at this from the start we can see we first used documentation and then our tools and knowledge to form a hypothesis on what is causing the problem as it was clear what the incident statement was after asking some questions. As I mentioned in that ticket I saw before with the static MAC, sometimes little details can point you in the right direction.

Since we were unfamiliar with the setup we first went to check BGP routes since its an anycast app, but once that proved not the case we further looked at the flow diagram which helped hone in on a certain part of the network to try and isolate why the latency was higher than normal.

After looking at the interfaces and devices in the network path, we identified the problem interface due to anomalous data, we had to make a change in order to be able to test to fully diagnose and prove our theory. This also provided a temporary work-around for users having excessive latency. Using work-arounds can help buy time by pausing SLAs but try and root cause the ticket before leaving it in the work-around state permanently to remedy the problem.

Once we opened a ticket with the carrier it was clear there was a problem with the circuit in question. Later after the fix was in place we tested again to confirm it was indeed resolved, and then reverted back to a normalized state since a work-around was implemented. We stayed on task and worked methodically.

The reason I picked that as the incident is I have personally had an issue very similiar to this before and know that also sometimes when you are troubleshooting there will be unknowns where you don't have visibility, that is where experience can take over in your process or maybe it means its time to open a vendor or carrier ticket, or consult the business unit responsible for that area you need visibility in.

Troubleshooting Scenario 2

Take a look at the diagram and then read the below paragraphs where we can walk through the second troubleshooting scenario.

Lets say you have the above topology as your network. You have multiple routers (NR) that border other areas of the network and have assigned /60 and /19 IPv6 and IPv4 BGP aggregated summaries respectively. On DCR1 and DCR2 they only see the those summary routes in the BGP routing table as received from the carrier MPLS/NR routers. Networks 1, 2, and 3 all utilize OSPF within their networks for intra-network routing behind their respective NR routers.

A newer jr. network technician has to deploy a new /24 and /64 as part of a network management test network project behind NR2 on a test router in network 2. This was assigned by the other senior engineer who is out sick so you have no knowledge of the project.

Since its a management type network the jr. tech decides to use 10.100.100.0/24 "so it will be easy to remember!" for this network. In addition, the other senior engineer already assigned a v6 subnet of 2001:db8:abcd:002f::/64 since those are already all allocated.

However the tech is reporting to you that they cannot ping the new test router loopback 10.100.100.1 from either DC1 or DC2 test servers, but they can from Router NR2.

They say they see the /24 in the routing table on NR2 when viewing the routing table, however they don't have access to the DC routers to check, so they have engaged you about the issue stating there is a problem on the DC routers for the new test network subnet.

Based on the info presented to do you see the issue? Using the diagram information and your knowledge you should know what the likely issue is. What follow up questions might you ask? "Why do you think its the DC routers?" maybe. The questions will vary in scenarios based on the type of issue and who you are talking to.

sometimes you can identify the issue quickly

Now that we have a reported problem and are in the issue identification stage you might have already noticed that NR2 is only advertising 10.100.64.0/19 which only covers the networks 10.100.64.0/24 through 10.100.95.0/24. The summary 10.100.96.0/19 covers the 10.100.100.0/24 subnet which is not present.

As part of info gathering to make a determination you might login to check the routing table of say DCR1 and NR2 (since its isolated to those devices) in order to verify that the route is not in the BGP table, but in the routing table via OSPF on NR2 which is expected based on what we know from the documentation.

The tech was right that the route was present but its an OSPF route not a BGP route. You might check the BGP route filters or aggregate configuration to confirm it is not allowed in there either as the diagram states its the /19 only. Thus you have isolated the issue to a bad deployment decision being that the subnet is out of the allowed advertisement range.

As a repair you might implement a different subnet inside the existing /19 summary aggregated in BGP on the test router which would be within that summary and therefore work. Alternatively, maybe you create a new configuration to advertise the 10.100.96.0/19 summary from NR2 if you didn't mind the space assigned there, or you could allow only the /24 through. The last option is less optimal since it appears to be non-standard based on the current topology.

I like this example because it shows how a seemingly complex problem can be identified with basic subnetting knowledge and the use of accurate documentation. Then as we talked about earlier with network flows, you can isolate where you might need to make changes in order for the new network to work (DCR1/2 and NR2) as apart of your repairs.

Chronic or Complex Scenarios

Over the years there have been some complex issues I've had to solve (I might do more case posts in the future as time passes) or chronic situations that have lasted weeks/months/years. There's usually finger pointing and a lot of back and forth in these. It gets to the point you want to give up or abandon the predicament. But I must tell you never give up!

One frustrating thing can be in a SaaS or managed service scenario where you have no control or visibility and the issue keeps happening over time or in very niche situations. Always continue to drive the vendors and call for continuous review and data gathering to try and isolate the fix. There will be times when you go through the I.I.R. process and the repair doesn't remedy the problem either by you or the vendor, usually you'll want to analyze why and rollback and start the process over. Providing more examples and information and escalations can often be fruitful here.

Still there are times when you can't lean on the vendor and you are the ultimate point of escalation, it can put you down not figuring things out right away or having to wait again to get the issue to reappear but remember to be thorough, look at every step and follow the methodical process. There will be times when you need to trust your instincts as well.

I did encounter a chronic performance problem once that affected many users of a customer for a long time. The consensus at that time is that it was network-related, however no one had really proven that it was or was not. I therefore created a report that encompassed many aspects of the network troubleshooting process to show case findings in that there was actually not a network problem during the impacted occurrence.

When presenting the report to the customer it also helped to educate them on what a network problem looks like and doesn't look like, what the normal baseline was and how different (or not) the baseline was during the chronic situation. Often you will need to educate more than convince. The post I did on the report can be found here on the blog or you can download it here.

There is also special teams that are formed to solve chronic issues, we had the "Tiger team" at AT&T which I was on for a chronic problem once and I know some providers have called them tech "SWAT" teams (referring to the name of the more specialized police units that handle special situations).

These special teams usually will compose some of the most senior level or SME folks who meet and discuss the issue at a high level first and then form a plan of action to chase it down to completion. They are often formed for the higher impact, unsolvable issues, that T2 or T3 cannot complete through standard processes and require additional time and effort to resolve. It's something to think about if you at a stop and need help or if you are senior level and looking to close out some of the plagues on your network.

When facing the chronic issue I have found that a high percentage of the time the root cause comes from in-depth knowledge and effort being able to analyze the data and logically walk through it. The data is often gathered by taking wireshark captures at every step of the way, e.g. the end point, intermediate routers or firewalls, and the endpoint or server. In this case it's not Pcaps or it didn't happen, it is Pcaps or you didn't troubleshoot.

Closing

I hope this post helped to better assist you with troubleshooting by illustrating some insight into network troubleshooting along with some of my experiences and how I handled them. The scenarios are two examples very similiar to real world issues I have encountered.

Remember to learn the basics and know how network flows operate. Also know your network and keep accurate documentation to assist in information gathering to form a diagnosis on where a root cause might be. Over time you will get better and get to encounter some fun ones.

I.I.R - Identify, Isolate, and Repair is a new acronym that gives you something different to think about than that standard top down, divide and conquer, bottom up methodologies you commonly see related to the OSI model. Identify and Isolation are very important steps so you are not trying random changes to see what sticks.

When facing continuous or chronic problems, don't give up! The end users or customers are depending on you. Look at every step and take captures if the situation warrants it or perhaps you need to present the data to educate the stake holders.

Don't forget to use the tools at your disposal and to setup some dashboards or alerts that can help with proactive issue identification. Lastly, don't be afraid to research things related to what you are troubleshooting. And remember "Knowing is half the battle!"

Good luck out there. Thank you