Tuesday, December 24, 2024

What does it want to say? (An approach for chasing bugs)

 Long ago when I was studying French, I came across an idiomatic means of expressing a question about a word or thing. If you want to ask a French speaker to help you with defining a word’s meaning, you can ask, “What does it want to say?” (Qu'est-ce que ça veut dire?) I liked the idea that the word actually is personified as a willful entity in this question’s framing. The word has a desire or a will that needs to be considered. Over time, I used to think of technology problems in this way. If a computer wasn’t functioning properly, I’d frame it as “What is the computer wanting to say/do? And how am I seeing it try to do that? And what’s the result? If you ever have a strange behavior of an app or computer process, walking through the steps to diagnose how they communicate will sometimes help you troubleshoot. By narrowing down the steps in their way of reasoning/acting, you can often isolate the problems and facilitate the process such that the device can achieve it’s goal, which is a proxy for your goal in using the device.

If your phone or computer wants to find an internet connection. Is there one? Is it LAN cable, wifi, LTE, Edge, 3G, 5G? (Each is slightly different in the process of connecting and what volumes of data it can transmit after all.) If it your device “wants” to use that internet connection to reach a website to refresh its content, what happens or doesn’t happen when that request to the server returns a response? Gradually from the keyboard to the screen to the operating system to the network to the connected service provider, you can observe each step in this flow. When I worked in Tokyo providing search engine services to internet portals, my clients would sometimes call me on the cell phone asking me to diagnose an issue. I could run an isotope query or trace route through the network to test their servers’ response and connection to my company’s servers. It’s like the game of Pooh sticks from the Winnie the Pooh stories. If you drop a stick on one side of a bridge, you can watch it come across the other side of the bridge and have races with your friends, or bears, to see whose sticks come out on the other side the fastest. This is the same process as doing device level troubleshooting. Who is saying what how? Your email is stuck in your out box for instance?

I recently had to troubleshoot a perplexing issue. My father so enjoyed the process of going through this with me that he asked me to write about it. He could find no web documentation of the issue he’d encountered as he was experiencing it. My father worked as a systems engineer for many years at IBM and wrote programs in dozens of languages from early mainframe computers to modern day Macs. He knows pretty much every trick that a Mac can do and has followed computer web forums for years to expand his understanding of how the Apple operating systems had shifted from pre system 10 (aka OSX), through the AMD and Intel chip phases to modern “Apple Silicon” chips.  So him reaching out to me about a technical bug is somewhat of a rare thing. It was usually the other way around. But I was able to narrow down the observable symptoms to several potential root causes until finally figuring out the bug as a network issue rather than an operating system or hardware issue. In case this issue affects your home computer/network, or if you’re just curious the steps involved in doing this, here is how we sorted it out.

My father’s computer had a peculiar symptom that he’d never seen before in 30+ years of working on Macs, that I’d also not seen.  Inside perspective, the symptoms were that he couldn’t access bank websites on his new Mac devices. He couldn’t do a phone-home system re-install from Apple servers nor contact Apple support from the machine’s integrated communication system. The symptom didn’t happen on his wife’s older operating system on an Intel-chip Mac. So he narrowed down his conclusion that it was either his Apple Silicon Mac or the operating system updates, which had recently been upgraded, which were potentially the source of the issue. Outside perspective, his FaceTime availability disappeared for me when networking to him. I couldn’t call him with Voice over IP tools. So I expected something was wrong with his ID account or potentially a compromised account. Because I’d read of phishing tactics in the press and how to avoid them, I started by triaging if there was a potential malware issue with his machine. It didn’t seem that was the case. Everything else on his Mac worked, except for those apps that required web resources to be fetched securely by an internal web-dependent function. He checked with his banks to ensure there was no suspicious activity in his accounts nor attempts to reroute or reset his bank logins recently. We established a secondary channel of communication and assured that his account phone numbers had not been redirected either. Once we sorted out that he wasn’t in any particular immediate risk, we took to testing and ruling out other potential issues.

Source: Wiki Commons
Cutting to the chase scene: Ultimately, after ruling out as many factors as we could, we isolated the issue as being an IPv6 problem. Over a decade prior I had attended a lecture on how the web industry was transitioning to a process of generating IP addresses for the planned future of broader range of computers coming online over future decades. IPv4 process for issuing IP addresses for devices, to identify themselves as unique entities across the web, was going to reach a scaling issue akin to the old Y2K issue that took place at the turn of the 21st century. (More information about that elsewhere as it’s analogous but not directly related, as it had to do with date formats not device namespace.) The new IPv6 process of self-identifying devices over a network would use a much wider range of values than IPv4 addresses, meaning that there would be lower risk of any two devices being confused with each other and creating network conflicts due to simultaneous inbound connection requests. This lecture was deep in my mind, but the memory was triggered because of a comment someone made about the IPV6 transition resulting in higher security of networks in the future. It was relevant here because my father’s computer was somehow communicating in a way that wasn’t being accepted by banks and Apple itself. Could there be a difference in how Intel-chip Macs and their operating systems convey TLS (Transport Layer Security) traffic over the web? Sure enough, that was the issue. We found that banks were accepting the IPv4 traffic from the older Macs. But the newer Macs and their respective OSes were trying to transmit IPv6 values which weren’t getting through the network. Once we configured his network to route IPv6 values generated by the OS (and now we suspect are anticipated by banks) his computer, browsers and applications started functioning flawlessly again. You can read much more in depth about IPv4 and IPv6 elsewhere, but suffice it to say that there was nothing wrong with his Mac. It was the attempts to communicate secure device-unique values over the network that were failing.

I hope that you don’t run into a network routing problems like he had. Bank customer service, Apple customer service and even your internet provider may not be familiar with your home computing or network setup. The also may have difficulty understanding what issues you face based on how you describe the problem. But tracking down how the symptoms represent, will help you communicate with them to resolve whatever challenges you face.

We delegate our processes to these device "agents" that act on the web on our behalf. Just like humans, they can get tripped up on the way to saying things, or the channel through which to express them. Like studying a foreign language, we can examine the terms our agents use to help them communicate for us more effectively. When their speech breaks down, we have only to examine the vocabulary and steps they use to get across their "meaning" and thereby return them to functioning eloquently on our behalf.

For more on IPv6 see: https://en.wikipedia.org/wiki/IPv6

Special thanks to the Las Vegas Consumer Electronics Show for offering the lecture on IPv4 vs IPv6 that set us on the right track in this particular case. 

For those who want to follow the troubleshooting steps we used, the important clues and conclusions and the route to isolating the problem behavior:

  • Important steps in the investigation were first the non-working FaceTime VOIP service and his device’s inability to connect to Apple servers. Not only could he not access Apple’s network, Apple’s network could not reach him either. It was a two way problem across multiple applications, but non-sensitive web traffic was unhindered.
  • Testing the IP address configuration was the main key to resolving it. My computer registered an IPV6 address when querying https://whatismyipaddress.com from outside his home. His computer registered an IPV4 address but not an IPV6. Then, when I tested my other computer in his home environment, my computer then suffered the same issue as his. (I used a more recent beta version of MacOS than he did.) Replicating the bug with a different machine on a different version of the OS conclusively proved that the network was the gating source of the problem.

Questions and steps in our exploration narrow down to the answer:

  • Had his IP addresses been flagged as a phishing or malware source, leading to banks blocking traffic? (Confirmed not.)
  • Operating System issue? Try a fresh install of a base operating system. (Not possible in his case because Apple silicon OS doesn’t allow boot from terminal mode on an external machine the way Intel based Macs did. Both of his Macs couldn’t revert to his wife’s OS because of OS incompatibility between Intel and Apple silicon versions.) 
  • Browser issue? He was accessing banks via several browsers, all failed, while regular non-logged-in sites would function fine. (This means there was not a browser-dependent issue causing the problem. But because secure sites were failing to load, including Apple, it made me suspect Transport Layer Security over TCP/IP was the problem of some kind, which it ultimately turned out to be.
  • Cable internet access restrictions? Because some cable internet providers give parental controls, I suspected something that Comcast had done could have rolled out a traffic throttling limit for some accounts or  to all customers in a region inadvertently. Did any of his friends who used this provider complain of loss of access? (It was just him. Specifically it was his newer Macs on the newest operating system release.)
  • We reset the DCHP settings of his computer to no avail. 
  • Finally we bypassed his router and thereby resolved the issue. We resolved to let his router be used for non-sensitive traffic around the house, but not sensitive or secure traffic.

No comments:

Post a Comment