The bug that almost took me out I
It must have been a month now since I first started tackling this bug. I almost accepted defeat and claimed insanity like the main character in “The bug”. I did, however, develop an unfriendly scowl and slacked off on maintaining my image. Not only is this a great example of some of the extremities I encounter, it is also “the bug that almost took me out but not quite”. Most of my colleagues tells me that it is has the strangest twist that they have ever encountered. I see it as a good lesson on how various assumptions and cost reductions across different products can eventually result in “The One”.
The first report of the bug appeared a month ago, from a customer complaint of some missed bytes through RS-232 serial communication. For those of you unfamiliar with this standard, we usually laugh it off because the serial communication has been done to death and any error should be recovered by the redundancy, handshaking and recovery routines that is inherent in the communication protocol. The problem, therefore, must lie in loose cables or chipped connectors somewhere. So the first thing that we do is to ask the customers on a witch hunt. Of course, being a suspicious engineer by nature, I setup something similar on my end and ran some test overnight just to make sure that we didn’t accidentally “design” it wrong.
Christmas came and went the memories of this bug went to the back of my mind as I tackle numerous demands from customers deploying their systems all over the place in order to get rid of their inventory to strategically position themselves in the coming recession. This particular client is in a different situation however as their systems are already deployed and are just now encountering this bug in the field. As time went by, more and more report of the bug appeared till eventually, it happened on one of their development system that they have in house. This presented them with an opportunity to hook up all sorts of monitoring devices on each point to track the flow of data. Lots of expensive equipments and probes later, they found out that 8 bytes of data will randomly disappear after it gets received by the serial communication port. The nail in the coffin is that they found out that the problem only happens when the cable is connected to our product. By now, we have gathered several confusing evidences that doesn’t make sense and are in conflict with each other.
- The serial communication works when the cable is connected to the com port of another product
- It only happens on certain systems
- Errors are reported after 15 retries failed
- On systems that does not have problems, it didn’t fail even once
So it seems obvious right now that we only need to compare the “good” and “bad” system and look at what is different in order to zero in on the problem. Here’s the kicker. All systems are exactly the same. Same hardware bought at the same time, same installation procedure and the HDD are all restored from the same disk image. In essence, they are clones of each other.
How can some clones have problems while others don’t? The mystery intesifies…
Leave a Reply