Monthly Archives: August 2006

Recovery from ethernet card failure

As part of my talk at the O’Reilly OSCON I had hoped to use examples from some off-site servers. Unfortunately the Internet connection to those servers went down and I had the only key to the server room with me in Portland!

I have a Nagios set up for the servers along with remote power control. When Nagios notices that certain things are unreachable it fires scripts which begin the process of troubleshooting and resetting devices along the way in hopes that a reboot will bring the connection back to life. Unfortunately, the errors this time were apparently due to an ethernet card going bad. It appears as though there may have been an electrical event that caused the interface to go bad along with another device unrelated to the Internet connection.

In any event, when I returned from Portland I set about trying to figure out the issue. The kernel was reporting errors like:

eth1: mismatched read page pointers 4c vs 6f.
eth1: timeout waiting for Tx RDC.
eth1: mismatched read page pointers 0 vs 63
eth1: bogus packet size: 65280, status=0x0 nxpg=0x0

When the bogus packet size error would come up, the interface would essentially stop working. Bringing the interface down and then back up seemed to fix the issue temporarily. The issue also seemed to be load related, as in the heavier the load the more prevalent the errors. I was unable to determine (as in, I didn’t bother to tcpdump) if there was something else on the network that would trigger the bogus packet size but it didn’t appear as such.

After the interface reset had only a temporary effect, I followed this troubleshooting path:

1. I rebooted the server. I know it’s Linux and doesn’t really require reboots but I thought maybe this would help and it was painless nonetheless. This didn’t help.
2. Replace the cable. Again a simple and quick fix. It had no effect so I put the old cable back on.
3. Try a different interface in the firewall. The machine has four ethernet cards in it, so I plugged into a different one. This fixed the problem. I interpreted this to mean that neither the router’s interface nor malicious activity (bogus packets) were the cause.
4. After running on the other card for a while I decided that the observations must be correct so I replaced the bad ethernet card (an NE2k clone) with an Intel card that I had on spare.

I hadn’t previously used the ‘ethtool’ program but I found it to be useful here, even though it wasn’t directly involved in the ultimate fix. ifconfig, arp -a, netstat -rn, and various others were my friends.

Suehring’s Law of Laptop Life Expectancy

I have a couple partitions on my laptop, a Microsoft Windows XP and an Ubuntu Linux. The Windows portition has been locking up fairly regularly. When I say locking up I mean hard lock where I lose mouse, keyboard, everything. It never comes back. The only way to get it back is to hold the power button to power the unit down. Naturally, the Linux side doesn’t have any of these behaviors which leads me to believe that it’s not hardware related or if it is hardware related that the Linux side is more tolerant of the problem. whatever it may be.

Either way, the laptop is over four human years old. I use the term “human years” because laptops age at a different rate than humans and a different rate than other computers. This is of course due to their portable and largely non-upgradeable nature. They get beat on, kicked, and generally abused and there’s not a great way to replace their bits when they go bad.

I’ve come up with a formula, which I call “Suehring’s Law of Laptop Life Expectancy” or just “Suehring’s Law” for short. The formula is still being refined as I received feedback but here it is nonetheless:

laptop_age = years * (22.4 (+/- 5)) + (number_times_traveled * operating_system_factor)

operating_system_factor:
Linux = 0.7
MS Windows = 1.3

For example, my laptop is 4 years old and it has traveled about 40 times, more or less, and it primarily runs a *nix OS. Therefore, the formula looks like this:

laptop_age = 4 * 22.4 + (40 * 0.7)

This in turn can be expressed:

laptop_age = 89.6 + 28

Therefore my laptop’s age is: 117.6 laptop years old.