I have been hit by a problem caused by binary only software. The program
used to run on my main home server. The only problem is that I wrote the
program myself and I have unfortunately left and lost the source on /tmp which
was over six months ago. The only trace of the source I could find on my
system were some object files in ~/.ccache.
The binary was used to retrieve cover images for the web front end of my
music player. The web front end is just a hacked PHP script so that I also can
easily control the music player with my PDA.
After updating my server to Fedora Core 6 the binary was not working
anymore because it was linked to an older openssl version and there was no
compat version in FC6. Luckily I found a SRPM for exactly this case
at http://people.redhat.com/tmraz/openssl097f/.
I am not sure if I should rewrite it or just do nothing. I will probably
rewrite it at some point in time.
So now we are also running our mirror server
on Dell hardware. Poweredge
6950. It has two Dual-Core AMD CPUs (8214) with 8GB of RAM. It supports up to
4 CPUs and 64GB RAM. So we can still upgrade this machine in the next few
months.
The server came 3 weeks ago and after running memtest for two
days on it we moved to the new server on the third day. Everything went pretty
good and the first impressions were very positive. But then on the weekend
it crashed for the first time. Nasty ext3 errors and we
really did not knew where this came from. So after rebooting on Monday it
crashed two days later with the same errors. Very disappointing. First idea
was that maybe x86_64 is after all not as stable as i386, but we did not
really believe that this might be the cause.
We did lot of filesystem checks, recreated some of the file systems and
tried to get the serial console running so that if it crashed over christmas
we would be able to reboot it remotely. But it crashed again on the 26th of
December.
Until then I already did some research on google if maybe I would find
reports from other people reporting similar problems. I found a Dell
mailinglist where people were describing the same problems but not on the same
hardware… except that all the servers have similar RAID controllers. So after
looking at the updates available for this server I found an error description
which matched ours pretty good: On systems with 8GB or more RAM the RAID
controller might get stuck and the only thing you can do is to reboot. So I
installed this update on the 27th and until now the system works without any
errors. It would be great if the system is now stable and does not need a
reboot every second day.
The only disappointing thing about this is that Dell shipped a server from
which they knew that it is broken because the update was available before we
had the server in our hands. But this is probably the usual case that sale and
development have no connection in a company as big as Dell.