Main Page | Report this Page
 
   
Linux Forum Index  »  Linux Development  »  FOLLOWUP -- Re: Mysterious server lockups with Ubuntu...
Page 1 of 1    
Author Message
Ignoramus3863...
Posted: Sun Aug 31, 2008 12:06 pm
Guest
If you recall, I was asking what I can do with a important production
server that would lock up every few days, sometimes even more
often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.

My readings on this matter suggested that 2.6.24 is not that great
kernel, that Fedora chose 2.6.25 for a reason, etc, so I decided to try
the kernel route.

So, I finally tried one more thing, which is to download 2.6.25 from
kernel.org, compile it and use it without any special flags like
hpet=disable and so on. Just a standard compile, standard install,
no extra arguments, etc.

After that, with 2.4.25, it's been 13 days, the server is up, and does
not seem to crash (knock on wood). It is not yet conclusive, so I will
keep an eye on it.

i
Matt...
Posted: Tue Sep 02, 2008 6:19 am
Guest
Ignoramus3863 wrote:

Quote:
often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.


I didn't see you mention previously that it was 64 bit.
Ignoramus2176...
Posted: Tue Sep 02, 2008 7:28 am
Guest
On 2008-09-02, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Quote:
Ignoramus3863 wrote:

often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.


I didn't see you mention previously that it was 64 bit.

Yes... That's what it is, 64 bit....
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
Matt...
Posted: Fri Sep 05, 2008 12:51 am
Guest
Ignoramus2176 wrote:
Quote:
On 2008-09-02, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Ignoramus3863 wrote:

often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.

I didn't see you mention previously that it was 64 bit.

Yes... That's what it is, 64 bit....


I would expect more problems in 64-bit software because it is less used.

We bought a couple DEC Alpha-based Linux workstations in '95 or '96. I
don't remember all the details ... We were pretty excited to run them,
but our hello, world C programs dumped core every time, or maybe gcc
crashed every time---some gross problem like that. Meanwhile our
Pentium Linux systems worked fine. We didn't even mess with the Alphas
anymore---we just sent them back for a refund. I want to say that they
ran some (poorly) customized Red Hat, but that could be wrong.
Ignoramus29627...
Posted: Fri Sep 05, 2008 6:10 am
Guest
On 2008-09-05, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Quote:
Ignoramus2176 wrote:
On 2008-09-02, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Ignoramus3863 wrote:

often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.

I didn't see you mention previously that it was 64 bit.

Yes... That's what it is, 64 bit....


I would expect more problems in 64-bit software because it is less used.

We bought a couple DEC Alpha-based Linux workstations in '95 or '96. I
don't remember all the details ... We were pretty excited to run them,
but our hello, world C programs dumped core every time, or maybe gcc
crashed every time---some gross problem like that. Meanwhile our
Pentium Linux systems worked fine. We didn't even mess with the Alphas
anymore---we just sent them back for a refund. I want to say that they
ran some (poorly) customized Red Hat, but that could be wrong.

Well, by now it's been 17 days and the server is still up and running
with custom compiled 2.6.25. So definitely, the cause was some server
bug in 2.6.24 Ubuntu kernel that I ran previously.

--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
The Natural Philosopher...
Posted: Fri Sep 05, 2008 8:06 am
Guest
Ignoramus29627 wrote:
Quote:
On 2008-09-05, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Ignoramus2176 wrote:
On 2008-09-02, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Ignoramus3863 wrote:

often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.
I didn't see you mention previously that it was 64 bit.
Yes... That's what it is, 64 bit....

I would expect more problems in 64-bit software because it is less used.

We bought a couple DEC Alpha-based Linux workstations in '95 or '96. I
don't remember all the details ... We were pretty excited to run them,
but our hello, world C programs dumped core every time, or maybe gcc
crashed every time---some gross problem like that. Meanwhile our
Pentium Linux systems worked fine. We didn't even mess with the Alphas
anymore---we just sent them back for a refund. I want to say that they
ran some (poorly) customized Red Hat, but that could be wrong.

Well, by now it's been 17 days and the server is still up and running
with custom compiled 2.6.25. So definitely, the cause was some server
bug in 2.6.24 Ubuntu kernel that I ran previously.

or the bad RAM bit or bad sequence of bus commands that caused the crash

is not being exercised by the new kernel..


I dont want to alarm you but many years at the blunt end of hardware
programming shows that bad hardware can and does only show up under a
precise set of circumstances sometimes.

I spend several days tracing such: it only showed up when doing a DMA
transfer from floppy disk to a particular RAM location.

It was nothing to do with the disk, the RAM or the DMA controller. iT
was a third party card on the IO bus that woke up and grabbed the bus
when a precise set of timings and addresses were presented to it in a
precise sequence.
Ignoramus29627...
Posted: Fri Sep 05, 2008 8:14 am
Guest
On 2008-09-05, The Natural Philosopher <a at (no spam) b.c> wrote:
Quote:
Ignoramus29627 wrote:
On 2008-09-05, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Ignoramus2176 wrote:
On 2008-09-02, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Ignoramus3863 wrote:

often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.
I didn't see you mention previously that it was 64 bit.
Yes... That's what it is, 64 bit....

I would expect more problems in 64-bit software because it is less used.

We bought a couple DEC Alpha-based Linux workstations in '95 or '96. I
don't remember all the details ... We were pretty excited to run them,
but our hello, world C programs dumped core every time, or maybe gcc
crashed every time---some gross problem like that. Meanwhile our
Pentium Linux systems worked fine. We didn't even mess with the Alphas
anymore---we just sent them back for a refund. I want to say that they
ran some (poorly) customized Red Hat, but that could be wrong.

Well, by now it's been 17 days and the server is still up and running
with custom compiled 2.6.25. So definitely, the cause was some server
bug in 2.6.24 Ubuntu kernel that I ran previously.

or the bad RAM bit or bad sequence of bus commands that caused the crash
is not being exercised by the new kernel..

Or maybe a tooth fairy.

Quote:
I dont want to alarm you but many years at the blunt end of hardware
programming shows that bad hardware can and does only show up under
a precise set of circumstances sometimes.

Certainly, but the new kernel eliminated whatever it was.

Quote:
I spend several days tracing such: it only showed up when doing a DMA
transfer from floppy disk to a particular RAM location.

It was nothing to do with the disk, the RAM or the DMA controller. iT
was a third party card on the IO bus that woke up and grabbed the bus
when a precise set of timings and addresses were presented to it in a
precise sequence.

I think that the most likely explanation is a kernel bug that was
fixed.

--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
Jean-David Beyer...
Posted: Fri Sep 05, 2008 9:15 am
Guest
The Natural Philosopher wrote:

Quote:
I dont want to alarm you but many years at the blunt end of hardware
programming shows that bad hardware can and does only show up under a
precise set of circumstances sometimes.

Boy is that ever true!


I wrote an operating system for a Computer Control DDP-224 computer over 25
years ago. At one point, the console typewriter would type out

Rdady $

instead of

Ready $

Otherwise it worked fine. I checked the RAM and it contained "Ready".

And it did not seem to be the typewriter because I could write a trivial
program and type out "Ready $" just fine. I ran the vendor-supplied memory
test program for hours and found no errors, on a machine with only 16384
words of RAM.

The hardware techie said he was not willing to debug my OS, and I should
write a 10-line program that showed it was the hardware of the computer.

After thinking about it, I diddled the loader and loaded the entire OS one
word further up in absolute memory than it used to go, and it worked fine.
(My OS was relocatable even though I always loaded it in the same place.) So
the code was OK, but the trouble varied depending on where in memory it was.

Then I took just the code from the typewriter driver and loaded it into the
computer without an OS. I put it in various absolute locations, and it
always worked fine until I put it where it actually would have been if the
OS had been in there. And it failed. Well, it was longer than 10 words, but
it convinced the hardware techie and he found and fixed the memory problem.
Back in the days of magnetic core memory and no memory management unit.

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org
^^-^^ 10:00:01 up 29 days, 16:06, 4 users, load average: 4.01, 4.15, 4.12
Jean-David Beyer...
Posted: Fri Sep 05, 2008 9:20 am
Guest
Ignoramus29627 wrote:

Quote:
I dont want to alarm you but many years at the blunt end of hardware
programming shows that bad hardware can and does only show up under
a precise set of circumstances sometimes.

Certainly, but the new kernel eliminated whatever it was.

If it really was a kernel bug, why no other people at least saying "me too"

about this bug?

My guess is that you do have a hardware problem and the new kernel just
hides it. For example, it may be a memory problem and the problem memory
cells are now used for an error message that is seldom or never produced?

If it is something like that, it could come and bite you later, for example
if it is a module that gets paged out, and later gets paged back somewhere else.

These problems are terribly difficult to find. I am often forced to use
intuition to find them. Luckily my intuition is fairly good, but does not
usually work "on demand."

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org
^^-^^ 10:15:01 up 29 days, 16:21, 4 users, load average: 4.83, 4.56, 4.36
The Natural Philosopher...
Posted: Fri Sep 05, 2008 10:17 am
Guest
Jean-David Beyer wrote:
Quote:
Ignoramus29627 wrote:

I dont want to alarm you but many years at the blunt end of hardware
programming shows that bad hardware can and does only show up under
a precise set of circumstances sometimes.
Certainly, but the new kernel eliminated whatever it was.

If it really was a kernel bug, why no other people at least saying "me too"
about this bug?

My guess is that you do have a hardware problem and the new kernel just
hides it. For example, it may be a memory problem and the problem memory
cells are now used for an error message that is seldom or never produced?

If it is something like that, it could come and bite you later, for example
if it is a module that gets paged out, and later gets paged back somewhere else.

These problems are terribly difficult to find. I am often forced to use
intuition to find them. Luckily my intuition is fairly good, but does not
usually work "on demand."

Worse still, it may be like teh one that I had..it only corrupted two

bytes of something loaded off a floppy..now consider that somewhere
within a CD-ROM loaded kernel, two bytes are wrong..
Jean-David Beyer...
Posted: Fri Sep 05, 2008 11:06 am
Guest
The Natural Philosopher wrote:

Quote:
Worse still, it may be like teh one that I had..it only corrupted two
bytes of something loaded off a floppy..now consider that somewhere
within a CD-ROM loaded kernel, two bytes are wrong..

I do not think that is likely, though it is not impossible. I use Red Hat
Enterprise Linux, and the CD-ROM images are checksummed. I verify the
checksum before I burn them to CD-ROMs. Then when I install from CD-ROMs,
the install program does its own checksum of the disks to ensure that they
are OK. So those things have survived two checksum tests. The first one is
MD5. I do not know what the install program uses. RPM files are signed and
the signature contains a checksum, so once an RPM file is read into the
machine, its signature can be checked before it is written to disk. Another
safeguard.

But if, at the last minute, the disk surface is of marginal quality, the
disk controller could write out something on the disk that was stored
incorrectly. I am unaware of disk controllers that read back every block
after writing it and compare it with what is stored in memory. (My tape
drive does that, but that is another story.) So if the controller writes
something that is recorded on a defective spot on disk, so it reads back
correctly sometimes, but not all the time, too bad.

I have never noticed problems like this on my new machines, but I am glad I
do not write real-time control software for an Intensive Care Unit in a
hospital. And I imagine the patients are glad too.

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org
^^-^^ 11:50:01 up 29 days, 17:56, 4 users, load average: 4.27, 4.40, 4.39
Ignoramus29627...
Posted: Fri Sep 05, 2008 9:58 pm
Guest
On 2008-09-05, Jean-David Beyer <jeandavid8 at (no spam) verizon.net> wrote:
Quote:
Ignoramus29627 wrote:

I dont want to alarm you but many years at the blunt end of hardware
programming shows that bad hardware can and does only show up under
a precise set of circumstances sometimes.

Certainly, but the new kernel eliminated whatever it was.

If it really was a kernel bug, why no other people at least saying "me too"
about this bug?

Check out ubuntuforums, many people are complaining about lockups.

Quote:
My guess is that you do have a hardware problem and the new kernel just
hides it. For example, it may be a memory problem and the problem memory
cells are now used for an error message that is seldom or never produced?

I think that it is the driver for a relatively new RAID card that I
had.

Quote:
If it is something like that, it could come and bite you later, for
example if it is a module that gets paged out, and later gets paged
back somewhere else.

These problems are terribly difficult to find. I am often forced to use
intuition to find them. Luckily my intuition is fairly good, but does not
usually work "on demand."


It is a pain in the ass problem, that I spent a while on.
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/
Robert Riches...
Posted: Sat Sep 06, 2008 12:04 am
Guest
On 2008-09-05, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Quote:
Ignoramus2176 wrote:
On 2008-09-02, Matt <matt at (no spam) themattfella.xxxyyz.com> wrote:
Ignoramus3863 wrote:

often. It was running Ubuntu hardy 64 bit. I tried a few things, HPET
disable, nohz, you name it. Nothing helped. It was running 2.6.24.

I didn't see you mention previously that it was 64 bit.

Yes... That's what it is, 64 bit....


I would expect more problems in 64-bit software because it is less used.

We bought a couple DEC Alpha-based Linux workstations in '95 or '96. I
don't remember all the details ... We were pretty excited to run them,
but our hello, world C programs dumped core every time, or maybe gcc
crashed every time---some gross problem like that. Meanwhile our
Pentium Linux systems worked fine. We didn't even mess with the Alphas
anymore---we just sent them back for a refund. I want to say that they
ran some (poorly) customized Red Hat, but that could be wrong.

A lot of 64-bit Linux systems, Alpha and otherwise, work
just fine. The X86-64 servers I do use to test my software
at work are rock solid. My Alpha home computer from 2000 to
2003 was also rather solid, except for a few buggy graphics
card drivers along the way and an eventual motherboard
failure a month or two out of the three-year warranty.

--
Robert Riches
spamtrap42 at (no spam) verizon.net
(Yes, that is one of my email addresses.)
The Natural Philosopher...
Posted: Sat Sep 06, 2008 1:38 am
Guest
Ignoramus29627 wrote:
Quote:
On 2008-09-05, Jean-David Beyer <jeandavid8 at (no spam) verizon.net> wrote:
Ignoramus29627 wrote:

I dont want to alarm you but many years at the blunt end of hardware
programming shows that bad hardware can and does only show up under
a precise set of circumstances sometimes.
Certainly, but the new kernel eliminated whatever it was.

If it really was a kernel bug, why no other people at least saying "me too"
about this bug?

Check out ubuntuforums, many people are complaining about lockups.

My guess is that you do have a hardware problem and the new kernel just
hides it. For example, it may be a memory problem and the problem memory
cells are now used for an error message that is seldom or never produced?

I think that it is the driver for a relatively new RAID card that I
had.


That has a fairly high probability of being the case: drivers are where
a bug can cause lockups for sure.

Well I hope it turns out to be the case as.....

<snip>

Quote:

It is a pain in the ass problem, that I spent a while on.

Indeed.
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Fri Dec 05, 2008 3:49 am