How-To Test Machine Stability
From Unofficial BOINC Wiki
[edit] Introduction
This was written by UK_Nick for people who had difficulty running Climateprediction. Climateprediction seems more sensitive to computer stability problems that almost anything else. The tests suggested however should be a good test of stability whatever Projects you are running.
[edit] Climateprediction related review
If your computer is having difficulty running Climateprediction then model instability may happen at very frequent intervals, usually soon after commencing the first Phase of each new Climateprediction Model - you should cease running Climateprediction and run some of the hardware stability checks detailed below to determine what is wrong with your machine and fix it before resuming Climateprediction.
From personal experience. (I have no solid figures);
1 crashed model is entirely possible.
2 consecutive crashed models would be too much of a coincidence
The proportion of crashed models caused by 'bad Parameters' is most likely less than 1 in 25. eg. I had a series of 25 consecutive completed full runs in a row that was eventually broken by two crashed models which were almost certainly due to machine errors - I then had another long series of 32 consecutive completed full runs that was eventually broken when I upgraded to v3.0.0 Beta (THC Slowdown) due to a bug in the first version.
3 crashed models in 5 runs would be beyond suspicious and I would have long since started running tests to find out what was wrong.
[edit] If within Guarantee
If you didn't assemble the machine yourself and it is still inside guarantee then calling the manufacturer is your first option if it fails any of the following tests. Any computer sold should be capable of passing all of these tests entirely free of errors over a reasonable period of time; Eg. 24 hours of Prime95's 'Torture Test'.
[edit] Memory sub-system
Memtest86+ - download - is a really good standalone check for your memory sub-system.
(Select "Download - Pre-Compiled package for Floppy (DOS - Win)")
Note: Memtest86 runs outside Windows, directly off a floppy disk, so you cannot use your machine for anything else whilst it is running.
From Memtest86's 'Readme.txt'; To install Memtest86:
- Extract the files from the zip archive
- Open the directory where the files were extracted and click on "install.bat".
- The install program will prompt you for the floppy drive letter (usually 'a') and also prompt you to insert a blank floppy.
- To run Memtest86 leave the floppy in the drive and reboot.
When Memtest86 has loaded, Hit [c], [2], [3] & [Enter] for the full 11 test suite. This will take quite some time on an older machine but you cannot be sure your memory sub-system is error free until Memtest86 has run at least one clean pass of the full 11 test suite.
Memtest86 will loop continuously until you hit [Esc] to exit - remove the floppy disk and Windows will boot as normal.
If you see errors then you can try; 1) Slowing down memory timings in BIOS. 2) Increase memory Voltage a notch or two - I have a lot of DDR memory that has run at 2.7V (0.2V above spec.) for years without problems - some PC3200 will not even run error free at spec' timings unless it is overVolted slightly. 3) Re-seating memory, perhaps in different slots. 4) Fitting better quality memory.
Note: Check your motherboard handbook and hardware specific websites or Usenet newsgroups for information relating to your particular hardware. Eg. The alt.comp.periphs.mainboard.abit Usenet newsgroup has many folks who can help with any problem regarding Abit motherboards.
[edit] CPU
Even an off-the-shelf 'whitebox' may be unstable at specification speed due to manufacturing errors (Eg. an incorrectly seated CPU heatsink) [3] but instability is more often due to enthusiasts pushing their overclock just a little tooooo far. [2]
Tests:
===SuperPI=== - download - is a good initial check for CPU and memory sub-system stability. The full 32M test may take some hours but, if it runs clean, then the CPU is at least reasonably stable. Any errors at all then you need to back off on the overclock [2] or do some basic maintenance. [3]
Note: The full 32M test may not run under Win98OE/SE/ME due to memory limitations but the 16M test is adequate.
===Prime95=== - download from here - has a good CPU & memory sub-system stability check embedded in it. Prime95 / Options / Torture Test. If you see any error at all whilst running Prime95's 'Torture Test' overnight then there is something wrong with either the memory sub-system or CPU, most often the CPU is being pushed a little too hard for either your cooling system or the CPU's own architectural limitations. [2] or [3]
[edit] Using Climateprediction (or other project) to test a system after either a number of fast 'short runs' or tweaking BIOS or a hardware upgrade
I test my own systems thus;
1) Disconnect the machine from the internet so that it cannot send in dud results. [1]
2) Backup the complete C:\Program Files\BOINC folder so that I can revert to the backup if I see any errors at all - Eg. If the world turns to an iceball very quickly & tries to upload results then I would back off the overclock [2] to where it was, revert to the CPDN backup & check to see if it still turns to an iceball.
3) Run self-checking software like Prime95 & SuperPI at the same time as CPDN.
4) Run the CPDN '3D' visualisation for some time to see if it causes a lockup. (3D applications are prone to causing stability problems, as any 'gamer' knows all too well, but these are more usually driver related - updating to the latest drivers is the primary 'fix' if you have problems here.)
Prime95's execution thread priority can be changed using the password found in the Read-Me File - I run Prime95, SuperPI 32M & CPDN's Model.exe all at the same time with 33% of the CPU going to each - when SuperPI 32M completes a clean run, I then balance Prime95 & Model.exe at roughly 50% CPU each overnight. If Prime95 has run clean overnight then I reduce Prime95's priority a notch so that it takes under 10% of the CPU and leave it running for a few days, occasionally allowing CPDN to connect (if Prime95 is still running clean) & taking a new backup of CPDN immediately afterward.
[edit] Hard Disk Drives:(HDD)
I always test a new HDD by running Windows XP's Checkdisk 'Scan for and attempt recovery of bad sectors' or Windows 98xx's Scandisk 'Thorough' full surface scan.
WinXP: Running 'checkdisk' - double click 'my computer' / rightclick your hard drive (C: ) / properties / tools tab / error-checking [check now] / check off 'automaticaly fix file system errors' & 'Scan and attempt recovery of bad sectors' [Start] - since it's your 'system disk' a box will come up telling you that a check can't be run now and asking if you want to schedule a check for the next reboot, click [yes] and then reboot - during boot the program 'chkdsk' will run and the machine will automaticaly reboot itself again when it finishes.
Win98xx: Running 'scandisk' - double click 'my computer' / rightclick your hard drive (C: ) / properties / tools tab / "error-checking staus" [check now] / select 'Thorough' & also check 'Automaticaly fix file system errors' [Start].
Warning: WinXP and Win98xx - this test can take a loooong time with the huge size of recent HDDs.
If the HDD passes this test then it should be good for some years of continuous use as long as it remains reasonably cool in operation, preferably under 40C. HDD errors will show up in a number of ways; Eg. A warning from Windows that XXX file is missing or corrupt during bootup - WinXP can recover by itself from a lot of errors like this but not all.
[edit] Conclusion
If your machine has passed that lot then it should be good-to-go - have fun.
[edit] Footnotes
[edit] [1]
Disconnect a machine from the internet by either; a) Pull the plug, ie. Remove the telephone cable from the modem or the Ethernet cable from your router / hub.
b) Win98OE/SE/ME - Rightclick on 'My Computer' / Properties / Device Manager tab / select 'Network adapters' and 'Dialup Adapter' if you connect via a modem or XXXX Ethernet Adapter' if you connect via Local Area Network (LAN) - check 'Diasble in this hardware profile' & [Okay] back to the desktop. (Some setups may require a reboot.)
c) WinXP - Rightclick 'My Network Places' / Properties / Rightclick 'Dialup adapter' or 'Local Area Connection' and select 'Disable' - I then leave this window active whilst testing for stability so that I remember I have the connection disabled.
d) It may be possible to block CPDN from accessing the internet by changing some settings in your firewall software.
Reverse the above to re-connect - Win98 systems may require a reboot. (Check the connection is good using CPDN GUI / Settings / 'Check Central Server Connection'.)
[edit] [2]
For Overclockers: Backing off only 2~3MHz on the FSB & memory bus can make a surprising difference to longer term overall stability. You lose less than 100MHz raw CPU speed but you get to see a model complete a full run, so that little loss will be well worthwhile.
Note: There are literally thousands of overclocking resources on the internet - try a search for 'overclocking' at Google..! Thus I've made no attempt at helping beginners to overclocking here, you'll have to do some work on that one yourselves. One good resource that I used a lot in the past, & fed back into as my experience widened, is the Usenet Newsgroup alt.comp.hardware.overclocking - lots of very knowledgeable folks there but do read through the FAQ first if you desire a polite reply.
[edit] [3]
Perhaps the CPU heatsink, other fans &/or filters simply need a good cleanup - see How-To Do Basic Hardware Maintenance.
RSS Feeds

