Warning: Parameter 1 to Language::getMagic() expected to be a reference, value given in /home/boincnew/public_html/boinc-wiki.info/w/includes/StubObject.php on line 58
Over-Clocking - Unofficial BOINC Wiki

Over-Clocking

From Unofficial BOINC Wiki

(Redirected from Over Clocked)
Jump to: navigation, search

Contents

[edit] General

This is the practice of setting the clock multiplier, CPU operating voltages, memory timing speeds, and other system parameters so that the CPU will run faster than its rated speed. This is generally not a recommended practice.

The by-products of Over-Clocking is higher CPU heat which may exceed allowed operating limits and "soft-errors" in instruction processing.

If the clock frequency is increased too far, eventually some component in the system will not be able to cope and the system will stop working. This failure may be continuous (the system never works at the higher frequency) or intermittent (it fails more often but works some of the time) or, in the worst case, irreversible (a component is damaged by overheating). Over-Clocking may necessitate improved cooling to maintain the same level of reliability.

The Floating Point Unit is usually the first section of the CPU to start generating errors. These errors will normally not affect the Operating System, but when you do a heavy math problem and use those results in a new problem then errors become significant. So it may appear to run perfectly normally yet still return bad work. There have even been cases at Climateprediction.net (CPDN) where people had to underclock computers to get them to run without errors.

Oh, and you may void your warranty.

[edit] Implications Of Over-Clocking and BOINC Powered Projects

All BOINC Powered Projects are "iterative" in nature. Which means that we do something over, and over, and over, and over, and over ... I hope you get the idea ...

So, next assertion, Floating Point Numbers are in-exact representations of number values and therefore have a certain amount of error "built-in".

In theory, a computer will always return the exact same numbers when a calculation series is repeated. In practice this is not always the case and it is usually more common to see diversion at the "end" of the calcualted result (least significant digits). Causes include problems with the FPU not repeating can include improper initialization, bias in results because of interaction between other running processes that use the FPU, and so forth.

Next assertion, the design of FPUs, though compliant with IEEE 754, and later does not guarantee that the outputs of those FPUs will be identical. This is true even within processor families and steppings of those processors.

With all of this, what we see is chaotic behavior in the calculation of values. Meaning, noise.

The LHC@Home Project, for example, has no plans for a Macintosh compatible Science Application, not because they are "bad", but simply a matter of pragmatic considerations because of the numerical consistency required. It is not that one answer is "better" it is just that they have a better chance of getting comparible results by sticking with one basic architecture and compiler.

Ok, I have more in the Glossary under FPU, Floating Point Numbers, and the like that will give you a better feel for what I am talking about.

Conclusion, (in Paul's opinion) Over-Clocking is bad because the point is the science, and not how many Results we return. Returning more Results with questionable equipment is worse (remember this is Paul's opinion) because of the possibility of error. An over-clockers assurance that the Results are accurate because of a test begs the question.

If the point is the science, then the accuracy, consistency, reliability, and repeatability of the process is of prime concern. Over-clocking is done for one reason and one reason only, to increase the absolute number of answers. But it does this at the cost of decreasing everything else related to the processing of work. If the thousands of engineers tell you that this processor should run at speed "x", how can I believe that a test of "x" hours of running program "y" is going to do anything to conclusively prove them wrong.

[edit] What Does Over-Clocking Do To the Computer

This, like many of the examples in the Unofficial BOINC Wiki is greatly simplified, but here goes:

Over-Clocking & Transition Detection

The black line is a slightly exaggerated version of a signal transitioning from 0 to 1. The green line represents normal clocking.

A really good waveform is just beyond my graphic abilities, but there should be a little "ringing" at the top of the wave, and the edges aren't really that square, but it'll do.

The slope of the black line will change a bit with voltage and temperature. This is why you need to raise the voltage sometimes when you over-clock -- and one of the reasons overclockers are often obsessed with cooling. Running cold will make the black line steeper and you can move the green line to the left (by increasing the clock) and still "sample" on the top of the waveform.

The green line is where it is because the chip manufacturer has determined that, under virtually all circumstances (specified temperature range, etc.) that the signal will be stable, and if you sample the waveform at that point you'll get a good solid reliable "1" or a good solid reliable "0".

The yellow line represents a machine that has been overclocked. This line is most of the way up the slope, and will probably always be a "1" -- sometimes, the rise time will be a little slow and the value may be interpreted as a zero.

The red line is a machine that has been overclocked too much. What should be a "1" will almost always be interpreted as a "0" and the machine likely won't run.

The main point is, as you increase the clock speed, you are trading performance for margins. Margins allow for reliable operation as environmental variables change (voltage, temperature), quantum effects, whatever randomness might sneak in.

Over-Clocking done right isn't necessarily even Over-Clocking: you may be lucky and get a part that performs well at higher clock speeds. If you properly characterize that part you may find that it clocks reliably at 20% over the marked speed -- if you run it at 95% of the fastest reliable speed, you'll get the "free" perfomance boost and still have enough margin for good results.

If you find the absolute top speed, and then stay there, you may fall of the corner once in a while and have a machine that is mostly reliable.


A Square Wave

A more technical look at some of the characteristics of a square waveform with distortions.

(Source: Time Domain Audio Measurements - www.tvhandbook.com/support/pdf_files/audio/Chapter13_4.pdf)

Note:
More information about clocks and what they mean for the Central Processing Unit can be found in the article "How a CPU Works".


[edit] In Support of the Practice of Over-Clocking

So, as stated above, the computational demands of each project vary such that Over-Clocking may pose considerably different risks for corruption. The Climateprediction.net (CPDN) project has been noted to be the most computationally intense of the active projects. So, while the prior comments regarding Over-Clocking and Result reliability are true, they are more true for some projects than for others. Thus, just because Over-Clocking is not causing problems with SETI@Home Results does not mean that that same over-clock will not cause invalid Results on other projects.

[edit] A Philosophical Discussion of the Implications of Over-Clocking

Let me discuss my perspective, and rather than being "general", I'll tie it specifically to the Rosetta@Home Project, and the redundancy issue.


First, computing is inherently chaotic. That may sound strange when everything is supposedly "binary" and deterministic, but at the levels of speed and size (both physical in the electronics, "very small", and in software complexity, "very large") we're dealing with, random effects make a difference in any system whether over-clocked or not. The very fact that you can run the same benchmarks on the same system ten times in a row and get ten different results shows this. The best we can hope for is some statistical level of "certainty" in a result.


The way we improve "certainty" is scaling, both in the software and in the processes. Integer mathematics is inherently more "stable" than floating point, because there are many Floating Point Numbers that cannot be represented exactly in binary. This is why any good accounting package will deal solely in "cents" and will stick the decimal point in after all the mathematics is complete. Scientific computing doesn't have this luxury, and it is obvious from what we've seen overall in the BOINC Powered Projects that steps are necessary to counter the uncertainties involved; thus both general Redundancy and Homogeneous Redundancy, to counter differences created by running even the same software on different platforms. The experience of the optimizers on the SETI@Home's Science Application points this out as well; changing a compiler switch can make the difference between the output being "strongly similar" or "weekly similar". Even the existence of the terms points out that no two systems can be guaranteed to produce the exact same result. From this perspective, over-clocking simply adds one more variable to the equation; but it's not really a new variable, it's just a change to the value of an existing variable, and that is the stability of the CPU/bus/RAM "system".


We don't know the value of this "stability" variable. We can attempt to measure it, by running diagnostics like Prime95, and RAM tests, and so on. When we reach a "failure point" with these diagnostics, we know the system is unstable at that speed. This could be at some extreme over-clock level, or it could be at "stock" speeds - there are too many physical variables, too many "parts" involved in a modern computer for it to be any other way. Below this unstable point, we cannot say that the system is "stable", we can only say that it is "less unstable". If my PC shows no visible errors during testing at 2.5GHz, and your PC shows no visible errors during testing at 2GHz, we have no way of knowing which is more likely to insert a minor error into a calculation tomorrow.


This is where the robustness and design of the software has to take over. The "easy way out" is to have redundancy. Have three or four or twenty-seven computers do the same calculation, and accept the "majority vote" as being correct. (That's a whole other topic, that I won't get into...) With this approach, the software "system" itself is irrelevant, you're basing your ability to certify the accuracy of the results on the redundancy. Rosetta's system however is enough "different" from, say, SETI's, that redundancy simply is not necessary to be able to certify the accuracy. Rosetta@Home Project is not analyzing a set of data off of some radio telescope tape. They are running an algorithm thousands of times with a set of random inputs, in an attempt to locate the "best" input values. The output is not a "yes/no" "this signal exists" result, it is a statistical plot of "this input graphs to be here, this input graphs to be there". Effectively, every WU issued in a series is part of one giant "WU", and we are wanting every returned value to be different, because each is based on a different random number input.


It is not necessary to test every single random number in the input range to have a useful outcome. Once a few thousand have been tested, the ones clustering at the "lower left" are known to be the most "likely candidates". From those, the project can either re-run the algorithm with a tighter "range" of random input values, or move on to some other method of analysis.


Some input values will simply not be tested, just by the nature of a random number generator. Some input values that would have been tested, will not be returned by the host assigned to test it. Some input values that are tested will be returned with the "wrong" result by the host assigned, for whatever reason. If that "wrong" result is "high and right" instead of "low and left", then it won't be part of the valuable set of results, and effectively it might as well not have been returned at all. Which is okay, because the system isn't needing every value to be tested. This leaves two possible concerns for an over-clocked system (or any other system that might not return a "perfect" answer).


If there are enough systems returning "wrong" answers to be statistically significant, so the graphs no longer show the expected clustering, but instead are returning effectively random results, then something is seriously wrong, either with the programs being used, or with our entire computer industry.


If a particular system returns a result that is THE lowest-left value for that run, then if no further checking was done, it would be a significant problem for that result to be wrong. The project has this covered. They re-run the algorithm with that particular input value, on their own systems, and see if they get the same result. If not, they throw it out and pick another. Thus, returning a wrong result that happens to be in "just the right spot", might cost the project some computer time, but will not affect the accuracy of the project's outcome. And if a Result is "wrong", the chance that it would be "wrong" in just such a way that it would be "the answer", is vanishingly small.


So. If your computer, over-clocked or not, is "acceptably functional" - i.e.; doesn't crash, doesn't fail Prime95, etc.; then it's Results running Rosetta@Home Project are just as likely to be useful as those from any other system that is equally functional. Might over-clocking cause you to return a "wrong" result twice out of a thousand instead of once in a thousand? Sure. But the value of returning a thousand results, 998 of which are "perfect", instead of returning 800 results, 799 of which are "perfect", in the same time period, is greater than the "cost". The project has (MUST have) the necessary systems in place to deal with a small number of incorrect results.


If your PC is unstable enough to return a large number of incorrect results, or even any more than a very small percentage of incorrect results, then it is likely to be unstable enough to cause "errors" in the processing, instead of successfully running but giving the wrong output value.


(A comment: one way to test the stability of your system is to run one of the other BOINC Powered Projects that does use redundancy. If you have more than one "successful but invalid" results in a given month, then you know you are past the "unstable" point. I recommend this in addition to periodic "local" tests such as Prime95, RAM check, and so forth.)


[edit] Also See

Personal tools
RSS Feeds
BOINC Wiki RSS feeds RSS Feeds
Powered by BOINC!
Powered by BOINC