Because of a documented rounding issue in Delphi XE2, we are using a special rounding unit available on the Embarcadero site named DecimalRounding_JH1 to achieve true bankers rounding. A link to the unit can be found here:
DecimalRounding_JH1
Using this unit's DecimalRound function with numbers containing a large number of decimal place we
This is the rounding routine from the DecimalRounding_JH1 unit. In our example we call this DecimalRound function with the following parameters (166426800, 12, MaxRelErrDbl, drHalfEven) where maxRelErrDbl = 2.2204460493e-16 * 1.234375 * 2
Function DecimalRound(Value: extended; NDFD: integer; MaxRelErr: double;
Ctrl: tDecimalRoundingCtrl = drHalfEven): extended;
{ The DecimalRounding function is for doing the best possible job of rounding
floating binary point numbers to the specified (NDFD) number of decimal
fraction digits. MaxRelErr is the maximum relative error that will allowed
when determining when to apply the rounding rule. }
var i64, j64: Int64; k: integer; m, ScaledVal, ScaledErr: extended;
begin
If IsNaN(Value) or (Ctrl = drNone)
then begin Result := Value; EXIT end;
Assert(MaxRelErr > 0,
'MaxRelErr param in call to DecimalRound() must be greater than zero.');
{ Compute 10^NDFD and scale the Value and MaxError: }
m := 1; For k := 1 to abs(NDFD) do m := m*10;
If NDFD >= 0
then begin
ScaledVal := Value * m;
ScaledErr := abs(MaxRelErr*Value) * m;
end
else begin
ScaledVal := Value / m;
ScaledErr := abs(MaxRelErr*Value) / m;
end;
{ Do the diferent basic types separately: }
Case Ctrl of
drHalfEven: begin
**i64 := round((ScaledVal - ScaledErr));**
The last line is where we get a floating point error.
Any thoughts on why this error is occurring?
If you get an exception, that means you cannot represent your value as an double within specified error range.
In other words, the maxRelErrDbl is too small.
Try with maxRelErrDbl = 0,0000000001 or something to test if I am right.
Related
I need to implement Seidel method in Pascal. I tried this code but it gives the wrong answer. I don't understand what the mistake is. This is what the procedure for finding roots looks like:
procedure Seidel(n: Integer; var x: vector; a: matrix; e: Real);
var k, i, j, z: integer;
s: Real;
begin
for k := 1 to 100 do
begin
z := k;
for i := 1 to n do
begin
s := a[i, n + 1];
for j := 1 to n do s := s - a[i, j] * x[j];
s := s / a[i, i];
x[i] := x[i] + s;
if abs(s) > e then z := 0
end;
if z <> 0 then Break;
end;
end;
Procedure for variable 'a'
procedure ReadA;
var i, j: integer;
begin
for i := 1 to m do
for j := 1 to m + 1 do
a[i, j] := StrToFloat(Form1.StringGrid1.Cells[j, i])
end;
This is how StringGrid looks like:
"Корни Х" - "Roots X"
When you click on the "Расчёт" (calculate) button, the answers are different, and after repeated clicking, the "Floating point overflow" error appears.
The mistakes are
using no comments
using more than 2 single-letter variable names
using anti-patterns: a counting loop (for loop) should be used only if you can predict the exact number of iterations. Break does/should not belong to your standard repertoire, I even consider it a variant of spaghetti code. There are very few exceptions to this rule, but here you it’s better to stick to using a conditional loop (while … do or repeat … until).
omitting begin … end frames (for branches and loops) during development, when your program evidently is not finished yet
To be fair, the Seidel method can be confusing. On the other hand, Pascal is, provided a sufficient language proficiency, pretty well-suited for such a task.
I actually had to program that task myself in order to possibly understand why your procedure does not produce the right result. The following program uses some Extended Pascal (ISO 10206) features like schemata and type inquiries. You will need an EP-compliant compiler for that, such as the GPC (GNU Pascal Compiler). AFAIK, Delphi does not support those features, but it should be an easy task to resolve any deficiencies.
Considering all aforementioned “mistakes” you arrive at the following solution.
program seidel(output);
type
naturalNumber = 1..maxInt value 1;
All naturalNumber values below are initialized with 1 unless otherwise specified. This is an EP extension.
linearSystem(
coefficientCount: naturalNumber;
equationCount: naturalNumber
) = record
coefficient: array[1..equationCount, 1..coefficientCount] of real;
result: array[1..coefficientCount] of real;
solution: array[1..equationCount] of real;
end;
Of course you may structure that data type differently depending on your main usage scenario.
{
Approximates the solution of the passed linearSystem
using the Gauss-Seidel method.
system.solution should contain an estimate of the/a solution.
}
procedure approximateSolution(var system: linearSystem);
{ Returns `true` if any element along the main diagonal is zero. }
{ NB: There is a chance of false negatives. }
function mainDiagonalNonZero: Boolean;
var
product: real value 1.0;
n: naturalNumber;
begin
{ Take the product of all elements along the main diagonal. }
{ If any element is zero, the entire product is zero. }
for n := 1 to system.coefficientCount do
begin
product := product * system.coefficient[n, n];
end;
mainDiagonalNonZero := product <> 0.0;
end;
This function mainDiagonalNonZero serves as a reminder that you can “nest” routines in routines. Although it is only called once below, it cleans up your source code a bit if you structure units of code like that.
type
{ This is more readable than using plain integer values. }
relativeOrder = (previous, next);
var
approximation: array[relativeOrder] of type of system.solution;
Note, that approximation is declared in front of getNextApproximationResidual, so both this function and the main block of approximateSolution can access the same vectors.
{ Calculates the next approximation vector. }
function getNextApproximationResidual: real;
var
{ used for both, identifying the equation and a coefficient }
n: naturalNumber;
{ used for identifying one term, i.e. coefficient × solution }
term: 0..maxInt;
{ denotes a current error of this new/next approximation }
residual: real;
{ denotes the largest error }
residualMaximum: real value 0.0;
{ for simplicity, you could use `approximation[next, n]` instead }
sum: real;
begin
for n := 1 to system.equationCount do
begin
sum := 0.0;
for term := 1 to n - 1 do
begin
sum := sum + system.coefficient[n, term] * approximation[next, term];
end;
{ term = n is skipped, because that's what we're calculating }
for term := n + 1 to system.equationCount do
begin
sum := sum + system.coefficient[n, term] * approximation[previous, term];
end;
Here it becomes apparent, that your implementation does not contain two for loops. It does not iterate over all terms.
sum := system.result[n] - sum;
{ everything times the reciprocal of coefficient[n, n] }
approximation[next, n] := sum / system.coefficient[n, n];
{ finally, check for larger error }
residual := abs(approximation[next, n] - approximation[previous, n]);
if residual > residualMaximum then
begin
residualMaximum := residual;
end;
end;
getNextApproximationResidual := residualMaximum;
end;
I have outsourced this function getNextApproximationResidual so I could write a nicer abort condition in the loop below.
const
{ Perform at most this many approximations before giving up. }
limit = 1337;
{ If the approximation improved less than this value, }
{ we consider the approximation satisfactory enough. }
errorThreshold = 8 * epsReal;
var
iteration: naturalNumber;
begin
if system.coefficientCount <> system.equationCount then
begin
writeLn('Error: Gauss-Seidel method only works ',
'on a _square_ system of linear equations.');
halt;
end;
{ Values in the main diagonal later appear as divisors, }
{ that means they must be non-zero. }
if not mainDiagonalNonZero then
begin
writeLn('Error: supplied linear system contains ',
'at least one zero along main diagonal.');
halt;
end;
Do not trust user input. Before we calculate anything, ensure the system meets some basic requirements. halt (without any parameters) is an EP extension. Some compilers’ halt also accept an integer parameter to communicate the error condition to the OS.
{ Take system.solution as a first approximation. }
approximation[next] := system.solution;
repeat
begin
iteration := iteration + 1;
{ approximation[next] is overwritten by `getNextApproximationError` }
approximation[previous] := approximation[next];
end
until (getNextApproximationResidual < errorThreshold) or_else (iteration >= limit);
The or_else operator is an EP extension. It explicitly denotes “lazy/short-cut evaluation”. Here it wasn’t necessary, but I like it nevertheless.
{ Emit a warning if the previous loop terminated }
{ because of reaching the maximum number of iterations. }
if iteration >= limit then
begin
writeLn('Note: Maximum number of iterations reached. ',
'Approximation may be significantly off, ',
'or it does not converge.');
end;
{ Finally copy back our best approximation. }
system.solution := approximation[next];
end;
I used the following for testing purposes. protected (EP) corresponds to const in Delphi (I guess).
{ Suitable for printing a small linear system. }
procedure print(protected system: linearSystem);
const
totalWidth = 8;
fractionWidth = 3;
times = ' × ';
plus = ' + ';
var
equation, term: naturalNumber;
begin
for equation := 1 to system.equationCount do
begin
write(system.coefficient[equation, 1]:totalWidth:fractionWidth,
times,
system.solution[1]:totalWidth:fractionWidth);
for term := 2 to system.coefficientCount do
begin
write(plus,
system.coefficient[equation, term]:totalWidth:fractionWidth,
times,
system.solution[term]:totalWidth:fractionWidth);
end;
writeLn('⩰ ':8, system.result[equation]:totalWidth:fractionWidth);
end;
end;
The following example system of linear equations was taken from Wikipedia, so I “knew” the correct result:
{ === MAIN ============================================================= }
var
example: linearSystem(2, 2);
begin
with example do
begin
{ first equation }
coefficient[1, 1] := 16.0;
coefficient[1, 2] := 3.0;
result[1] := 11.0;
{ second equation }
coefficient[2, 1] := 7.0;
coefficient[2, 2] := -11.0;
result[2] := 13.0;
{ used as an estimate }
solution[1] := 1.0;
solution[2] := 1.0;
end;
approximateSolution(example);
print(example);
end.
I am Using this library for Big Integers in Pascal but I am having trouble using the modulo function. Can anyone help?
My code:
a = b modulo(c);
here is the library location: http://www.delphiforfun.org/programs/library/big_integers.htm
{ ***************** Modulo ************* }
procedure TInteger.Modulo(const I2: TInteger);
{ Modulo (remainder after division) - by TInteger }
var
k: int64;
imod3: TInteger;
begin
if high(I2.fDigits) = 0 then begin
divmodsmall(I2.Sign * I2.fDigits[0], k);
assignsmall(k);
end
else
begin
imod3:= GetNextScratchPad;
DivideRem(I2, imod3);
Assign(imod3);
ReleaseScratchPad(imod3);
end;
end;
Why does this not work?:
also why doesnt this work?:
var
P, Q, N, E, D,i: TInteger;
Eing, Cout: TInteger;
begin
E := 3;
D := 27;
N := 55;
writeln(N.Modulo(E));
The source code that you downloaded comes with an example of how to use the modulo function. I urge you to take time to read the example code that comes with a library. If you would do so then you'd be able to solve far more problems by yourself. The example code looks like this:
procedure Tbigints.ModBtnClick(Sender: TObject);
var
i1,i2,i3:Tinteger;
begin
i1:=TInteger.create(0);
i2:=TInteger.create(0);
Getxy(i1,i2);
i1.modulo(i2);
memo1.text:=i1.converttoDecimalString(true);
i1.free;
i2.free;
alloclbl.caption:=format('Allocated memory: %d',[allocmemsize]);
end;
The key point is that the modulo method acts in place. In the code above, the dividend is held in i1 and the divisor in i2. Then you call modulo on i1 passing i2 as the argument. The result of the operation is then placed in i1. So, this method replaces the dividend with the modulus of the division.
Take the following record:
TVector2D = record
public
class operator Equal(const V1, V2: TVector2D): Boolean;
class operator Multiply(const D: Accuracy; const V: TVector2D): TVector2D;
class operator Divide(const V: TVector2D; const D: Accuracy): TVector2D;
class function New(const x, y: Accuracy): TVector2D; static;
function Magnitude: Accuracy;
function Normalised: TVector2D;
public
x, y: Accuracy;
end;
With the methods defined as:
class operator TVector2D.Equal(const V1, V2: TVector2D): Boolean;
var
A, B: Boolean;
begin
Result := (V1.x = V2.x) and (V1.y = V2.y);
end;
class operator TVector2D.Multiply(const D: Accuracy; const V: TVector2D): TVector2D;
begin
Result.x := D*V.x;
Result.y := D*V.y;
end;
class operator TVector2D.Divide(const V: TVector2D; const D: Accuracy): TVector2D;
begin
Result := (1.0/D)*V;
end;
class function TVector2D.New(const x, y: Accuracy): TVector2D;
begin
Result.x := x;
Result.y := y;
end;
function TVector2D.Magnitude;
begin
RESULT := Sqrt(x*x + y*y);
end;
function TVector2D.Normalised: TVector2D;
begin
Result := Self/Magnitude;
end;
and a constant:
const
jHat2D : TVector2D = (x: 0; y: 1);
I would expect the Boolean value of (jHat2D = TVector2D.New(0,0.707).Normalised) to be True. Yet it comes out as False.
In the debugger TVector2D.New(0,0.707).Normalised.y shows as exactly 1.
It cannot be the case that this is exactly 1, otherwise the Boolean value of (jHat2D = TVector2D.New(0,0.707).Normalised) would be True.
Any ideas?
Edit
Accuracy is a Type defined as: Accuracy = Double
Assuming that Accuracy is a synonym for a Double type, this is a bug in the visualization of floating point values by the debugger. Due to the inherent problems with internal representation of floating points, v1.Y and v2.Y have very slightly different values, though both approximate to 1.
Add watches for v1.y and v2.y. Ensure that these watch values are configured to represent as "Floating Point" values with Digits set to 18 for maximum detail.
At your breakpoint you will see that:
v1.y = 1
v2.y = 0.999999999999999889
(whosrdaddy provided the above short version in the comments on the question, but I am retaining the long form of my investigation - see below the line after Conclusion - as it may prove useful in other, similar circumstances as well as being of potential interest)
Conclusion
Whilst the debugger visualizations are strictly speaking incorrect (or at best misleading), they are never-the-less very almost correct. :)
The question then is whether you require strict accuracy or accuracy to within a certain tolerance. If the latter then you can adopt the use of SameValue() with an EPSILON defined suitable to the degree of accuracy you require.
Otherwise you must accept that when debugging your code you cannot rely on the debugger to represent the values involved in your debugging to the degree of accuracy relied on in the code itself.
Option: Customise the Debug Visualization Itself
Alternatively you may wish to investigate creating a custom debug visualisation for your TVector2D type to represent your x/y values to the accuracy employed in your code.
For such a visualization using FloatToStr(), use Format() with a %f format specifier with a suitable number of decimal places. e.g. the below call yields the result obtained by watching the variable as described above:
Format('%.18f', [v2.y]);
// Yields 0.999999999999999889
Long Version of Original Investigation
I modified the Equal operator to allow me to inspect the internal representation of the two values v1.y and v2.y:
type
PAccuracy = Accuracy;
class operator TVector2D.Equal(const V1, V2: TVector2D): Boolean;
var
A, B: Boolean;
ay, by: PAccuracy;
begin
ay := #V1.y;
by := #V2.y;
A := (V1.x = V2.x);
B := (V1.y = V2.y);
result := A and B;
end;
By setting watches in the debugger to provide a Memory Dump of ay^ and by^ we see that the two values are represented internally very differently:
v1.y : $3f f0 00 00 00 00 00 00
v2.y : $3f ef ff ff ff ff ff ff
NOTE: Byte order is reversed in the watch value results, as compared to the actual values above, due to the Little Endian nature of Intel.
We can then test the hypothesis by passing Doubles with these internal representations into FloatToStr():
var
a: Double;
b: Double;
ai: Int64 absolute a;
bi: Int64 absolute b;
begin
ai := $3ff0000000000000;
bi := $3fefffffffffffff;
s := FloatToStr(a) + ' = ' + FloatToStr(b);
// Yields 's' = '1 = 1';
end;
We can conclude therefore that the evaluation of B is correct. v1.y and v2.y are different. The representation of the Double values by the debugger is incorrect (or at best misleading).
By changing the expression for B to use SameValue() we can determine the deviation between the values involved:
uses
Math;
const
EPSILON = 0.1;
B := SameValue(V1.y, V2.y, EPSILON);
By progressively reducing the value of EPSILON we find that v1.y and v2.y differ by an amount less than 0.000000000000001 since:
EPSILON = 0.000000000000001; // Yields B = TRUE
EPSILON = 0.0000000000000001; // Yields B = FALSE
Your problem stems from the fact that the 2 floating point values are not 100% equal and that the Debug Inspector rounds the floating point, to see the real value you need add a watch and specify floating point as visualizer:
Using the memory dump visualizer also reveals the difference between the 2 values:
I have selected columns from a database table and want this data with two decimal places only. I have:
SQL.Strings = ('select '#9'my_index '#9'his_index,'...
What is that #9?
How can I deal with the data I selected to make it only keep two decimal places?
I am very new to Delphi.
#9 is the character with code 9, TAB.
If you want to convert a floating point value to a string with 2 decimal places you use one of the formatting functions, e.g. Format():
var
d: Double;
s: string;
...
d := Sqrt(2.0);
s := Format('%.2f', [d]);
function Round2(aValue:double):double;
begin
Round2:=Round(aValue*100)/100;
end;
#9 is the tab character.
If f is a floating-point variable, you can do FormatFloat('#.##', f) to obtain a string representation of f with no more than 2 decimals.
For N Places behind the seperator use
function round_n(f:double; n:nativeint):double;
var i,m : nativeint;
begin
m := 10;
for i := 1 to pred(n) do
m := m * 10;
f := f * m;
f := round(f);
result := f / m;
end;
For Float to Float (with 2 decimal places, say) rounding check this from documentation. Gives sufficient examples too. It uses banker's rounding.
x := RoundTo(1.235, -2); //gives 1.24
Note that there is a difference between simply truncating to two decimal places (like in Format()), rounding to integer, and rounding to float.
Nowadays the SysUtils unit contains the solution:
System.SysUtils.FloatToStrF( singleValue, 7, ffFixed, 2 );
System.SysUtils.FloatToStrF( doubleValue, 15, ffFixed, 2 );
You can pass +1 TFormatSettings parameter if the requiered decimal/thousand separator differ from the current system locale settings.
The internal float format routines only work with simple numbers > 1
You need to do something more complicated for a general purpose decimal place limiter that works correctly on both fixed point and values < 1 with scientific notation.
I use this routine
function TForm1.Flt2str(Avalue:double; ADigits:integer):string;
var v:double; p:integer; e:string;
begin
if abs(Avalue)<1 then
begin
result:=floatTostr(Avalue);
p:=pos('E',result);
if p>0 then
begin
e:=copy(result,p,length(result));
setlength(result,p-1);
v:=RoundTo(StrToFloat(result),-Adigits);
result:=FloatToStr(v)+e;
end else
result:=FloatToStr(RoundTo(Avalue,-Adigits));
end
else
result:=FloatToStr(RoundTo(Avalue,-Adigits));
end;
So, with digits=2, 1.2349 rounds to 1.23 and 1.2349E-17 rounds to 1.23E-17
This worked for me :
Function RoundingUserDefineDecaimalPart(FloatNum: Double; NoOfDecPart: integer): Double;
Var
ls_FloatNumber: String;
Begin
ls_FloatNumber := FloatToStr(FloatNum);
IF Pos('.', ls_FloatNumber) > 0 Then
Result := StrToFloat
(copy(ls_FloatNumber, 1, Pos('.', ls_FloatNumber) - 1) + '.' + copy
(ls_FloatNumber, Pos('.', ls_FloatNumber) + 1, NoOfDecPart))
Else
Result := FloatNum;
End;
Function RealFormat(FloatNum: Double): string;
Var
ls_FloatNumber: String;
Begin
ls_FloatNumber:=StringReplace(FloatToStr(FloatNum),',','.',[rfReplaceAll]);
IF Pos('.', ls_FloatNumber) > 0 Then
Result :=
(copy(ls_FloatNumber, 1, Pos('.', ls_FloatNumber) - 1) + '.' + copy
(ls_FloatNumber, Pos('.', ls_FloatNumber) + 1, 2))
Else
Result := FloatToStr(FloatNum);
End;
look the follow code, why the result of Trunc function is different?
procedure TForm1.Button1Click(Sender: TObject);
var
D: Double;
E: Extended;
I: Int64;
begin
D := Frac(101 / 100) * 100;
E := Frac(101 / 100) * 100;
I := Trunc(D);
ShowMessage('Trunc(Double): ' + IntToStr(I)); // Trunc(Double): 1
I := Trunc(E);
ShowMessage('Trunc(Extended): ' + IntToStr(I)); // Trunc(Extended): 0
end;
Formatting functions don't always display the actual numbers (data).
Real numbers and precision can be tricky.
Check out this code where I use more precision on what I want to see on the screen:
D := Frac(101 / 100);
E := Frac(101 / 100);
ShowMessage(FloatToStrF(D, ffFixed, 15, 20));
ShowMessage(FloatToStrF(E, ffFixed, 18, 20));
It appears that D is something like 0.010000000000 while E is like 0.00999999999.
Edit: Extended type has better precision than Double type.
If we try to display the values of D and E with FloatToString() we'll probably get the same result, even though the actual values are not the same.
Note Nick D’s answer. He is right when saying that
It appears that D is something like
0.010000000000 while E is like 0.00999999999.
The answer however, is not in formatting function. This is how the float calculations are done. Computers simply do not understand float numbers (since there is infinite amount of numbers between 0 and 1, while computers operate on finite number of bits and bytes), and every Double or Extended variable in Delphi (and most other languages) is just an approximation (with some really rare exceptions).
You can read more of it on Wikipedia: Floating point and Fixed-point