Binary floating point Subtraction - binary-data

I was solving a binary subtraction and got stuck at a point. I am unable to understand how to subtract a larger number from a smaller number.
My operands are 0.111000*2^-3 and 1.0000*2^-3.
I have easily subtracted the fractional part but when coming to the MSB, I dont know how to do it. From where should I borrow to perform the operation. I know subtracting 1 from 0 requires a borrow and it turns sign bit to negative. But here, the storing is not under concern. My problem is with the operation itself. Could anyone explain wats the result and how to perform it??

Very late, but had the same question so putting this here for others, having the same problem.
If your problem lies only at the fraction part, you could try this method:
- 0.111
Step 1: Add sign bit of both binary numbers so you can add.
0 1.000
1 0.111
------- +
1 1.111
Now invert and add one to convert from 2's complement to sign-magnitude:
1 1.111 -> 0.001

Here is a great example that may help you:
If doing this programmatically, you can cheat and see which is larger before you perform the subtraction (which is what I do).


Nested iif statement to round to nickels! Will it work? AKA Penny Rounding

I have searched the web over, and have found very little dealing with this. I wanted to know if there are any deeper issues that I am unware of getting the results this way. the [total] variable represents the calculated total owing. PayAmt represents what the customer will pay when paying cash only.
PayAmt: FormatCurrency(
This does on its face give the results as expected, I am just not sure IF I should approach this issue this way?
0.98 - 1.02 = 1.00
1.03 - 1.07 = 1.05
Having not seen anything like this, I suspect it can't be this easy. I just don't know why.
Thanks for any help!
Never use string handling for numbers.
Here is an article about serious rounding including all necessary code for any value and data type of VBA:
Rounding values up, down, by 4/5, or to significant figures

objective-c looking for algorithm

In my application I need to determine what the plates a user can load on their barbell to achieve the desired weight.
For example, the user might specify they are using a 45LB bar and have 45,35,25,10,5,2.5 pound plates to use. For a weight like 115, this is an easy problem to solve as the result neatly matches a common plate. 115 - 45 / 2 = 35.
So the objective here is to find the largest to smallest plate(s) (from a selection) the user needs to achieve the weight.
My starter method looks like this...
-(void)imperialNonOlympic:(float)barbellWeight workingWeight:(float)workingWeight {
float realWeight = (workingWeight - barbellWeight);
float perSide = realWeight / 2;
.... // lots of inefficient mod and division ....
My thought process is to determine first what the weight per side would be. Total weight - weight of the barbell / 2. Then determine what the largest to smallest plate needed would be (and the number of each, e.g. 325 would be 45 * 3 + 5 or 45,45,45,5.
Messing around with fmodf and a couple of other ideas it occurred to me that there might be an algorithm that solves this problem. I was looking into BFS, and admit that it is above my head but still willing to give it a shot.
Appreciate any tips on where to look in algorithms or code examples.
Your problem is called Knapsack problem. You will find a lot solution for this problem. There are some variant of this problem. It is basically a Dynamic Programming (DP) problem.
One of the common approach is that, you start taking the largest weight (But less than your desired weight) and then take the largest of the remaining weight. It easy. I am adding some more links ( Link 1, Link 2, Link 3 ) so that it becomes clear. But some problems may be hard to understand, skip them and try to focus on basic knapsack problem. Good luck.. :)
Let me know if that helps.. :)

Ruby Floating Point Math - Issue with Precision in Sum Calc

Good morning all,
I'm having some issues with floating point math, and have gotten totally lost in ".to_f"'s, "*100"'s and ".0"'s!
I was hoping someone could help me with my specific problem, and also explain exactly why their solution works so that I understand this for next time.
My program needs to do two things:
Sum a list of decimals, determine if they sum to exactly 1.0
Determine a difference between 1.0 and a sum of numbers - set the value of a variable to the exact difference to make the sum equal 1.0.
For example:
[0.28, 0.55, 0.17] -> should sum to 1.0, however I keep getting 1.xxxxxx. I am implementing the sum in the following fashion:
sum = array.inject(0.0){|sum,x| sum+ (x*100)} / 100
The reason I need this functionality is that I'm reading in a set of decimals that come from excel. They are not 100% precise (they are lacking some decimal points) so the sum usually comes out of 0.999999xxxxx or 1.000xxxxx. For example, I will get values like the following:
To fix this, I am ok taking the sum of the first n-1 numbers, and then changing the final number slightly so that all of the numbers together sum to 1.0 (must meet validation using the equation above, or whatever I end up with). I'm currently implementing this as follows:
sum = 0.0
array.each do |item|
sum += item * 100.0
array[i] = (100 - sum.round)/100.0
I know I could do this with inject, but was trying to play with it to see what works. I think this is generally working (from inspecting the output), but it doesn't always meet the validation sum above. So if need be I can adjust this one as well. Note that I only need two decimal precision in these numbers - i.e. 0.56 not 0.5623225. I can either round them down at time of presentation, or during this calculation... It doesn't matter to me.
Thank you VERY MUCH for your help!
If accuracy is important to you, you should not be using floating point values, which, by definition, are not accurate. Ruby has some precision data types for doing arithmetic where accuracy is important. They are, off the top of my head, BigDecimal, Rational and Complex, depending on what you actually need to calculate.
It seems that in your case, what you're looking for is BigDecimal, which is basically a number with a fixed number of digits, of which there are a fixed number of digits after the decimal point (in contrast to a floating point, which has an arbitrary number of digits after the decimal point).
When you read from Excel and deliberately cast those strings like "0.9987" to floating points, you're immediately losing the accurate value that is contained in the string.
require "bigdecimal"
That value is precise. It is 0.9987. Not 0.998732109, or anything close to it, but 0.9987. You may use all the usual arithmetic operations on it. Provided you don't mix floating points into the arithmetic operations, the return values will remain precise.
If your array contains the raw strings you got from Excel (i.e. you haven't #to_f'd them), then this will give you a BigDecimal that is the difference between the sum of them and 1.
1 -{|v| BigDecimal(v)}.reduce(:+)
continue using floats and round(2) your totals: 12.341.round(2) # => 12.34
use integers (i.e. cents instead of dollars)
use BigDecimal and you won't need to round after summing them, as long as you start with BigDecimal with only two decimals.
I think that algorithms have a great deal more to do with accuracy and precision than a choice of IEEE floating point over another representation.
People used to do some fine calculations while still dealing with accuracy and precision issues. They'd do it by managing the algorithms they'd use and understanding how to represent functions more deeply. I think that you might be making a mistake by throwing aside that better understanding and assuming that another representation is the solution.
For example, no polynomial representation of a function will deal with an asymptote or singularity properly.
Don't discard floating point so quickly. I could be that being smarter about the way you use them will do just fine.

Project Euler -Prob. #20 (Lua)
I've written code to figure out this problem, however, it seems to be accurate in some cases, and inaccurate in others. When I try solving the problem to 10 (answer is given in question, 27) I get 27, the correct answer. However, when I try solving the question given (100) I get 64, the incorrect answer, as the answer is something else.
Here's my code:
function factorial(num)
if num>=1 then
return num*factorial(num-1)
return 1
function getSumDigits(str)
str=string.format("%18.0f",str):gsub(" ","")
local sum=0
for i=1,#str do
return sum
Since Lua converts large numbers into scientific notation, I had to convert it back to standard notation. I don't think this is a problem, though it might be.
Is there any explanation to this?
Unfortunately, the correct solution is more difficult. The main problem here is that Lua uses 64bit floating point variables, which means this applies.
Long story told short: The number of significant digits in a 64bit float is much too small to store a number like 100!. Lua's floats can store a maximum of 52 mantissa bits, so any number greater than 2^52 will inevitably suffer from rounding errors, which gives you a little over 15 decimal digits. To store 100!, you'll need at least 158 decimal digits.
The number calculated by your factorial() function is reasonably close to the real value of 100! (i.e. the relative error is small), but you need the exact value to get the right solution.
What you need to do is implement your own algorithms for dealing with large numbers. I actually solved that problem in Lua by storing each number as a table, where each entry stores one digit of a decimal number. The complete solution takes a little more than 50 lines of code, so it's not too difficult and a nice exercise.

(La)TeX Base 10 fixed point arithmetic

I'm trying to implement decimal arithmetic in (La)TeX. I'm trying to use dimens to store the values. I want the arithmetic to be exact to some (fixed) number of decimal places. If I use 1pt as my base unit, then this fails, because \divide rounds down, so 1pt / 10 gives 0.09999pt. If I use something like 1000sp as my base unit, then I get working fixed point arithmetic with 3 decimal places, but I can't figure out an easy way to format the numbers. If I try to convert them to pt, so I can use TeX's display mechanism, I have the same problem with \divide.
How do I fix this problem, or work around it?
The fp package provides fixed point arithmetic for LaTeX. The LaTeX3 Project are currently implementing something similar as part of the expl3 bundle. The code is currently not on CTAN, but can be grabbed from the SVN (or will appear when the next update from the SVN to CTAN takes place).
I would represent all the values as integers and scale them appropriately. For example, when you need three decimal digits, 0.124 would be represented as 124. This is nice because addition and subtraction are trivial. When multiplying two numbers a and b, you would have to divide the result by 1000 to get the proper representation. Dividing works by multiplying the result with 1000.
You still have to get the rounding issues correct, but this isn't very difficult. At least if you don't get near the maximum representable integer (I don't remember if it's 2^31-1 or 2^30-1).
Here is some code:
\advance #1 by #3\relax
\advance #1 by #3\relax
\multiply #1 by #3\relax
\divide #1 by 1000\relax
\divide #1 by #3\relax
\multiply #1 by 1000\relax
The operations are modeled after a three register machine, where the first is the destination and the other two are the operands. The rounding after the multiplication and division, including corner cases for very large or very small numbers are left as an exercise to you.