Crash at any optimization level other than -o0 in iOS

Crash at any optimization level other than -o0 in iOS - ios

The following two pieces of code works fine when optimization level is -o0.
But, when the optimization level is anything other than -o0, the first code crashes at some point, but the seconds does not crash. could you please explain why?
1.
unsigned char* _pos = ...;
double result;
*((int*)&result) = *((int*)_pos;
2.
unsigned char* _pos = ...;
double result;
int* curPos = (int*)_pos;
int* resultPos = (int*)&result;
*resultPos = *curPos;
EDIT:
By the way, this code is in an inlined function. When the function is not inlined, there in no crash even with optimizations.

The code here actually yields several problems at once.
First, as it was said before, the code violates the aliasing rules and thus the result is undefined per standard. So, strictly speaking, compiler can do bunch of stuff while optimizing (this is actually your case when the code mentioned above is inlined).
Second (and I believe this is the actual problem here) - casting char* to int* will increase the assumed alignment of the pointer. According to your platform ABI, char can be 1 byte aligned, but int - at least 4 (double is 8 byte aligned, btw). The system can tolerate the unaligned loads, but not always, e.g. on arm/darwin it can tolerate 4 byte unaligned loads, but not 8. The latter case can happen when compiler would decide to merge two consecutive loads / stored into 1. Since you bumped the actual alignment of the pointer compiler might deduce that everything is suitable aligned and generate such 8 byte loads.
So, in short - fix your code :) In this particular case memcpy / memmove will help you.

Related

Outputting values from CAMPARY

I'm trying to use the CAMPARY library (CudA Multiple Precision ARithmetic librarY). I've downloaded the code and included it in my project. Since it supports both cpu and gpu, I'm starting with cpu to understand how it works and make sure it does what I need. But the intent is to use this with CUDA.
I'm able to instantiate an instance and assign a value, but I can't figure out how to get things back out. Consider:
#include <time.h>
#include "c:\\vss\\CAMPARY\\Doubles\\src_cpu\\multi_prec.h"
int main()
{
const char *value = "123456789012345678901234567";
multi_prec<2> a(value);
a.prettyPrint();
a.prettyPrintBin();
a.prettyPrintBin_UnevalSum();
char *cc = a.prettyPrintBF();
printf("\n%s\n", cc);
free(cc);
}
Compiles, links, runs (VS 2017). But the output is pretty unhelpful:
Prec = 2
Data[0] = 1.234568e+26
Data[1] = 7.486371e+08
Prec = 2
Data[0] = 0x1.987bf7c563caap+86;
Data[1] = 0x1.64fa5c3800000p+29;
0x1.987bf7c563caap+86 + 0x1.64fa5c3800000p+29;
1.234568e+26 7.486371e+08
Printing each of the doubles like this might be easy to do, but it doesn't tell you much about the value of the 128 number being stored. Performing highly accurate computations is of limited value if there's no way to output the results.
In addition to just printing out the value, eventually I also need to convert these numbers to ints (I'm willing to try it all in floats if there's a way to print, but I fear that both accuracy and speed will suffer). Unlike MPIR (which doesn't support CUDA), CAMPARY doesn't have any associated multi-precision int type, just floats. I can probably cobble together what I need (mostly just add/subtract/compare), but only if I can get the integer portion of CAMPARY's values back out, which I don't see a way to do.
CAMPARY doesn't seem to have any docs, so it's conceivable these capabilities are there, and I've simply overlooked them. And I'd rather ask on the CAMPARY discussion forum/mail list, but there doesn't seem to be one. That's why I'm asking here.
To sum up:
Is there any way to output the 128bit ( multi_prec<2> ) values from CAMPARY?
Is there any way to extract the integer portion from a CAMPARY multi_prec? Perhaps one of the (many) math functions in the library that I don't understand computes this?

There are really only 2 possible answers to this question:
There's another (better) multi-precision library that works on CUDA that does what you need.
Here's how to modify this library to do what you need.
The only people who could give the first answer are CUDA programmers. Unfortunately, if there were such a library, I feel confident talonmies would have known about it and mentioned it.
As for #2, why would anyone update this library if they weren't a CUDA programmer? There are other, much better multi-precision libraries out there. The ONLY benefit CAMPARY offers is that it supports CUDA. Which means the only people with any real motivation to work with or modify the library are CUDA programmers.
And, as the CUDA programmer with the most vested interest in solving this, I did figure out a solution (albeit an ugly one). I'm posting it here in the hopes that the information will be of value to future CAMPARY programmers. There's not much information out there for this library, so this is a start.
The first thing you need to understand is how CAMPARY stores its data. And, while not complex, it isn't what I expected. Coming from MPIR, I assumed that CAMPARY stored its data pretty much the same way: a fixed size exponent followed by an arbitrary number of bits for the mantissa.
But nope, CAMPARY went a different way. Looking at the code, we see:
private:
double data[prec];
Now, I assumed that this was just an arbitrary way of reserving the number of bits they needed. But no, they really do use prec doubles. Like so:
multi_prec<8> a("2633716138033644471646729489243748530829179225072491799768019505671233074369063908765111461703117249");
// Looking at a in the VS debugger:
[0] 2.6337161380336443e+99 const double
[1] 1.8496577979210756e+83 const double
[2] 1.2618399223120249e+67 const double
[3] -3.5978270144026257e+48 const double
[4] -1.1764513205926450e+32 const double
[5] -2479038053160511.0 const double
[6] 0.00000000000000000 const double
[7] 0.00000000000000000 const double
So, what they are doing is storing the max amount of precision possible in the first double, then the remainder is used to compute the next double and so on until they encompass the entire value, or run out of precision (dropping the least significant bits). Note that some of these are negative, which means the sum of the preceding values is a bit bigger than the actual value and they are correcting it downward.
With this in mind, we return to the question of how to print it.
In theory, you could just add all these together to get the right answer. But kinda by definition, we already know that C doesn't have a datatype to hold a value this size. But other libraries do (say MPIR). Now, MPIR doesn't work on CUDA, but it doesn't need to. You don't want to have your CUDA code printing out data. That's something you should be doing from the host anyway. So do the computations with the full power of CUDA, cudaMemcpy the results back, then use MPIR to print them out:
#define MPREC 8
void ShowP(const multi_prec<MPREC> value)
{
multi_prec<MPREC> temp(value), temp2;
// from mpir at mpir.org
mpf_t mp, mp2;
mpf_init2(mp, value.getPrec() * 64); // Make sure we reserve enough room
mpf_init(mp2); // Only needs to hold one double.
const double *ptr = value.getData();
mpf_set_d(mp, ptr[0]);
for (int x = 1; x < value.getPrec(); x++)
{
// MPIR doesn't have a mpf_add_d, so we need to load the value into
// an mpf_t.
mpf_set_d(mp2, ptr[x]);
mpf_add(mp, mp, mp2);
}
// Using base 10, write the full precision (0) of mp, to stdout.
mpf_out_str(stdout, 10, 0, mp);
mpf_clears(mp, mp2, NULL);
}
Used with the number stored in the multi_prec above, this outputs the exact same value. Yay.
It's not a particularly elegant solution. Having to add a second library just to print a value from the first is clearly sub-optimal. And this conversion can't be all that speedy either. But printing is typically done (much) less frequently than computing. If you do an hour's worth of computing and a handful of prints, the performance doesn't much matter. And it beats the heck out of not being able to print at all.
CAMPARY has a lot of shortcomings (undoced, unsupported, unmaintained). But for people who need mp numbers on CUDA (especially if you need sqrt), it's the best option I've found.

Delphi Warning W1073 Combining signed type and unsigned 64-bit type - treated as an unsigned type

I get the subject warning on the following line of code;
SelectedFilesSize := SelectedFilesSize +
UInt64(IdList.GetPropertyValue(TShellColumns.Size)) *
ifthen(Selected, 1, -1);
Specifically, the IDE highlights the third line.
SelectedFilesSize is declared as UInt64.
The code appears to work when I run it; if I select an item, its file size is added to the total, if I deselect a file its size is subtracted.
I know I can suppress this warning with {$WARN COMBINING_SIGNED_UNSIGNED64 OFF}.
Can someone explain? Will there be an unforeseen impact if SelectedFilesSize gets huge? Or an impact on a specific target platform?
Delphi 10.3, Win32 and Win64 targets

This will work here, but the warning is right.
If you multiply a UInt64 with -1, you are actually multiplying it with $FFFFFFFFFFFFFFFF. The final result will be a 128 bit value, but the lower 64 bits will be the same as for a signed multiplication (that is also why the code generator often produces an imul opcode, even for unsiged multiplication: the lower bits will be correct, just the — unused — higher bits won't be). The upper 64 bits won't be used anyway, so they don't matter.
If you add that (actually negative) value to another UInt64 (e.g. SelectedFilesSize), the 64 bit result will be correct again. The CPU does not discriminate between positive or negative values when adding. The resulting CPU flags (carry, overflow) will indicate overflow, but if you ignore that by not using range or overflow checks, your code will be fine.
Your code will likely produce a runtime error if range or overflow checks are on, though.
In other words, this works because any excess upper bit — the 64th bit and above — can be ignored. Otherwise, the values would be wrong. See example.
Example
Say your IdList.GetPropertyValue(TShellColumns.Size) is 420. Then you are performing:
$00000000000001A4 * $FFFFFFFFFFFFFFFF = $00000000000001A3FFFFFFFFFFFFFF5C
This is a huge but positive number, but fortunately the lower 64 bits ($FFFFFFFFFFFFFF5C) can be interpreted as -420 (a really negative value in 128 bit would be $FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5C or -420).
Now say your SelectedFileSize is 100000 (or hex $00000000000186A0). Then you get:
$00000000000186A0 + $FFFFFFFFFFFFFF5C = $00000000000184FC
(or actually $100000000000184FC, but the top bit -- the carry -- is ignored).
$00000000000184FC is 99580 in decimal, so exactly the value you wanted.

Benefits of using NSInteger over int?

I am trying to comprehend how development is affected when developing for both 32-bit and 64-bit architectures. From what I have researched thus far, I understand an int is always 4 bytes regardless of the architecture of the device running the app. But an NSInteger is 4 bytes on a 32-bit device and 8 bytes on a 64-bit device. I get the impression NSInteger is "safer" and recommended but I'm not sure what the reasoning is for that.
My question is, if you know the possible value you're using is never going to be large (maybe you're using it to index into an array of 200 items or store the count of objects in an array), why define it as an NSInteger? That's just going to take up 8 bytes when you won't use it all. Is it better to define it as an int in those cases? If so, in what case would you want to use an NSInteger (as opposed to int or long etc)? Obviously if you needed to utilize larger numbers, you could with the 64-bit architecture. But if you needed it to also work on 32-bit devices, would you not use long long because it's 8 bytes on 32-bit devices as well? I don't see why one would use NSInteger, at least when creating an app that runs on both architectures.
Also I cannot think of a method which takes in or returns a primitive type - int, and instead utilizes NSInteger, and am wondering if there is more to it than just the size of the values. For example, (NSInteger)tableView:(UITableView *)tableView numberOfRowsInSection:(NSInteger)section. I'd like to understand why this is the case. Assuming it's possible to have a table with 2,147,483,647 rows, what would occur on a 32-bit device when you add one more - does it wrap around to a -2,147,483,647? And on a 64-bit device it would be 2,147,483,648. (Why return a signed value? I'd think it should be unsigned since you can't have a negative number of rows.)
Ultimately, I'd like to obtain a better understanding of actual use of these number data types, perhaps some code examples would be great!

I personally think that, 64-bit is actually the reason for existence for NSInteger and NSUInteger; before 10.5, those did not exist. The two are simply defined as longs in 64-bit, and as ints in 32-bit.
NSInteger/NSUInteger are defined as *dynamic typedef*s to one of these types, and they are defined like this:
#if __LP64__ || NS_BUILD_32_LIKE_64
typedef long NSInteger;
typedef unsigned long NSUInteger;
#else
typedef int NSInteger;
typedef unsigned int NSUInteger;
#endif
Thus, using them in place of the more basic C types when you want the 'bit-native' size.
I suggest you to throughly read this link.
CocoaDev has some more info.
For proper format specifier you should use for each of these types, see the String Programming Guide's section on Platform Dependencies

I remember when attending iOS developer conference. you have to take a look on the data-type in iOS7. for example, you use NSInteger in 64-bit device and save it on iCloud. then you want to sync to lower device (say iPad 2nd gen), your app will not behave the same, because it recognizes NSInteger in 4 bytes not 8 bytes, then your calculation would be wrong.
But so far, I use NSInteger because mostly my app doesn't use iCloud or doesn't sync. and to avoid compiler warning.

Apple uses int because for a loop control variable (which is only used to control the loop iterations) int datatype is fine, both in datatype size and in the values it can hold for your loop. No need for platform dependent datatype here. For a loop control variable even a 16-bit int will do most of the time.
Apple uses NSInteger for a function return value or for a function argument because in this case datatype [size] matters, because what you are doing with a function is communicating/passing data with other programs or with other pieces of code.
Apple uses NSInteger (or NSUInteger) when passing a value as an
argument to a function or returning a value from a function.
The only thing I would use NSInteger for is passing values to and from an API that specifies it. Other than that it has no advantage over an int or a long. At least with an int or a long you know what format specifiers to use in a printf or similar statement.

As a continue to Irfan's answer:
sizeof(NSInteger)
equals a processor word's size. It is much more simple and faster for processor to operate with words

Beginner assembly programming memory usage question

I've been getting into some assembly lately and its fun as it challenges everything i have learned. I was wondering if i could ask a few questions
When running an executable, does the entire executable get loaded into memory?
From a bit of fiddling i've found that constants aren't really constants? Is it just a compiler thing?
const int i = 5;
_asm { mov i, 0 } // i is now 0 and compiles fine
So are all variables assigned with a constant value embedded into the file as well?
Meaning:
int a = 1;
const int b = 2;
void something()
{
const int c = 3;
int d = 4;
}
Will i find all of these variables embedded in the file (in a hex editor or something)?
If the executable is loaded into memory then "constants" are technically using memory? I've read around on the net people saying that constants don't use memory, is this true?

Your executable's text (i.e. code) and data segments get mapped into the process's virtual address space when the executable starts up, but the bytes might not actually be copied from the disk until those memory locations are accessed. See http://en.wikipedia.org/wiki/Demand_paging
C-language constants actually exist in memory, because you have to be able to take the address of them. (That is, &i.) Constants are usually found in the .rdata segment of your executable image.
A constant is going to take up memory somewhere--if you have the constant number 42 in your program, there must be somewhere in memory where the 42 is stored, even if that means that it's stored as the argument of an immediate-mode instruction.

The OS loads the code and data segments in order to prepare them for execution.
If the executable has a resource segment, the application loads parts of it at demand.
It's true that const variables take memory space but compilers are free to optimize
for memory usage and code size, and embed their values in the code.
(in case they don't detect any address references for those variables)
const char * aka C strings, usually are interned by the compilers, to save memory.

Reading from 16-bit hardware registers

On an embedded system we have a setup that allows us to read arbitrary data over a command-line interface for diagnostic purposes. For most data, this works fine, we use memcpy() to copy data at the requested address and send it back across a serial connection.
However, for 16-bit hardware registers, memcpy() causes some problems. If I try to access a 16-bit hardware register using two 8-bit accesses, the high-order byte doesn't read correctly.
Has anyone encountered this issue? I'm a 'high-level' (C#/Java/Python/Ruby) guy that's moving closer to the hardware and this is alien territory.
What's the best way to deal with this? I see some info, specifically, a somewhat confusing [to me] post here. The author of this post has exactly the same issue I do but I hate to implement a solution without fully understanding what I'm doing.
Any light you can shed on this issue is much appreciated. Thanks!

In addition to what Eddie said, you typically need to use a volatile pointer to read a hardware register (assuming a memory mapped register, which is not the case for all systems, but it sounds like is true for yours). Something like:
// using types from stdint.h to ensure particular size values
// most systems that access hardware registers will have typedefs
// for something similar (for 16-bit values might be uint16_t, INT16U,
// or something)
uint16_t volatile* pReg = (int16_t volatile*) 0x1234abcd; // whatever the reg address is
uint16_t val = *pReg; // read the 16-bit wide register
Here's a series of articles by Dan Saks that should give you pretty much everything you need to know to be able to effectively use memory mapped registers in C/C++:
"Mapping memory"
"Mapping memory efficiently"
"More ways to map memory"
"Sizing and aligning device registers"
"Use volatile judiciously"
"Place volatile accurately"
"Volatile as a promise"

Each register in this hardware is exposed as a two-byte array, the first element is aligned at a two-byte boundary (its address is even). memcpy() runs a cycle and copies one byte at each iteration, so it copies from these registers this way (all loops unrolled, char is one byte):
*((char*)target) = *((char*)register);// evenly aligned - address is always even
*((char*)target + 1) = *((char*)register + 1);//oddly aligned - address is always odd
However the second line works incorrectly for some hardware specific reasons. If you copy two bytes at a time instead of one at a time, it is instead done this way (short int is two bytes):
*((short int*)target) = *((short*)register;// evenly aligned
Here you copy two bytes in one operation and the first byte is evenly aligned. Since there's no separate copying from an oddly aligned address, it works.
The modified memcpy checks whether the addresses are venely aligned and copies in tow bytes chunks if they are.

If you require access to hardware registers of a specific size, then you have two choices:
Understand how your C compiler generates code so you can use the appropriate integer type to access the memory, or
Embed some assembly to do the access with the correct byte or word size.
Reading hardware registers can have side affects, depending on the register and its function, of course, so it's important to access hardware registers with the proper sized access so you can read the entire register in one go.

Usually it's sufficient to use an integer type that is the same size as your register. On most compilers, a short is 16 bits.
void wordcpy(short *dest, const short *src, size_t bytecount)
{
int i;
for (i = 0; i < bytecount/2; ++i)
*dest++ = *src++;
}

I think all the detail is contained in that thread you posted so I'll try and break it down a little;
Specifically;
If you access a 16-bit hardware register using two 8-bit
accesses, the high-order byte doesn't read correctly (it
always read as 0xFF for me). This is fair enough since
TI's docs state that 16-bit hardware registers must be
read and written using 16-bit-wide instructions, and
normally would be, unless you're using memcpy() to
read them.
So the problem here is that the hardware registers only report the correct value if their values are read in a single 16-bit read. This would be equivalent to doing;
uint16 value = *(regAddress);
This reads from the address into the value register using a single 16-byte read. On the other hand you have memcpy which is copying data a single-byte at a time. Something like;
while (n--)
{
*(uint8*)pDest++ = *(uint8*)pSource++;
}
So this causes the registers to be read 8-bits (1 byte) at a time, resulting in the values being invalid.
The solution posted in that thread is to use a version of memcpy that will copy the data using 16-bit reads whereever the source and destination are a6-bit aligned.

What do you need to know? You've already found a separate post explaining it. Apparently the CPU documentation requires that 16-bit hardware registers are accessed with 16-bit reads and writes, but your implementation of memcpy uses 8-bit reads/writes. So they don't work together.
The solution is simply not to use memcpy to access this register.
Instead, write your own routine which copies 16-bit values.

Not sure exactly what the question is - I think that post has the right solution.
As you stated, the issue is that the standard memcpy() routine reads a byte at a time, which does not work correctly for memory mapped hardware registers. That is a limitation of the processor - there's simply no way to get a valid value reading a byte at at time.
The suggested solution is to write your own memcpy() which only works on word-aligned addresses, and reads 16-bit words at a time. This is fairly straightforward - the link gives both a c and an assembly version. The only gotcha is to make sure you always do the 16 bit copies from validly aligned address. You can do that in 2 ways: either use linker commands or pragmas to make sure things are aligned, or add a special case for the extra byte at the front of an unaligned buffer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart