The importance of using a 16bit integer

The importance of using a 16bit integer - memory

How seriously do developers think about using a 16bit integer when writing code? I've been using 32bit integers ever since I've been programming and I don't really think about using 16bit.
Its so easy to declare a 32bit int because its the default for most languages.
Whats the upside of using a 16bit integer apart from a little memory saved?

Now that we have cars, we don't walk or ride horses as much, but we still do walk and ride horses.
There is less need to use shorts these days. In a lot of situations the cost of disk space and availability of RAM mean that we no longer need to squeeze every last bit of storage out of computers as we did 20 years ago, so we can sacrifice a bit of storage efficiency in order to save on development/maintenance costs.
However, where large amounts of data are used, or we are working with systems with small memories (e.g. embedded controllers) or when we are transmitting data over networks, using 32 or 64 bits to represent a 16-bit value is just a waste of memory/bandwidth. It doesn't matter how much memory you have, wasting half or three quarters of it would just be stupid.

APIs/interfaces (e.g. TCP/IP port numbers) and algorithms that require manipulation (e.g. rotation) of 16-bit values.

I was interested in the relative performance so I wrote this small test program to perform a very simple test of the speed of allocating, using, and freeing a significant amount of data in both int and short format.
I run the tests several times in case caching and so on are affected.
#include <iostream>
#include <windows.h>
using namespace std;
const int DATASIZE = 1000000;
template <typename DataType>
long long testCount()
{
long long t1, t2;
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
DataType* data = new DataType[DATASIZE];
for(int i = 0; i < DATASIZE; i++) {
data[i] = 0;
}
delete[] data;
QueryPerformanceCounter((LARGE_INTEGER*)&t2);
return t2-t1;
}
int main()
{
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
}
and here are the results on my system (64 bit quad core system running windows7 64 bit, but the program is a 32 bit program built using VC++ express 2010 beta in release mode)
Test using short : 3672 ticks.
Test using int : 7903 ticks.
Test using short : 4321 ticks.
Test using int : 7936 ticks.
Test using short : 3697 ticks.
Test using int : 7701 ticks.
Test using short : 4222 ticks.
This seems to show that there are significant performance advantages at least in some cases to using short instead of int when there is a large amount of data. I realise that this is far from being a comprehensive test, but it's some evidence that not only do they use less space but they can be faster to process too at least in some applications.

when there is memory constraints short can help u lot. for e.g. while coding for embedded systems, u need to consider the memory.

16-bit values are still in great demand (though unsigned would do - don't really need signed).
For example,
16 bit Unicode - UTF-16/UCS-2.
16 bit graphics - especially for embedded devices.
16 bit checksums - for UDP headers and similar.
16 Bit devices - e.g. many norflash devices are 16 bit.

You might need to wrap at 65535.
You might need to work with a message sent from a device which includes fields which are 16 bit. Using 32 bit integers in this case would cause you to be accessing bits at the wrong offset in the message.
You might be working on an embedded 16 bit micro, or an embedded 8 bit micro. Hint: not all processors are x86, 32 bit.

This is really important in database development, because sometimes people are using a lot more space than is really needed (e.g. using int when small would have been sufficient). When you have tables with millions of rows this can be important factor in e.g. database size and queries. I would recommend people using always the appropriate datatype for columns.
I also try to use the correct datatype for other development, I know it can be a pain dealing with long and small (pretty convenient to have everyting int) but I think it pays off in the end, for example when serializing objects.

you ask: Any good reason to keep them around?
Since you say 'language-agnostic' the answer is a 'certainly yes'.
The computer CPU still works with bytes, words, full registers and whatnot, no matter how much these 'data types' are abstracted by some programming languages. There will always be situations where the code needs to 'touch the metal'.

It's hardly a little memory saved [read: 50%] when you allocate memory for a large number of numeric values. Common uses are:
COM and external device interop
Reducing memory consumption for large arrays where each number will never exceed a couple thousands in magnitude
Unique hashes for pairs of objects, where no more than ~65K objects are needed (hash values can only be 32-bit ints, but note that hash table types must transform the value for internal representations so collisions are still likely, but equality can be based on exact hash matches)
Speed up algorithms that rely on structs (smaller sized value types translates to increased performance when they are copied around in memory)

In large arrays, "little memory saved" could instead be "much memory saved".

The use of 16 bit integers is primarily for when you need to encode things for transmission over a network, for saving on hard disk, etc. without using up any more space than necessary. It might also occasionally be useful to save memory if you have a very large array of integers, or a lot of objects that contain integers.
Use of 16 bit integers without there being a good memory saving reason is pretty pointless. And 16 bit local variables are most often silently implemented with 32 or 64 bit integers anyway.

you have probably been using the 16 bit datatype more often than you knew. The char datatype in both C# and Java are 16 bit. Unicode is typically stored in a 16bit datatype.

The question should really be why we need a 16-bit primitive data type, and the answer would be that there is an awful lot of data out there which is naturally represented in 16 bits. One ubiquitous example is audio, e.g. CD audio is represented as streams of 16 bit signed integers.

16 bits is still plenty big enough to hold pixel channel values (e.g. R, G, or B). Most pixels only use 8 bits to store a channel, but Photoshop has a 16-bit mode that professionals use.
In other words, a pixel might be defined as struct Pixel16 { short R, G, B, A; } or an image might be defined as separate channels of struct Channel16 { short channel[]; }

I think most people use the default int on their platform. However there are times when you have to communicate with older systems or libraries that are expecting 16 bit or even eight bit integers (thank god we don't have to worry about 12 bit integers any more). This is especially true for databases. Also, if you're doing bit masking or bit shifting, you might have an algorithm that specifies the length of the integer. By default, and on platforms where memory is cheap, you should probably use integers sized to your processor.

Those 2 bytes add up. Your data types eventually become part of array or databases or messages, they go into data files. It adds up to a lot of wasted space and on embedded systems it can make a huge difference.
When we do peer review of our code at work, if something is sized incorrectly, it will be written as a discrepancy and must be corrected. If we find something that has a range of 1-1000 using an int32_t, it has to be corrected. The range must also be documented in a comment. Our department does not allow use of int, long, etc, we must use int32_t, int16_t, uint16_t, etc. so that the expected size is documented.
uint16_t conicAngle; // angle in tenths of a degree (range 0..3599)
or in Ada:
type Amplitude is range 0 .. 255; // signal amplitude from FPGA
Get in the habit of using what you need and no more and documenting what you need (if the language doesn't support it).
We are currently in the process of fixing a performance problem by resizing the data types in several messages, they have 32 bit fields that could be 8 or 16 bit. By resizing them appropriately we can reduce the message rate in half and improve our data throughput to meet the requirements.

Once upon a time, in the land of Earth, there existed devices called computers.
In the early days following the invention of "computers," there was limited storage in memory for fancy things like numbers and strings.
Billy, a programmer, was encouraged by the evil Wizard (his boss) to use the least amount of memory that he could!
Then one day, memory sizes got large enough that everyone could use 32-bit numbers if they wanted!
I could continue on, but all the other obvious things were already covered.

Related

What is the use of bit manipulation?

I searched for clear explanation with some example close to real life.
What is bit manipulation?
Why we need to use bit manipulation?
We can use bit manipulation in image processing as far as I know. Can anyone show me a simple problem which can be solved using bit manipulation?
I read about bit manipulation from some link:
Link 1
Link 2
In Link 2 Data compression is done using bit packing. Are there any difference between bit manipulation and bit packing?
It will be appreciable If anyone explain me with very simple example which have resemble to real life problem.

What is bit manipulation?
Bit manipulation usually refers to changing data using bit operators.
I think Wikipedia expains it good enough so I won't write another article.
https://en.wikipedia.org/wiki/Bit_manipulation
Bit manipulation is the act of algorithmically manipulating bits or
other pieces of data shorter than a word. Computer programming tasks
that require bit manipulation include low-level device control, error
detection and correction algorithms, data compression, encryption
algorithms, and optimization. For most other tasks, modern programming
languages allow the programmer to work directly with abstractions
instead of bits that represent those abstractions. Source code that
does bit manipulation makes use of the bitwise operations: AND, OR,
XOR, NOT, and possibly other operations analogous to the boolean
operators; there are also bit shifts and operations to count ones and
zeros, find high and low one or zero, set, reset and test bits,
extract and insert fields, mask and zero fields, gather and scatter
bits to and from specified bit positions or fields. Integer arithmetic
operators can also effect bit-operations in conjunction with the other
operators.
Bit manipulation, in some cases, can obviate or reduce the need to
loop over a data structure and can give many-fold speed ups, as bit
manipulations are processed in parallel.
Why we need to use bit manipulation?
Because it is fast and often we don't have another choice. For example in microcontrollers, pretty much everything is controlled by manipulating the bits of 8 bit registers. So an output would go high if you set a certain bit 1.
Bit packing is a compression technique that tries to minimize the number of bits necessary to represent a number. While you'll use bit operators to implement it, it is not the same as "bit-manipulation". It's just one of many many use cases for bit-manipulation.
Can anyone show me a simple problem which can be solved using bit manipulation?
Let's say you have a rgb touple rgb = 0xa1fc03 and you want to make the green channel 0.
rgb_without_green = rgb & 0xFF00FF
We've bitwise ANDed the value with 0xFF00FF.
Now rgb is 0xa10003.
Basically any operation boils down to bit manipulation. For most of them you just have convenient solutions. Say instead of 0x00000011 << 0x0000101 you write 3 * 32
Or have a look at this where the addition of two integers is implemented using bit operations. Add two integers using only bitwise operators?
Edit due to comment
How bitwise AND operation between 0xa1fc03 and 0xFF00FF gives 0xa10003
? Just I need to see how to do this calculation
Bitwise AND means that you AND all the bits of both numbers.
1 AND 1 -> 1
0 AND 1 -> 0
1 AND 0 -> 0
0 AND 0 -> 0
So
0xa1fc03 -> 0b101000011111110000000011
0xff00ff -> 0b111111110000000011111111
AND -> 0b101000010000000000000011
0b101000010000000000000011 -> 0xa10003
With a bit more expierience you know that 0xFF is 0b11111111 so you instantly know that 0xa1fc03 AND 0xff00ff is 0xa1003 becaue you keep everything that is masked with FF and set everything 0 that is masked with 00.
There are countless resources available. You should not have to ask me how to bitwise AND two numbers. Please do your own research.

I am busy designing a new barcode symbology for real-life applications. It uses a checksum value, which is computed on slices of k bits of large numbers. Hence intense bit manipulation.

Outputting values from CAMPARY

I'm trying to use the CAMPARY library (CudA Multiple Precision ARithmetic librarY). I've downloaded the code and included it in my project. Since it supports both cpu and gpu, I'm starting with cpu to understand how it works and make sure it does what I need. But the intent is to use this with CUDA.
I'm able to instantiate an instance and assign a value, but I can't figure out how to get things back out. Consider:
#include <time.h>
#include "c:\\vss\\CAMPARY\\Doubles\\src_cpu\\multi_prec.h"
int main()
{
const char *value = "123456789012345678901234567";
multi_prec<2> a(value);
a.prettyPrint();
a.prettyPrintBin();
a.prettyPrintBin_UnevalSum();
char *cc = a.prettyPrintBF();
printf("\n%s\n", cc);
free(cc);
}
Compiles, links, runs (VS 2017). But the output is pretty unhelpful:
Prec = 2
Data[0] = 1.234568e+26
Data[1] = 7.486371e+08
Prec = 2
Data[0] = 0x1.987bf7c563caap+86;
Data[1] = 0x1.64fa5c3800000p+29;
0x1.987bf7c563caap+86 + 0x1.64fa5c3800000p+29;
1.234568e+26 7.486371e+08
Printing each of the doubles like this might be easy to do, but it doesn't tell you much about the value of the 128 number being stored. Performing highly accurate computations is of limited value if there's no way to output the results.
In addition to just printing out the value, eventually I also need to convert these numbers to ints (I'm willing to try it all in floats if there's a way to print, but I fear that both accuracy and speed will suffer). Unlike MPIR (which doesn't support CUDA), CAMPARY doesn't have any associated multi-precision int type, just floats. I can probably cobble together what I need (mostly just add/subtract/compare), but only if I can get the integer portion of CAMPARY's values back out, which I don't see a way to do.
CAMPARY doesn't seem to have any docs, so it's conceivable these capabilities are there, and I've simply overlooked them. And I'd rather ask on the CAMPARY discussion forum/mail list, but there doesn't seem to be one. That's why I'm asking here.
To sum up:
Is there any way to output the 128bit ( multi_prec<2> ) values from CAMPARY?
Is there any way to extract the integer portion from a CAMPARY multi_prec? Perhaps one of the (many) math functions in the library that I don't understand computes this?

There are really only 2 possible answers to this question:
There's another (better) multi-precision library that works on CUDA that does what you need.
Here's how to modify this library to do what you need.
The only people who could give the first answer are CUDA programmers. Unfortunately, if there were such a library, I feel confident talonmies would have known about it and mentioned it.
As for #2, why would anyone update this library if they weren't a CUDA programmer? There are other, much better multi-precision libraries out there. The ONLY benefit CAMPARY offers is that it supports CUDA. Which means the only people with any real motivation to work with or modify the library are CUDA programmers.
And, as the CUDA programmer with the most vested interest in solving this, I did figure out a solution (albeit an ugly one). I'm posting it here in the hopes that the information will be of value to future CAMPARY programmers. There's not much information out there for this library, so this is a start.
The first thing you need to understand is how CAMPARY stores its data. And, while not complex, it isn't what I expected. Coming from MPIR, I assumed that CAMPARY stored its data pretty much the same way: a fixed size exponent followed by an arbitrary number of bits for the mantissa.
But nope, CAMPARY went a different way. Looking at the code, we see:
private:
double data[prec];
Now, I assumed that this was just an arbitrary way of reserving the number of bits they needed. But no, they really do use prec doubles. Like so:
multi_prec<8> a("2633716138033644471646729489243748530829179225072491799768019505671233074369063908765111461703117249");
// Looking at a in the VS debugger:
[0] 2.6337161380336443e+99 const double
[1] 1.8496577979210756e+83 const double
[2] 1.2618399223120249e+67 const double
[3] -3.5978270144026257e+48 const double
[4] -1.1764513205926450e+32 const double
[5] -2479038053160511.0 const double
[6] 0.00000000000000000 const double
[7] 0.00000000000000000 const double
So, what they are doing is storing the max amount of precision possible in the first double, then the remainder is used to compute the next double and so on until they encompass the entire value, or run out of precision (dropping the least significant bits). Note that some of these are negative, which means the sum of the preceding values is a bit bigger than the actual value and they are correcting it downward.
With this in mind, we return to the question of how to print it.
In theory, you could just add all these together to get the right answer. But kinda by definition, we already know that C doesn't have a datatype to hold a value this size. But other libraries do (say MPIR). Now, MPIR doesn't work on CUDA, but it doesn't need to. You don't want to have your CUDA code printing out data. That's something you should be doing from the host anyway. So do the computations with the full power of CUDA, cudaMemcpy the results back, then use MPIR to print them out:
#define MPREC 8
void ShowP(const multi_prec<MPREC> value)
{
multi_prec<MPREC> temp(value), temp2;
// from mpir at mpir.org
mpf_t mp, mp2;
mpf_init2(mp, value.getPrec() * 64); // Make sure we reserve enough room
mpf_init(mp2); // Only needs to hold one double.
const double *ptr = value.getData();
mpf_set_d(mp, ptr[0]);
for (int x = 1; x < value.getPrec(); x++)
{
// MPIR doesn't have a mpf_add_d, so we need to load the value into
// an mpf_t.
mpf_set_d(mp2, ptr[x]);
mpf_add(mp, mp, mp2);
}
// Using base 10, write the full precision (0) of mp, to stdout.
mpf_out_str(stdout, 10, 0, mp);
mpf_clears(mp, mp2, NULL);
}
Used with the number stored in the multi_prec above, this outputs the exact same value. Yay.
It's not a particularly elegant solution. Having to add a second library just to print a value from the first is clearly sub-optimal. And this conversion can't be all that speedy either. But printing is typically done (much) less frequently than computing. If you do an hour's worth of computing and a handful of prints, the performance doesn't much matter. And it beats the heck out of not being able to print at all.
CAMPARY has a lot of shortcomings (undoced, unsupported, unmaintained). But for people who need mp numbers on CUDA (especially if you need sqrt), it's the best option I've found.

Largest amount of entries in lua table

I am trying to build a Sieve of Eratosthenes in Lua and i tried several things but i see myself confronted with the following problem:
The tables of Lua are to small for this scenario. If I just want to create a table with all numbers (see example below), the table is too "small" even with only 1/8 (...) of the number (the number is pretty big I admit)...
max = 600851475143
numbers = {}
for i=1, max do
table.insert(numbers, i)
end
If I execute this script on my Windows machine there is an error message saying: C:\Program Files (x86)\Lua\5.1\lua.exe: not enough memory. With Lua 5.3 running on my Linux machine I tried that too, error was just killed. So it is pretty obvious that lua can´t handle the amount of entries.
I don´t really know whether it is just impossible to store that amount of entries in a lua table or there is a simple solution for this (tried it by using a long string aswell...)? And what exactly is the largest amount of entries in a Lua table?
Update: And would it be possible to manually allocate somehow more memory for the table?
Update 2 (Solution for second question): The second question is an easy one, I just tested it by running every number until the program breaks: 33.554.432 (2^25) entries fit in one one-dimensional table on my 12 GB RAM system. Why 2^25? Because 64 Bit per number * 2^25 = 2147483648 Bits which are exactly 2 GB. This seems to be the standard memory allocation size for the Lua for Windows 32 Bit compiler.
P.S. You may have noticed that this number is from the Euler Project Problem 3. Yes I am trying to accomplish that. Please don´t give specific hints (..). Thank you :)

The Sieve of Eratosthenes only requires one bit per number, representing whether the number has been marked non-prime or not.
One way to reduce memory usage would be to use bitwise math to represent multiple bits in each table entry. Current Lua implementations have intrinsic support for bitwise-or, -and etc. Depending on the underlying implementation, you should be able to represent 32 or 64 bits (number flags) per table entry.
Another option would be to use one or more very long strings instead of a table. You only need a linear array, which is really what a string is. Just have a long string with "t" or "f", or "0" or "1", at every position.
Caveat: String manipulation in Lua always involves duplication, which rapidly turns into n² or worse complexity in terms of performance. You wouldn't want one continuous string for the whole massive sequence, but you could probably break it up into blocks of a thousand, or of some power of 2. That would reduce your memory usage to 1 byte per number while minimizing the overhead.
Edit: After noticing a point made elsewhere, I realized your maximum number is so large that, even with a bit per number, your memory requirements would optimally be about 73 gigabytes, which is extremely impractical. I would recommend following the advice Piglet gave in their answer, to look at Jon Sorenson's version of the sieve, which works on segments of the space instead of the whole thing.
I'll leave my suggestion, as it still might be useful for Sorenson's sieve, but yeah, you have a bigger problem than you realize.

Lua uses double precision floats to represent numbers. That's 64bits per number.
600851475143 numbers result in almost 4.5 Terabytes of memory.
So it's not Lua's or its tables' fault. The error message even says
not enough memory
You just don't have enough RAM to allocate that much.
If you would have read the linked Wikipedia article carefully you would have found the following section:
As Sorenson notes, the problem with the sieve of Eratosthenes is not
the number of operations it performs but rather its memory
requirements.[8] For large n, the range of primes may not fit in
memory; worse, even for moderate n, its cache use is highly
suboptimal. The algorithm walks through the entire array A, exhibiting
almost no locality of reference.
A solution to these problems is offered by segmented sieves, where
only portions of the range are sieved at a time.[9] These have been
known since the 1970s, and work as follows
...

Why is the smallest value that can be stored is a Byte(8bit) & not a Bit(1bit)?

Why is the smallest value that can be stored a Byte(8bit) & not a Bit(1bit) in memory?
Even booleans are stored as Bytes. Will we ever bump the smallest number to 32 or 64bits like register's on the CPU?
EDIT: To clarify as many answers seemed confused about the nature of questing. This question is about why isn't a byte 7-bit, 1-bit, 32-bit, etc (not why lower bit primitives must fit within the hardware's byte at min). Is the 8-bit byte simply historical as some hardware has 10-bit bytes for example. Or is there a mathematical reason 8-bit is ideal vs say 10-bit for general processing?

The hardware is built to read data in blocks (bytes, later words and dwords). This provides greater efficiency, than accessing individual bits, and also offers more addressing range. So most data is aligned to at least byte boundary. There exist encodings that operate with bit sequences, rather than bytes, but they are quite rare.
Nowadays the data is most often aligned to dword (32-bits) boundary anyway. Moreover, some hardware (ARM, for example), can't access misaligned multibyte variables, i.e. 16-bit word can't "cross" dword boundary - exception will be thrown.

Because computers address memory at the byte level, so anything smaller than a byte is not addressable.

The underlying methods of processor access are limited to the size of the smallest usable register. On most architectures, that size is 8 bits. You can use smaller portions of these; for instance, C has the bitfield feature in structs that will allow combining fields that only need to be certain bit lengths. Access will still require that the whole byte be read.
Some older exotic architectures actually did have different a "word size." In these machines, 10 bits might be the common size.
Lastly, processors are almost always backwards compatible. Intel, for instance, has maintained complete instruction compatibility from the 386 on up. If you take a program compiled for the 386, it will still run on an i7 processor. Changing the word size would break compatibility. So while it is possible, no manufacturer will ever do it.

Assume that we have native language that consist of 2 character such as a , b
to distinguish two characters we need at least 1 bit for example 0 to represent char a and 1 to represent char b
so that if we count number of characters and special characters and symbols, there are 128 character and to distinguish one character from another, you need log2(128) = 7 bit and 8th bit for transmission

Reading from 16-bit hardware registers

On an embedded system we have a setup that allows us to read arbitrary data over a command-line interface for diagnostic purposes. For most data, this works fine, we use memcpy() to copy data at the requested address and send it back across a serial connection.
However, for 16-bit hardware registers, memcpy() causes some problems. If I try to access a 16-bit hardware register using two 8-bit accesses, the high-order byte doesn't read correctly.
Has anyone encountered this issue? I'm a 'high-level' (C#/Java/Python/Ruby) guy that's moving closer to the hardware and this is alien territory.
What's the best way to deal with this? I see some info, specifically, a somewhat confusing [to me] post here. The author of this post has exactly the same issue I do but I hate to implement a solution without fully understanding what I'm doing.
Any light you can shed on this issue is much appreciated. Thanks!

In addition to what Eddie said, you typically need to use a volatile pointer to read a hardware register (assuming a memory mapped register, which is not the case for all systems, but it sounds like is true for yours). Something like:
// using types from stdint.h to ensure particular size values
// most systems that access hardware registers will have typedefs
// for something similar (for 16-bit values might be uint16_t, INT16U,
// or something)
uint16_t volatile* pReg = (int16_t volatile*) 0x1234abcd; // whatever the reg address is
uint16_t val = *pReg; // read the 16-bit wide register
Here's a series of articles by Dan Saks that should give you pretty much everything you need to know to be able to effectively use memory mapped registers in C/C++:
"Mapping memory"
"Mapping memory efficiently"
"More ways to map memory"
"Sizing and aligning device registers"
"Use volatile judiciously"
"Place volatile accurately"
"Volatile as a promise"

Each register in this hardware is exposed as a two-byte array, the first element is aligned at a two-byte boundary (its address is even). memcpy() runs a cycle and copies one byte at each iteration, so it copies from these registers this way (all loops unrolled, char is one byte):
*((char*)target) = *((char*)register);// evenly aligned - address is always even
*((char*)target + 1) = *((char*)register + 1);//oddly aligned - address is always odd
However the second line works incorrectly for some hardware specific reasons. If you copy two bytes at a time instead of one at a time, it is instead done this way (short int is two bytes):
*((short int*)target) = *((short*)register;// evenly aligned
Here you copy two bytes in one operation and the first byte is evenly aligned. Since there's no separate copying from an oddly aligned address, it works.
The modified memcpy checks whether the addresses are venely aligned and copies in tow bytes chunks if they are.

If you require access to hardware registers of a specific size, then you have two choices:
Understand how your C compiler generates code so you can use the appropriate integer type to access the memory, or
Embed some assembly to do the access with the correct byte or word size.
Reading hardware registers can have side affects, depending on the register and its function, of course, so it's important to access hardware registers with the proper sized access so you can read the entire register in one go.

Usually it's sufficient to use an integer type that is the same size as your register. On most compilers, a short is 16 bits.
void wordcpy(short *dest, const short *src, size_t bytecount)
{
int i;
for (i = 0; i < bytecount/2; ++i)
*dest++ = *src++;
}

I think all the detail is contained in that thread you posted so I'll try and break it down a little;
Specifically;
If you access a 16-bit hardware register using two 8-bit
accesses, the high-order byte doesn't read correctly (it
always read as 0xFF for me). This is fair enough since
TI's docs state that 16-bit hardware registers must be
read and written using 16-bit-wide instructions, and
normally would be, unless you're using memcpy() to
read them.
So the problem here is that the hardware registers only report the correct value if their values are read in a single 16-bit read. This would be equivalent to doing;
uint16 value = *(regAddress);
This reads from the address into the value register using a single 16-byte read. On the other hand you have memcpy which is copying data a single-byte at a time. Something like;
while (n--)
{
*(uint8*)pDest++ = *(uint8*)pSource++;
}
So this causes the registers to be read 8-bits (1 byte) at a time, resulting in the values being invalid.
The solution posted in that thread is to use a version of memcpy that will copy the data using 16-bit reads whereever the source and destination are a6-bit aligned.

What do you need to know? You've already found a separate post explaining it. Apparently the CPU documentation requires that 16-bit hardware registers are accessed with 16-bit reads and writes, but your implementation of memcpy uses 8-bit reads/writes. So they don't work together.
The solution is simply not to use memcpy to access this register.
Instead, write your own routine which copies 16-bit values.

Not sure exactly what the question is - I think that post has the right solution.
As you stated, the issue is that the standard memcpy() routine reads a byte at a time, which does not work correctly for memory mapped hardware registers. That is a limitation of the processor - there's simply no way to get a valid value reading a byte at at time.
The suggested solution is to write your own memcpy() which only works on word-aligned addresses, and reads 16-bit words at a time. This is fairly straightforward - the link gives both a c and an assembly version. The only gotcha is to make sure you always do the 16 bit copies from validly aligned address. You can do that in 2 ways: either use linker commands or pragmas to make sure things are aligned, or add a special case for the extra byte at the front of an unaligned buffer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart