String constructor differences between operating systems - character-encoding

I have the following code:
byte[] b = new byte[len]; //len is preset to 157004 in this example
//fill b with data by reading from a socket
String pkt = new String(b);
System.out.println(b.length + " " + pkt.length());
This prints out two different values on Ubuntu; 157004 and 147549, but the same values on OS X. This string is actually an image being transmitted by the ImageIO library. Thus, on OS X I am able to decode the string into an image just fine, but on Ubuntu I am not able to.
I am using version 1.6.0_45 on OS X, and tried the same version on Ubuntu, in addition to Oracle jdk 7 and the default openjdk.
I noticed that I can get the string length to equal the byte array length by decoding with Latin-1:
String pkt = new String(b,"ISO-8859-1");
However this does not make it possible to decode the image, and understanding what's going on can be difficult as the string looks like garbage to me.
I'm perplexed by the fact that I'm using the same jdk version, but a different OS.

This string is actually an image being transmitted by the ImageIO library.
And that's where you're going wrong.
An image is not text data - it's binary data. If you really need to encode it in a string, you should use base64. Personally I like the public domain base64 encoder/decoder at iharder.net.
This isn't just true for images - it's true for all binary data which isn't known to be text in a particular encoding... whether that's sound, movies, Word documents, encrypted data etc. Never just treat it as if it were just encoded text - it's a recipe for disaster.

Ubuntu uses utf-8 by default, which is a variable length encoding so the lengths of the string and byte data differ. This is the source of the difference, but for the solution I defer to Jon's answer.

Related

How to convert hexadecimal data (stored in a string variable) to an integer value

Edit (abstract)
I tried to interpret Char/String data as Byte, 4 bytes at a time. This was because I could only get TComport/TDatapacket to interpret streamed data as String, not as any other data type. I still don't know how to get the Read method and OnRxBuf event handler to work with TComport.
Problem Summary
I'm trying to get data from a mass spectrometer (MS) using some Delphi code. The instrument is connected with a serial cable and follows the RS232 protocol. I am able to send commands and process the text-based outputs from the MS without problems, but I am having trouble with interpreting the data buffer.
Background
From the user manual of this instrument:
"With the exception of the ion current values, the output of the RGA are ASCII character strings terminated by a linefeed + carriage return terminator. Ion signals are represented as integers in units of 10^-16 Amps, and transmitted directly in hex format (four byte integers, 2's complement format, Least Significant Byte first) for maximum data throughput."
I'm not sure whether (1) hex data can be stored properly in a string variable. I'm also not sure how to (2) implement 2's complement in Delphi and (3) the Least Significant Byte first.
Following #David Heffernan 's advice, I went and revised my data types. Attempting to harvest binary data from characters doesn't work, because not all values from 0-255 can be properly represented. You lose data along the way, basically. Especially it your data is represented 4 bytes at a time.
The solution for me was to use the Async Professional component instead of Denjan's Comport lib. It handles datastreams better and has a built-in log that I could use to figure out how to interpret streamed resposes from the instrument. It's also better documented. So, if you're new to serial communications (like I am), rather give that a go.

Compression of 2-D array on the fly with iOS

I am currently using Swift to store some data on iOS. The values come as a 2-D integer array, defined as an [[Int]]. I need to save these integer arrays to disk. Currently, I am using the following function to do so:
func writeDataToFile(data: [[Int]], filename: String){
let fullfile = NSString(string: self.folderpath).stringByAppendingPathComponent(filename+".txt")
var fh = NSFileHandle(forWritingAtPath: fullfile)
if fh == nil{
NSFileManager.defaultManager().createFileAtPath(fullfile, contents: nil, attributes: nil)
fh = NSFileHandle(forWritingAtPath: fullfile)
}
fh?.writeData("Time: \(filename)\n".dataUsingEncoding(NSUTF16StringEncoding)!)
fh?.writeData("\(data)".dataUsingEncoding(NSUTF16StringEncoding)!)
fh?.closeFile()
}
Currently this function works just fine, but it produces files that are relatively large (1.1mb each - which when you are writing them at 1 Hz, gets huge fast). The arrays written have a fixed size and the values will be from 20000 < x < 35000. Is there a way to compress this data on the fly such that I can later read the data into say Python or some other language? Would it just be easier to use some library like Zip to compress the files into zips after writing? Is there some way to transform the data (without loss of data/fidelity) into an image (for compression purposes, not viewing purposes). There is some metadata that I would like to store along with the 2-D array, such as a timestamp.
Since you are currently saving those as string values, the simplest and fastest size reduction would be to save them as binary values (or base64 encoded strings). Then you could convert all of your int values into 2 byte sets (since unsigned 2 bytes can store up to 65536) and save the values that way. That would go from 5 bytes per int value down to 2 bytes per int value. Immediate savings of 60%.
For the Base64 encoding I use something I found on the internet called NSData+Base64. But in looking that up I just read:
In the iOS 7 and Mac OS 10.9 SDKs, Apple introduced new base64 methods on NSData that make it unnecessary to use a 3rd party base 64 decoding library. What's more, they exposed access to private base64 methods that are retrospectively available back as far as IOS 4 and Mac OS 6.
Link.
You could go much further into the compression by realizing that data from one element to the next will likely not change by the entire range, since heat maps will always be gradients. Then you could save the arrays as difference since the last element and likely get that down to a single byte (255 value) change set. But that may lose precision if you are viewing something with a very fast heat gradient (or using a low resolution camera).
If you eventually need to get into compression, I use GTMNSData+zlib and decompress it in a c# webservice. So with a little bit of work it is cross platform.
A proper answer for this would require more information about the problem domain. Most likely, 2D arrays are the wrong data structure for this but it's hard to tell without more info.
What's the data stored in these arrays?
Apple has had a compression library since last year:
https://developer.apple.com/library/ios/documentation/Performance/Reference/Compression/index.html

Can i define in what endianess i read from NSData?

I have some files written on an Android device, it wrote bytes in big endian.
Now i try to read this file with iOS and there i need them in small endian.
I can make a for loop and
int temp;
for(...) {
[readFile getBytes:&temp range:NSMakeRange(offset, sizeof(int))];
target_array[i] = CFSwapInt32BigToHost(temp);
// read more like that
}
However it feels silly to read every single value and turn it before i can store it. Can i tell the NSData that i want the value read with a certain byte-order so that i can directly store it where it should be ?
(and save some time, as the data can be quite large)
I also worry about errors when some datatype changes and i forget to use the 16 instead of the 32 swap.
No, you need to swap every value. NSData is just a series of bytes with no value or meaning. It is your app that understands the meaning so it is your code logic that must swap each set of bytes as needed.
The data could be filled with all kinds of values of different sizes. 8-bit values, 16-bit values, 32-bit values, etc. as well as string data or just a stream of bytes that don't need any ordering at all. And the NSData can contain any combination of these values.
Given all of this, there is no simple way to tell NSData that the bytes need to be treated in a specific endianness.
If your data is, for example, nothing but 32-bit integer values stored in a specific endianness and you want to extract an array of bytes, create a helper class that does the conversion.

Reading TIFF files

I need to read and interpret a binary file containing a TIFF image. I know there exist readers for doing this but I want to go the hard way. I found the TIFF format description and need to parse the binary file in small chunks. Assume I was able to read in memory the complete binary file. This means that I have a variable containing one long list of bytes.
I know via the format definition what the meaning is of the different groups of n bytes.
How can one define character variables with different lengths (sometimes 2, sometimes 3, sometimes 4 etc.) so that the variable address points to the right position in the image variable array?
With other words, assume my image is loaded into an array Image containing all bytes of the file.
The first 2 bytes I want to load in a string with length 2 bytes so that I can just link the address pointer to the first position in the Image array and automatically the first 2 bytes are associated with the first character string. A second string of 4 bytes would have another meaning and so I make the address for the second string of 4 bytes point to the 3rd position of the Image array.
Is this feasible in C++? I remember that this was a normal way of working for dynamical memory allocation in Fortran 77 in a simulation code I analysed a long time ago.
Thanks in advance for the hints!
Regards,
Stefan
The C++ language is easily capable of processing TIFF files from a byte array. The idea you have in mind is basically correct, but there a few problems with it. C strings are zero-terminated and the strings which appear in TIFF files are not necessarily zero terminated since their length is specified explicitly. It really is simpler to create a dedicated data structure to hold the TIFF-specific data fields and then parse the binary data into the structure. Your method will immediately run into trouble with the Motorola/Intel byte issue if your machine has the opposite endian-ness.

how to convert an image's stream to GUID

We ingest a lot of images from external sources. I would like to assure that already ingested images are not re-ingested in the backend. For this I was thinking of generating a GUID based on image's stream as follows
File.ReadAllBytes()
or
public byte[] imageToByteArray(System.Drawing.Image imageIn)
{
MemoryStream ms = new MemoryStream();
imageIn.Save(ms,System.Drawing.Imaging.ImageFormat.Gif);
return ms.ToArray();
}
enter code here
I was then thinking of making this into a CLR (if at all necessary) then save the GUID with the metadata of the image in SQL server. Not sure how accurately unique that GUID would be.
Any inputs?
Thanks
As #Mark Ransom suggested, you're confusing a GUID and a hash. A GUID is an identifier that is supposed to be unique. It's independent of any inputs, and is just something you can generate. A hash is supposed to be unique for unique inputs. In other words, identical inputs will have identical hashes, in the vast majority of cases.
A common hash algorithm to use is MD5. Here's a link to a similar question on SO.
Alternatively, you could avoid writing code by using existing command-line utilities, such as md5sum, sort and uniq.
Here's one solution for a "fingerprint string" algorithm.
As the post says, you will often want the same visual to map to the same string even if the file formats are different, or it's a different size. So this algorithm squashes the image into a 8x8 thumbnail with a 62-color palette (you could probably achieve the same thing with ImageMagick).
This transform leaves you with an image of 64 values ranging from 1 to 62.
In other words, a short base-62 string.

Resources