Convert bits to files - storage

I am currently working on something for storage and I am having issues; lets say I have a text file full of 1s and 0s how would one convert these to their original file. I don't mind if I have to use a program.

Read every 8 bits as 1 byte, and write that raw bytes as file.
Here is simplified Node.js example:
const fs = require('fs');
const content = fs.readFileSync('./input.txt');
const bits = content.toString();
const bytes = [];
for(let i=0;i<bits.length;i+=8){
bytes.push(parseInt(bits.substr(i,8),2));
}
fs.appendFileSync('./output.bin', Buffer.from(bytes));

Related

How to read large portions of a file without exhausting memory in Rust?

I'm trying to re-write a portion of the GNU coreutils 'split' tool, to split a file in multiple parts of approximately the same size.
A part of my program is reading large portions of a file just to write them into another. On the memory side I don't want to map these portions in memory because they can be anywhere from zero bytes long up to several gigabytes.
Here's an extract of the code I wrote using a BufReader:
let file = File::open("myfile.txt");
let mut buffer = Vec::new();
let mut reader = BufReader::new(&file);
let mut handle = reader.take(length); // here length can be 10 or 1Go !
let read = handle.read_to_end(&mut buffer);
I feel like I'm mapping the whole chunk of file in memory because of the read_to_end(&mut buffer) call. Am I? If not, does it mean the the BufReader is doing its job and can I just admit that it's doing some kind of magic (abstraction) allowing me to "read" an entire portion of a file without really mapping it into memory? Or am I misusing these concepts in my code?
Yes, you're reading the whole chunk into memory. You can inspect buffer to confirm. If it has length bytes then there you go; there are length bytes in memory. There's no way BufReader could fake that.
Yes, if we look into the source of the read_to_end function we can see that the buffer you give it will be extended to hold the new data as it comes in if the available space in the vector is exhausted.
And even just in the docs, rust tells us that is read everything until EOF into the buffer:
Read all bytes until EOF in this source, placing them into buf
You can also take a look at the code presented in this question as a starting point using a BufReader:
use std::{
fs::File,
io::{self, BufRead, BufReader},
};
fn main() -> io::Result<()> {
const CAP: usize = 1024 * 128;
let file = File::open("my.file")?;
let mut reader = BufReader::with_capacity(CAP, file);
loop {
let length = {
let buffer = reader.fill_buf()?;
// do stuff with buffer here
buffer.len()
};
if length == 0 {
break;
}
reader.consume(length);
}
Ok(())
}
A better approach might be to set up an un-buffered Reader, and read bytes directly into the buffer while checking that you are not exceeding whatever byte or line bounds specified by the user, and writing the buffer contents to file.

How to encode to UTF16 Little Endian in Dart?

I am attempting to manipulate some system variables used by a program using Dart. I have encountered the problem of dart's utf package being discontinued, and I have not found any way to encode to UTF 16 Little Endian for a File.write. Is there a library that can do a byte to UTF 16 LE conversion in Dart? I would use UTF anyway, but it is not null safe. I may end up trying to use the utf package source code, but I am checking here to see if there is a native (or pub) implementation I have missed, as I am new to the world of UTF and byte conversions.
My goal:
encodeAsUtf16le(String s);
I do not need to write a BOM.
Dart Strings internally use UTF-16. You can use String.codeUnits to get the UTF-16 code units and then write them in little-endian form:
var s = '\u{1F4A9}';
var codeUnits = s.codeUnits;
var byteData = ByteData(codeUnits.length * 2);
for (var i = 0; i < codeUnits.length; i += 1) {
byteData.setUint16(i * 2, codeUnits[i], Endian.little);
}
var bytes = byteData.buffer.asUint8List();
await File('output').writeAsBytes(bytes);
or assume that you're running on a little-endian system:
var s = '\u{1F4A9}';
var codeUnits = s.codeUnits;
var bytes = Uint16List.fromList(codeUnits).buffer.asUint8List();
await File('output').writeAsBytes(bytes);
Also see https://stackoverflow.com/a/67802971/, which is about encoding UTF-16LE to Strings.
I also feel compelled to advise against writing UTF-16 to disk unless you're forced to by external requirements.

Writing UInt16List via IOSink.Add, what's the result?

Trying to write audio samples to a file.
I have List of 16-bit ints
UInt16List _samples = new UInt16List(0);
I add elements to this list as samples come in.
Then I can write to an IOSink like so:
IOSink _ios = ...
List<int> _toWrite;
_toWrite.addAll(_samples);
_ios.add(_toWrite);
or
_ios.add(_samples);
just works, no issues with types despite the signature of add taking List<int> and not UInt16List.
As I read, in Dart the 'int' type is 64 bit.
Are both writes above identical? Do they produce packed 16-bit ints in this file?
A Uint16List is-a List<int>. It's a list of integers which truncates writes to 16-bits, and always reads out 16-bit integers, but it is a list of integers.
If you copy those integers to a plain growable List<int>, it will contain the same integer values.
So, doing ios.add(_sample) will do the same as ios.add(_toWrite), and most likely neither does what you want.
The IOSink's add method expects a list of bytes. So, it will take a list of integers and assume that they are bytes. That means that it will only use the low 8 bits of each integer, which will likely sound awful if you try to play that back as a 16-bit audio sample.
If you want to store all 16 bits, you need to figure out how to store each 16-bit value in two bytes. The easy choice is to just assume that the platform byte order is fine, and do ios.add(_samples.buffer.asUint8List(_samples.offsetInBytes, _samples.lengthInBytes)). This will make a view of the 16-bit data as twice as many bytes, then write those bytes.
The endianness of those bytes (is the high byte first or last) depends on the platform, so if you want to be safe, you can convert the bytes to a fixed byte order first:
if (Endian.host == Endian.little) {
ios.add(
_samples.buffer.asUint8List(_samples.offsetInBytes, _samples.lengthInBytes);
} else {
var byteData = ByteData(_samples.length * 2);
for (int i = 0; i < _samples.length; i++) {
byteData.setUint16(i * 2, _samples[i], Endian.little);
}
var littleEndianData = byteData.buffer.asUint8List(0, _samples.length * 2);
ios.add(littleEndianData);
}

Tiff: algorithm to decode PackBits

While working on a routine to open TIFF files generated by a microscope software in Matlab I got stuck on reading compressed images. At this point it's a matter of honour for me to understand how to decode tiff data in the PackBits format.
With little experience in real computer science, I have troubles understanding the guidelines in the TIFF documentation, more specifically:
In the inverse routine, it is best to encode a 2-byte repeat run as a
replicate run except when preceded and followed by a literal run. In
that case, it is best to merge the three runs into one literal run.
Always encode 3-byte repeats as replicate runs. That is the essence of
the algorithm. Here are some additional rules:
• Pack each row
separately. Do not compress across row boundaries.
• The number of uncompressed bytes per row is defined to be (ImageWidth + 7) / 8. If the uncompressed bitmap is required to have an even number of bytes per row, decompress into word-aligned buffers.
• If a run is larger than 128 bytes, encode the remainder of the run as one or more additional replicate runs.
source: https://www.fileformat.info/format/tiff/corion-packbits.htm
I understand how to implement the pseudocode, and decode a sample string compressed with PackBits in Matlab. However, I'm lost during parsing a chunk of 16 bit, greyscale Tiff file. My question is how do I go about it? I don't really understand what it means in replicate run, neither what is a word-aligned buffer.
When I start decoding the data form form the first byte, I just get nonsense.
Help with understanding the logic of decompression will be appreciated, also a link to code decompressing the Tiff PackBits will be helpful.
~Jakub
Edit: I got the decompression algorithm to work, my error was to interpret the bytes wrongly, here is a code, if anyone will be interested in a similar problem in the future.
Tiff_file = 'compressed.tiff';
%open and read tiff file file
imInfo = imfinfo(Tiff_file);
fId = fopen(Tiff_file);
im = fread(fId);
fclose(fId);
%parse the file
output = zeros(1,imInfo.Width * imInfo.Height * 2);%preallocate
thisLoc = 1;
for strip = 1:length(imInfo.StripOffsets)
thisLength = imInfo.StripByteCounts(strip);
thisOffset = imInfo.StripOffsets(strip);
thisStrip = im(thisOffset + 1 : thisOffset + thisLength);
pntr = 1; %start at the first byte
%loop throught the coded data
while pntr < thisLength
key = thisStrip(pntr);
if key >= 129
key = 257 - key;
datTmp = repmat(thisStrip(pntr+1), [1 key]);
output(thisLoc:thisLoc+key-1) = datTmp;
thisLoc = thisLoc+key;
pntr = pntr + 2;
elseif key == 128 %nope
pntr = pntr + 1;
else
datTmp = thisStrip(pntr + 1 : pntr + 1 + key);
output(thisLoc:thisLoc+key) = datTmp;
thisLoc = thisLoc + key+1;
pntr = pntr + key + 2;
end
end
end
im = typecast(uint8(output),'uint16');
%reshape decoded data.
im = reshape(im,[imInfo.Width imInfo.Height])';

How to serve Array[Byte] from spray.io

I am using the following path in my spray-can server (using spray 1.2):
path("my"/"path"){
get{
complete{
val buf:Array[Byte] = functionReturningArrayofByte()
println(buf.length)
buf
}
}
}
The length of the buffer (and what is printed by the code above) is 2,263,503 bytes. However, when accessing my/path from a web browser, it downloads a file that is 10,528,063 bytes long.
I thought spray set the content type to application/octet-stream, and the content length, automatically when completing with an Array[Byte]. I don't realize what I may be doing wrong.
EDIT
I've run a small test and have seen that the array of bytes is output as a String. So, for example, if I had two bytes, for example 0xFF and 0x01, the output, instead of just the two bytes, would be the string [ 255, 1 ]. I just don't know how to make it output the raw content instead of a string representation of it.
Wrapping the buf as HttpData solves the problem:
path("my"/"path"){
get{
complete{
val buf:Array[Byte] = functionReturningArrayofByte()
HttpData(buf)
}
}
}

Resources