process a signal from a .wav and turn it into binary data - signal-processing

I recorded a radio signal into a .wav, I can open it in audacity and see that there is binary data encoded using a certain algorithm. Does anyone know of a way to process the signal that is contained within the .wav? so that i can extract the binary data from it?
I know that I need to know the encoding algorithm for it to work properly, anyone know of any program that does something like that?
Thanks

sox will convert most audio formats to most other audio formats - including raw binary.

The .wav format is generally very simple and wav files usually don't have compressed data. It's quite feasible to parse it yourself, but much easier to use something already made. So the short answer is to find something that can read wav files in your language of choice.
Here's an example in Python, using the wave module:
import wave
w = wave.open("myfile.wav", "rb")
binary_data = w.readframes(w.getnframes())
w.close()
Now where you go depends on what else you want to do. binary_data is now a python string of the raw bytes. If you just want to chop this and repackage it, it's probably easiest to leave it in this form. If you want to manipulated the data, such as scale it, interpolate, filter, etc, you would probably want to convert this into a sequence of numbers, and for this, in Python, you'd want to convert it to a numpy array. You could do this yourself using the struct module, which is for interpreting strings as packed binary data, or you could just have read in the data using scipy.io.wave module which does this for you. As you can see, most of this becomes fairly language dependent quickly.

Related

trying to figure out the charset

I'm downloading a CSV from Google Docs and in it characters like “ are saved as \xE2\x80\x9C and ” are saved as \xE2\x80\x9D.
My question is... what charset are those being saved in? How might I go about figuring that out?
It is in UTF-8.. You can tell by decoding it as UTF-8 and it shows the correct characters.
UTF-8 also has a unique and very distinctive pattern, just 3 bytes with highest bit set forming a valid UTF-8 sequence are enough to tell if something is UTF-8 with 99% confidence. Even with 2 bytes with highest bit set forming a valid UTF-8 sequence, you can already get to 90%.
In a case it wasn't UTF-8, and was some 8-bit code page instead, it would be impossible to tell just by looking at the bytes alone. Without any other information, you would basically have to brute force by decoding it in various 8-bit encodings and then seeing if it looks correct. The other possibility is using an algorithm that would go through the encodings automatically, and see if it the result makes sense in any language.
With more information like what operating system and locale the file was saved in, you could reduce the amount of possible encodings to try by a huge deal though.

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

How to write and read float data fast, not using string?

I have many float data which is generated from an image. I want to store it to a file, like XX.dat ( general in C). and I will read it again to do further processing.
I have method to represent float by nsstring and write it in to .txt file. but it is too slow. Is there some function which is same as fwrite( *data , *pfile) and fread(*buf, *pfile) in c? or some new idea?
many thanks!
In iOS you can still make use of the standard low-level file (and socket, among other things) API's. So you can use fopen(), fwrite(), fread(), etc. just as you would in any other C program.
This question has some examples of using the low-level file API on iOS: iPhone Unzip code
Another option to consider is writing your floats into something like an NSMutableData instance, and then writing that to file. That will be faster than converting everything to strings (you'll get a binary file instead of a text one), though probably still not as fast as using the low-level API's. And you'd probably have to use something like this to convert between floats and byte-arrays.
If you are familiar with lower level access, you could mmap your file, and then access the data directly just as you would any allocated memory.

Parsing binary data

I got interested in parser generators. But I don't have the theoretical background. I just read a few things on the internet.
Currently I'm trying to do something with ANTLR
So my questions:
I have a special format of my dataframes:
The first byte of a frame is a tag that describes the nature of the data
The second byte contains the length (number of bytes) of the data itself
Then follows the data itself
The data can contain dataframes itself, and dataframes can be listed one after the other
I hope my description is clear. My questions:
Can I create such a parser with ANTLR that reads the lengs of the frame and then knows when the frame ends?
In ANTLR can I load the different tags I use from a generated file?
Thank you!
I'm not 100% sure about this, but:
Parser generators like antlr require a grammar that is at least context-free
using length-fields in your data makes your grammar not context free (context-sensitive i think)
It is the latter point i'm not sure about - maybe you want to research some more on that.
You probably have to write a packet "parser" yourself (which then has to be a parser for your context-sensitive packet grammar)
Alternatively, you could drop the length field, and use something like s-expressions, JSON or xml; these would be parseable by something generated with antlr.
I think you will be better off to create a hand written binary parser instead of using ANTLR because ANTLR is primarily intended to read and make sense of a text file and not binary data. The lexer part is focused on tokenizing text so trying to make it read binary data instead would be an uphill battle.
It sounds as if your structure would need some kind of recursive way of reading the data although it could be done easier just having a tree structure and then fill it as you read your file.

Using Haskell's Parsec to parse binary files?

Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.

Resources