Defining datatypes in execution time? - parsing

I'm writing a PLY file parser and serialisation library as part of a research project in point clouds compression. The PLY file specifies the data format in a header before a payload section with all the data. The data is described by means of elements containing properties. For example:
element vertex 12
property float x
property float y
property float z
informs that the payload will have a list with 12 elements called "vertex" with three float properties called x, y and z all of which are floating point numbers. For an arbitrary file, I don't know in advance which elements it contains nor the name, type or the amount of properties each one has. I solve this by parsing the header to create a list of elements
data Element = Element Name Qty [Property]
data Property = Property PropType Name
which tells me the overall structure of the payload with some functions to search if a given element is present and locate/parse the data in the payload section.
I have a working version, but I'm thinking about a new possibility (just for fun right now). What if I wanted to build a data structure that represents the data in the type level? Something like
data Vertex = Vertex { elX :: Float
, elY :: Float
, elY :: Float }
but with the detail I can only discover which fields (properties) my datatype (or types in case of more than one kind of element being described in the file) would have in execution time after parsing the file header so I can later build a function parseVertex :: Parser [Vertex] which then takes the payload data and parsers it?
May someone give me some advice on what to start looking for solving such a problem?

Related

Accessing field on original C struct in Go

I'm trying to use OpenCV from Go. OpenCV defines a struct CvMat that has a data field:
typedef struct CvMat
{
...
union
{
uchar* ptr;
short* s;
} data;
}
I'm using the go bindings for opencv found here. This has a type alias for CvMat:
type Mat C.CvMat
Now I have a Mat object and I want to access the data field on it. How can I do this? If I try to access _data, it doesn't work. I printed out the fields on the Mat object with the reflect package and got this:
...
{data github.com/lazywei/go-opencv/opencv [8]uint8 24 [5] false}
...
So there is a data field on it, but it's not even the same type. It's an array of 8 uint8s! I'm looking for a uchar* that is much longer than 8 characters. How do I get to this uchar?
The short answer is that you can't do this without modifying go-opencv. There are a few impediments here:
When you import a package, you can only use identifiers that have been exported. In this case, data does not start with an upper case letter, so is not exported.
Even if it was an exported identifier, you would have trouble because Go does not support unions. So instead the field has been represented by a byte array that matches the size of the underlying C union (8 bytes in this case, which matches the size of a 64-bit pointer).
Lastly, it is strongly recommended not to expose cgo types from packages. So even in cases like this where it may be possible to directly access the underlying C structure, I would recommend against it.
Ideally go-opencv would provide an accessor for the information you are after (presumably one that could check which branch of the union is in use, rather than silently returning bad data. I would suggest you either file a bug report on the package (possibly with a patch), or create a private copy with the required modifications if you need the feature right away.

Is there a Design Pattern for parsing binary data like this?

I'm working on parsing some input from a UDP stream. The protocol is sort of like a binary query string. It'll send a code byte that tells you how to read the following bytes. For example a code value of 1 might mean that the next 4 bytes are an int intended to be an ID, a value of 2 might mean the next 4 bytes are an int meant to be a Velocity, a value of 3 might mean a float for latitude, a value of 4 might mean the next bytes are a string with a length prepended as an int.
Is there a design pattern for parsing things with these kinds of rules? I'm sure there has to be some approach that's better than a large switch on the code value. I'm using a BinaryReader in C#, but I imagine there's a language agnostic solution.
You probably want Strategy Pattern. Each Strategy instance will know how to parse it's type of data and how many bytes to consume and some kind of callback or builder object that will handle the relevant data that is read
interface for ReadStrategy{
Read(Stream stream, MyObject obj);
}
class VelocityReader{
Read(Stream stream, MyObject obj){
//read 4 bytes as int.
int value = stream.ReadInt32();
myObj.setVelocity(value);
}
}
You would also need a factory class that reads the first byte per record to know which strategy to use (could be implemented as a switch) or if you want to use even more patterns, add a method to the strategy to know how to recognize what its own code value is and use Chain of Responsibility to poll each strategy type to find the first one that can handle the code value.

How are hashtables (maps) stored in memory?

This question is specifically for hashtables, but might also cover other data structures such as linked lists or trees.
For instance, if you have a struct as follows:
struct Data
{
int value1;
int value2;
int value3;
}
And each integer is 4-byte aligned and stored in memory sequentially, are the key and value of a hash table stored sequentially as well? If you consider the following:
std::map<int, string> list;
list[0] = "first";
Is that first element represented like this?
struct ListNode
{
int key;
string value;
}
And if the key and value are 4-byte aligned and stored sequentially, does it matter where the next pair is stored?
What about a node in a linked list?
Just trying to visualize this conceptually, and also see if the same guidelines for memory storage also apply for open-addressing hashing (the load is under 1) vs. chained hashing (load doesn't matter).
It's highly implementation-specific. And by that I am not only referring to the compiler, CPU architecture and ABI, but also the implementation of the hash table.
Some hash tables use a struct that contains a key and a value next to each other, much like you have guessed. Others have one array of keys and one array of values, so that values[i] is the associated value for the key at keys[i]. This is independent of the "open addressing vs. separate chaining" question.
A hash is a data structure itself. Here's your visualizing:
http://en.wikipedia.org/wiki/Hash_table
http://en.wikipedia.org/wiki/Hash_function
Using a hash function (langauge-specific), the keys are turned into places, and the values are placed there (in an array.)
Linked-lists i'm not as sure about, but i would be they are stored sequentially if they are created sequentially. Obviously, if what the nodes hold increases in size, they'd need to be moved and the pointer redefined to that point.
Usually when the value is not that big (int) it's best to group it together with the key (which by default shouldn't be too big), otherwise only a pointer to it is kept.
The simplest representation of a hash table is an array (the table).
A hash function generates a number between 0 and the size of the array. That number is the index for the item.
There is more to it than this, bit that's the general concept and explains why lookups are so fast.

multi dimension dynamic array allocation

I want to use dynamic allocation for multi-block CFD code, where, the index (i,j,k) varies for different blocks. I really dont know, how to allocate arbitrary array index for n number of blocks and pass it to subroutines. I have given a sample code, which gives error message "Error: Expression at (1) must be scalar", on compilation using gfortran.
common/iteration/nb
integer, dimension (:),allocatable::nib,njb,nkb
real, dimension (:,:,:,:),allocatable::x,y,z
allocate (nib(nb),njb(nb),nkb(nb))
do l=1,nb
ni=nib(l)
nj=njb(l)
nk=nkb(l)
allocate (x(l,ni,nj,nk),y(l,ni,nj,nk),z(l,ni,nj,nk))
enddo
call gridatt (x,y,z,nib,njb,nkb)
deallocate(x,y,z,nib,njb,nkb)
end
subroutine gridatt (x,y,z,nib,njb,nkb)
common/iteration/nb
integer, dimension (nb)::nib,njb,nkb
real, dimension (nb,nib,njb,nkb)::x,y,z
do l=1,nb
read(7,*)nib(l),njb(l),nkb(l)
read(7,*)(((x(l,i,j,k),i=1,nib(l)),j=1,njb(l)),k=1,nkb(l)),
$ (((y(l,i,j,k),i=1,nib(l)),j=1,njb(l)),k=1,nkb(l)),
$ (((z(l,i,j,k),i=1,nib(l)),j=1,njb(l)),k=1,nkb(l))
enddo
return
end
The error message gfortran gives, is as good as they get. It points to nib in the line
real, dimension (nb,nib,njb,nkb)::x,y,z
nib is declared as an array. This is not allowed. (What would the size of x, y, and z be in this dimension?)
Apart from this, I don't really understand your description of what it is that you are trying to do, and the sample code you show doesn't make much sense to me.
common/iteration/nb
integer, dimension (:),allocatable::nib,njb,nkb
real, dimension (:,:,:,:),allocatable::x,y,z
allocate (nib(nb),njb(nb),nkb(nb))
When writing new code, using modules to communicate between program units is highly preferred. Old style common blocks are to be avoided.
You are trying to allocate nib, njb, and nkb with size nb. The problem is that nb hasn't been given a value yet (and won't be given one anywhere in the code).
do l=1,nb
ni=nib(l)
nj=njb(l)
nk=nkb(l)
allocate (x(l,ni,nj,nk),y(l,ni,nj,nk),z(l,ni,nj,nk))
enddo
Again the problem with nb not having a value. This loop runs for an unknown number of times. You're also using the arrays nib, njb, and nkb, which don't contain any values yet.
In each iteration of the loop x, y, and z get allocated. This will lead to a runtime error in the second iteration, because you can not allocate an already allocated variable. Even if the allocations would work, this loop would be useless, because the three arrays would be reset in each iteration and would eventually be set to the dimensions of the last allocation.
Now that I'm writing this, I'm starting to think that what it is that you are trying to do, is create so called 'jagged arrays': you want to create a block in x(1,:,:,:) that differs in size in the second, third, and/or fourth dimension from the block in x(2,:,:,:), and so on. This is simply not possible in fortran.
One way to achieve this would be to create a user defined type with an allocatable, three dimensional array component, and create an array of this type. You can then allocate the array component to the desired size for each element of the array of user defined type.
This would look something like the following (disclaimer: untested and just one possible way of achieving your goal).
type :: blocktype
real, dimension(:, :, :), allocatable :: x, y, z
end type blocktype
type(blocktype), dimension(nb) :: myblocks
You can then run a loop to allocate x, y, and z to a different size for each array element. This is assuming nb has been set to the required value, and nib, njb, and nkb contain the desired sizes for the different blocks.
do block = 1, nb
ni = nib(block)
nj = njb(block)
nk = nkb(block)
allocate(myblocks(block)%x(ni, nj, nk))
allocate(myblocks(block)%y(ni, nj, nk))
allocate(myblocks(block)%z(ni, nj, nk))
enddo
If you want to do it like this, you will definitely want to put your procedures in modules, because that way you automatically get explicit interfaces, which are required for passing around such arrays of user defined type.
One afterthought: Don't use implicit typing, not even in sample code. Always use implicit none.

Efficiently Reorganize or Reference Large Data in MATLAB

I am currently bringing large (tens of GB) data files into Matlab using memmapfile. The file I'm reading in is structured with several fields describing the data that follows it. Here's an example of how my format might look:
m.format = { 'uint8' [1 1024] 'metadata'; ...
'uint8' [1 500000] 'mydata' };
m.repeat = 10000;
So, I end up with a structure m where one sample of the data is addressed like this:
single_element = m.data(745).mydata(26);
I want to think of this data as a matrix of, from the example, 10,000 x 500,000. Indexing individual items in this way is not difficult though somewhat cumbersome. My real problem arises when I want to access e.g. the 4th column of every row. MATLAB will not allow the following:
single_column = m.data(:).mydata(4);
I could write a loop to slowly piece this whole thing into an actual matrix (I don't care about the metadata by the way), but for data this large it's hard to overemphasize how prohibitively slow that will be... not to mention the fact that it will double the memory required. Any ideas?
Simply map it to a matrix:
m.format = { 'uint8' [1024 500000] 'x' };
m.Data(1).x will be you data matrix.

Resources