Is it possible to pipe HDF5 formated data? - hdf5

It is possible to write HDF5 to stdout and read from stdin (via H5::File file("/dev/stdout",H5F_ACC_RDONLY) or otherwise)?
What I want is to have a program foo to write to an HDF5 file (taken to be its first argument, say) and another program bar to read from an HDF5 file and then instead of
command_prompt> foo temp.h5
command_prompt> bar temp.h5
command_prompt> rm temp.h5
simply say
command_prompt> foo - | bar -
where the programs foo and bar understand the special file name - to mean stdout and stdin respectively. In order to write those programs, I want to know 1) whether this is at all possible and 2) how I implement this, i.e. what to pass to H5Fcreate() and H5Fopen(), respectively, in case file name = -.
I tried and it seems impossible (not a big surprise). HDF5 only has H5Fcreate(), H5Fopen(), and H5Freopen(), neither of which seems to support I/O to stdin/stdout.

I do not think you can use stdin as an hdf5 input file. The library needs to seek around between the header contents and the data, and you cannot do that with stdin.

Related

Count Lines, grep, head, and tail inside Feather Files

Setup: I am contemplating switching from writing large (~20GB) data files with csv to feather format, since I have plenty of storage space and the extra speed is more important. One thing I like about csv files is that at the command line, I can do a quick
wc -l filename
to get a row count, even for large data files. Also, I can quickly search for a simple string with
grep search_string filename
The head and tail commands are also very useful at times. These are straight-forward and work well with csv files, but not with feather. If I try any of them on a feather file, I do not get results that make sense or are helpful.
While I certainly can read a feather file into, say, Python or R, and analyze it then, the hassle of writing out the path and importing the necessary libraries is something I'd rather dispense with.
My Question: Does there exist either a cross-platform (at least Mac and Linux) feather file reader I can use to quickly read in and view feather data (this would be in tabular format) with features corresponding to row count, grep, head, and tail? Or are there simple CLI utilities I could install that would enable me to do the equivalent of line count, grep, head, and tail?
I've seen this question, but it is very incomplete relative to my question.
Using feather files you must use Python or R programs.
To use csv you can use any of the common text manipulation utilities available to Linxu/Unix users.
Linux text manipulation tools
reader less
search grep
converters awk sed
extractor split
editor vim
Each of the above tools requires some learning and practice.
Suggestion
If you have programming skill, create a program to manipulate your feather file.

Does io.lines() stream or slurp the file?

For algorithms that support line by line processing, lua documentation suggests that using io.lines() is more efficient than io:read("*line") in a while loop.
The call io.read("*line") returns the next line from the current input
file, without the newline character. (...) However, to iterate on a
whole file line by line, we do better to use the io.lines iterator. (21.1 – The Simple I/O Model)
I can imagine three possible reasons that the io.lines() call is preferred.
The iterator is more efficient than the while loop
The file reading is handled more efficiently
It's easier to read/maintain the code
The lua documentation also promotes slurping files
(Y)ou should always consider the alternative of reading the whole file
with option "*all" from io.read and then using gfind to break it up (21.1 – The Simple I/O Model)
Hypothesis: io:read("*line") streams the file. If slurping is more efficient in lua, and io.lines() slurps the file, then io.lines() might be more efficient for that reason.
However, the unofficial Lua FAQ has the following to say about io.lines()
Note that it is an iterator, this does not bring
the whole file into memory initially.
This suggests streaming instead of slurping.
TLDR Does io.lines() ever hold the whole file in memory or does it only hold one line in memory at a time? Is its memory usage different than io:read("*line") in a while loop?
io.lines() does not hold the whole file in memory: it reads the file one line at a time, not the whole file at once. For that, use io.read("*all").

Storing applicative version info in SPSS sav file

I'm using C SPSS I/O library to write and read sav files.
I need to store my own version number in sav file. The requirements are:
1) That version should not be visible to user when he/she uses regular SPSS programs.
2) Obviously, regular SPSS programs and the I/O module should not overwrite the number.
Please, advice about that place or function.
Regards,
There is a header field in the sav file that identifies the creator. However, that would be overwritten if the file is resaved. It would be visible with commands such as sysfile info.
Another approach would be to create a custom file attribute using a name that is unlikely to be used by anyone else. It would also be visible in a few system status commands such as DISPLAY DICT and I think, CODEBOOK. It could be overwritten, with the DATASET ATTRIBUTE command but would not be changed just by resaving the file.

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

what is appropriate for me? generateAllGrams() or is generateCollocations() enough for me?

I am developing a project on wordnet-based document summarizer.in that i need to extract collocations. i tried to research as much as I could, but since i have not worked with Mahout before I am having difficulty in understanding how CollocDriver.java works (in API context)
while scouring through the web, i landed on this :
Mahout Collocations
this is the problem: i have a POSTagged input text. i need to identify collocations in it.i have got collocdriver.java code..now i need to know how do i use it? whether to use generateAllGrams() method or only generateCollocations() method is enough for my subtask within my summarizer..??
and most importantly HOW to use it? i raise this question coz I admit, i dont know the API well,
i also got a grepcode version of collocdriver the two implementations seem to be slightly different..the inputs are in string for the grepcode version and in the form of Path object in the original...
my questions: what is configuration object in input params and how to use it?? will the source / destn will be in string (as in grepcode) or Path (as in original)??
what will be the output?
i have done some further R & D on collocdriver program...i found out that it uses a sequence file and then vector generation...i wanna know how this sequence file / vector generation works..plz help..
To get collocation using mahout,u need to follow some simple steps
1) You must make a sequence file from ur input text file.
/bin/mahout seqdirectory -i /home/developer/Desktop/colloc/ -o /home/developer/Desktop/colloc/test-seqdir -c UTF-8 -chunk 5
2)There are two ways to generate collocations from a sequence file.
a)Convert sequence file to sparse vector and find out the collocation
b)Directly find out the collocation from the sequence file (with out creating the sparse vector)
3)Here i am considering choice b.
/bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i /home/developer/Desktop/colloc/test-seqdir -o /home/developer/Desktop/colloc/test-colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p
Just check out the output folder,the files u need is over there !!! (in sequence file format)
/bin/mahout seqdumper -s /home/developer/Desktop/colloc/test-colloc/ngrams/part-r-00000 >> out.txt will give u a text output !!!

Resources