Comparing h5 files

Comparing h5 files - hdf5

I often have to compare hdf files. How I do it is either with a binary diff (which tells me files are different even though the actual numbers inside are the same) or by dumping the content into a txt file with h5dump and the comparing the content of the two files (which is also quite annoying).
I was wondering if there is a more clever way to do this, perhaps a feature of h5 or of softwares like HDFView or Panoply.

Perhaps hdiff is what you require ? Some examples here

h5diff can be used to compare HDF5 files, and on Ubuntu it can be installed with
apt-get install hdf5-tools
then it's simply
h5diff file1.hdf5 file2.hdf5

Related

Count Lines, grep, head, and tail inside Feather Files

Setup: I am contemplating switching from writing large (~20GB) data files with csv to feather format, since I have plenty of storage space and the extra speed is more important. One thing I like about csv files is that at the command line, I can do a quick
wc -l filename
to get a row count, even for large data files. Also, I can quickly search for a simple string with
grep search_string filename
The head and tail commands are also very useful at times. These are straight-forward and work well with csv files, but not with feather. If I try any of them on a feather file, I do not get results that make sense or are helpful.
While I certainly can read a feather file into, say, Python or R, and analyze it then, the hassle of writing out the path and importing the necessary libraries is something I'd rather dispense with.
My Question: Does there exist either a cross-platform (at least Mac and Linux) feather file reader I can use to quickly read in and view feather data (this would be in tabular format) with features corresponding to row count, grep, head, and tail? Or are there simple CLI utilities I could install that would enable me to do the equivalent of line count, grep, head, and tail?
I've seen this question, but it is very incomplete relative to my question.

Using feather files you must use Python or R programs.
To use csv you can use any of the common text manipulation utilities available to Linxu/Unix users.
Linux text manipulation tools
reader less
search grep
converters awk sed
extractor split
editor vim
Each of the above tools requires some learning and practice.
Suggestion
If you have programming skill, create a program to manipulate your feather file.

Hyper directory - Can a (Linux) directory in addition to a list of subdirectories and files also contain text itself?

Under Linux, are there ways to add comments, description (text, rich text, hypertext .. ) to a directory itself, rather than by means of auxiliary files in such a directory, like README.txt, INSTALL.txt, NOTE_ON_WHY_WE_DID_THIS_THIS_WAY.txt, .. ?
In such a generalized directory, a directory entry (subdirectory/file) would be represented as (hyper)link, at least in one view of such a generalized directory. A "classical directory view" may also be available for generalized directories, in which the commments, description, mentioned above, would be omitted, or be available through an auxiliary file. I am aware this may require either special formatting of the storage medium, or a software layer on top of a classical disk formatting structure. The views would have to be derived from the generalized directory and not vice versa (in order to avoid consistency problems between the views).

Not in general, but some file-systems have extended file attributes. You could use getfattr(1), setfattr(1). See attr(5), listxattr(2), setxattr(2) etc...
AFAIK, few utilities are using these extended file attributes (and that surprises me; I would imagine that desktop environments would e.g. use them to store e.g. the MIME type of files, but they usually don't). There is a significant (file-system specific) limit on these extended attributes, e.g. 255 bytes
A more practical and traditional way would be to decide to store your additional meta-data in some hidden directory (with a name starting with a dot, like .git/ used by git)

I can't refer to all filesystems, but at least in extX, directory contains only the names of files/dirs which are in this dir, their inode numbers, and offset beetween where the next pair (dir/file - inode) starts. Generally such data which describe dirs are kept in inode structure (not inside directory itself), for instance owner of dir, atime, ctime extended attributes, number of links and so on, all these things are there. You can look on such structure in kernel source, and there is not such field which allows to put "labels" on the file/dir. In theory you would use some "unused" fields of this structure, but only in theory since these is very limited space.

Interesting question, but I believe not. From what I remember, directories are just pointers to files other directories, so I don't think it would be possible to store text in them. Maybe if you re-enginner the whole filesystem...

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

Fast Search to see if a String Exists in Large Files with Delphi

I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.
If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:
var
FileContents: TStringlist;
begin
FileContents.LoadFromFile(Filepath);
if Pos(TextToFind, FileContents.Text) = 0 then
Found := false
else
Found := true;
The above code is simple, and it generally works okay. But it has two problems:
It fails for very large files (e.g. 300 MB)
I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?
I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.
So how can I speed this search up and also make it work for very large files?
Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?
Solution:
Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.
But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.
So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.
That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).
So I'm off and running. Thanks once again SO community.

The best approach here is probably to use memory mapped files.
First you need a file handle, use the CreateFile windows API function for that.
Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.
To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)
To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.
Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.
Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).
EDIT:
A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.
Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.
If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).
Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.

If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.
To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.
Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.

This is a problem connected with your previous question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and the same answers apply. If you don't read the files completely but in blocks then large files won't pose a problem. There's also a big speed-up to be had for files containing the text, in that you should cancel the search upon the first match. Currently you read the whole files even when the text to be found is in the first few lines.

May I suggest a component ? If yes I would recommend ATStreamSearch.
It handles ANSI and UNICODE (and even EBCDIC and Korean and more).
Or the class TUTBMSearch from the JclUnicode (Jedi-jcl). It was mainly written by Mike Lischke (VirtualTreeview). It uses a tuned Boyer-Moore algo that ensure speed. The bad point in your case, is that is fully works in unicode (widestrings) so the trans-typing from String to Widestring risk to be penalizing.

It depends on what kind of data yre you going to search with it, in order for you to achieve a real efficient results you will need to let your programm parse the interesting directories including all files in there, and keep the data in a database which you can access each time for a specific word in a specific list of files which can be generated up to the searching path. A Database statement can provide you results in milliseconds.
The Issue is that you will have to let it run and parse all files after the installation, which may take even more than 1 hour up to the amount of data you wish to parse.
This Database should be updated eachtime your programm starts, this can be done by comparing the MD5-Value of each file if it was changed, so you dont have to parse all your files each time.
If this way of working can be interesting if you have all your data in a constant place and you analyse data in the same files more than each time totally new files, some code analyser work like this and they are real efficient. So you invest some time on parsing and saving intresting data and you can jump to the exact place where a searching word appears and provide a list of all places it appears on in a very short time.

If the files are to be searched multiple times, it could be a good idea to use a word index.
This is called "Full Text Search".
It will be slower the first time (text must be parsed and indexes must be created), but any future search will be immediate: in short, it will use only the indexes, and not read all text again.
You have the exact parser you need in The Delphi Magazine Issue 78, February 2002:
"Algorithms Alfresco: Ask A Thousand Times
Julian Bucknall discusses word indexing and document searches: if you want to know how Google works its magic this is the page to turn to."
There are several FTS implementation for Delphi:
Rubicon
Mutis
ColiGet
Google is your friend..
I'd like to add that most DB have an embedded FTS engine. SQLite3 even has a very small but efficient implementation, with page ranking and such.
We provide direct access from Delphi, with ORM classes, to this Full Text Search engine, named FTS3/FTS4.

How do I "diff" multiple files against a single base file?

I have a configuration file that I consider to be my "base" configuration. I'd like to compare up to 10 other configuration files against that single base file. I'm looking for a report where each file is compared against the base file.
I've been looking at diff and sdiff, but they don't completely offer what I am looking for.
I've considered diff'ing the base against each file individually, but my problem then become merging those into a report. Ideally, if the same line is missing in all 10 config files (when compared to the base config), I'd like that reported in an easy to visualize manner.
Notice that some rows are missing in several of the config files (when compared individually to the base). I'd like to be able to put those on the same line (as above).
Note, the screenshot above is simply a mockup, and not an actual application.
I've looked at using some Delphi controls for this and writing my own (I have Delphi 2007), but if there is a program that already does this, I'd prefer it.
The Delphi controls I've looked at are TDiff, and the TrmDiff* components included in rmcontrols.

For people that are still wondering how to do this, diffuse is the closest answer, it does N-way merge by way of displaying all files and doing three way merge among neighboors.

None of the existing diff/merge tools will do what you want. Based on your sample screenshot you're looking for an algorithm that performs alignments over multiple files and gives appropriate weights based on line similarity.
The first issue is weighting the alignment based on line similarity. Most popular alignment algorithms, including the one used by GNU diff, TDiff, and TrmDiff, do an alignment based on line hashes, and just check whether the lines match exactly or not. You can pre-process the lines to remove whitespace or change everything to lower-case, but that's it. Add, remove, or change a letter and the alignment things the entire line is different. Any alignment of different lines at that point is purely accidental.
Beyond Compare does take line similarity into account, but it really only works for 2-way comparisons. Compare It! also has some sort of similarity algorithm, but it also limited to 2-way comparisons. It can slow down the comparison dramatically, and I'm not aware of any other component or program, commercial or open source, that even tries.
The other issue is that you also want a multi-file comparison. That means either running the 2-way diff algorithm a bunch of times and stitching the results together or finding an algorithm that does multiple alignments at once.
Stitching will be difficult: your sample shows that the original file can have missing lines, so you'd need to compare every file to every other file to get the a bunch of alignments, and then you'd need to work out the best way to match those alignments up. A naive stitching algorithm is pretty easy to do, but it will get messed up by trivial matches (blank lines for example).
There are research papers that cover aligning multiple sequences at once, but they're usually focused on DNA comparisons, you'd definitely have to code it up yourself. Wikipedia covers a lot of the basics, then you'd probably need to switch to Google Scholar.
Sequence alignment
Multiple sequence alignment
Gap penalty

Try Scooter Software's Beyond Compare. It supports 3-way merge and is written in Delphi / Kylix for multi-platform support. I've used it pretty extensively (even over a VPN) and it's performed well.

for f in file1 file2 file3 file4 file5; do echo "$f\n\n">> outF; diff $f baseFile >> outF; echo "\n\n">> outF; done

Diff3 should help. If you're on Windows, you can use it from Cygwin or from diffutils.

I made my own diff tool DirDiff because I didn't want parts that match two times on screen, and differing parts above eachother for easy comparison. You could use it in directory-mode on a directory with an equal number of copies of the base file.
It doesn't render exports of diff's, but I'll list it as a feature request.

You might want to look at some Merge components as what you describe is exactly what Merge tools do between the common base, version control file and local file. Except that you want more than 2 files (+ base)...
Just my $0.02

SourceGear Diffmerge is nice (and free) for windows based file diffing.

I know this is an old thread but vimdiff does (almost) exactly what you're looking for with the added advantage of being able to edit the files right from the diff perspective.

But none of the solutions does more than 3 files still.
What I did was messier, but for the same purpose (comparing contents of multiple config files, no limit except memory and BASH variables)
While loop to read a file into an array:
loadsauce () {
index=0
while read SRCCNT[$index]
do let index=index+1
done < $SRC
}
Again for the target file
loadtarget () {
index=0
while read TRGCNT[$index]
do let index=index+1
done < $TRG
}
string comparison
brutediff () {
# Brute force string compare, probably duplicates diff
# This is very ugly but it will compare every line in SRC against every line in TRG
# Grep might to better, version included for completeness
for selement in $(seq 0 $((${#SRCCNT[#]} - 1)))
do for telement in $(seq 0 $((${#TRGCNT[#]} - 1)))
do [[ "$selement" == "$telement" ]] && echo "${selement} is in ${SRC} and ${TRG}" >> $OUTMATCH
done
done
}
and finally a loop to do it against a list of files
for sauces in $(cat $SRCLIST)
do echo "Checking ${sauces}..."
loadsauce
loadtarget
brutediff
echo -n "Done, "
done
It's still untested/buggy and incomplete (like sorting out duplicates or compiling a list for each line with common files,) but it's definitely a move in the direction OP was asking for.
I do think Perl would be better for this though.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart