How do I "diff" multiple files against a single base file?

How do I "diff" multiple files against a single base file? - delphi

I have a configuration file that I consider to be my "base" configuration. I'd like to compare up to 10 other configuration files against that single base file. I'm looking for a report where each file is compared against the base file.
I've been looking at diff and sdiff, but they don't completely offer what I am looking for.
I've considered diff'ing the base against each file individually, but my problem then become merging those into a report. Ideally, if the same line is missing in all 10 config files (when compared to the base config), I'd like that reported in an easy to visualize manner.
Notice that some rows are missing in several of the config files (when compared individually to the base). I'd like to be able to put those on the same line (as above).
Note, the screenshot above is simply a mockup, and not an actual application.
I've looked at using some Delphi controls for this and writing my own (I have Delphi 2007), but if there is a program that already does this, I'd prefer it.
The Delphi controls I've looked at are TDiff, and the TrmDiff* components included in rmcontrols.

For people that are still wondering how to do this, diffuse is the closest answer, it does N-way merge by way of displaying all files and doing three way merge among neighboors.

None of the existing diff/merge tools will do what you want. Based on your sample screenshot you're looking for an algorithm that performs alignments over multiple files and gives appropriate weights based on line similarity.
The first issue is weighting the alignment based on line similarity. Most popular alignment algorithms, including the one used by GNU diff, TDiff, and TrmDiff, do an alignment based on line hashes, and just check whether the lines match exactly or not. You can pre-process the lines to remove whitespace or change everything to lower-case, but that's it. Add, remove, or change a letter and the alignment things the entire line is different. Any alignment of different lines at that point is purely accidental.
Beyond Compare does take line similarity into account, but it really only works for 2-way comparisons. Compare It! also has some sort of similarity algorithm, but it also limited to 2-way comparisons. It can slow down the comparison dramatically, and I'm not aware of any other component or program, commercial or open source, that even tries.
The other issue is that you also want a multi-file comparison. That means either running the 2-way diff algorithm a bunch of times and stitching the results together or finding an algorithm that does multiple alignments at once.
Stitching will be difficult: your sample shows that the original file can have missing lines, so you'd need to compare every file to every other file to get the a bunch of alignments, and then you'd need to work out the best way to match those alignments up. A naive stitching algorithm is pretty easy to do, but it will get messed up by trivial matches (blank lines for example).
There are research papers that cover aligning multiple sequences at once, but they're usually focused on DNA comparisons, you'd definitely have to code it up yourself. Wikipedia covers a lot of the basics, then you'd probably need to switch to Google Scholar.
Sequence alignment
Multiple sequence alignment
Gap penalty

Try Scooter Software's Beyond Compare. It supports 3-way merge and is written in Delphi / Kylix for multi-platform support. I've used it pretty extensively (even over a VPN) and it's performed well.

for f in file1 file2 file3 file4 file5; do echo "$f\n\n">> outF; diff $f baseFile >> outF; echo "\n\n">> outF; done

Diff3 should help. If you're on Windows, you can use it from Cygwin or from diffutils.

I made my own diff tool DirDiff because I didn't want parts that match two times on screen, and differing parts above eachother for easy comparison. You could use it in directory-mode on a directory with an equal number of copies of the base file.
It doesn't render exports of diff's, but I'll list it as a feature request.

You might want to look at some Merge components as what you describe is exactly what Merge tools do between the common base, version control file and local file. Except that you want more than 2 files (+ base)...
Just my $0.02

SourceGear Diffmerge is nice (and free) for windows based file diffing.

I know this is an old thread but vimdiff does (almost) exactly what you're looking for with the added advantage of being able to edit the files right from the diff perspective.

But none of the solutions does more than 3 files still.
What I did was messier, but for the same purpose (comparing contents of multiple config files, no limit except memory and BASH variables)
While loop to read a file into an array:
loadsauce () {
index=0
while read SRCCNT[$index]
do let index=index+1
done < $SRC
}
Again for the target file
loadtarget () {
index=0
while read TRGCNT[$index]
do let index=index+1
done < $TRG
}
string comparison
brutediff () {
# Brute force string compare, probably duplicates diff
# This is very ugly but it will compare every line in SRC against every line in TRG
# Grep might to better, version included for completeness
for selement in $(seq 0 $((${#SRCCNT[#]} - 1)))
do for telement in $(seq 0 $((${#TRGCNT[#]} - 1)))
do [[ "$selement" == "$telement" ]] && echo "${selement} is in ${SRC} and ${TRG}" >> $OUTMATCH
done
done
}
and finally a loop to do it against a list of files
for sauces in $(cat $SRCLIST)
do echo "Checking ${sauces}..."
loadsauce
loadtarget
brutediff
echo -n "Done, "
done
It's still untested/buggy and incomplete (like sorting out duplicates or compiling a list for each line with common files,) but it's definitely a move in the direction OP was asking for.
I do think Perl would be better for this though.

Related

Count Lines, grep, head, and tail inside Feather Files

Setup: I am contemplating switching from writing large (~20GB) data files with csv to feather format, since I have plenty of storage space and the extra speed is more important. One thing I like about csv files is that at the command line, I can do a quick
wc -l filename
to get a row count, even for large data files. Also, I can quickly search for a simple string with
grep search_string filename
The head and tail commands are also very useful at times. These are straight-forward and work well with csv files, but not with feather. If I try any of them on a feather file, I do not get results that make sense or are helpful.
While I certainly can read a feather file into, say, Python or R, and analyze it then, the hassle of writing out the path and importing the necessary libraries is something I'd rather dispense with.
My Question: Does there exist either a cross-platform (at least Mac and Linux) feather file reader I can use to quickly read in and view feather data (this would be in tabular format) with features corresponding to row count, grep, head, and tail? Or are there simple CLI utilities I could install that would enable me to do the equivalent of line count, grep, head, and tail?
I've seen this question, but it is very incomplete relative to my question.

Using feather files you must use Python or R programs.
To use csv you can use any of the common text manipulation utilities available to Linxu/Unix users.
Linux text manipulation tools
reader less
search grep
converters awk sed
extractor split
editor vim
Each of the above tools requires some learning and practice.
Suggestion
If you have programming skill, create a program to manipulate your feather file.

How to match efficiently against keys in a table in Lua?

Available in my Lua 5.1 environment are obviously the default Lua pattern matching, but also a reasonably recent version of PCRE and LPEG. I don't honestly care which of these is used; as long as my problem is tackled in an efficient manner I'm happy. (My personal knowledge of LPEG especially is next to non-existent, but I hear it has some very good qualities.)
I have a table with certain string patterns as keys, the accompanying values are to be used once the keys matches... which means they aren't really important for this matter.
Suppose you have:
tbl = { ["aaa"] = 12, ["aab"] = 452, ["aba"] = -2 }
Now my goal is to find out which one of these matches first in a particular string like "accaccaacaadacaabacdaaba".
In reality, the keys are more numerous and the match string is considerably lengthier. This means simply matching against all keys one by one and compare the column the match begins at is a very inefficient solution that is not viable for me.
Parts of the match strings can have considerable overlaps, too. From the theory, I know one state machine per key pattern would be ideal in this regard; just go through the motions on every pattern and the moment you have a complete match on one of them you are done.
But I would be crazy to go code something like that myself when there's so many pattern matching libraries in my environment. The only one I know is technically capable is PCRE; just append the keys like "aaa|aab|aba" and you'll get the first feasible match.
But there's also the problem. For one, I am unsure how intelligent it is when compiling such a match. (I think it first tries 'aaa', unwinds completely once it fails, then completely tries aab, but I haven't tested) which wouldn't be too efficient compared to matching it like "a(a[ab]|ba)" where similarities get resolved faster.
Additionally, I'd like to have the capacity to put in some flexibility ("a.ad" where the second character doesn't matter, or matches a number.. basic stuff like that). With a pattern like that in such an additive approach, I do not see a way to regain the original pattern that matched so I can use the value that goes with it.
(Worst case, I could just generate a lot of entries in the table to match every possible wildcard variation and do away with the pattern requirement, but I honestly don't want to.)
Which library is the right tool for the job, and to boot, how to best use said library to achieve above-stated goals without reinventing the wheel?

A comment to your question mentioned Aho–Corasick algorithm.
If your environment has access to os.execute or io.popen, you can call fgrep -o -f patterns filename, where patterns is the name of a file that contains patterns separated with newlines, and filename is the name of your input. -o means that only matches will be output, one per line. You can replace filename with - so that fgrep reads from standard input: echo "String to match" | fgrep -o -f patterns.
fgrep implements Aho–Corasick algorithm.
However, remember that Aho–Corasick algorithm does not recognise metacharacters.

Just as Alexander Mashin's answer said, Aho–Corasick algorithm is an efficient algorithm that will solve your problem. In Lua land, cloudflare /
lua-aho-corasick is an implementation for LuaJIT using FFI. There's also a pure lua implemetation jgrahamc/aho-corasick-lua which might be slower.

Flip doxygen's graphs from top-to-bottom orientation to left-to-right

The doxygen graph for "includes" and "is included by" are created with nesting depth increasing from top to bottom (using 1.8.5).
Since we have mostly shallow graphs with many nodes, this leads to very wide graphs with ugly horizontal scroll bars. Is there a way to teach doxygen to create these graphs in a left-to-right orientation, the way it creates caller/call graphs?
I know that graphviz/dot supports this, but can't find a way to tell doxygen my preference.

There is a similar question asked recently which I am duplicate answering:
Doxygen: Is it possible to control the orientation of dependency graphs?
After looking for the same myself and finding nothing, the best I can offer is a hack using the graph attribute rankdir.
Step 1) Make sure Doxygen keeps the dot files. Put DOT_CLEANUP=NO in your confige file.
Step 2) find your dot files that Doxygen generated. Should be in the form *__incl.dot. for steps below I will refer to this file as <source>.dot
Step 3a) Assuming the dot file did not explicitly specify rankdir (usually it is TB" by default), regenerate the output with this command.
dot -Grankdir="LR" -Tpng -o<source>.png -Tcmapx -o<source>.map <source>.dot
Step 3b) If for some reason rankdir is specified in the dot file, go into the file and add the rankdir="LR" (by default they are rankdir is set to "TB").
digraph "AppMain"
{
rankdir="LR";
...
Then regenerate the output with:
dot -Tpng -o<source>.png -Tcmapx -o<source>.map <source>.dot
You need to redo this after every run of Doxygen. A batch file might be handy, especially if you want to process all files. For step 3b, batch replacing text is outside of the scope of this answer :). But here seems to be a good answer:
How can you find and replace text in a file using the Windows command-line environment?

Splitting a bibliography: how to overcome Multibib/Latex's file limitation (16 max)

I'm writing a thesis where I'm asked to split the bibliography into different sections, and so far I've been using Multibib which was really perfect for what I wanted to do:
\newcites{ltex}{\TeX\ and \LaTeX\ References}
...
\bibliographystyleltex{alpha}
\bibliographyltex{lit}
But I'm now facing a limitation regarding the number of files I'm allowed to use, as described in the Multibib documentation:
The tiny \newcites command is not limited to one bibliography. In fact, you
can generate as much bibliographies as you like (only limited by the maximum
number of TEX’s output files, usually 16).
Is there any way to easily bypass this limitation? (I can't reduce the number of sections, and I would like to keep Bibtex --- AFAIU splitbib doesn't)
Many thanks,

I'd look at biblatex, which does a single read of all of the data (at the start of your file). (BTW, you might get more answers at the new TeX-specific site http://tex.stackexchange.com)

Bibtopic seems to match what I was looking for, it's even simpler than Multibib!
\usepackage{bibtopics}
...
\begin{btSect}{BIB_FILENAME}
\section{SECTION TITLE}
\btPrintAll || \btPrintCited || \btPrintNotCited
\end{btSect}
I'll accept the answer if it works properly within my document.
EDIT: Bibtopic suffers from the same problem as Multibib ... any other solution than copying the .bbl files directly into my latex files?

console output formatting

Are there any conventions for formatting console output from a command line app for readability and consistency? For instance, do you indent sub-information, when do you print a blank line, if ever, how should you accent important statements.
I've found output can quickly degenerate into a chaotic blur. I'm interested in hearing about what other people do.
Update: Really this is for embedded software which spits debug status out a terminal, but it's pretty much like a console app, and I figured everyone would be more familiar with that. Thanks so far.

I'd differentiate two kinds of programs:
Do you print information that might be used by a script (i.e. it should be parseable)? Then define a pretty strict format and use only that (for example fixed field separators).
Do you print information that need not be parsed by a script (or is there an alternative script-parseable format already)? Then write what comes natural:
My suggestions:
write it so that you would like to read it
indent sub-information 2 or 4 spaces, definitely not more
separate blocks of information by one empty line at most
respect the COLUMN environment variable (and possible ROWS if it applies to your output).

If this is for a *nix environment, then I'd recommend reading Basics of Unix Philosophy. It's not specific to output but there are some good guidelines for command line programs in general.
Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart