For (incremental) loading performance I want to split a huge (believe me), generated BUILD.bazel into smaller .bzl files.
In each .bzl I then plan to have a Macro foo, which contains the actual rule calls:
def foo():
foorule("a")
foorule("b")
...
In the BUILD.bazel I then would have (a lot of) loads like:
load("foo.bzl", foo_0 = "foo")
load("other/foo.bzl", foo_1 = "foo")
...
and then trigger the rules in BUILD.bazel via:
foo_0()
foo_1()
Is this supposed to be faster than evaluating all rules inside of a symbol in .bzl?
foo = [
foorule("a"),
foorule("b"),
]
Or is there even a better way to load all the info in parallel?
If I am not mistaken this should be slightly faster but the best approach would be to split the rules into several packages so the package loading itself can be parallelized.
Here just the Skylark loading would be parallelized.
Each .bzl is file can be loaded and evaluated in parallel. Evaluating a .bzl file consists in evaluating all top-level statements. If you have CPU-intensive computations, you can do them in the top-level and store the result in a global value.
However, if .bzl files simply define some functions, there's almost nothing to parallelize (just loading the file from the disk and parsing it). I would expect no visible speedup in your case. Macros will not be evaluated in parallel.
Do you have more data? How many rules are there in your file (run bazel query :all | wc -l)? How long does it take to load the file? Are you sure the bottleneck is the loading phase?
If your BUILD file is huge, I would encourage you to split it. Create multiple BUILD files instead, if it's possible. You'll get more parallelism.
Related
I am learning Bazel and confused by many basic concepts.
load("//bazel/rules:build_tools.bzl", "build_tools_deps")
build_tools_deps() //build_tools_deps is macro or rules?
load("#bazel_gazelle//:deps.bzl", "gazelle_dependencies")
gazelle_dependencies() //what about the # mean exactly? where is the bazel_gazelle ?
native.new_git_repository(...) //what about the native mean?
What definition is called a function? what definition is a rule?
A macro is a regular Starlark function that wraps (and expands to) rules.
def my_macro(name = ..., ...):
native.cc_library(...)
android_library(...)
native.genrule(...)
Think of macros as a way to chain and group several rules together, which allows you to pipe the output of some rules into the input of others. At this level, you don't think about how a rule is implemented, but what kinds of inputs and outputs they are associated with.
On the other hand, a rule's declaration is done using the rule() function. cc_library, android_library and genrule are all rules. The rule implementation is abstracted in a regular function that accepts a single parameter for the rule context (ctx).
my_rule = rule(
attrs = { ... },
implementation = _my_rule_impl,
)
def _my_rule_impl(ctx):
outfile = ctx.actions.declare_file(...)
ctx.actions.run(...)
return [DefaultInfo(files = depset([outfile]))]
Think of actions as a way to chain and group several command lines together, which works at the level of individual files and running your executables to transform them (ctx.actions.run with exectuable, args, inputs and outputs arguments). Within a rule implementation, you can extract information from rule attributes (ctx.attr), or from dependencies through providers (e.g. ctx.attr.deps[0][DefaultInfo].files)
Note that rules can only be called in BUILD files, not WORKSPACE files.
# is the notation for a repository namespace. #bazel_gazelle is an external repository fetched in the WORKSPACE by a repository rule (not a regular rule), typically http_archive or git_repository. This repository rule can also be called from a macro, like my_macro above or build_tools_deps in your example.
native.<rule name> means that the rule is implemented in Java within Bazel and built into the binary, and not in Starlark.
I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)
The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:
If you’re dealing with large input files or very slow parsers, it
might also be worth trying to parse multiple sections within a single
file in parallel. For this to be efficient there must be a fast way to
find the start and end points of such sections. For example, if you
are parsing a large serialized data structure, the format might allow
you to easily skip over segments within the file, so that you can chop
up the input into multiple independent parts that can be parsed in
parallel. Another example could be a programming languages whose
grammar makes it easy to skip over a complete class or function
definition, e.g. by finding the closing brace or by interpreting the
indentation. In this case it might be worth not to parse the
definitions directly when they are encountered, but instead to skip
over them, push their text content into a queue and then to process
that queue in parallel.
This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?
FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)
The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.
If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.
Either way, you end up with a lazy seq that the parser can then read.
I wonder how the multiple pdbs can be written in single pdb file using biopython libraries. For reading multiple pdbs such as NMR structure, there is content in documentation but for writing, I do not find. Does anybody have an idea on it?
Yes, you can. It's documented here.
Image you have a list of structure objects, let's name it structures. You might want to try:
from bio import PDB
pdb_io = PDB.PDBIO()
target_file = 'all_struc.pdb'
with pdb_file as open_file:
for struct in structures:
pdb_io.set_structure(struct[0])
pdb_io.save(open_file)
That is the simplest solution for this problem. Some important things:
Different protein crystal structures have different coordinate systems, then you probably need to superimpose them. Or apply some transformation function to compare.
In pdb_io.set_structure you can select a entity or a chain or even a bunch of atoms.
In pdb_io.save has an secondary argument which is a Select class instance. It will help you remove waters, heteroatoms, unwanted chains...
Be aware that NMR structures contain multiple entities. You might want to select one. Hope this can help you.
Mithrado's solution may not actually achieve what you want. With his code, you will indeed write all the structures into a single file. However, it does so in such a way that might not be readable by other software. It adds an "END" line after each structure. Many pieces of software will stop reading the file at that point, as that is how the PDB file format is specified.
A better solution, but still not perfect, is to remove a chain from one Structure and add it to a second Structure as a different chain. You can do this by:
# Get a list of the chains in a structure
chains = list(structure2.get_chains())
# Rename the chain (in my case, I rename from 'A' to 'B')
chains[0].id = 'B'
# Detach this chain from structure2
chains[0].detach_parent()
# Add it onto structure1
structure1[0].add(chains[0])
Note that you have to be careful that the name of the chain you're adding doesn't yet exist in structure1.
In my opinion, the Biopython library is poorly structured or non-intuitive in many respects, and this is just one example. Use something else if you can.
Inspired by Nate's solution, but adding multiple models to one structure, rather than multiple chains to one model:
ms = PDB.Structure.Structure("master")
i=0
for structure in structures:
for model in list(structure):
new_model=model.copy()
new_model.id=i
new_model.serial_num=i+1
i=i+1
ms.add(new_model)
pdb_io = PDB.PDBIO()
pdb_io.set_structure(ms)
pdb_io.save("all.pdb")
I would like to succinctly diff a tree of configuration files, most of which are flat (i.e. key/value pairs) but some of which are XML, bash scripts or custom formats. The configuration information is almost always not ordered, and can contain whitespace and comments.
For flat files, doing a diff without whitespace or comments on sorted output gets very close to what I would like to do. For XML there are some tools available. However some custom formats have e.g. nested configuration. Order of the keys is not important, order of the subkeys is not important, but the tree structure is (much like XML). Others are very order-dependent.
How would you go about doing this if you had to do it often? Are there any tools out there that are general enough? What about rolling my own solution? The number of formats is nor enormous (certainly not as bad as /etc), and the default is flat - perhaps some libmagic and filename matching, combined with custom parsers? Has anyone tried something like that?
One approach would be to try to solve 95% of the problem by doing a decent job on files with nested but unnordered structure and special-casing a few other types with existing tools. Can you suggest a mostly-works approach to simple nested files?
Some examples:
com.example.resource.host=foo
com.example.resource.port=8080
vs
com.example.resource.port=8080
com.example.resource.host=bar
//com.example.network.timeout=600
com.example.network.timeout=300
Should produce
< com.example.resource.host=foo
---
> com.example.resource.host=bar
> //com.example.network.timeout=600
> com.example.network.timeout=300
or optionally:
< com.example.resource.host=foo
---
> com.example.resource.host=bar
> com.example.network.timeout=300
as you would expect. However, something like:
Conf com.example.resource =
Conf host = foo;
Conf port = 8080;
;
vs
Conf com.example.resource =
Conf port = 8080;
Conf host = bar;
;
Conf com.example.network =
Conf timeout = 300;
;
Should also work:
< Conf host = foo
---
> Conf host = bar
> Conf com.example.network =
> Conf timeout = 300;
> ;
Each of your configuration files has a syntax and an implied semantics. It appears that what you want to do, is to compare configuration files by the implied semantics rather than the text.
The only way to do this is have a custom parser for each configuration file type. Then you need compare the files according to the implied semantics.
In general, this is pretty hard to do, certainly for real programming languages. We offer a compromise solution, call SmartDifferencers that parse code according the precise langauge syntax, and then compares them according to the language structures (e.g, expressions, statements, declarations, methods, ...), reporting differences as abstract editing operations (move, copy, delete, insert, rename-identifier within a block). This gives succint deltas (which is what you asked for) in terms that make sense to programmers, rather than just "this block of lines changed somehow" typical of diff.
These tools know the language syntax, and they know a tiny bit of the semantics; in particular, we try (and we're not completely there for all of them) to handle the notion
of commutative language elements. In Java, the order of methods in a class is immaterial. In your case, the order of some configuration elements are likely immaterial. Our machinery can take this into account.
To do this for what you want to do, you need a separate parser for each type of configuration file, and separate knowledge for each type of when you can shuffle commands around safely. You need a separate tool for each type of configuration file to do this right. (Those files that are XML-based, in effect need separtes tools because you are trying to distinguish semantics as well as syntax.). Your ideal solution, I think, are SmartDifferencers for each of your configuration file types.
I have a configuration file that I consider to be my "base" configuration. I'd like to compare up to 10 other configuration files against that single base file. I'm looking for a report where each file is compared against the base file.
I've been looking at diff and sdiff, but they don't completely offer what I am looking for.
I've considered diff'ing the base against each file individually, but my problem then become merging those into a report. Ideally, if the same line is missing in all 10 config files (when compared to the base config), I'd like that reported in an easy to visualize manner.
Notice that some rows are missing in several of the config files (when compared individually to the base). I'd like to be able to put those on the same line (as above).
Note, the screenshot above is simply a mockup, and not an actual application.
I've looked at using some Delphi controls for this and writing my own (I have Delphi 2007), but if there is a program that already does this, I'd prefer it.
The Delphi controls I've looked at are TDiff, and the TrmDiff* components included in rmcontrols.
For people that are still wondering how to do this, diffuse is the closest answer, it does N-way merge by way of displaying all files and doing three way merge among neighboors.
None of the existing diff/merge tools will do what you want. Based on your sample screenshot you're looking for an algorithm that performs alignments over multiple files and gives appropriate weights based on line similarity.
The first issue is weighting the alignment based on line similarity. Most popular alignment algorithms, including the one used by GNU diff, TDiff, and TrmDiff, do an alignment based on line hashes, and just check whether the lines match exactly or not. You can pre-process the lines to remove whitespace or change everything to lower-case, but that's it. Add, remove, or change a letter and the alignment things the entire line is different. Any alignment of different lines at that point is purely accidental.
Beyond Compare does take line similarity into account, but it really only works for 2-way comparisons. Compare It! also has some sort of similarity algorithm, but it also limited to 2-way comparisons. It can slow down the comparison dramatically, and I'm not aware of any other component or program, commercial or open source, that even tries.
The other issue is that you also want a multi-file comparison. That means either running the 2-way diff algorithm a bunch of times and stitching the results together or finding an algorithm that does multiple alignments at once.
Stitching will be difficult: your sample shows that the original file can have missing lines, so you'd need to compare every file to every other file to get the a bunch of alignments, and then you'd need to work out the best way to match those alignments up. A naive stitching algorithm is pretty easy to do, but it will get messed up by trivial matches (blank lines for example).
There are research papers that cover aligning multiple sequences at once, but they're usually focused on DNA comparisons, you'd definitely have to code it up yourself. Wikipedia covers a lot of the basics, then you'd probably need to switch to Google Scholar.
Sequence alignment
Multiple sequence alignment
Gap penalty
Try Scooter Software's Beyond Compare. It supports 3-way merge and is written in Delphi / Kylix for multi-platform support. I've used it pretty extensively (even over a VPN) and it's performed well.
for f in file1 file2 file3 file4 file5; do echo "$f\n\n">> outF; diff $f baseFile >> outF; echo "\n\n">> outF; done
Diff3 should help. If you're on Windows, you can use it from Cygwin or from diffutils.
I made my own diff tool DirDiff because I didn't want parts that match two times on screen, and differing parts above eachother for easy comparison. You could use it in directory-mode on a directory with an equal number of copies of the base file.
It doesn't render exports of diff's, but I'll list it as a feature request.
You might want to look at some Merge components as what you describe is exactly what Merge tools do between the common base, version control file and local file. Except that you want more than 2 files (+ base)...
Just my $0.02
SourceGear Diffmerge is nice (and free) for windows based file diffing.
I know this is an old thread but vimdiff does (almost) exactly what you're looking for with the added advantage of being able to edit the files right from the diff perspective.
But none of the solutions does more than 3 files still.
What I did was messier, but for the same purpose (comparing contents of multiple config files, no limit except memory and BASH variables)
While loop to read a file into an array:
loadsauce () {
index=0
while read SRCCNT[$index]
do let index=index+1
done < $SRC
}
Again for the target file
loadtarget () {
index=0
while read TRGCNT[$index]
do let index=index+1
done < $TRG
}
string comparison
brutediff () {
# Brute force string compare, probably duplicates diff
# This is very ugly but it will compare every line in SRC against every line in TRG
# Grep might to better, version included for completeness
for selement in $(seq 0 $((${#SRCCNT[#]} - 1)))
do for telement in $(seq 0 $((${#TRGCNT[#]} - 1)))
do [[ "$selement" == "$telement" ]] && echo "${selement} is in ${SRC} and ${TRG}" >> $OUTMATCH
done
done
}
and finally a loop to do it against a list of files
for sauces in $(cat $SRCLIST)
do echo "Checking ${sauces}..."
loadsauce
loadtarget
brutediff
echo -n "Done, "
done
It's still untested/buggy and incomplete (like sorting out duplicates or compiling a list for each line with common files,) but it's definitely a move in the direction OP was asking for.
I do think Perl would be better for this though.