replicating trees between ACID RDB using CRDT - crdt

I'm interested in replicating "hierachies" of data say similar to addresses.
Area
District
Sector
Unit
but you may have different pieces of data associated to each layer, so you may know the area of Sectors, but not of units, and you may know the population of a unit, basically its not a homogenious tree.
I know little about replication of data except brushing Brewers theorem/CAP, and some naive intuition about what eventual consistency is.
I'm looking for SIMPLE mechanisms to replicate this data from an ACID RDB, into other ACID RDBs, systemically the system needs to eventually converge, and obviously each RDB will enforce its own local consistent view, but any 2 nodes may not match at any given time (except 'eventually').
The simplest way to approach this is to simple store all the data in a single message from some designated leader and distribute it...like an overnight dump and load process, but thats too big.
So the next simplest thing (I thought) was if something inside an area changes, I can export the complete set of data inside an area, and load it into the nodes, thats still quite a coarse algorithm.
The next step was if, say an 'object' at any level changed, was to send all the data in the path to that 'object', i.e. if something in a sector is amended, you would send the data associated to the sector, its parent the district, and its parent the sector (with some sort of version stamp and lets say last update wins)....what i wanted to do was to ensure that any replication 'update' was guaranteed to succeed (so it needs the whole path, which potentially would be created if it didn't exist).
then i stumbled on CRDTs and thought....ah...I'm reinventing the wheel here, and the algorithms are allegedly easy in principle, but tricky to get correct in practice
are there standards accepted patterns to do this sort of thing?
In my use case the hierarchies are quite shallow, and there is only a single designated leader (at this time), I'm quite attracted to state based CRDTs because then I can ignore ordering.
Simplicity is the key requirement.

Actually it appears I've reinvented (in a very crude naive way) the SHELF algorithm.
I'll write some code and see if I can get it to work, and try to understand whats going on.

Related

ELKI: Normalization undo for result

I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.

NeuroEvolution of Augmenting Topologies (NEAT) and global innovation number

I was not able to find why we should have a global innovation number for every new connection gene in NEAT.
From my little knowledge of NEAT, every innovation number corresponds directly with an node_in, node_out pair, so, why not only use this pair of ids instead of the innovation number? Which new information there is in this innovation number? chronology?
Update
Is it an algorithm optimization?
Note: this more of an extended comment than an answer.
You encountered a problem I also just encountered whilst developing a NEAT version for javascript. The original paper published in ~2002 is very unclear.
The original paper contains the following:
Whenever a new
gene appears (through structural mutation), a global innovation number is incremented
and assigned to that gene. The innovation numbers thus represent a chronology of the
appearance of every gene in the system. [..] ; innovation numbers are never changed. Thus, the historical origin of every
gene in the system is known throughout evolution.
But the paper is very unclear about the following case, say we have two ; 'identical' (same structure) networks:
The networks above were initial networks; the networks have the same innovation ID, namely [0, 1]. So now the networks randomly mutate an extra connection.
Boom! By chance, they mutated to the same new structure. However, the connection ID's are completely different, namely [0, 2, 3] for parent1 and [0, 4, 5] for parent2 as the ID is globally counted.
But the NEAT algorithm fails to determine that these structures are the same. When one of the parents scores higher than the other, it's not a problem. But when the parents have the same fitness, we have a problem.
Because the paper states:
In composing the offspring, genes are randomly chosen from veither parent at matching genes, whereas all excess or disjoint genes are always included from the more fit parent, or if they are equally fit, from both parents.
So if the parents are equally fit, the offspring will have connections [0, 2, 3, 4, 5]. Which means that some nodes have double connections... Removing global innovation counters, and just assign id's by looking at node_in and node_out, you avoid this problem.
So when you have equally fit parents, yes you have optimized the algorithm. But this is almost never the case.
Quite interesting: in the newer version of the paper, they actually removed that bolded line! Older version here.
By the way, you can solve this problem by instead of assigning innovation ID's, assign ID based on node_in and node_out using pairing functions. This creates quite interesting neural networks when fitness is equal:
I can't provide a detailed answer, but the innovation number enables certain functionality within the NEAT model to be optimal (like calculating the species of a gene), as well as allowing crossover between the variable length genomes. Crossover is not necessary in NEAT, but it can be done, due to the innovation number.
I got all my answers from here:
http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
It's a good read
During crossover, we have to consider two genomes that share a connection between the two same nodes in their personal neural networks. How do we detect this collision without iterating both genome's connection genes over and over again for each step of crossover? Easy: if both connections being examined during crossover share an innovation number, they are connecting the same two nodes because they received that connection from the same common ancestor.
Easy Example:
If I am a genome with a specific connection gene with innovation number 'i', my children that take gene 'i' from me may eventually cross over with each other in 100 generations. We have to detect when these two evolved versions (alleles) of my gene 'i' are in collision to prevent taking both. Taking two of the same gene would cause the phenotype to probably loop and crash, killing the genotype.
When I created my first implementation of NEAT I thought the same... why would you keep a innovation number tracker...? and why would you use it only for one generation? Wouldn't be better to not keep it at all and use a key value par with the nodes connected?
Now that I am implementing my third revision I can see what Kenneth Stanley tried to do with them and why he wanted to keep them only for one generation.
When a connection is created, it will start its optimization in that moment. It marks its origin. If the same connection pops out in another generation, that will start its optimization then. Generation numbers try to separate the ones which come from a common ancestor, so the ones that have been optimized for many generations are not put side to side that one that was just generated. If a same connection is found in two genomes, that means that that gene comes from the same origin and thus, can be aligned.
Imagine then that you have your generation champion. Some of their genes will have 50 percent chance to be lost due that the aligned genes are treated equally.
What is better...? I haven't seen any experiments comparing the two approaches.
Kenneth Stanley also addressed this issue in the NEAT users page: https://www.cs.ucf.edu/~kstanley/neat.html
Should a record of innovations be kept around forever, or only for the current
generation?
In my implementation of NEAT, the record is only kept for a generation, but there
is nothing wrong with keeping them around forever. In fact, it may work better.
Here is the long explanation:
The reason I didn't keep the record around for the entire run in my
implementation of NEAT was because I felt that calling something the same
mutation that happened under completely different circumstances was not
intuitive. That is, it is likely that several generations down the line, the
"meaning" or contribution of the same connection relative to all the other
connections in a network is different than it would have been if it had appeared
generations ago. I used a single generation as a yardstick for this kind of
situation, although that is admittedly ad hoc.
That said, functionally speaking, I don't think there is anything wrong with
keeping innovations around forever. The main effect is to generate fewer species.
Conversely, not keeping them around leads to more species..some of them
representing the same thing but separated nonetheless. It is not currently clear
which method produces better results under what circumstances.
Note that as species diverge, calling a connection that appeared in one species a
different name than one that appeared earlier in another just increases the
incompatibility of the species. This doesn't change things much since they were
incompatible to begin with. On the other hand, if the same species adds a
connection that it added in an earlier generation, that must mean some members of
the species had not adopted that connection yet...so now it is likely that the
first "version" of that connection that starts being helpful will win out, and
the other will die away. The third case is where a connection has already been
generally adopted by a species. In that case, there can be no mutation creating
the same connection in that species since it is already taken. The main point is,
you don't really expect too many truly similar structures with different markings
to emerge, even with only keeping the record around for 1 generation.
Which way works best is a good question. If you have any interesting experimental
results on this question, please let me know.
My third revision will allow both options. I will add more information to this answer when I have results about it.

Does it help to duplicate original data in order to make more data for building model?

I just got an interview question.
"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?
Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.
But is there anyone can explain it more statistically? Thanks
Consider e.g. variance. The data set with the duplicated data will have the exact same variance - you don't have a more precise estimate of the distrbution afterwards.
There are, however, some exceptions. For example bootstrap validation helps when evaluating your model, but you have very little data.
Well, it depends on exactly what one means by "duplicating the data".
If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)
However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.
Also if one means bootstrapping, then that, too, makes a difference.
PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Omniture: Creating Specific Context Variables

Was wondering if anyone out there can help.......
My company works in the travel industry and one of the product we provide is the function of buying a flight and hotel together.
One of the advantages of this is that sometimes a visitor can save on a hotel if they buy the package together.
What I want to be able to track is the following:
The hotel which has the saving on it (accomodation code); the saving that they will make; the price of the package that they will pay.
I am new to implementing but have been told by a colleague that I can use a context variable.
Would anyone be able to tell me how I should write this please?
Kind Regards
Yaser
Here is the document entry for Context Data Variables
For example, in the custom code section of the on-page code, within s_doPlugins or via some wrapper function that ultimately makes a s.t() or s.tl() call, you would have:
s.contextData['package.code'] = "accommodation code";
s.contextData['package.savings'] = "savings";
s.contextData['package.price'] = "price";
Then in the interface you can go to processing rules and map them to whatever props or eVars you want.
Having said that...processing rules are pretty basic at the moment, and to be honest, it's not really worth it IMO. Firstly, you have to get certified (take an exam and pass) to even access processing rules. It's not that big a deal, but it's IMO a pointless hoop to jump through (tip: if you are going to go ahead and take this step, be sure to study up on more than just processing rules. Despite the fact that the exam/certification is supposed to be about processing rules, there are several questions that have little to nothing to do with them)
2nd, context data doesn't show up in reports by themselves. You must assign the values to actual props/eVars/events through processing rules (or get ClientCare to use them in a vista rule, which is significantly more powerful than a processing rule, but costs lots of money)
3rd, the processing rules are pretty basic. Seriously, you're limited to just simple stuff like straight duping, concatenating values, etc.
4th, processing rules are limited in setting events, and won't let you set the products string. IOW, You can set a basic (counter) event, but not a numeric or currency event (an event with a custom value associated with it). Reason I mention this is because those price and savings values might be good as a numeric or currency event for calculated metrics. Well since you can't set an event as such via processing rules, you'd have to set the events in your page code anyways.
The only real benefit here is if you're simply looking to dupe them into a prop/eVar and that prop/eVar varies from report suite to report suite (which FYI, most people try to keep them consistent across report suites anyways, and people rarely repurpose them).
So if you are already being consistent across multiple report suites (or only have like 1 report suite in the first place), since you're already having to put some code on the site, there's no real incentive to just pop the values in the first place.
I guess the overall point here is that since the overall goal is to get the values into actual props, eVars and possibly events, and processing rules fail on a lot of levels, there's no compelling reason not to just pop them in the first place.

Why do we use data structures? (when no dynamic allocation is needed)

I'm pretty sure this is a silly newbie question but I didn't know it so I had to ask...
Why do we use data structures, like Linked List, Binary Search Tree, etc? (when no dynamic allocation is needed)
I mean: wouldn't it be faster if we kept a single variable for a single object? Wouldn't that speed up access time? Eg: BST possibly has to run through some pointers first before it gets to the actual data.
Except for when dynamic allocation is needed, is there a reason to use them?
Eg: using linked list/ BST / std::vector in a situation where a simple (non-dynamic) array could be used.
Each thing you are storing is being kept in it's own variable (or storage location). Data structures apply organization to your data. Imagine if you had 10,000 things you were trying to track. You could store them in 10,000 separate variables. If you did that, then you'd always be limited to 10,000 different things. If you wanted more, you'd have to modify your program and recompile it each time you wanted to increase the number. You might also have to modify the code to change the way in which the calculations are done if the order of the items changes because the new one is introduced in the middle.
Using data structures, from simple arrays to more complex trees, hash tables, or custom data structures, allows your code to both be more organized and extensible. Using an array, which can either be created to hold the required number of elements or extended to hold more after it's first created keeps you from having to rewrite your code each time the number of data items changes. Using an appropriate data structure allows you to design algorithms based on the relationships between the data elements rather than some fixed ordering, giving you more flexibility.
A simple analogy might help to understand. You could, for example, organize all of your important papers by putting each of them into separate filing cabinet. If you did that you'd have to memorize (i.e., hard-code) the cabinet in which each item can be found in order to use them effectively. Alternatively, you could store each in the same filing cabinet (like a generic array). This is better in that they're all in one place, but still not optimum, since you have to search through them all each time you want to find one. Better yet would be to organize them by subject, putting like subjects in the same file folder (separate arrays, different structures). That way you can look for the file folder for the correct subject, then find the item you're looking for in it. Depending on your needs you can use different filing methods (data structures/algorithms) to better organize your information for it's intended use.
I'll also note that there are times when it does make sense to use individual variables for each data item you are using. Frequently there is a mixture of individual variables and more complex structures, using the appropriate method depending on the use of the particular item. For example, you might store the sum of a collection of integers in a variable while the integers themselves are stored in an array. A program would need to be pretty simple though before the introduction of data structures wouldn't be appropriate.
Sorry, but you didn't just find a great new way of doing things ;) There are several huge problems with this approach.
How could this be done without requring programmers to massively (and nontrivially) rewrite tons of code as soon as the number of allowed items changes? Even when you have to fix your data structure sizes at compile time (e.g. arrays in C), you can use a constant. Then, changing a single constant and recompiling is sufficent for changes to that size (if the code was written with this in mind). With your approach, we'd have to type hundreds or even thousands of lines every time some size changes. Not to mention that all this code would be incredibly hard to read, write, maintain and verify. The old truism "more lines of code = more space for bugs" is taken up to eleven in such a setting.
Then there's the fact that the number is almost never set in stone. Even when it is a compile time constant, changes are still likely. Writing hundreds of lines of code for a minor (if it exists at all) performance gain is hardly ever worth it. This goes thrice if you'd have to do the same amount of work again every time you want to change something. Not to mention that it isn't possible at all once there is any remotely dynamic component in the size of the data structures. That is to say, it's very rarely possible.
Also consider the concept of implicit and succinct data structures. If you use a set of hard-coded variables instead of abstracting over the size, you still got a data structure. You merely made it implicit, unrolled the algorithms operating on it, and set its size in stone. Philosophically, you changed nothing.
But surely it has a performance benefit? Well, possible, although it will be tiny. But it isn't guaranteed to be there. You'd save some space on data, but code size would explode. And as everyone informed about inlining should know, small code sizes are very useful for performance to allow the code to be in the cache. Also, argument passing would result in excessive copying unless you'd figure out a trick to derive the location of most variables from a few pointers. Needless to say, this would be nonportable, very tricky to get right even on a single platform, and liable to being broken by any change to the code or the compiler invocation.
Finally, note that a weaker form is sometimes done. The Wikipedia page on implicit and succinct data structures has some examples. On a smaller scale, some data structures store much data in one place, such that it can be accessed with less pointer chasing and is more likely to be in the cache (e.g. cache-aware and cache-oblivious data structures). It's just not viable for 99% of all code and taking it to the extreme adds only a tiny, if any, benefit.
The main benefit to datastructures, in my opinion, is that you are relationally grouping them. For instance, instead of having 10 separate variables of class MyClass, you can have a datastructure that groups them all. This grouping allows for certain operations to be performed because they are structured together.
Not to mention, having datastructures can potentially enforce type security, which is powerful and necessary in many cases.
And last but not least, what would you rather do?
string string1 = "string1";
string string2 = "string2";
string string3 = "string3";
string string4 = "string4";
string string5 = "string5";
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.WriteLine(string4);
Console.WriteLine(string5);
Or...
List<string> myStringList = new List<string>() { "string1", "string2", "string3", "string4", "string5" };
foreach (string s in myStringList)
Console.WriteLine(s);

Resources