Reintroducing spaces into a document - alignment

Imagine we have some reference text on hand
Four score and seven years ago our fathers brought forth on this
continent a new nation, conceived in liberty, and dedicated to the
proposition that all men are created equal. Now we are engaged in a
great civil war, testing whether that nation, or any nation, so
conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that
field, as a final resting place for those who here gave their lives
that that nation might live. It is altogether fitting and proper that
we should do this. But, in a larger sense, we can not dedicate, we can
not consecrate, we can not hallow this ground. The brave men, living
and dead, who struggled here, have consecrated it, far above our poor
power to add or detract. The world will little note, nor long remember
what we say here, but it can never forget what they did here. It is
for us the living, rather, to be dedicated here to the unfinished work
which they who fought here have thus far so nobly advanced. It is
rather for us to be here dedicated to the great task remaining before
us—that from these honored dead we take increased devotion to that
cause for which they gave the last full measure of devotion—that we
here highly resolve that these dead shall not have died in vain—that
this nation, under God, shall have a new birth of freedom—and that
government of the people, by the people, for the people, shall not
perish from the earth.
and we receive snippets of that text back to us with no spaces or punctuation, and some characters deleted, inserted, and substituted
ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn
Using the reference text what are some tools (in any programming language) we can use to try properly space the words
ield as a final rTsting place for who fought here gave their liZes that that n
correcting errors is not necessary, just spacing

Weird problem you've got there :)
If you can't rely on capitalization for hints, just lowercase everything to start with.
Then get a dictionary of words. Maybe just a wordlist, or you could try Wordnet.
And a corpus of similar, correctly spaced, material. If suitable, download the Wikipedia dump. You'll need to clean it up and break into ngrams. 3 grams will probably suit the task. Or save yourself the time and use Google's ngram data. Either the web ngrams (paid) or the book ngrams (free-ish).
Set a max word length cap. Let's say 20chars.
Take the first char of your mystery string and look it up in the dictionary. Then take the first 2 chars and look them up. Keep doing this until you get to 20. Store all matches you get, but the longest one is probably the best. Move the starting point 1 char at a time, through your string.
You'll end up with an array of valid word matches.
Loop through this new array and pair each value up with the following value, comparing it to the original string, so that you identify all possible valid word combinations that don't overlap. You might end up with 1 output string, or several.
If you've got several, break each output string into 3-grams. Then lookup in your new ngram database to see which combinations are most frequent.
There might also be some time-saving techniques like starting with stop words, checking them in a dictionary combined with incremental letter either side, and adding spaces there first.
... or I'm over-thinging the whole issue and there's an awk one liner that someone will humble me with :)

You can do this using edit distance and finding the minimum edit distance substring of the reference. Check out my answer (PHP implementation) to a similar question here:
Longest Common Substring with wrong character tolerance
Using the shortest_edit_substring() function from the above link, you can add this to do the search after stripping out everything but letters (or whatever you want to keep in: letters, numbers, etc.) and then correctly map the result back to the original version.
// map a stripped down substring back to the original version
function map_substring($haystack_letters,$start,$length,$haystack, $regexp)
{
$r_haystack = str_split($haystack);
$r_haystack_letters = $r_haystack;
foreach($r_haystack as $k => $l)
{
if (preg_match($regexp,$l))
{
unset($r_haystack_letters[$k]);
}
}
$key_map = array_keys($r_haystack_letters);
$real_start = $key_map[$start];
$real_end = $key_map[$start+$length-1];
$real_length = $real_end - $real_start + 1;
return array($real_start,$real_length);
}
$haystack = 'Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.';
$needle = 'ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn';
// strip out all non-letters
$regexp_to_strip_out = '/[^A-Za-z]/';
$haystack_letters = preg_replace($regexp_to_strip_out,'',$haystack);
list($start,$length) = shortest_edit_substring($needle,$haystack_letters);
list($real_start,$real_length) = map_substring($haystack_letters,$start,$length,$haystack,$regexp_to_strip_out);
printf("Found |%s| in |%s|, matching |%s|\n",substr($haystack,$real_start,$real_length),$haystack,$needle);
This will do the error correction as well; it's actually easier to do it than to not do it. The minimum edit distance search is pretty straightforward to implement in other languages if you want something faster than PHP.

Related

daily quote for each page with latex

I want create calendar planner from 1/1/2019 to 31/2/2019:
In Main.tex: I use \pgfcalendar{cal}{2019-01-01}{2019-02-31} to create pdf.
One page = one day
Now i want add each different quote to each day.
How write code in quote.tex and connect in Main.tex.
Thanks
\pgfcalendar{cal}{2019-01-01}{2019-02-31}
{%
\LARGE\bfseries
\pgfcalendarweekdayname{\pgfcalendarcurrentweekday},
\pgfcalendarcurrentday{}.
\pgfcalendarmonthname{\pgfcalendarcurrentmonth}
\pgfcalendarcurrentyear{}
%}
\pagebreak
==========
I'd suggest setting up a daily quote in Excel with all the bells and whistles of formulas to extract the weekday, day, month and year from a date. I pulled some random quotes from the Internets to compile with each of the first 100 days...
\documentclass{article}
\usepackage{filecontents}
\begin{filecontents*}{daily_quote.csv}
Number,Date,Day,Month,Year,DayOfWeek,Quote,Author
1,2019-01-01,1,January,2019,Tuesday,"We are what we repeatedly do. Excellence, therefore, is not an act but a habit.",Aristotle
2,2019-01-02,2,January,2019,Wednesday,The best way out is always through.,Robert Frost
3,2019-01-03,3,January,2019,Thursday,Do not wait to strike till the iron is hot; but make it hot by striking.,William B Sprague
4,2019-01-04,4,January,2019,Friday,Great spirits have always encountered violent opposition from mediocre minds.,Albert Einstein
5,2019-01-05,5,January,2019,Saturday,"Whether you think you can or think you can't, you're right.",Henry Ford
6,2019-01-06,6,January,2019,Sunday,I know for sure that what we dwell on is who we become.,Oprah Winfrey
7,2019-01-07,7,January,2019,Monday,"I've missed more than 9000 shots in my career. I've lost almost 300 games. 26 times, I've been trusted to take the game winning shot and missed. I've failed over and over and over again in my life. And that is why I succeed.",Michael Jordan
8,2019-01-08,8,January,2019,Tuesday,You must be the change you want to see in the world.,Mahatma Gandhi
9,2019-01-09,9,January,2019,Wednesday,What you get by achieving your goals is not as important as what you become by achieving your goals.,Goethe
10,2019-01-10,10,January,2019,Thursday,You can get everything in life you want if you will just help enough other people get what they want.,Zig Ziglar
11,2019-01-11,11,January,2019,Friday,"Whatever you do will be insignificant, but it is very important that you do it.",Mahatma Gandhi
12,2019-01-12,12,January,2019,Saturday,"Desire is the starting point of all achievement, not a hope, not a wish, but a keen pulsating desire which transcends everything.",Napoleon Hill
13,2019-01-13,13,January,2019,Sunday,Failure is the condiment that gives success its flavor.,Truman Capote
14,2019-01-14,14,January,2019,Monday,Vision without action is daydream. Action without vision is nightmare.,Japanese Proverb
15,2019-01-15,15,January,2019,Tuesday,"In any situation, the best thing you can do is the right thing; the next best thing you can do is the wrong thing; the worst thing you can do is nothing.",Theodore Roosevelt
16,2019-01-16,16,January,2019,Wednesday,"If you keep saying things are going to be bad, you have a chance of being a prophet.",Isaac B Singer
17,2019-01-17,17,January,2019,Thursday,Success consists of doing the common things of life uncommonly well.,Unknown
18,2019-01-18,18,January,2019,Friday,"Keep on going and the chances are you will stumble on something, perhaps when you are least expecting it. I have never heard of anyone stumbling on something sitting down.","Charles F Kettering, Engineer and Inventor"
19,2019-01-19,19,January,2019,Saturday,Twenty years from now you will be more disappointed by the things that you didn't do than by the ones you did do. So throw off the bowlines. Sail away from the safe harbor. Catch the trade winds in your sails. Explore. Dream. Discover.,Mark Twain
20,2019-01-20,20,January,2019,Sunday,Losers visualize the penalties of failure. Winners visualize the rewards of success.,Unknown
21,2019-01-21,21,January,2019,Monday,Some succeed because they are destined. Some succeed because they are determined.,Unknown
22,2019-01-22,22,January,2019,Tuesday,Experience is what you get when you don't get what you want.,Dan Stanford
23,2019-01-23,23,January,2019,Wednesday,Setting an example is not the main means of influencing others; it is the only means.,Albert Einstein
24,2019-01-24,24,January,2019,Thursday,"A happy person is not a person in a certain set of circumstances, but rather a person with a certain set of attitudes.",Hugh Downs
25,2019-01-25,25,January,2019,Friday,"If you're going to be able to look back on something and laugh about it, you might as well laugh about it now.",Marie Osmond
26,2019-01-26,26,January,2019,Saturday,"Remember that happiness is a way of travel, not a destination.",Roy Goodman
27,2019-01-27,27,January,2019,Sunday,"If you want to test your memory, try to recall what you were worrying about one year ago today.",E Joseph Cossman
28,2019-01-28,28,January,2019,Monday,What lies behind us and what lies before us are tiny matters compared to what lies within us.,Ralph Waldo Emerson
29,2019-01-29,29,January,2019,Tuesday,We judge of man's wisdom by his hope.,Ralph Waldo Emerson
30,2019-01-30,30,January,2019,Wednesday,The best way to cheer yourself up is to try to cheer somebody else up.,Mark Twain
31,2019-01-31,31,January,2019,Thursday,"Age is an issue of mind over matter. If you don't mind, it doesn't matter.",Mark Twain
32,2019-02-01,1,February,2019,Friday,"Whenever you find yourself on the side of the majority, it's time to pause and reflect.",Mark Twain
33,2019-02-02,2,February,2019,Saturday,"Keep away from people who try to belittle your ambitions. Small people always do that, but the really great make you feel that you, too, can become great.",Mark Twain
34,2019-02-03,3,February,2019,Sunday,The surest way not to fail is to determine to succeed.,Richard B Sheridan
35,2019-02-04,4,February,2019,Monday,"Take the first step in faith. You don't have to see the whole staircase, just take the first step.",Dr Martin Luther King Jr
36,2019-02-05,5,February,2019,Tuesday,Act or accept.,Anonymous
37,2019-02-06,6,February,2019,Wednesday,"Many great ideas go unexecuted, and many great executioners are without ideas. One without the other is worthless.",Tim Blixseth
38,2019-02-07,7,February,2019,Thursday,The world is more malleable than you think and it's waiting for you to hammer it into shape.,Bono
39,2019-02-08,8,February,2019,Friday,Sometimes you just got to give yourself what you wish someone else would give you.,Dr Phil
40,2019-02-09,9,February,2019,Saturday,"Motivation is a fire from within. If someone else tries to light that fire under you, chances are it will burn very briefly.",Stephen R Covey
41,2019-02-10,10,February,2019,Sunday,People become really quite remarkable when they start thinking that they can do things. When they believe in themselves they have the first secret of success.,Norman Vincent Peale
42,2019-02-11,11,February,2019,Monday,Whenever you find whole world against you just turn around and lead the world.,Anonymous
43,2019-02-12,12,February,2019,Tuesday,Being defeated is only a temporary condition; giving up is what makes it permanent.,"Marilyn vos Savant, Author and Advice Columnist"
44,2019-02-13,13,February,2019,Wednesday,I can't understand why people are frightened by new ideas. I'm frightened by old ones.,John Cage
45,2019-02-14,14,February,2019,Thursday,Setting an example is not the main means of influencing others; it is the only means.,Albert Einstein
46,2019-02-15,15,February,2019,Friday,The difference between ordinary and extraordinary is that little extra.,Unknown
47,2019-02-16,16,February,2019,Saturday,The best way to predict the future is to create it.,Unknown
48,2019-02-17,17,February,2019,Sunday,Anyone can do something when they WANT to do it. Really successful people do things when they don't want to do it.,Dr Phil
49,2019-02-18,18,February,2019,Monday,"There are two primary choices in life: to accept conditions as they exist, or accept the responsibility for changing them.",Dr Denis Waitley
50,2019-02-19,19,February,2019,Tuesday,Success is the ability to go from failure to failure without losing your enthusiasm.,Sir Winston Churchill
51,2019-02-20,20,February,2019,Wednesday,Success seems to be connected with action. Successful people keep moving. They make mistakes but don't quit.,Conrad Hilton
52,2019-02-21,21,February,2019,Thursday,Attitudes are contagious. Make yours worth catching.,Unknown
53,2019-02-22,22,February,2019,Friday,Do not let what you cannot do interfere with what you can do.,John Wooden
54,2019-02-23,23,February,2019,Saturday,"There are only two rules for being successful. One, figure out exactly what you want to do, and two, do it.",Mario Cuomo
55,2019-02-24,24,February,2019,Sunday,"Sooner or later, those who win are those who think they can.",Richard Bach
56,2019-02-25,25,February,2019,Monday,Vision doesn't usually come as a lightening bolt. Rather it comes as a slow crystallization of life challenges that we one day recognize as a beautiful diamond with great value to ourselves and others.,Dr Michael Norwood
57,2019-02-26,26,February,2019,Tuesday,"Success is a state of mind. If you want success, start thinking of yourself as a success.",Dr Joyce Brothers
58,2019-02-27,27,February,2019,Wednesday,Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.,Samuel Beckett
59,2019-02-28,28,February,2019,Thursday,Flops are a part of life's menu and I've never been a girl to miss out on any of the courses.,Rosalind Russell
60,2019-03-01,1,March,2019,Friday,Cause change and lead. Accept change and survive. Resist change and die.,"Ray Norda, Chairman, Novell"
61,2019-03-02,2,March,2019,Saturday,"Winners lose much more often than losers. So if you keep losing but you're still trying, keep it up! You're right on track.",Matthew Keith Groves
62,2019-03-03,3,March,2019,Sunday,"An idea can turn to dust or magic, depending on the talent that rubs against it.",Bill Bernbach
63,2019-03-04,4,March,2019,Monday,An obstacle is often a stepping stone.,Prescott
64,2019-03-05,5,March,2019,Tuesday,Life is trying things to see if they work,Ray Bradbury
65,2019-03-06,6,March,2019,Wednesday,"If you worry about yesterday's failures, then today's successes will be few.",Anonymous
66,2019-03-07,7,March,2019,Thursday,Life is 10% what happens to us and 90% how we react to it.,Dennis P Kimbro
67,2019-03-08,8,March,2019,Friday,"We are all inventors, each sailing out on a voyage of discovery, guided each by a private chart, of which there is no duplicate. The world is all gates, all opportunities.",Ralph Waldo Emerson
68,2019-03-09,9,March,2019,Saturday,Knowing is not enough; we must apply. Willing is not enough; we must do.,Johann Wolfgang von Goethe
69,2019-03-10,10,March,2019,Sunday,"In matters of style, swim with the current; in matters of principle, stand like a rock.",Thomas Jefferson
70,2019-03-11,11,March,2019,Monday,"I think and think for months and years. Ninety-nine times, the conclusion is false. The hundredth time I am right.",Albert Einstein
71,2019-03-12,12,March,2019,Tuesday,"Where the willingness is great, the difficulties cannot be great.",Machiavelli
72,2019-03-13,13,March,2019,Wednesday,Strength does not come from physical capacity. It comes from an indomitable will.,Mahatma Gandhi
73,2019-03-14,14,March,2019,Thursday,You are what you think about all day long.,Dr Robert Schuller
74,2019-03-15,15,March,2019,Friday,What you do speaks so loudly that I cannot hear what you say.,Ralph Waldo Emerson
75,2019-03-16,16,March,2019,Saturday,"Success is not to be measured by the position someone has reached in life, but the obstacles he has overcome while trying to succeed.",Booker T Washington
76,2019-03-17,17,March,2019,Sunday,"Talent is formed in solitude, character in the bustle of the world.",Johann Wolfgang von Goethe
77,2019-03-18,18,March,2019,Monday,"To avoid criticism do nothing, say nothing, be nothing.",Elbert Hubbard
78,2019-03-19,19,March,2019,Tuesday,"If you want to make your dreams come true, the first thing you have to do is wake up.",JM Power
79,2019-03-20,20,March,2019,Wednesday,By working faithfully eight hours a day you may eventually get to be boss and work twelve hours a day,Robert Frost
80,2019-03-21,21,March,2019,Thursday,"I've learned that no matter what happens, or how bad it seems today, life does go on, and it will be better tomorrow.",Maya Angelou
81,2019-03-22,22,March,2019,Friday,The art of being wise is the art of knowing what to overlook.,William James
82,2019-03-23,23,March,2019,Saturday,"When I hear somebody sigh, ‘Life is hard,' I am always tempted to ask, ‘Compared to what?'",Sydney Harris
83,2019-03-24,24,March,2019,Sunday,Don't let life discourage you; everyone who got where he is had to begin where he was.,Richard L Evans
84,2019-03-25,25,March,2019,Monday,In three words I can sum up everything I've learned about life: It goes on.,Robert Frost
85,2019-03-26,26,March,2019,Tuesday,"You gain strength, courage and confidence by every experience in which you stop to look fear in the face.",Eleanor Roosevelt
86,2019-03-27,27,March,2019,Wednesday,Sometimes even to live is an act of courage.,Seneca
87,2019-03-28,28,March,2019,Thursday,"Do first things first, and second things not at all.",Peter Drucker
88,2019-03-29,29,March,2019,Friday,The only people who find what they are looking for in life are the fault finders.,Foster's Law
89,2019-03-30,30,March,2019,Saturday,Defeat is not bitter unless you swallow it.,Joe Clark
90,2019-03-31,31,March,2019,Sunday,I am an optimist. It does not seem too much use being anything else.,Winston Churchill
91,2019-04-01,1,April,2019,Monday,Positive anything is better than negative thinking.,Elbert Hubbard
92,2019-04-02,2,April,2019,Tuesday,People seem not to see that their opinion of the world is also a confession of character.,Ralph Waldo Emerson
93,2019-04-03,3,April,2019,Wednesday,"Those who wish to sing, always find a song.",Swedish Proverb
94,2019-04-04,4,April,2019,Thursday,"If you're going through hell, keep going.",Winston Churchill
95,2019-04-05,5,April,2019,Friday,"The sun shines and warms and lights us and we have no curiosity to know why this is so; but we ask the reason of all evil, of pain, and hunger, and mosquitoes and silly people.",Ralph Waldo Emerson
96,2019-04-06,6,April,2019,Saturday,Life is a shipwreck but we must not forget to sing in the lifeboats.,Voltaire
97,2019-04-07,7,April,2019,Sunday,"Enduring habits I hate…. Yes, at the very bottom of my soul I feel grateful to all my misery and bouts of sickness and everything about me that is imperfect, because this sort of thing leaves me with a hundred backdoors through which I can escape from enduring habits.","Friedrich Nietzsche, The Gay Science, 1882"
98,2019-04-08,8,April,2019,Monday,There is no education like adversity.,Disraeli
99,2019-04-09,9,April,2019,Tuesday,He who has a why to live can bear almost any how.,Friedrich Nietzsche
100,2019-04-10,10,April,2019,Wednesday,Adversity introduces a man to himself.,Unknown
\end{filecontents*}
\usepackage{datatool}
\begin{document}
\DTLloadrawdb[keys = {
Number,
Date,
Day,
Month,
Year,
DayOfWeek,
Quote,
Author
}]
{DailyQuote} % DB name
{daily_quote.csv} % Filename
\raggedright
\DTLforeach
{DailyQuote}
{\Day = Day,
\Month = Month,
\Year = Year,
\DayOfWeek = DayOfWeek,
\Quote = Quote,
\Author = Author
}{
\clearpage
\section*{\DayOfWeek, \Month~\Day, \Year}
\Quote~\textit{\Author}
}
\end{document}

Cluster Analysis for crowds of people

I have location data from a large number of users (hundreds of thousands). I store the current position and a few historical data points (minute data going back one hour).
How would I go about detecting crowds that gather around natural events like birthday parties etc.? Even smaller crowds (let's say starting from 5 people) should be detected.
The algorithm needs to work in almost real time (or at least once a minute) to detect crowds as they happen.
I have looked into many cluster analysis algorithms, but most of them seem like a bad choice. They either take too long (I have seen O(n^3) and O(2^n)) or need to know how many clusters there are beforehand.
Can someone help me? Thank you!
Let each user be it's own cluster. When she gets within distance R to another user form a new cluster and separate again when the person leaves. You have your event when:
Number of people is greater than N
They are in the same place for the timer greater than T
The party is not moving (might indicate a public transport)
It's not located in public service buildings (hospital, school etc.)
(good number of other conditions)
One minute is plenty of time to get it done even on hundreds of thousands of people. In naive implementation it would be O(n^2), but mind there is no point in comparing location of each individual, only those in close neighbourhood. In first approximation you can divide the "world" into sectors, which also makes it easy to make the task parallel - and in turn easily scale. More users? Just add a few more nodes and downscale.
One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd)ale at time of limited activity.
How to avoid "single-linkage problem" mentioned by author in comments? One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd). It is not a novel problem and I am sure there are papers that cover it (partially), e.g. Is There a Crowd? Experiences in Using Density-Based Clustering and Outlier Detection.
There is little use in doing a full clustering.
Just uses good database index.
Keep a database of the current positions.
Whenever you get a new coordinate, query the database with the desired radius, say 50 meters. A good index will do this in O(log n) for a small radius. If you get enough results, this may be an event, or someone joining an ongoing event.

NeuroEvolution of Augmenting Topologies (NEAT) and global innovation number

I was not able to find why we should have a global innovation number for every new connection gene in NEAT.
From my little knowledge of NEAT, every innovation number corresponds directly with an node_in, node_out pair, so, why not only use this pair of ids instead of the innovation number? Which new information there is in this innovation number? chronology?
Update
Is it an algorithm optimization?
Note: this more of an extended comment than an answer.
You encountered a problem I also just encountered whilst developing a NEAT version for javascript. The original paper published in ~2002 is very unclear.
The original paper contains the following:
Whenever a new
gene appears (through structural mutation), a global innovation number is incremented
and assigned to that gene. The innovation numbers thus represent a chronology of the
appearance of every gene in the system. [..] ; innovation numbers are never changed. Thus, the historical origin of every
gene in the system is known throughout evolution.
But the paper is very unclear about the following case, say we have two ; 'identical' (same structure) networks:
The networks above were initial networks; the networks have the same innovation ID, namely [0, 1]. So now the networks randomly mutate an extra connection.
Boom! By chance, they mutated to the same new structure. However, the connection ID's are completely different, namely [0, 2, 3] for parent1 and [0, 4, 5] for parent2 as the ID is globally counted.
But the NEAT algorithm fails to determine that these structures are the same. When one of the parents scores higher than the other, it's not a problem. But when the parents have the same fitness, we have a problem.
Because the paper states:
In composing the offspring, genes are randomly chosen from veither parent at matching genes, whereas all excess or disjoint genes are always included from the more fit parent, or if they are equally fit, from both parents.
So if the parents are equally fit, the offspring will have connections [0, 2, 3, 4, 5]. Which means that some nodes have double connections... Removing global innovation counters, and just assign id's by looking at node_in and node_out, you avoid this problem.
So when you have equally fit parents, yes you have optimized the algorithm. But this is almost never the case.
Quite interesting: in the newer version of the paper, they actually removed that bolded line! Older version here.
By the way, you can solve this problem by instead of assigning innovation ID's, assign ID based on node_in and node_out using pairing functions. This creates quite interesting neural networks when fitness is equal:
I can't provide a detailed answer, but the innovation number enables certain functionality within the NEAT model to be optimal (like calculating the species of a gene), as well as allowing crossover between the variable length genomes. Crossover is not necessary in NEAT, but it can be done, due to the innovation number.
I got all my answers from here:
http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
It's a good read
During crossover, we have to consider two genomes that share a connection between the two same nodes in their personal neural networks. How do we detect this collision without iterating both genome's connection genes over and over again for each step of crossover? Easy: if both connections being examined during crossover share an innovation number, they are connecting the same two nodes because they received that connection from the same common ancestor.
Easy Example:
If I am a genome with a specific connection gene with innovation number 'i', my children that take gene 'i' from me may eventually cross over with each other in 100 generations. We have to detect when these two evolved versions (alleles) of my gene 'i' are in collision to prevent taking both. Taking two of the same gene would cause the phenotype to probably loop and crash, killing the genotype.
When I created my first implementation of NEAT I thought the same... why would you keep a innovation number tracker...? and why would you use it only for one generation? Wouldn't be better to not keep it at all and use a key value par with the nodes connected?
Now that I am implementing my third revision I can see what Kenneth Stanley tried to do with them and why he wanted to keep them only for one generation.
When a connection is created, it will start its optimization in that moment. It marks its origin. If the same connection pops out in another generation, that will start its optimization then. Generation numbers try to separate the ones which come from a common ancestor, so the ones that have been optimized for many generations are not put side to side that one that was just generated. If a same connection is found in two genomes, that means that that gene comes from the same origin and thus, can be aligned.
Imagine then that you have your generation champion. Some of their genes will have 50 percent chance to be lost due that the aligned genes are treated equally.
What is better...? I haven't seen any experiments comparing the two approaches.
Kenneth Stanley also addressed this issue in the NEAT users page: https://www.cs.ucf.edu/~kstanley/neat.html
Should a record of innovations be kept around forever, or only for the current
generation?
In my implementation of NEAT, the record is only kept for a generation, but there
is nothing wrong with keeping them around forever. In fact, it may work better.
Here is the long explanation:
The reason I didn't keep the record around for the entire run in my
implementation of NEAT was because I felt that calling something the same
mutation that happened under completely different circumstances was not
intuitive. That is, it is likely that several generations down the line, the
"meaning" or contribution of the same connection relative to all the other
connections in a network is different than it would have been if it had appeared
generations ago. I used a single generation as a yardstick for this kind of
situation, although that is admittedly ad hoc.
That said, functionally speaking, I don't think there is anything wrong with
keeping innovations around forever. The main effect is to generate fewer species.
Conversely, not keeping them around leads to more species..some of them
representing the same thing but separated nonetheless. It is not currently clear
which method produces better results under what circumstances.
Note that as species diverge, calling a connection that appeared in one species a
different name than one that appeared earlier in another just increases the
incompatibility of the species. This doesn't change things much since they were
incompatible to begin with. On the other hand, if the same species adds a
connection that it added in an earlier generation, that must mean some members of
the species had not adopted that connection yet...so now it is likely that the
first "version" of that connection that starts being helpful will win out, and
the other will die away. The third case is where a connection has already been
generally adopted by a species. In that case, there can be no mutation creating
the same connection in that species since it is already taken. The main point is,
you don't really expect too many truly similar structures with different markings
to emerge, even with only keeping the record around for 1 generation.
Which way works best is a good question. If you have any interesting experimental
results on this question, please let me know.
My third revision will allow both options. I will add more information to this answer when I have results about it.

Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?

My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.

Best way to detect and store path combinations for analysing purpose later

I am searching for ideas/examples on how to store path patterns from users - with the goal of analysing their behaviours and optimizing on "most used path" when we can detect them somehow.
Eg. which action do they do after what, so that we later on can check to see if certain actions are done over and over again - therefore developing a shortcut or assembling some of the actions into a combined multiaction.
My first guess would be some sort of "simple log", perhaps stored in some SQL-manner, where we can keep each action as an index and then just record everything.
Problem is that the path/action might be dynamically changed - even while logging - so we need to be able to take care of this fact too, when looking for patterns later.
Would you log everthing "bigtime" first and then POST-process every bit of details after some time or do you have great experience with other tactics?
My worry is that this is going to take up space, BIG TIME while logging 1000 users each day for a month or more.
Hope this makes sense and I am curious to see if anyone can provide sample code, pseudocode or perhaps links to something usefull.
Our tools will be C#, SQL-database, XML and .NET 3.5 - clients could also get .NET 4.0 if needed.
Patterns examples as we expect them
...
User #1001: A-B-A-A-A-B-C-E-F-G-H-A-A-A-C-B-A
User #1002: B-A-A-B-C-E-F
User #1003: F-B-B-A-E-C-A-A-A
User #1002: C-E-F
...
etc. no real way to know what they do next nor how many they will use, how often they will do it.
A secondary goal, if possible, if we later on add a new "action" called G (just sample to illustrate, there will be hundreds of actions) how could we detect these new behaviours influence on the previous patterns.
To explain it better, my thought here would be some way to detect "patterns within patterns", sort of like how compressions work, so that "repeative patterns" are spottet. We dont know how long these patterns might be, nor how often they might come. How do we break this down into "small bits and pieces" - whats the best approach you think?
I am not sure what you mean by path, but, if you gave every action in a path a unique symbol, you could reduce the problem to longest common substring or subsequence.
Or have a map of paths to the number of times that action occurred. Every time a certain path happens, increment the count for that path. Then sort to find the most common.
Pseudo idea/implementation so far
Log ever users action into a list/series of actions, bulk kinda style (textfiles/SQL - what ever, just store the whole thing for post-processing)
start counting every "1 action", "2 actions", "3 actions" up til a certain amount (lets say 30 levels)
sort them all, by giving values of importants to some of the actions (might be those producing end results)
A usefull result perhaps?
If we count all [A], [A-A], [A-B], [A-C], [A-A-A], [A-A-B] etc. its going to make a LONG and fine list of which actions are used in row frequently, and thats in the right direction, because if some of these results gets too high, we might need a shorter path. Problem is then, whats too few actions to be optimized and whats the longest needed actionlist to search for? My guess is that we need to do this counting first, then examine the numbers.
Problem is that this would be part of an analyzing tool we are developing and we dont have data until implementation, so we dont know what to look for before its actually done. hmm... wondering if there really IS an answer to this one.

Resources