Reintroducing spaces into a document - alignment
Imagine we have some reference text on hand
Four score and seven years ago our fathers brought forth on this
continent a new nation, conceived in liberty, and dedicated to the
proposition that all men are created equal. Now we are engaged in a
great civil war, testing whether that nation, or any nation, so
conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that
field, as a final resting place for those who here gave their lives
that that nation might live. It is altogether fitting and proper that
we should do this. But, in a larger sense, we can not dedicate, we can
not consecrate, we can not hallow this ground. The brave men, living
and dead, who struggled here, have consecrated it, far above our poor
power to add or detract. The world will little note, nor long remember
what we say here, but it can never forget what they did here. It is
for us the living, rather, to be dedicated here to the unfinished work
which they who fought here have thus far so nobly advanced. It is
rather for us to be here dedicated to the great task remaining before
us—that from these honored dead we take increased devotion to that
cause for which they gave the last full measure of devotion—that we
here highly resolve that these dead shall not have died in vain—that
this nation, under God, shall have a new birth of freedom—and that
government of the people, by the people, for the people, shall not
perish from the earth.
and we receive snippets of that text back to us with no spaces or punctuation, and some characters deleted, inserted, and substituted
ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn
Using the reference text what are some tools (in any programming language) we can use to try properly space the words
ield as a final rTsting place for who fought here gave their liZes that that n
correcting errors is not necessary, just spacing
Weird problem you've got there :)
If you can't rely on capitalization for hints, just lowercase everything to start with.
Then get a dictionary of words. Maybe just a wordlist, or you could try Wordnet.
And a corpus of similar, correctly spaced, material. If suitable, download the Wikipedia dump. You'll need to clean it up and break into ngrams. 3 grams will probably suit the task. Or save yourself the time and use Google's ngram data. Either the web ngrams (paid) or the book ngrams (free-ish).
Set a max word length cap. Let's say 20chars.
Take the first char of your mystery string and look it up in the dictionary. Then take the first 2 chars and look them up. Keep doing this until you get to 20. Store all matches you get, but the longest one is probably the best. Move the starting point 1 char at a time, through your string.
You'll end up with an array of valid word matches.
Loop through this new array and pair each value up with the following value, comparing it to the original string, so that you identify all possible valid word combinations that don't overlap. You might end up with 1 output string, or several.
If you've got several, break each output string into 3-grams. Then lookup in your new ngram database to see which combinations are most frequent.
There might also be some time-saving techniques like starting with stop words, checking them in a dictionary combined with incremental letter either side, and adding spaces there first.
... or I'm over-thinging the whole issue and there's an awk one liner that someone will humble me with :)
You can do this using edit distance and finding the minimum edit distance substring of the reference. Check out my answer (PHP implementation) to a similar question here:
Longest Common Substring with wrong character tolerance
Using the shortest_edit_substring() function from the above link, you can add this to do the search after stripping out everything but letters (or whatever you want to keep in: letters, numbers, etc.) and then correctly map the result back to the original version.
// map a stripped down substring back to the original version
function map_substring($haystack_letters,$start,$length,$haystack, $regexp)
{
$r_haystack = str_split($haystack);
$r_haystack_letters = $r_haystack;
foreach($r_haystack as $k => $l)
{
if (preg_match($regexp,$l))
{
unset($r_haystack_letters[$k]);
}
}
$key_map = array_keys($r_haystack_letters);
$real_start = $key_map[$start];
$real_end = $key_map[$start+$length-1];
$real_length = $real_end - $real_start + 1;
return array($real_start,$real_length);
}
$haystack = 'Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.';
$needle = 'ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn';
// strip out all non-letters
$regexp_to_strip_out = '/[^A-Za-z]/';
$haystack_letters = preg_replace($regexp_to_strip_out,'',$haystack);
list($start,$length) = shortest_edit_substring($needle,$haystack_letters);
list($real_start,$real_length) = map_substring($haystack_letters,$start,$length,$haystack,$regexp_to_strip_out);
printf("Found |%s| in |%s|, matching |%s|\n",substr($haystack,$real_start,$real_length),$haystack,$needle);
This will do the error correction as well; it's actually easier to do it than to not do it. The minimum edit distance search is pretty straightforward to implement in other languages if you want something faster than PHP.
Related
daily quote for each page with latex
I want create calendar planner from 1/1/2019 to 31/2/2019: In Main.tex: I use \pgfcalendar{cal}{2019-01-01}{2019-02-31} to create pdf. One page = one day Now i want add each different quote to each day. How write code in quote.tex and connect in Main.tex. Thanks \pgfcalendar{cal}{2019-01-01}{2019-02-31} {% \LARGE\bfseries \pgfcalendarweekdayname{\pgfcalendarcurrentweekday}, \pgfcalendarcurrentday{}. \pgfcalendarmonthname{\pgfcalendarcurrentmonth} \pgfcalendarcurrentyear{} %} \pagebreak ==========
I'd suggest setting up a daily quote in Excel with all the bells and whistles of formulas to extract the weekday, day, month and year from a date. I pulled some random quotes from the Internets to compile with each of the first 100 days... \documentclass{article} \usepackage{filecontents} \begin{filecontents*}{daily_quote.csv} Number,Date,Day,Month,Year,DayOfWeek,Quote,Author 1,2019-01-01,1,January,2019,Tuesday,"We are what we repeatedly do. Excellence, therefore, is not an act but a habit.",Aristotle 2,2019-01-02,2,January,2019,Wednesday,The best way out is always through.,Robert Frost 3,2019-01-03,3,January,2019,Thursday,Do not wait to strike till the iron is hot; but make it hot by striking.,William B Sprague 4,2019-01-04,4,January,2019,Friday,Great spirits have always encountered violent opposition from mediocre minds.,Albert Einstein 5,2019-01-05,5,January,2019,Saturday,"Whether you think you can or think you can't, you're right.",Henry Ford 6,2019-01-06,6,January,2019,Sunday,I know for sure that what we dwell on is who we become.,Oprah Winfrey 7,2019-01-07,7,January,2019,Monday,"I've missed more than 9000 shots in my career. I've lost almost 300 games. 26 times, I've been trusted to take the game winning shot and missed. I've failed over and over and over again in my life. And that is why I succeed.",Michael Jordan 8,2019-01-08,8,January,2019,Tuesday,You must be the change you want to see in the world.,Mahatma Gandhi 9,2019-01-09,9,January,2019,Wednesday,What you get by achieving your goals is not as important as what you become by achieving your goals.,Goethe 10,2019-01-10,10,January,2019,Thursday,You can get everything in life you want if you will just help enough other people get what they want.,Zig Ziglar 11,2019-01-11,11,January,2019,Friday,"Whatever you do will be insignificant, but it is very important that you do it.",Mahatma Gandhi 12,2019-01-12,12,January,2019,Saturday,"Desire is the starting point of all achievement, not a hope, not a wish, but a keen pulsating desire which transcends everything.",Napoleon Hill 13,2019-01-13,13,January,2019,Sunday,Failure is the condiment that gives success its flavor.,Truman Capote 14,2019-01-14,14,January,2019,Monday,Vision without action is daydream. Action without vision is nightmare.,Japanese Proverb 15,2019-01-15,15,January,2019,Tuesday,"In any situation, the best thing you can do is the right thing; the next best thing you can do is the wrong thing; the worst thing you can do is nothing.",Theodore Roosevelt 16,2019-01-16,16,January,2019,Wednesday,"If you keep saying things are going to be bad, you have a chance of being a prophet.",Isaac B Singer 17,2019-01-17,17,January,2019,Thursday,Success consists of doing the common things of life uncommonly well.,Unknown 18,2019-01-18,18,January,2019,Friday,"Keep on going and the chances are you will stumble on something, perhaps when you are least expecting it. I have never heard of anyone stumbling on something sitting down.","Charles F Kettering, Engineer and Inventor" 19,2019-01-19,19,January,2019,Saturday,Twenty years from now you will be more disappointed by the things that you didn't do than by the ones you did do. So throw off the bowlines. Sail away from the safe harbor. Catch the trade winds in your sails. Explore. Dream. Discover.,Mark Twain 20,2019-01-20,20,January,2019,Sunday,Losers visualize the penalties of failure. Winners visualize the rewards of success.,Unknown 21,2019-01-21,21,January,2019,Monday,Some succeed because they are destined. Some succeed because they are determined.,Unknown 22,2019-01-22,22,January,2019,Tuesday,Experience is what you get when you don't get what you want.,Dan Stanford 23,2019-01-23,23,January,2019,Wednesday,Setting an example is not the main means of influencing others; it is the only means.,Albert Einstein 24,2019-01-24,24,January,2019,Thursday,"A happy person is not a person in a certain set of circumstances, but rather a person with a certain set of attitudes.",Hugh Downs 25,2019-01-25,25,January,2019,Friday,"If you're going to be able to look back on something and laugh about it, you might as well laugh about it now.",Marie Osmond 26,2019-01-26,26,January,2019,Saturday,"Remember that happiness is a way of travel, not a destination.",Roy Goodman 27,2019-01-27,27,January,2019,Sunday,"If you want to test your memory, try to recall what you were worrying about one year ago today.",E Joseph Cossman 28,2019-01-28,28,January,2019,Monday,What lies behind us and what lies before us are tiny matters compared to what lies within us.,Ralph Waldo Emerson 29,2019-01-29,29,January,2019,Tuesday,We judge of man's wisdom by his hope.,Ralph Waldo Emerson 30,2019-01-30,30,January,2019,Wednesday,The best way to cheer yourself up is to try to cheer somebody else up.,Mark Twain 31,2019-01-31,31,January,2019,Thursday,"Age is an issue of mind over matter. If you don't mind, it doesn't matter.",Mark Twain 32,2019-02-01,1,February,2019,Friday,"Whenever you find yourself on the side of the majority, it's time to pause and reflect.",Mark Twain 33,2019-02-02,2,February,2019,Saturday,"Keep away from people who try to belittle your ambitions. Small people always do that, but the really great make you feel that you, too, can become great.",Mark Twain 34,2019-02-03,3,February,2019,Sunday,The surest way not to fail is to determine to succeed.,Richard B Sheridan 35,2019-02-04,4,February,2019,Monday,"Take the first step in faith. You don't have to see the whole staircase, just take the first step.",Dr Martin Luther King Jr 36,2019-02-05,5,February,2019,Tuesday,Act or accept.,Anonymous 37,2019-02-06,6,February,2019,Wednesday,"Many great ideas go unexecuted, and many great executioners are without ideas. One without the other is worthless.",Tim Blixseth 38,2019-02-07,7,February,2019,Thursday,The world is more malleable than you think and it's waiting for you to hammer it into shape.,Bono 39,2019-02-08,8,February,2019,Friday,Sometimes you just got to give yourself what you wish someone else would give you.,Dr Phil 40,2019-02-09,9,February,2019,Saturday,"Motivation is a fire from within. If someone else tries to light that fire under you, chances are it will burn very briefly.",Stephen R Covey 41,2019-02-10,10,February,2019,Sunday,People become really quite remarkable when they start thinking that they can do things. When they believe in themselves they have the first secret of success.,Norman Vincent Peale 42,2019-02-11,11,February,2019,Monday,Whenever you find whole world against you just turn around and lead the world.,Anonymous 43,2019-02-12,12,February,2019,Tuesday,Being defeated is only a temporary condition; giving up is what makes it permanent.,"Marilyn vos Savant, Author and Advice Columnist" 44,2019-02-13,13,February,2019,Wednesday,I can't understand why people are frightened by new ideas. I'm frightened by old ones.,John Cage 45,2019-02-14,14,February,2019,Thursday,Setting an example is not the main means of influencing others; it is the only means.,Albert Einstein 46,2019-02-15,15,February,2019,Friday,The difference between ordinary and extraordinary is that little extra.,Unknown 47,2019-02-16,16,February,2019,Saturday,The best way to predict the future is to create it.,Unknown 48,2019-02-17,17,February,2019,Sunday,Anyone can do something when they WANT to do it. Really successful people do things when they don't want to do it.,Dr Phil 49,2019-02-18,18,February,2019,Monday,"There are two primary choices in life: to accept conditions as they exist, or accept the responsibility for changing them.",Dr Denis Waitley 50,2019-02-19,19,February,2019,Tuesday,Success is the ability to go from failure to failure without losing your enthusiasm.,Sir Winston Churchill 51,2019-02-20,20,February,2019,Wednesday,Success seems to be connected with action. Successful people keep moving. They make mistakes but don't quit.,Conrad Hilton 52,2019-02-21,21,February,2019,Thursday,Attitudes are contagious. Make yours worth catching.,Unknown 53,2019-02-22,22,February,2019,Friday,Do not let what you cannot do interfere with what you can do.,John Wooden 54,2019-02-23,23,February,2019,Saturday,"There are only two rules for being successful. One, figure out exactly what you want to do, and two, do it.",Mario Cuomo 55,2019-02-24,24,February,2019,Sunday,"Sooner or later, those who win are those who think they can.",Richard Bach 56,2019-02-25,25,February,2019,Monday,Vision doesn't usually come as a lightening bolt. Rather it comes as a slow crystallization of life challenges that we one day recognize as a beautiful diamond with great value to ourselves and others.,Dr Michael Norwood 57,2019-02-26,26,February,2019,Tuesday,"Success is a state of mind. If you want success, start thinking of yourself as a success.",Dr Joyce Brothers 58,2019-02-27,27,February,2019,Wednesday,Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.,Samuel Beckett 59,2019-02-28,28,February,2019,Thursday,Flops are a part of life's menu and I've never been a girl to miss out on any of the courses.,Rosalind Russell 60,2019-03-01,1,March,2019,Friday,Cause change and lead. Accept change and survive. Resist change and die.,"Ray Norda, Chairman, Novell" 61,2019-03-02,2,March,2019,Saturday,"Winners lose much more often than losers. So if you keep losing but you're still trying, keep it up! You're right on track.",Matthew Keith Groves 62,2019-03-03,3,March,2019,Sunday,"An idea can turn to dust or magic, depending on the talent that rubs against it.",Bill Bernbach 63,2019-03-04,4,March,2019,Monday,An obstacle is often a stepping stone.,Prescott 64,2019-03-05,5,March,2019,Tuesday,Life is trying things to see if they work,Ray Bradbury 65,2019-03-06,6,March,2019,Wednesday,"If you worry about yesterday's failures, then today's successes will be few.",Anonymous 66,2019-03-07,7,March,2019,Thursday,Life is 10% what happens to us and 90% how we react to it.,Dennis P Kimbro 67,2019-03-08,8,March,2019,Friday,"We are all inventors, each sailing out on a voyage of discovery, guided each by a private chart, of which there is no duplicate. The world is all gates, all opportunities.",Ralph Waldo Emerson 68,2019-03-09,9,March,2019,Saturday,Knowing is not enough; we must apply. Willing is not enough; we must do.,Johann Wolfgang von Goethe 69,2019-03-10,10,March,2019,Sunday,"In matters of style, swim with the current; in matters of principle, stand like a rock.",Thomas Jefferson 70,2019-03-11,11,March,2019,Monday,"I think and think for months and years. Ninety-nine times, the conclusion is false. The hundredth time I am right.",Albert Einstein 71,2019-03-12,12,March,2019,Tuesday,"Where the willingness is great, the difficulties cannot be great.",Machiavelli 72,2019-03-13,13,March,2019,Wednesday,Strength does not come from physical capacity. It comes from an indomitable will.,Mahatma Gandhi 73,2019-03-14,14,March,2019,Thursday,You are what you think about all day long.,Dr Robert Schuller 74,2019-03-15,15,March,2019,Friday,What you do speaks so loudly that I cannot hear what you say.,Ralph Waldo Emerson 75,2019-03-16,16,March,2019,Saturday,"Success is not to be measured by the position someone has reached in life, but the obstacles he has overcome while trying to succeed.",Booker T Washington 76,2019-03-17,17,March,2019,Sunday,"Talent is formed in solitude, character in the bustle of the world.",Johann Wolfgang von Goethe 77,2019-03-18,18,March,2019,Monday,"To avoid criticism do nothing, say nothing, be nothing.",Elbert Hubbard 78,2019-03-19,19,March,2019,Tuesday,"If you want to make your dreams come true, the first thing you have to do is wake up.",JM Power 79,2019-03-20,20,March,2019,Wednesday,By working faithfully eight hours a day you may eventually get to be boss and work twelve hours a day,Robert Frost 80,2019-03-21,21,March,2019,Thursday,"I've learned that no matter what happens, or how bad it seems today, life does go on, and it will be better tomorrow.",Maya Angelou 81,2019-03-22,22,March,2019,Friday,The art of being wise is the art of knowing what to overlook.,William James 82,2019-03-23,23,March,2019,Saturday,"When I hear somebody sigh, ‘Life is hard,' I am always tempted to ask, ‘Compared to what?'",Sydney Harris 83,2019-03-24,24,March,2019,Sunday,Don't let life discourage you; everyone who got where he is had to begin where he was.,Richard L Evans 84,2019-03-25,25,March,2019,Monday,In three words I can sum up everything I've learned about life: It goes on.,Robert Frost 85,2019-03-26,26,March,2019,Tuesday,"You gain strength, courage and confidence by every experience in which you stop to look fear in the face.",Eleanor Roosevelt 86,2019-03-27,27,March,2019,Wednesday,Sometimes even to live is an act of courage.,Seneca 87,2019-03-28,28,March,2019,Thursday,"Do first things first, and second things not at all.",Peter Drucker 88,2019-03-29,29,March,2019,Friday,The only people who find what they are looking for in life are the fault finders.,Foster's Law 89,2019-03-30,30,March,2019,Saturday,Defeat is not bitter unless you swallow it.,Joe Clark 90,2019-03-31,31,March,2019,Sunday,I am an optimist. It does not seem too much use being anything else.,Winston Churchill 91,2019-04-01,1,April,2019,Monday,Positive anything is better than negative thinking.,Elbert Hubbard 92,2019-04-02,2,April,2019,Tuesday,People seem not to see that their opinion of the world is also a confession of character.,Ralph Waldo Emerson 93,2019-04-03,3,April,2019,Wednesday,"Those who wish to sing, always find a song.",Swedish Proverb 94,2019-04-04,4,April,2019,Thursday,"If you're going through hell, keep going.",Winston Churchill 95,2019-04-05,5,April,2019,Friday,"The sun shines and warms and lights us and we have no curiosity to know why this is so; but we ask the reason of all evil, of pain, and hunger, and mosquitoes and silly people.",Ralph Waldo Emerson 96,2019-04-06,6,April,2019,Saturday,Life is a shipwreck but we must not forget to sing in the lifeboats.,Voltaire 97,2019-04-07,7,April,2019,Sunday,"Enduring habits I hate…. Yes, at the very bottom of my soul I feel grateful to all my misery and bouts of sickness and everything about me that is imperfect, because this sort of thing leaves me with a hundred backdoors through which I can escape from enduring habits.","Friedrich Nietzsche, The Gay Science, 1882" 98,2019-04-08,8,April,2019,Monday,There is no education like adversity.,Disraeli 99,2019-04-09,9,April,2019,Tuesday,He who has a why to live can bear almost any how.,Friedrich Nietzsche 100,2019-04-10,10,April,2019,Wednesday,Adversity introduces a man to himself.,Unknown \end{filecontents*} \usepackage{datatool} \begin{document} \DTLloadrawdb[keys = { Number, Date, Day, Month, Year, DayOfWeek, Quote, Author }] {DailyQuote} % DB name {daily_quote.csv} % Filename \raggedright \DTLforeach {DailyQuote} {\Day = Day, \Month = Month, \Year = Year, \DayOfWeek = DayOfWeek, \Quote = Quote, \Author = Author }{ \clearpage \section*{\DayOfWeek, \Month~\Day, \Year} \Quote~\textit{\Author} } \end{document}
Cluster Analysis for crowds of people
I have location data from a large number of users (hundreds of thousands). I store the current position and a few historical data points (minute data going back one hour). How would I go about detecting crowds that gather around natural events like birthday parties etc.? Even smaller crowds (let's say starting from 5 people) should be detected. The algorithm needs to work in almost real time (or at least once a minute) to detect crowds as they happen. I have looked into many cluster analysis algorithms, but most of them seem like a bad choice. They either take too long (I have seen O(n^3) and O(2^n)) or need to know how many clusters there are beforehand. Can someone help me? Thank you!
Let each user be it's own cluster. When she gets within distance R to another user form a new cluster and separate again when the person leaves. You have your event when: Number of people is greater than N They are in the same place for the timer greater than T The party is not moving (might indicate a public transport) It's not located in public service buildings (hospital, school etc.) (good number of other conditions) One minute is plenty of time to get it done even on hundreds of thousands of people. In naive implementation it would be O(n^2), but mind there is no point in comparing location of each individual, only those in close neighbourhood. In first approximation you can divide the "world" into sectors, which also makes it easy to make the task parallel - and in turn easily scale. More users? Just add a few more nodes and downscale. One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd)ale at time of limited activity. How to avoid "single-linkage problem" mentioned by author in comments? One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd). It is not a novel problem and I am sure there are papers that cover it (partially), e.g. Is There a Crowd? Experiences in Using Density-Based Clustering and Outlier Detection.
There is little use in doing a full clustering. Just uses good database index. Keep a database of the current positions. Whenever you get a new coordinate, query the database with the desired radius, say 50 meters. A good index will do this in O(log n) for a small radius. If you get enough results, this may be an event, or someone joining an ongoing event.
NeuroEvolution of Augmenting Topologies (NEAT) and global innovation number
I was not able to find why we should have a global innovation number for every new connection gene in NEAT. From my little knowledge of NEAT, every innovation number corresponds directly with an node_in, node_out pair, so, why not only use this pair of ids instead of the innovation number? Which new information there is in this innovation number? chronology? Update Is it an algorithm optimization?
Note: this more of an extended comment than an answer. You encountered a problem I also just encountered whilst developing a NEAT version for javascript. The original paper published in ~2002 is very unclear. The original paper contains the following: Whenever a new gene appears (through structural mutation), a global innovation number is incremented and assigned to that gene. The innovation numbers thus represent a chronology of the appearance of every gene in the system. [..] ; innovation numbers are never changed. Thus, the historical origin of every gene in the system is known throughout evolution. But the paper is very unclear about the following case, say we have two ; 'identical' (same structure) networks: The networks above were initial networks; the networks have the same innovation ID, namely [0, 1]. So now the networks randomly mutate an extra connection. Boom! By chance, they mutated to the same new structure. However, the connection ID's are completely different, namely [0, 2, 3] for parent1 and [0, 4, 5] for parent2 as the ID is globally counted. But the NEAT algorithm fails to determine that these structures are the same. When one of the parents scores higher than the other, it's not a problem. But when the parents have the same fitness, we have a problem. Because the paper states: In composing the offspring, genes are randomly chosen from veither parent at matching genes, whereas all excess or disjoint genes are always included from the more fit parent, or if they are equally fit, from both parents. So if the parents are equally fit, the offspring will have connections [0, 2, 3, 4, 5]. Which means that some nodes have double connections... Removing global innovation counters, and just assign id's by looking at node_in and node_out, you avoid this problem. So when you have equally fit parents, yes you have optimized the algorithm. But this is almost never the case. Quite interesting: in the newer version of the paper, they actually removed that bolded line! Older version here. By the way, you can solve this problem by instead of assigning innovation ID's, assign ID based on node_in and node_out using pairing functions. This creates quite interesting neural networks when fitness is equal:
I can't provide a detailed answer, but the innovation number enables certain functionality within the NEAT model to be optimal (like calculating the species of a gene), as well as allowing crossover between the variable length genomes. Crossover is not necessary in NEAT, but it can be done, due to the innovation number. I got all my answers from here: http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf It's a good read
During crossover, we have to consider two genomes that share a connection between the two same nodes in their personal neural networks. How do we detect this collision without iterating both genome's connection genes over and over again for each step of crossover? Easy: if both connections being examined during crossover share an innovation number, they are connecting the same two nodes because they received that connection from the same common ancestor. Easy Example: If I am a genome with a specific connection gene with innovation number 'i', my children that take gene 'i' from me may eventually cross over with each other in 100 generations. We have to detect when these two evolved versions (alleles) of my gene 'i' are in collision to prevent taking both. Taking two of the same gene would cause the phenotype to probably loop and crash, killing the genotype.
When I created my first implementation of NEAT I thought the same... why would you keep a innovation number tracker...? and why would you use it only for one generation? Wouldn't be better to not keep it at all and use a key value par with the nodes connected? Now that I am implementing my third revision I can see what Kenneth Stanley tried to do with them and why he wanted to keep them only for one generation. When a connection is created, it will start its optimization in that moment. It marks its origin. If the same connection pops out in another generation, that will start its optimization then. Generation numbers try to separate the ones which come from a common ancestor, so the ones that have been optimized for many generations are not put side to side that one that was just generated. If a same connection is found in two genomes, that means that that gene comes from the same origin and thus, can be aligned. Imagine then that you have your generation champion. Some of their genes will have 50 percent chance to be lost due that the aligned genes are treated equally. What is better...? I haven't seen any experiments comparing the two approaches. Kenneth Stanley also addressed this issue in the NEAT users page: https://www.cs.ucf.edu/~kstanley/neat.html Should a record of innovations be kept around forever, or only for the current generation? In my implementation of NEAT, the record is only kept for a generation, but there is nothing wrong with keeping them around forever. In fact, it may work better. Here is the long explanation: The reason I didn't keep the record around for the entire run in my implementation of NEAT was because I felt that calling something the same mutation that happened under completely different circumstances was not intuitive. That is, it is likely that several generations down the line, the "meaning" or contribution of the same connection relative to all the other connections in a network is different than it would have been if it had appeared generations ago. I used a single generation as a yardstick for this kind of situation, although that is admittedly ad hoc. That said, functionally speaking, I don't think there is anything wrong with keeping innovations around forever. The main effect is to generate fewer species. Conversely, not keeping them around leads to more species..some of them representing the same thing but separated nonetheless. It is not currently clear which method produces better results under what circumstances. Note that as species diverge, calling a connection that appeared in one species a different name than one that appeared earlier in another just increases the incompatibility of the species. This doesn't change things much since they were incompatible to begin with. On the other hand, if the same species adds a connection that it added in an earlier generation, that must mean some members of the species had not adopted that connection yet...so now it is likely that the first "version" of that connection that starts being helpful will win out, and the other will die away. The third case is where a connection has already been generally adopted by a species. In that case, there can be no mutation creating the same connection in that species since it is already taken. The main point is, you don't really expect too many truly similar structures with different markings to emerge, even with only keeping the record around for 1 generation. Which way works best is a good question. If you have any interesting experimental results on this question, please let me know. My third revision will allow both options. I will add more information to this answer when I have results about it.
Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?
My platform here is Ruby - a webapp using Rails 3.2 in particular. I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial. A quick illustration - The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code - MATCHES = {} for each (PERSON in (people except USER)) do for each (RATING that PERSON has made) do if (USER has rated the item that RATING refers to) do MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating end end end lowest values in MATCHES are the best matches for USER The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app. I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist. If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on). You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions. That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine... Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object. Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster. Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic. The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.
Best way to detect and store path combinations for analysing purpose later
I am searching for ideas/examples on how to store path patterns from users - with the goal of analysing their behaviours and optimizing on "most used path" when we can detect them somehow. Eg. which action do they do after what, so that we later on can check to see if certain actions are done over and over again - therefore developing a shortcut or assembling some of the actions into a combined multiaction. My first guess would be some sort of "simple log", perhaps stored in some SQL-manner, where we can keep each action as an index and then just record everything. Problem is that the path/action might be dynamically changed - even while logging - so we need to be able to take care of this fact too, when looking for patterns later. Would you log everthing "bigtime" first and then POST-process every bit of details after some time or do you have great experience with other tactics? My worry is that this is going to take up space, BIG TIME while logging 1000 users each day for a month or more. Hope this makes sense and I am curious to see if anyone can provide sample code, pseudocode or perhaps links to something usefull. Our tools will be C#, SQL-database, XML and .NET 3.5 - clients could also get .NET 4.0 if needed. Patterns examples as we expect them ... User #1001: A-B-A-A-A-B-C-E-F-G-H-A-A-A-C-B-A User #1002: B-A-A-B-C-E-F User #1003: F-B-B-A-E-C-A-A-A User #1002: C-E-F ... etc. no real way to know what they do next nor how many they will use, how often they will do it. A secondary goal, if possible, if we later on add a new "action" called G (just sample to illustrate, there will be hundreds of actions) how could we detect these new behaviours influence on the previous patterns. To explain it better, my thought here would be some way to detect "patterns within patterns", sort of like how compressions work, so that "repeative patterns" are spottet. We dont know how long these patterns might be, nor how often they might come. How do we break this down into "small bits and pieces" - whats the best approach you think?
I am not sure what you mean by path, but, if you gave every action in a path a unique symbol, you could reduce the problem to longest common substring or subsequence. Or have a map of paths to the number of times that action occurred. Every time a certain path happens, increment the count for that path. Then sort to find the most common.
Pseudo idea/implementation so far Log ever users action into a list/series of actions, bulk kinda style (textfiles/SQL - what ever, just store the whole thing for post-processing) start counting every "1 action", "2 actions", "3 actions" up til a certain amount (lets say 30 levels) sort them all, by giving values of importants to some of the actions (might be those producing end results) A usefull result perhaps? If we count all [A], [A-A], [A-B], [A-C], [A-A-A], [A-A-B] etc. its going to make a LONG and fine list of which actions are used in row frequently, and thats in the right direction, because if some of these results gets too high, we might need a shorter path. Problem is then, whats too few actions to be optimized and whats the longest needed actionlist to search for? My guess is that we need to do this counting first, then examine the numbers. Problem is that this would be part of an analyzing tool we are developing and we dont have data until implementation, so we dont know what to look for before its actually done. hmm... wondering if there really IS an answer to this one.