Adding nodes/connections to the structure in NEAT algorithm - machine-learning

Some sources (linked below) say that mutation can add a new node or subtract a node or add a connection between two existing nodes. But if we do that doesn't it change the number of genes throughout the population.
Let's say there are 2 creatures and one of them mutates and adds a connection between 2 existing nodes this creature now has more genes than the other one right?
So how is it possible for us to make crossover between the two?
Or does adding the connection between the nodes mean there was already a connection between the two but the weight of that connection was 0 so it didn't have any effect and mutation changed it enabling it to have some sort of effect?
This has been in the back of my mind for quite a bit of time now.
Any help is appreciated.
Thanks
https://towardsdatascience.com/neat-an-awesome-approach-to-neuroevolution-3eca5cc7930f

In brief: of course creating or destroying nodes and connections changes the size of the genome and this makes different genomes incompatible for comparison (and thus for sexual reproduction and crossover).
The key idea behind NEAT is to carefully label each new structure with its own ID. Typically, only nodes have unique IDs and connections are defined by the nodes they link together. In this way, crossover knows which parts of the genome are compatible and which are not.
Imagine these two genomes:
Genome A:
Nodes: 1, 2, 3, 5, 7
Connections: 1-2, 1-3, 2-5, 5-7, 3-7
Genome B:
Nodes: 1, 2, 3, 7
Connections: 1-2, 1-7, 2-3, 3-7
We know that nodes 1, 2, 3 and 7 are common, so crossover can happen directly (although very often nodes have no parameters so this step would do nothing really). Node 5 is unique to Genome A and would not take part in crossover.
Connections 1-2 and 3-7 are common, the rest are not, so only 1-2 and 3-7 would take place directly in crossover (typically taking the weight from one parent at random for each connection).
Normally there is a probability for the child genome to also inherit (after standard crossover) all unique elements from one of the two parents (or from the fittest parent or however you define this step).

Related

Why so many db hit in neo4j?

There is total 1 Category node and 2 Template node in my case. I put an * in [*] to support more further scenarios. But why there are so many db hit in this cypher for current data?
It's probably the * in the relationship part of your query that's doing it.
While you've got only one Category node and two Template nodes, you've asked Neo4j to hop through any number of relationships to get from one to the other and not given it any help to narrow down the search besides specifying the starting node.
For example, if your Category was connected to 100,000 other nodes (of any label, not just Template) you've forced Neo4j to jump through every single one of them looking to see if there's a path to a Template node - and if those nodes have their own connections then they all need to be explored too, because the depth of the traversal isn't constrained.
If you know how Category and Template nodes can be connected in ways you're interested in (for example, if there's only every some specific set of relationships you want to traverse) then you'll radically improve the performance of the query. Equally, reducing the maximum length of the path will help.

Cypher query to find nodes with shared properties, and formulate as input and output

I have searched to find a way to query my data in order to build a table to visualize it properly. The data is an import of a SysML model. The general structure of the data is this:
(node1:Type1)<-[:Reference]-(node2:Type1)-[:Property]->(node3:Type1)<-[:Property]-(node4:Type1)-[:Reference]->(node5:Type1)
Nodes 2 and 4 represent processes, with data being exchanged between them. The data exchanged is represented by node 3. Nodes 1 and 5, represent the tools in which these processes are performed. My ideal situation would be to have a table with columns
node1.name | node2.name | node3.name | node4.name | node5.name
allowing me to view the inputs/outputs of my processes and which tools perform those processes, effectively evaluating the interfaces. However, the queries I'm using are causing duplicates in the rows of the tables, as it's reading 'front to back' as well as 'back to front'. Is there a way to separate each 'step' of the path, and make a column for each step (perhaps by relationship direction)? There are also cases where there is more than one "tool" (node 1 or 5) in which these processes are performed, so a row for each (many to one, one to one, one to many, many to many) would be ideal. Lastly, there are cases where a process (nodes 2 and 4) may have more than just a single relationship. I would like to be able to show all of the process interfaces.
Any help is much appreciated. Thank you in advance for your time.
To fix the mirrored results issue, it helps to use an inequality predicate on the ids of the nodes to ensure a single result: WHERE id(node1) < id(node5)
As for showing all possible paths, that's what Cypher does, so you should see all possible paths that fit the pattern.

(neo4j) Best practice for the number of properties in relationships and nodes

I've just started using neo4j and, having done a few experiments, am ready to start organizing the database in itself. Therefore, I've started by designing a basic diagram (on paper) and came across the following doubt:
Most examples in the material I'm using (cypher and neo4j tutorials) present only a few properties per relationship/node. But I have to wonder what the cost of having a heavy string of properties is.
Q: Is it more efficient to favor a wide variety of relationship types (GOODFRIENDS_WITH, FRIENDS_WITH, ACQUAINTANCE, RIVAL, ENEMIES, etc) or fewer types with varying properties (SEES_AS type:good friend, friend, acquaintance, rival, enemy, etc)?
The same holds for nodes. The first draft of my diagram has a staggering amount of properties (title, first name, second name, first surname, second surname, suffix, nickname, and then there's physical characteristics, personality, age, jobs...) and I'm thinking it may lower the performance of the db. Of course some nodes won't need all of the properties, but the basic properties will still be quite a few.
Q: What is the actual, and the advisable, limit for the number of properties, in both nodes and relationships?
FYI, I am going to remake my draft in such a way as to diminish the properties by using nodes instead (create a node :family names, another for :job and so on), but I've only just started thinking it over as I'll need to carefully analyse which 'would-be properties' make sense to remain, even because the change will amplify the number of relationship types I'll be dealing with.
Background information:
1) I'm using neo4j to map out all relationships between the people living in a fictional small town. The queries I'll perform will mostly be as follow:
a. find all possible paths between 2 (or more) characters
b. find all locations which 2 (or more) characters frequent
c. find all characters which have certain types of relationship (friends, cousins, neighbors, etc) to character X
d. find all characters with the same age (or similar age) who studied in the same school
e. find all characters with the same age / first name / surname / hair color / height / hobby / job / temper (easy to anger) / ...
and variations of the above.
2) I'm not a programmer, but having self-learnt HTML and advanced excel, I feel confident I'll learn the intuitive Cypher quickly enough.
First off, for small data "sandbox" use, this is a moot point. Even with the most inefficient data layout, as long as you avoid Cartesian Products and its like, the only thing you will notice is how intuitive your data is to yourself. So if this is a "toy" scale project, just focus on what makes the most organizational sense to you. If you change your mind later, reformatting via cypher won't be too hard.
Now assuming this is a business project that needs to scale to some degree, remember that non-indexed properties are basically invisible to the Cypher planner. The more meaningful and diverse your relationships, the better the Cypher planner is going to be at finding your data quickly. Favor relationships for connections you want to be able to explore, and favor properties for data you just want to see. Index any properties or use labels that will be key for finding a particular (or set of) node(s) in your queries.

Neo4j: Node versus Node property

I am developing a Neo4j database that will contain genomic and clinical data for cancer patients. A common design issue in developing graph databases is whether a data item should be represented by a Node or by a property within a Node. In my case, patients will have hundreds of clinical and demographic measurements (e.g. sex, medications, tumor size). Some of these data will be constant (e.g. sex) while others will be subject to variation with each patient visit. The responses I've seen to previous node vs property questions have recommended using the anticipated queries against the data to make the decision. I think I can identify some properties that will be common search criteria and should be nodes (e.g. smoking history, sex, cancer type) but that still leaves me with hundreds of other properties. Is there a practical limit in Neo4j for the number of properties that a Node should contain? Also, a hybrid approach, where some data are properties and others are Nodes would seem to make both loading data from source files and subsequent queries more complicated.
The main idea behind "look at your queries to decide", is that how data relates to each other effects whether a node or property is better. Really, the main point of a graph database is to make walking relationships easier to query. So the real question you should ask yourself is "Does (a)-->()<--(b) have any significant meaning?" In other words, do I need to be able to find other nodes that share this property?
Here are some quick rule-of-thumb guidlines
Node
Has it's own sub-values or relations
Multiple nodes sharing this value has meaning, and you need to be able to walk along this shared value between them
Changes infrequently
If more than 1 value can apply at the same time
Properties
Has a large range of possible values
Changes over time
If more than 1 value can apply, values are usually updated as a set (rather than individually)
Label
Has a small, finite range of mutually exclusive values
Almost never changes
So lets go through the thought process of a few of your properties...
Sex
Will either be "Male" or "Female", and everyone will be connected to one of the two, so they will both end up being super nodes (overloaded). Also, if you do end up needing to find two people that share the same sex, almost any other method would be more efficient than finding them through the super node. However these are mutually exclusive, immutable, genetic traits so making this a label is also perfectly acceptable (and sometimes preferred).
Address
This is a variable value with sub-properties, won't be shared by very many nodes, and the walk from one person to another at the same address (or, by extension, live in an area) has valuable meaning. So this should almost definitely be a node.
Height and Weight
These change constantly with time, have no sub values, and two people sharing this value has little to no meaning. The range of values is far too wide, so Labels make no since either, so this should be a property.
Blood type
While has more options then Sex does, all the same logic applies, except that the relation does matter now (because people must share a blood type to donate). The problem is that this value will be so overloaded, that you will need to filter on area first, and than just verifying blood type. Could be a property or label. The case for node is if you include an "Can_Donate_To" or "Can_Accept" relation between the blood types. While you likely won't walk these relations to find a potential donor (because they are too overloaded, and you will have to filter by area first), you can use them to verify someone can be a donor.
Social Security Number
Is highly sensitive, and a lawsuit waiting to happen. Keep out of the DB if at all possible. If you absolutely have to; this property is immutable, but will be unique to every person, so because of the lack of reuse, is a bad label and will be pointless as a node. Definitely a property. (But should be salted+hashed if only for verification purposes only)
Mother's maiden name
The possible values are endless, and two nodes sharing this value has no real meaning. Definitely a property.
First born child
Since the child is already their own node, with it's own sub properties, just create a relation between the two. While the value of this info is questionable, any time you need to reference another node, always use a relationship for it. Definitely a node.

How does the Minimum spanning tree in neo4j work

I am playing around with some graph theory algorithms in neo4j. I am trying to find the minimum spanning tree (mst) within my network. I synthetically created a network of 10 000 people. Each person has 12 relationship types each one linking him back to the other 9999 and each relationship with its own weight assigned.
The problem I have however is the fact that according to the definition the results must be a tree spanning over the ENTIRE network. The neo4j function however only returns a very small sub-graph (only about 12 nodes) of the entire network.
The code I am using looks like this:
MATCH (a:Name {Name:"Dillon Snow"})
CALL algo.mst(a,"Weight",{stats:true})
YIELD loadMillis, computeMillis, writeMillis, weightSum, weightMin, weightMax, relationshipCount
RETURN loadMillis, computeMillis, writeMillis, weightSum, weightMin, weightMax, relationshipCount
What can I change to get the function to return the mst spreading through the entire network
algo.mst.* has not been adapted to the matured Neo4j-Graph-Algorithms-CoreAPI in its current release (3.2.5.2/3.3.0.0 # Dec 2017) which might lead to unexpected results. But there is a pull request in the pipe, you can expect some changes in the next release.
Anyway.. The procedure should add a new relationship-type (default mst) to your nodes. In a connected graph each node should be connected as well while a disconnected graph leads to connections only between the nodes of this particular connected component (from your startNode).
If i understand you right you have multiple relationship types and more then one of them between a pair of nodes? E.g. Node A is connected to Node B with several relations, each of them with a different type and property value. This is a problem. In general the Graph-Algorithms-API does not support multible releationships. Each pair of nodes can only have one connection per direction. Although you can import multible types the core-api itself has no idea of the underlying type. If multible relationships between a pair of nodes get imported usualy the last one wins. This has been mentioned in the documentation ;)
To overcome this limitation you could replace your relationship types with some kind of artificial nodes. When traversing over the result tree the occurence of one of those nodes would indicate the original relationship.

Resources