Finding the consecutive win in Cypher query language - neo4j

from fig we can see that Arsenal have won three match consecutively but I could not write the query.

Here is a query that should return the maximum number of consecutive wins for Arsenal:
MATCH (a:Club {name:'Arsenal FC'})-[r:played_with]-(:Club)
WITH ((CASE a.name WHEN r.home THEN 1 ELSE -1 END) * (TOINT(r.score[0]) - TOINT(r.score[1]))) > 0 AS win, r
ORDER BY TOINT(r.time)
RETURN REDUCE(s = {max: 0, curr: 0}, w IN COLLECT(win) |
CASE WHEN w
THEN {
max: CASE WHEN s.max < s.curr + 1 THEN s.curr + 1 ELSE s.max END,
curr: s.curr + 1}
ELSE {max: s.max, curr: 0}
END
).max AS result;
The WITH clause sets the win variable to true iff Arsenal won a particular game. Notice that the ORDER BY clause converts the time property to an integer, because the ordering of numeric strings does not work properly if the strings could be of different lengths (I am being a bit picky here, admittedly). The REDUCE function is used to calculate the maximum number of consecutive wins.
======
Finally, here are some suggestions for some improvements to your data model. For example:
It looks like your played_with relationship always points from the home team to the away team. If so, you can get rid of the redundant home and away properties, and you can also rename the relationship type to HOSTED to make the direction of the relationship more clear.
The scores and time should be stored as integers, not strings. That would make your queries more efficient, and easier to write and understand.
You could also consider splitting the scores property into two scalar properties, say homeScore and awayScore, which would make your code more clear. There seems to be no advantage to storing the scores in an array.
If you made all the above suggested changes, then you would just need to change the beginning of the above query to this:
MATCH (a:Club {name:'Arsenal FC'})-[r:HOSTED]-(:Club)
WITH ((CASE a WHEN STARTNODE(r) THEN 1 ELSE -1 END) * (r.homeScore - r.awayScore)) > 0 AS win, r
ORDER BY r.time
...

Related

What is a better way than a for loop to implement an algorithm that involves sets?

I'm trying to create an algorithm along these lines:
-Create 8 participants
-Each participant has a set of interests
-Pair them with another participant with the least amount of interests
So what I've done so far is create 2 classes, the Participant and Interest, where the Interest is Hashable so that I can create a Set with it. I manually created 8 participants with different names and interests.
I've made an array of participants selected and I've used a basic for in loop to somewhat pair them together using the intersection() function of sets. Somehow my index always kicks out of range and I'm positive there's a better way of doing this, but it's just so messy and I don't know where to start.
for i in 0..<participantsSelected.count {
if participantsSelected[i].interest.intersection(participantsSelected[i+1].interest) == [] {
participantsSelected.remove(at: i)
participantsSelected.remove(at: i+1)
print (participantsSelected.count)
}
}
So my other issue is using a for loop for this specific algorithm seems a bit off too since what if they all have 1 similar interest, and it won't equal to [] / nil.
Basically the output I'm trying is to remove them from the participants selected array once they're paired up, and for them to be paired up they would have to be with another participant with the least amount of interests with each other.
EDIT: Updated code, here's my attempt to improve my algorithm logic
for participant in 0..<participantsSelected {
var maxInterestIndex = 10
var currentIndex = 1
for _ in 0..<participantsSelected {
print (participant)
print (currentIndex)
let score = participantsAvailable[participant].interest.intersection(participantsAvailable[currentIndex].interest)
print ("testing score, \(score.count)")
if score.count < maxInterestIndex {
maxInterestIndex = score.count
print ("test, \(maxInterestIndex)")
} else {
pairsCreated.append(participantsAvailable[participant])
pairsCreated.append(participantsAvailable[currentIndex])
break
// participantsAvailable.remove(at: participant)
// participantsAvailable.remove(at: pairing)
}
currentIndex = currentIndex + 1
}
}
for i in 0..<pairsCreated.count {
print (pairsCreated[i].name)
}
Here is a solution in the case that what you are looking for is to pair your participants (all of them) optimally regarding your criteria:
Then the way to go is by finding a perfect matching in a participants graph.
Create a graph with n vertices, n being the number of participants. We can denote by u_p the vertex corresponding to participant p.
Then, create weighted edges as follows:
For each couple of participants p, q (p != q), create the edge (u_p, u_q), and weight it with the number of interests these two participants have in common.
Then, run a minimum weight perfect matching algorithm on your graph, and the job is done. You will obtain an optimal result (meaning the best possible, or one among the best possible matchings) in polynomial time.
Minimum weight perfect matching algorithm: The problem is strictly equivalent to the maximum weight matching algorithm. Find the edge of maximum weight (let's denote by C its weight). Then replace the weight w of each edge by C-w, and run a maximum weight matching algorithm on the resulting graph.
I would suggest that yoy use Edmond's blossom algorithm to find a perfect matching in your graph. First because it is efficient and well documented, second because I believe you can find implementations in most existing languages, but also because it truly is a very, very beautiful algorithm (it ain't called blossom for nothing).
Another possibility, if you are sure that your number of participants will be small (you mention 8), you can also go for a brute-force algorithm, meaning to test all possible ways to pair participants.
Then the algorithm would look like:
find_best_matching(participants, interests, pairs):
if all participants have been paired:
return sum(cost(p1, p2) for p1, p2 in pairs), pairs // cost(p1, p2) = number of interests in common
else:
p = some participant which is not paired yet
best_sum = + infinite
best_pairs = empty_set
for all participants q which have not been paired, q != p:
add (p, q) to pairs
current_sum, current_pairs = find_best_matching(participants, interests, pairs)
if current_sum < best_sum:
best_sum = current_sum
best_pairs = current_pairs
remove (p, q) from pairs
return best_sum, best_pairs
Call it initially with find_best_matching(participants, interests, empty_set_of_pairs) to get the best way to match your participants.

Neo4j List Consecutive

I am currently working with a football match data set and trying to get Cypher to return the teams with the most consecutive wins.
At the moment I have a collect statement which creates a list i.e. [0,1,1,0,1,1,1] where '0' represents a loss and '1' represents a win. I am trying to return the team with the most consecutive wins.
Here is what my code looks like at the moment:
MATCH(t1:TEAM)-[p:PLAYS]->(t2:TEAM)
WITH [t1,t2] AS teams, p AS matches
ORDER BY matches.time ASC
UNWIND teams AS team
WITH team.name AS teamName, collect(case when ((team = startnode(matches)) AND (matches.score1 > matches.score2)) OR ((team = endnode(matches)) AND (matches.score2 > matches.score1)) then +1 else 0 end) AS consecutive_wins
RETURN teamName, consecutive_wins
This returns a list for each team showing their win / lose record in the form explained above (i.e. [0,1,0,1,1,0])
Any guidance or help in regards to calculating consecutive wins would be much appreciated.
Thanks
I answered a similar question here.
The key is using apoc.coll.split() from APOC Procedures, splitting on 0, which will yield a row per winning streak (list of consecutive 1's) as value. The size of each of the lists is the number of consecutive wins for that streak, so just get the max size:
// your query above
CALL apoc.coll.split(consecutive_wins, 0) YIELD value
WITH teamName, max(size(value)) as consecutiveWins
ORDER BY consecutiveWins DESC
LIMIT 1
RETURN teamName, consecutiveWins
Your use case does not actually require the detection of the most consecutive 1s (and it also does not need to use UNWIND).
The following query uses REDUCE to directly calculate the maximum number of consecutive wins for each team (consW keeps track of the current number of consecutive wins, and maxW is the maximum number of consecutive wins found thus far):
MATCH (team:TEAM)-[p:PLAYS]-(:TEAM)
WITH team, p
ORDER BY p.time ASC
WITH team,
REDUCE(s = {consW: 0, maxW: 0}, m IN COLLECT(p) |
CASE WHEN (team = startnode(m) AND (m.score1 > m.score2)) OR (team = endnode(m) AND (m.score2 > m.score1))
THEN {consW: s.consW+1, maxW: CASE WHEN s.consW+1 > s.maxW THEN s.consW+1 ELSE s.maxW END}
ELSE s
END).maxW AS most_consecutive_wins
RETURN team.name AS teamName, most_consecutive_wins;

SET in combination with CASE statement in cypher

I am tryin to set two different relationship properties to a count, with a case construct depending on the value of another relationship property. There is a console at http://console.neo4j.org/?id=rt1ld5
the cnt column contains the number of times r.value occurs. The two first rows of the initial query in the console indicate that the term "Car" is linked to 1 document that is considered relevant, and to two documents that are considered not relevant.
I want to SET a property on the [:INTEREST] relation between (user) and (term) with two properties, indicating how many times an interest is linked to a document that is considered relevant or not. So for (John)-[r:INTEREST]->(Car) I want r.poscnt=1 and r.negcnt=2
I.m struggling with the CASE construct. I tried various ways, this was the closest I got.
MATCH (u:user)-[int:INTEREST]->(t:term)<-[:ISABOUT]-(d:doc)<- [r:RELEVANCE]-(u)
WITH int, t.name, r.value, count(*) AS cnt
CASE
WHEN r.value=1 THEN SET int.poscnt=cnt
WHEN r.value=-1 THEN SET int.negcnt=cnt
END
But it's returning an error
Error: Invalid input 'A': expected 'r/R' (line 3, column 2)
"CASE"
^
This did it! Also see console at http://console.neo4j.org/?id=rq2i7j
MATCH (u:user)-[int:INTEREST]->(t:term)<-[:ISABOUT]-(d:doc)<-[r:RELEVANCE]-(u)
WITH int, t,
SUM(CASE WHEN r.value= 1 THEN 1 ELSE 0 END ) AS poscnt,
SUM(CASE WHEN r.value= -1 THEN 1 ELSE 0 END ) AS negcnt
SET int.pos=poscnt,int.neg=negcnt
RETURN t.name,int.pos,int.neg
Is it important for you to keep positive and negative count separate? It seems you could have a score property summing positive and negative values.
MATCH (u:user)-[int:INTEREST]->()<-[:ISABOUT]-()<-[r:RELEVANCE]-(u)
SET int.score = SUM(r.value)
RETURN t.name, int.score
You already seem to have found a working solution but I'll add a note about CASE as I understand it. While CASE provides branching, I think it's correct to say that it is an expression and not a statement. It resembles a ternary operator more than a conditional statement.
As the expression
a > b ? x : y;
is resolved to a value, either x or y, that can be used in a statement, so also
CASE WHEN a > b THEN x ELSE y END
resolves to a value. You can then assign this value
result = CASE WHEN a > b THEN x ELSE y END
Your original query used the CASE expression like a conditional statement
CASE WHEN a > b THEN result = x ELSE result = y END
which resembles if-else
if a > b { result = x; } else { result = y; }
Someone may want to correct the terminology, the point is that in your working query you correctly let CASE resolve to a value to be used by SUM rather than put a conditional assignment inside CASE.

How to efficiently select a random row using a biased prob. distribution with Rails 2.3 and PostgreSQL 8?

I have wrote a few simple Rails application, accessing the database via the ActiveRecord abstraction, so I am afraid that I don't know much about the inner working of PostgreSQL engine.
However, I am writing a Rails app that need to support 100000+ rows with dynamically updated content and am wondering whether I am using the random functions efficiently:
Database migration schema setting:
t.float: attribute1
t.integer: ticks
add_index :mytable, :attribute1
add_index :mytable, :ticks
Basically, I want to have the following random function distribution:
a) row that has the top 10% value in attribute1 = 30% chance of being selected
b) the middle 60% (in attribute1) row = 50% chance of being selected
c) the lowest 30% (in attribute1) with less than 100 ticks = 15% chance of being selected,
d) and for those with the lowest 30% attribute1 that have more than X (use X = 100 in this question) ticks = 5% chance of being selected.
At the moment I have the following code:
#rand = rand()
if #rand>0.7
#randRowNum = Integer(rand(0.1) * Mytable.count )
#row = Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
if #rand>0.2
#randRowNum = Integer((rand(0.6)+0.1) * Mytable.count)
#row= Mytable.find(:first, :offset =>#randRowNum , :order => '"attribute1" DESC')
else
#row= Mytable.find(:first, :offset =>Integer(0.7 * Mytable.count), :order => '"attribute1" DESC')
if !(#row.nil?)
if (#rand >0.05)
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" < 100' ] )
else
#row= Mytable.find(:first, :order => 'RANDOM()', :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ] )
end
end
end
end
1) One thing I would like to do is avoid the use of :order => 'RANDOM()' as according to my research, it seems that each time it is called, involves the SQL engine first scanning through all the rows, assigning them a random value. Hence the use of #randRowNum = Integer(rand(0.1) * Mytable.count ) and offset by #randRowNum. for a) and b). Am I actually improving the efficiency? Is there any better way?
2) Should I be doing the same as 1) for c) and d), and is by using the :conditions => ['"attribute1" <= '"#{#row.attribute1}", '"ticks" >= 100' ], am I forcing the SQL engine to scan through all the rows anyway? Is there anything apart from indexing that I can improve the efficiency of this call (with as little space/storage overhead as possible too)?
3) There is a chance that the total number of entries in Mytable may have been changed between the Mytable.count and Mytable.find calls. I could put the two calls within a transaction, but it seems excessive to lock that entire table just for a read operation (at the moment, I have extra code to fall back to a simple random row selection if I got a #row.nil from the above code). Is it psosible to move that .count call within a single atomic SQL query in Rails? Or would it have the same efficiently as locking via transaction in Rails?
4) I have also been reading up on stored procedure in PostgreSQL... but for this particular case, is there any gain in efficiency to be achieved, worth moving the code from Activerecording abstraction to stored procedure?
Many thanks!
P.S. development/deployment environment:
Rube 1.8.7
Rails 2.3.14
PostgreSQL 8 (on Heroku)
Your question seems a bit vague, so correct me if my interpretation is wrong.
If you didn't split (c) and (d), I would just convert the uniform random variable over [0,1) to a biased random variable over [0,1) and use that to select your row offset.
Unfortunately, (c) and (d) is split based on the value of "ticks" — a different column. That's the hard part, and also makes the query much less efficient.
After you fetch the value of attribute1 at 70% from the bottom, also fetch the number of (c) rows; something like SELECT COUNT(*) FROM foo WHERE attribute1 <= partiton_30 AND ticks < 100. Then use that to find the offset into either the ticks < 100 or the ticks >= 100 cases. (You probably want an index on (attribute1, ticks) or something; the order which is best depends on your data).
If the "ticks" threshold is known in advance and doesn't need to change often, you can cache it in a column (BOOL ticks_above_threshold or whatever) which makes the query much more efficient if you have an index on (ticks_above_threshold, attribute1) (note the reversal). Of course, every time you change the threshold you need to write to every row.
(I think you can also use a "materialized view" to avoid cluttering the main table with an extra column, but I'm not sure what the difference in efficiency is.)
There are obviously some efficiency gains possible by using stored procedures. I wouldn't worry about it too much, unless latency to the server is particularly high.
EDIT:
To answer your additional questions:
Indexing (BOOL ticks_above_threshold, attribute1) should work better than (ticks, attribute1) (I may have the order wrong, though) because it lets you compare the first index column for equality. This is important.
Indexes generally use some sort of balanced tree to do a lookup on what is effectively a list. For example, take A4 B2 C3 D1 (ordered letter,number) and look up "number of rows with letter greater than B and number greater than 2". The best you can do is start after the Bs and iterate over the whole table. If you order by number,letter, you get 1D 2B 3C 4A. Again, start after the 2s.
If you instead your index is on is_greater_than_2,letter, it looks like false,B false,D true,A true,C. You can ask the index for the position of (true,B) — between true,A and true,C — and count the number of entries until the end. Counting the number of items between two index positions is fast.
Google App Engine's Restrictions on Queries goes one step further:
Inequality Filters Are Allowed on One Property Only
A query may only use inequality filters (<, <=, >=, >, !=) on one property across all of its filters.
...
The query mechanism relies on all results for a query to be adjacent to one another in the index table, to avoid having to scan the entire table for results. A single index table cannot represent multiple inequality filters on multiple properties while maintaining that all results are consecutive in the table.
If none of you other queries benefit from an index on ticks, then yes.
In some cases, it might be faster to index a instead of a,b (or b,a) if the clause including b is almost always true and you're fetching row data (not just getting a COUNT()) (i.e. if 1000 <= a AND a <= 1010 matches a million rows and b > 100 only fails for two rows, then it might end up being faster to do two extra row lookups than to work with the bigger index).
As long as rows aren't being removed b/n the call to count() and the call to find() I wouldn't worry about transactions. Definitely get rid of all calls ordering by RANDOM() as there is no way to optimize it. Make sure that attribute1 has an index on it. I haven't tested it, but something like this should be pretty quick:
total_rows = MyTable.count
r = rand()
if r > 0.7 # 90-100%
lower = total_rows * 0.9
upper = total_rows
elsif r > 0.2 # 30-89%
lower = total_rows * 0.3
upper = total_rows * 0.89
else # 0-29%
lower = 0
upper = total_rows * 0.29
end
offset = [lower + (upper - lower) * rand(), total_rows - 1].min.to_i
#row = Mytable.find(:first, :offset => offset, :order => 'attribute1 ASC')

How to calculate mean based on number of votes/scores/samples/etc?

For simplicity say we have a sample set of possible scores {0, 1, 2}. Is there a way to calculate a mean based on the number of scores without getting into hairy lookup tables etc for a 95% confidence interval calculation?
dreeves posted a solution to this here: How can I calculate a fair overall game score based on a variable number of matches?
Now say we have 2 scenarios ...
Scenario A) 2 votes of value 2 result in SE=0 resulting in the mean to be 2
Scenario B) 10000 votes of value 2 result in SE=0 resulting in the mean to be 2
I wanted Scenario A to be some value less than 2 because of the low number of votes, but it doesn't seem like this solution handles that (dreeve's equations hold when you don't have all values in your set equal to each other). Am I missing something or is there another algorithm I can use to calculate a better score.
The data available to me is:
n (number of votes)
sum (sum of votes)
{set of votes} (all vote values)
Thanks!
You could just give it a weighted score when ranking results, as opposed to just displaying the average vote so far, by multiplying with some function of the number of votes.
An example in C# (because that's what I happen to know best...) that could easily be translated into your language of choice:
double avgScore = Math.Round(sum / n);
double rank = avgScore * Math.Log(n);
Here I've used the logarithm of n as the weighting function - but it will only work well if the number of votes is neither too small or too large. Exactly how large is "optimal" depends on how much you want the number of votes to matter.
If you like the logarithmic approach, but base 10 doesn't really work with your vote counts, you could easily use another base. For example, to do it in base 3 instead:
double rank = avgScore * Math.Log(n, 3);
Which function you should use for weighing is probably best decided by the order of magnitude of the number of votes you expect to reach.
You could also use a custom weighting function by defining
double rank = avgScore * w(n);
where w(n) returns the weight value depending on the number of votes. You then define w(n) as you wish, for example like this:
double w(int n) {
// caution! ugly example code ahead...
// if you even want this approach, at least use a switch... :P
if (n > 100) {
return 10;
} else if (n > 50) {
return 8;
} else if (n > 40) {
return 6;
} else if (n > 20) {
return 3;
} else if (n > 10) {
return 2;
} else {
return 1;
}
}
If you want to use the idea in my other referenced answer (thanks!) of using a pessimistic lower bound on the average then I think some additional assumptions/parameters are going to need to be injected.
To make sure I understand: With 10000 votes, every single one of which is "2", you're very sure the true average is 2. With 2 votes, each a "2", you're very unsure -- maybe some 0's and 1's will come in and bring down the average. But how to quantify that, I think is your question.
Here's an idea: Everyone starts with some "baggage": a single phantom vote of "1". The person with 2 true "2" votes will then have an average of (1+2+2)/3 = 1.67 where the person with 10000 true "2" votes will have an average of 1.9997. That alone may satisfy your criteria. Or to add the pessimistic lower bound idea, the person with 2 votes would have a pessimistic average score of 1.333 and the person with 10k votes would be 1.99948.
(To be absolutely sure you'll never have the problem of zero standard error, use two different phantom votes. Or perhaps use as many phantom votes as there are possible vote values, one vote with each value.)

Resources