Calculating self citation counts in DBLP using neo4j - neo4j

I have imported the DBLP database with referenced publications from Crossref API into neo4j.
The goal is to calculate a self-citation-quotient for each author in the database.
The way I´d like to calculate this quotient is the following:
find authors that have written publications referencing another publication written by the same author
for each of these publications count the referenced publications written by the same author
divide amount of self references by the amount of all references
set this number as a parameter scq(self citation quotient) for the publication
sum all values of scq and divide them by the total amount of publications written by the author
set this value as a property scq for the Author
As an example I have the following sub-graph for the author "Danielle S. Bassett":
From the graph you can see that she has 2 publications that contain self-references.
In Words:
Danielle wrote Publication 1, 2, 3, 4
Publication 1 references publication 2
Publication 3 references publication 4
My attempt was to use the following cypher query:
match (a:Author{name:"Danielle S. Bassett"})-[:WROTE]->(p1:Publication)-[r:REFERENCES]->(p2:Publication)<-[:WROTE]-(a)
with count(p2) as ssc_per_publ,
count(p1) as main_publ_count,
collect(p2) as self_citations,
collect(p1) as main_publ,
collect(r) as refs,
a as author
return author, main_publ, ssc_per_publ, self_citations, main_publ_count, refs
The result of this query as a table looks like this:
As you can see from the table the main_publ_count is calculated correctly since there are 2 publications she has written that contain self references but the ssc_per_publ (self citation count per publication) is wrong because it counted ALL self references. But I need the count of self references for EACH PUBLICATION.
Calculating the quotients will not be the problem but getting the right values from neo4j is.
I hope I´ve expressed myself clearly enough for you to understand the issue.
Maybe someone of you knows a way of getting this right. Thanks!

Your WITH clause is using author as the sole aggregation function "grouping key", since it is the only term in that clause not using an aggregation function. So, all the aggregation functions in that clause are aggregating over just that one term.
To get a "self citation count" per publication (by that author), you'd have to do something like the following (for simplicity, this query ignores all the other counts and collections). author and publ together form the "grouping key" in this query.
MATCH (author:Author{name:"Danielle S. Bassett"})-[:WROTE]->
(publ:Publication)-[r:REFERENCES]->(p2:Publication)<-[:WROTE]-(a)
RETURN author, publ, COUNT(p2) as self_citation_count;
[Aside: your original query has other issues as well. For example, you should use COUNT(DISTINCT p1) as main_publ_count so that multiple self-citations to the same p1 instance will not inflate the count of "main" publications.]

Related

Co-occurence analysis in Neo4j database

Let's say I have a database with nodes of two types Candyjars and Candies. Every Candyjar (Candyjar1, Candyjar2...) has different number of candies of different types: CandyRed, CandyGreen etc..
Now let's say the end game here is to find how much is the probability of the various types of candies to occur together, and the covariance among them. Then I want to have relationships between each CandyType with an associated probabilities of co-occurence and covariance. Let's call this relationships OCCURS_WITH so that Candtype1 -[OCCURS_WITH]->Candytype2 and Candytype1 -[COVARIES]->Candytype2
I'd make a database with CandieTypes and CandyJars as nodes, make a relationship (cj:CandyJar)-[r:CONTAINS]->(ct:Candytype) where r can have an attribute to set "how many" candy of a type are cotained in the jar.
Noy my problems is that I don't understand how can i, in Cypher, make a query to assign the OCCURS_WITH relationship in an optimal manner. Would I have to iterate for every pair of Candies, counting the number of pairs that cooccurs in candyjars over the number of candyjars? Is there a way to do it for all of the possible pairs together?
When I try to do:
MATCH (ct1:Candytype)<-[r1:CONTAINS]-(cj:Candyjar)-[r2:CONTAINS]->(ct2:Candytype)
WHERE ct1<>ct2 AND ct1.name="CandyRed" AND ct2.name="CandyBlue"
RETURN ct1,r1,count(r1),cj1,ct2,r2,count(r2)
LIMIT 5
I cannot get the count of the relationships of the co-occurring candies that I would need to express the probability of co-occurrence.
Would I have to use something like python to do the calculations rather than try to make a statement in Cypher?
To get the count of how many times CandyRed and CandyBlue co-occur, you can use the following Cypher statement:
MATCH (ct1:Candytype)<-[:CONTAINS]-(:Candyjar)-[:CONTAINS]->(ct2:Candytype)
WHERE ct1.name="CandyRed" AND ct2.name="CandyBlue"
RETURN ct1,ct2, count(*) AS coOccur
LIMIT 5
If you want a query that will compare all the candy types, you can use:
MATCH (ct1:Candytype)<-[:CONTAINS]-(:Candyjar)-[:CONTAINS]->(ct2:Candytype)
WHERE id(ct1) < id(ct2)
RETURN ct1,ct2, count(*) AS coOccur
LIMIT 5

ActiveRecord group with alias

First question ever on here, and pretty new to coding full apps/Rails.
I was creating a method to get the counts for titles by author, and noticed that if the author is cased differently, it would count as different authors. I wanted to place some sort of validation/check to disregard the casing and count it together. I don't care about the casing of the book titles in this particular case.
So I have table like this:
Author Book Title Year Condition
William Shakespeare Hamlet 1599 Poor
Stephen King The Shining 1977 New
Edgar Allen Poe The Raven 1845 Good
JK Rowling Harry Potter and the Sorcerer's Stone 2001 New
edgar allen poe The Tell-Tale Heart 1843 Good
JK Rowling Fantastic Beasts and Where to Find Them 2001 New
I want to output this:
Author Count
William Shakespeare 1
Stephen King 1
Edgar Allen Poe 2
JK Rowling 2
My method was originally something like this:
def self.book_counts
distinct_counts = []
Book.group(:author).count.each do |count|
distinct_counts << count
end
distinct_counts
end
To ignore casing, I referenced this page and came up with these, which didn't end up working out, unfortunately:
1) With this one I get "undefined method lower":
Book.group(lower('author')).count.each do |count|
distinct_counts << count
2) This runs, but with the select method in general, I get a bunch of ActiveRecord results/Record id: nil. I am using Rails 6 and it additionally notes "DEPRECATION WARNING: Dangerous query method (method whose arguments are used as raw SQL) called with non-attribute argument(s) ... Non-attribute arguments will be disallowed in Rails 6.1. This method should not be called with user-provided values, such as request parameters or model attributes. Known-safe values can be passed by wrapping them in Arel.sql(). (called from irb_binding at (irb):579)":
Book.select("lower(author) as dc_auth, count(*) as book_count").group("dc_auth").order("book_count desc")
3) I even tried to test a different, simplified function to see if it'd work, but I got "ActiveRecord::StatementInvalid (PG::GroupingError: ERROR: column "books.author" must appear in the GROUP BY clause or be used in an aggregate function)":
Book.pluck('lower(author) as dc_auth, count(*) as book_count')
4) I've tried various other ways, with additional different errors, e.g. "undefined local variable or method 'dc_auth'", "undefined method 'group' did you mean group_by?", and "wrong number of arguments (given 1, expected 0)" (with group_by), etc.
This query works exactly how I want it to in postgresql. The syntax actually populates in the terminal when I run #2, but as mentioned, unfortunately due to ActiveRecord doesn't output properly in Rails.
SELECT lower(author) as dc_auth, count(*) as book_count FROM books GROUP BY dc_auth;
Is there even a way to run what I want through Rails??
Maybe you can try
Book.group("LOWER(author)").count
You can execute your query using ActiveRecord. And I will suggest to go with SQL block
book_count_query = <<-SQL
SELECT lower(author) as dc_auth, count(*) as book_count
FROM books
GROUP BY dc_auth;
SQL
1- result = ActiveRecord::Base.connection.execute(book_count_query)
or
2- result = ActiveRecord::Base.connection.exec_query(book_count_query)
What difference between line 1 and line 2?
exec_query it returns an ActiveRecords::Result object which has handy methods like .columns and .rows to access headers and values.
The array of hashes from .execute can be troublesome to deal with and gave me redundant results when I ran with a SUM GROUP BY clause.
If you need read more about this topic
example of exec_query in api.rubyonrails
active_record_querying in Rails Documentation
This Resource have example for query and output .
Why you store authors in the same table with books. The better solution is to add a separate table for authors and add a foreign key to author_id to books table. With counter_cache you can easily count the number of books for each author.
Here is a guide with books and authors examples https://guides.rubyonrails.org/association_basics.html

Auto increment id Neo4j to retrieve elements in insert order

Recently, I am experimenting Neo4j. I like the idea but I am facing a problem that I have never faced with relational databases.
I want to perform these inserts and then return them exactly in the insertion order.
Insert elements:
create(p1:Person {name:"Marc"})
create(p2:Person {name:"John"})
create(p3:Person {name:"Paul"})
create(p4:Person {name:"Steve"})
create(p5:Person {name:"Andrew"})
create(p6:Person {name:"Alice"})
create(p7:Person {name:"Bob"})
While to return them:
match(p:Person) return p order by id(p)
I receive the elements in the following order:
Paul
Andrew
Marc
John
Steve
Alice
Bob
I note that these elements are not returned respecting the query insertion order (through the id function).
In fact the id of my elements are the following:
Marc: 18221
John: 18222
Paul: 18208
Steve: 18223
Andrew: 18209
Alice: 18224
Bob: 18225
How does the Neo4j id function work? I read that it generates an auto incremental id but it seems a little strange his mechanism. How do I return items respecting the query insertion order? I thought about creating a timestamp attribute for each node but I don't think it's the best choice
If you're looking to generate sequence numbers in Neo4j then you need to manage this yourself using a strategy that works best in your application.
In ours we maintain sequence numbers in key/value pair nodes where Scope is the application name given to the sequence number range, and Value is the last sequence number used. When we generate a node of a given type, such as Product, then we increment the sequence number and assign it to our new node.
MERGE (n:Sequence {Scope: 'Product'})
SET n.Value = COALESCE(n.Value, 0) + 1
WITH n.Value AS seq
CREATE (product:Product)
SET product.UniqueId = seq
With this you can create as many sequence numbers you need just by creating sequence nodes with unique scope names.
For more examples and tests see the AutoInc.Neo4j project https://github.com/neildobson-au/AutoInc/blob/master/src/AutoInc.Neo4j/Neo4jUniqueIdGenerator.cs
The id of Neo4j is maintained internally, which your business code should not depend on.
Generally it's auto incrementally, but if there is delete operation, you may reuse the deleted id according to the Reuse Policy of Neo4j Server.

How to concatenate three columns into one and obtain count of unique entries among them using Cypher neo4j?

I can query using Cypher in Neo4j from the Panama database the countries of three types of identity holders (I define that term) namely Entities (companies), officers (shareholders) and Intermediaries (middle companies) as three attributes/columns. Each column has single or double entries separated by colon (eg: British Virgin Islands;Russia). We want to concatenate the countries in these columns into a unique set of countries and hence obtain the count of the number of countries as new attribute.
For this, I tried the following code from my understanding of Cypher:
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)-[:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND
NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND BEZ3.countries="Belize") OR
(BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved", "Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
SET BEZ4.countries= (BEZ1.countries+","+BEZ2.countries+","+BEZ3.countries)
RETURN BEZ3.countries AS IntermediaryCountries, BEZ3.name AS
Intermediaryname, BEZ2.countries AS OfficerCountries , BEZ2.name AS
Officername, BEZ1.countries as EntityCountries, BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress,DISTINCT count(BEZ4.countries) AS NoofConnections
The relevant part is the SET statement in the 7th line and the DISTINCT count in the last line. The code shows error which makes no sense to me: Invalid input 'u': expected 'n/N'. I guess it means to use COLLECT probably but we tried that as well and it shows the error vice-versa'd between 'u' and 'n'. Please help us obtain the output that we want, it makes our job hell lot easy. Thanks in advance!
EDIT: Considering I didn't define variable as suggested by #Cybersam, I tried the command CREATE as following but it shows the error "Invalid input 'R':" for the command RETURN. This is unfathomable for me. Help really needed, thank you.
CODE 2:
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)-
[:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND
NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND
BEZ3.countries="Belize") OR
(BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved",
"Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
CREATE (p:Connections{countries:
split((BEZ1.countries+";"+BEZ2.countries+";"+BEZ3.countries),";")
RETURN BEZ3.countries AS IntermediaryCountries, BEZ3.name AS
Intermediaryname, BEZ2.countries AS OfficerCountries , BEZ2.name AS
Officername, BEZ1.countries as EntityCountries, BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress, AS TOTAL, collect (DISTINCT
COUNT(p.countries)) AS NumberofConnections
Lines 8 and 9 are the ones new and to be in examination.
First Query
You never defined the identifier BEZ4, so you cannot set a property on it.
Second Query (which should have been posted in a separate question):
You have several typos and a syntax error.
This query should not get an error (but you will have to determine if it does what you want):
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)- [:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND BEZ3.countries="Belize") OR (BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved", "Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
CREATE (p:Connections {countries: split((BEZ1.countries+";"+BEZ2.countries+";"+BEZ3.countries), ";")})
RETURN BEZ3.countries AS IntermediaryCountries,
BEZ3.name AS Intermediaryname,
BEZ2.countries AS OfficerCountries ,
BEZ2.name AS Officername,
BEZ1.countries as EntityCountries,
BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress,
SIZE(p.countries) AS NumberofConnections;
Problems with the original:
The CREATE clause was missing a closing } and also a closing ).
The RETURN clause had a dangling AS TOTAL term.
collect (DISTINCT COUNT(p.countries)) was attempting to perform nested aggregation, which is not supported. In any case, even if it had worked, it probably would not have returned what you wanted. I suspect that you actually wanted the size of the p.countries collection, so that is what I used in my query.

NEO4j 3.0 retrieve data between certain period

I'm using NEO4J 3.0 and it seems that HAS function was removed.
Type of myrelationship is a date and I'm looking to retrieve all relation between two dates such as my property "a" is greater than certain value.
How can I test this using NEO4j
Thank you
[EDITED to add info from comments]
I have tried this:
MATCH p=(n:origin)-[r]->()
WHERE r>'2015-01'
RETURN AVG(r.amount) as totalamout;
I created relationship per date and each one has a property, amount, and I am looking to compute the average amount for certain period. As example, average amount since 2015-04.
To answer the issue raised by your first sentence: in neo4j 3.x, the HAS() function was replaced by EXISTS().
[UPDATE 1]
This version of your query should work:
MATCH p=(n:origin)-[r]->()
WHERE TYPE(r) > '2015-01'
RETURN AVG(r.amount) as totalamout;
However, it is a bad idea to give your relationships different types based on a date. It is better to just use a date property.
[UPDATE 2]
If you changed your data model to add a date property to your relationships (to which I will give the type FOO), then the following query will find the average amount, per p, of all the relationships whose date is after 2015-01 (assuming that all your dates follow the same strict YYYY-MM pattern):
MATCH p=(n:origin)-[r:FOO]->()
WHERE r.date > '2015-01'
RETURN p, AVG(r.amount) as avg_amout;

Resources