Neo4j percentage query - neo4j

I recently had a Homework problem that I could not figure out how to even start let alone how to get each of the 1 letter purposes added together and have their % returned. I have included the Neo4j model as well as a list of the first few rows. The data showed the Taxon, Class, Order, Family, Genus if they were imported/exported and the purpose (example C. Aves imported for commercial purposes)
What % of trades fall under each purpose type (bred, captured, wild, zoo, etc...)
CLASS: ORDER: IMPORT/EXPORT: PURPOSE:
AVES Falconiforms 43 Exported S (Science)
Mammalia Carnivora 2 Exported H (hunting)
Reptilia Crocodylia 10 Imported S (Science)
Aves Anseriformes 2 Imported T (commercial)
Mammalia Primates 700 Exported T (commercial)
Neo4j Graph Dataset

Related

how to select SpatRaster layers from their names?

I've got a SpatRaster of (150 x 150 x 1377) that shows temporal evolution of precipitations. Each layer is a given hour in a 2-month interval, but some hours are missing, and the dataset isn't continuous. The layers names are strings as "YYYYMMDDhhmm".
I need to find the mean value every three hours even on whole intervals or on missing-data intervals. On entire ones I want to average three data and on missing-data ones I would like to average two of them or, if two are missing, to select the unique value as the averaged one.
How can I use data names to select how to act?
I've already tried this code but I'm averaging on three continuous layers by index and not by hours. How can I convert names in DateTime form from "tidyverse" in order to use rollapply() to see if two steps back I find the DateTime I am expecting? Is there any other method to check this out?
HSAF=rast(c((paste0(resfolder, "HSAF_final1_5.tif")),(paste0(resfolder, "HSAF_final6_10.tif")),(paste0(resfolder, "HSAF_final11_15.tif")),
(paste0(resfolder, "HSAF_final16_20.tif")),(paste0(resfolder, "HSAF_final21_25.tif")),(paste0(resfolder, "HSAF_final26_30.tif")),
(paste0(resfolder, "HSAF_final31_N04.tif")),(paste0(resfolder, "HSAF_finalN05_N08.tif")),(paste0(resfolder, "HSAF_finalN09_N13.tif")),
(paste0(resfolder, "HSAF_finalN14_N18.tif")),(paste0(resfolder, "HSAF_finalN19_N23.tif")),(paste0(resfolder, "HSAF_finalN24_N28.tif")),
(paste0(resfolder, "HSAF_finalN29_N30.tif"))))
index=names(HSAF)
j=2
for (i in seq(1,3, by=3))
{third_el<- HSAF[index[i+j]]
second_el <- HSAF[index[i+j-1]]
first_el<- HSAF[index[i+j-2]]
newraster<- c(first_el, second_el, third_el)
newraster<- mean(newraster, filename=paste0(tempfile(), ".tif"))
names(newraster)<- paste0(index[i+j-2],index[i+j-1],index[i+j])
}
for (i in seq(4,1374 , by=3))
{ third_el<- HSAF[index[i+j]]
second_el <- HSAF[index[i+j-1]]
first_el<- HSAF[index[i+j-2]]
subraster<- c(first_el, second_el, third_el)
subraster<- mean(subraster, filename=paste0(tempfile(), ".tif"))
names(subraster)<- paste0(index[i+j-2],index[i+j-1],index[i+j])
add(newraster)<- subraster
}

[Google Data Studio]: Can't create histogram as bin dimension is interpreted as metric

I like to make a histogram of some data that is saved in a nested BigQuery table. In a simplified manner the table can be created in the following way:
CREATE TEMP TABLE Sessions (
id int,
hits
ARRAY<
STRUCT<
pagename STRING>>
);
INSERT INTO Sessions (id, hits)
VALUES
( 1,[STRUCT('A'),STRUCT('A'),STRUCT('A')]),
( 2,[STRUCT('A')]),
( 3,[STRUCT('A'),STRUCT('A')]),
( 4,[STRUCT('A'),STRUCT('A')]),
( 5,[]),
( 6,[STRUCT('A')]),
( 7,[]),
( 8,[STRUCT('A')]),
( 9,[STRUCT('A')]),
(10,[STRUCT('A'),STRUCT('A')]);
and it looks like
id
hits.pagename
1
A
A
A
2
A
3
A
A
and so on for the other ids. My goal is to obtain a histogram showing the distribution of A-occurences per id in data studio. The report for the MWE can be seen here: link
So far I created a calculated field called pageviews that evaluates the wanted occurences for each session via SUM(IF(hits.pagename="A",1,0)). Looking at a table showing id and pageviews I get the expected result:
table showing the number of occurences of page A for each id
However, the output of the calculated field is a metric, which might cause trouble for the following. In the next step I wanted to follow the procedure presented in this post. Therefore, I created another field bin to assign my sessions to bins according to the homepageviews as:
CASE
WHEN pageviews = 0 OR pageviews = 1 THEN "bin 1"
WHEN pageviews = 2 OR pageviews = 3 THEN "bin 2"
END
According to this bin-defintion I hope to obtain a histogram having 6 counts in bin 1 and 4 counts in bin 2. Well, in this particular example it will actually have 4 counts in bin one as ids 5 and 7 had "null" entries, but never mind. This won't happen in my real world table.
As you can see in the next image showing the same table as above, but now with a bin-column, this assignment works as well - each id is assigned the correct bin, but now the output field is a metric of type text. Therefore, the bar-chart won't let me use it (it needs it as dimension).
Assignment of each id to a bin
Somewhere I read the workaround to create a selfjoined blend, which outputs metrics as dimension. This works only by name: my field is now a dimension and i can use it as such for the bar-chart, but the bar chart won't load and shows a configuration error of the data source, which can be seen in this picture:
bar-chart of id-count over bin. In the configuration of the chart one can see that "bin" is now a dimension. The chart won't plot, however, as it shows a data configuration error (sorry for the German version of data studio).

Calculating self citation counts in DBLP using neo4j

I have imported the DBLP database with referenced publications from Crossref API into neo4j.
The goal is to calculate a self-citation-quotient for each author in the database.
The way I´d like to calculate this quotient is the following:
find authors that have written publications referencing another publication written by the same author
for each of these publications count the referenced publications written by the same author
divide amount of self references by the amount of all references
set this number as a parameter scq(self citation quotient) for the publication
sum all values of scq and divide them by the total amount of publications written by the author
set this value as a property scq for the Author
As an example I have the following sub-graph for the author "Danielle S. Bassett":
From the graph you can see that she has 2 publications that contain self-references.
In Words:
Danielle wrote Publication 1, 2, 3, 4
Publication 1 references publication 2
Publication 3 references publication 4
My attempt was to use the following cypher query:
match (a:Author{name:"Danielle S. Bassett"})-[:WROTE]->(p1:Publication)-[r:REFERENCES]->(p2:Publication)<-[:WROTE]-(a)
with count(p2) as ssc_per_publ,
count(p1) as main_publ_count,
collect(p2) as self_citations,
collect(p1) as main_publ,
collect(r) as refs,
a as author
return author, main_publ, ssc_per_publ, self_citations, main_publ_count, refs
The result of this query as a table looks like this:
As you can see from the table the main_publ_count is calculated correctly since there are 2 publications she has written that contain self references but the ssc_per_publ (self citation count per publication) is wrong because it counted ALL self references. But I need the count of self references for EACH PUBLICATION.
Calculating the quotients will not be the problem but getting the right values from neo4j is.
I hope I´ve expressed myself clearly enough for you to understand the issue.
Maybe someone of you knows a way of getting this right. Thanks!
Your WITH clause is using author as the sole aggregation function "grouping key", since it is the only term in that clause not using an aggregation function. So, all the aggregation functions in that clause are aggregating over just that one term.
To get a "self citation count" per publication (by that author), you'd have to do something like the following (for simplicity, this query ignores all the other counts and collections). author and publ together form the "grouping key" in this query.
MATCH (author:Author{name:"Danielle S. Bassett"})-[:WROTE]->
(publ:Publication)-[r:REFERENCES]->(p2:Publication)<-[:WROTE]-(a)
RETURN author, publ, COUNT(p2) as self_citation_count;
[Aside: your original query has other issues as well. For example, you should use COUNT(DISTINCT p1) as main_publ_count so that multiple self-citations to the same p1 instance will not inflate the count of "main" publications.]

Data Warehouse Design of Fact Tables

I'm pretty new to data warehouse design and am struggling with how to design the fact table given very similar, but somewhat different metrics. Lets say you were evaluating the below metrics, how would you break up the fact tables (in this case company is a subset of client). Would you be able to use one table for all of this or would each metric being measured warrant its own fact table or would each part of the metric being measured be its own column in one fact table?
Total company daily/monthly/yearly # of files processed
Total company daily/monthly/yearly file sizes processed
Total company daily/monthly/yearly # files errored
Total company daily/monthly/yearly # files failed
Total client daily/monthly/yearly # of files processed
Total client daily/monthly/yearly file sizes processed
Total client daily/monthly/yearly # files errored
Total client daily/monthly/yearly # files failed
By the looks of the measure names, I think you'll be served with a single fact table with a record for each file and a link back to a date_dim
create table date_dim (
date_sk int,
calendar_date date,
month_ordinal int,
month_name nvarchar,
Year int,
..etc you've got your own one of these ..
)
create table fact_file_measures (
date_sk,
file_sk, --ref the file_dim with additonal file info
company_sk, --ref the company_dim with the client details
processed int, --should always be one, optional depending on how your reporting team like to work
size_Kb decimal -- specifiy a size measurement, ambiguity is bad
error_count int -- 1 if file had error, 0 if fine
failed_count int -- 1 or file failed, 0 if fine
)
so now you should be able to construct queries to get everything you asked for
for example, for your monthly stats:
select
c.company_name,
c.client_name,
sum(f.file_count) total_files,
sum(f.size_Kb) total_files_size_Kb,
sum(f.file_count) total_files,
sum(f.file_count) total_files
from
fact_file_measure f
inner join dim_company c on f.company_sk = c.company_sk
inner join dim_date d on f.date_sk = d.date_sk
where
d.month = 'January' and d.year = "1984"
If you needed to have the side by side 'day/month/year' stuff, you can construct year and month fact tables to do the roll ups and join back via dim_date's month/year fields. (You could include month and year fields in the daily fact table, but these values may end up being miss-used by less experienced report builders) It all goes back to what your users actually want - design your fact tables to their requirements and don't be afraid to have separate fact tables - data warehouse is not about normalization, its about presenting the data in a way that it can be used.
Good luck

Index Match? Or some other function?

‘Student Needs’! Columns I through O contain information on when each student attends an intervention class. Intervention classes take place during the second half of the classes (Science or social studies) or during the second half of Co-taught classes (math or ELA). Science and social studies interventions are done on either Monday/Wednesday or Tuesday/Friday (Thursdays have a special schedule that we do not need to consider). Math and ELA interventions occur on all four days.
In ‘Student Master’!, each student’s schedule is listed for both MW and TF. In Columns E, G, K, and M, I would like to populate any of the interventions that are listed in the ‘Student Needs’! sheet. For instance, Lindsey Lukowski has Social Skills on MW2 (Mondays and Wednesdays 2nd hour). So in cell ‘Student Master’! G31 should return “Social Skills”.
William Watters is getting Read Naturally and Reading Comp during his 5th Hour Co-Taught ELA. So ‘Student Master’! K51 and K52 should both return Read Naturally & Reading Comp (in the same cell).
Here is the workbook:
https://docs.google.com/spreadsheets/d/1aW7ExATzMn9Rf8IFLI4v-CQiqsXnxyDm8PxqMW999bY/edit?usp=sharing
Here is a complex functions that seems to do what you want. I have tested it in a copy of your sheet
Just Change E$2 for different columns.
=IFERROR(INDIRECT("'Student Needs'!"&CHAR(72+IFERROR(MATCH("HR "&E$2,ARRAYFORMULA(REGEXEXTRACT(INDIRECT("'Student Needs'!I"&REGEXEXTRACT($A3,"[0-9]+")+5&":N"&REGEXEXTRACT($A3,"[0-9]+")+5),"[A-Z ]+[0-9]")),0),MATCH($C3,ARRAYFORMULA(REGEXEXTRACT(INDIRECT("'Student Needs'!I"&REGEXEXTRACT($A3,"[0-9]+")+5&":N"&REGEXEXTRACT($A3,"[0-9]+")+5),"[A-Z]+")),0)))&5))
Also I am not sure where "6- Science" should go? Is this also HR 6?
In order for it to work with actual
=IFERROR(INDIRECT("'Student Needs'!"&CHAR(72+IFERROR(MATCH("HR "&E$2,ARRAYFORMULA(REGEXEXTRACT(INDIRECT("'Student Needs'!I"&ROW(VLOOKUP($A3,'Student Needs'!$A$1:$B,2,false))&":N"&ROW(VLOOKUP($A3,'Student Needs'!$A$1:$B,2,false))),"[A-Z ]+[0-9]")),0),MATCH($C3,ARRAYFORMULA(REGEXEXTRACT(INDIRECT("'Student Needs'!I"&ROW(VLOOKUP($A3,'Student Needs'!$A$1:$B,2,false))&":N"&ROW(VLOOKUP($A3,'Student Needs'!$A$1:$B,2,false))),"[A-Z]+")),0)))&5))

Resources