Cypher request to extract genealogy - neo4j

I'm parsing a book within neo4j and I'd like to extract genealogy out of it I have sentences like :
"A begat B,C and D"
"X begat Y, and Y begat Z, ..."
and I store that as
(A:word)-[:subj]->(begat:word)-[:obj]-> (B:word)
(A:word)-[:subj]->(begat:word)-[:comp]-> (C:word)
(X:word)-[:subj]->(begat:word)-[:obj]-> (Y:word)
(Y:word)-[:subj]->(begat:word)-[:obj]-> (Z:word)
(X:word)-[:NNP]->(sentence:word)
(Y:word)-[:NNP]->(sentence:word)
(Z:word)-[:NNP]->(sentence:word)
(begat:word)-[:VBG]->(sentence:word)
How could I write my cypher request so that neo4j server visualization give me a tree instead of one "begat" node with all the other ones linking to it ? My genealogy is on several sentences and when linking word together I add the sentenceId to the relationship maybe we could use that.
The result would look like
A
______|_____
| | |
B C D
|
X
|
Y
|
Z
One more info the words are stored only once to avoid memory consumption.
Here is a sample of my data :
http://console.neo4j.org/r/xzsazf
Many thanks

Related

Esttab: Create new row with logarithm of a beta coefficient

I am using Stata and I'm currently trying to figure out how to create new row that shows me the relative effect of a certain coefficient.
eststo, title(log_total[1]]): reg log_total a b
eststo, title(log_total[2]]): reg log_total a b c
esttab using total.tex
To give a sample code, this is what I have.
However, in the end besides the rows for a, b, and c I want to have a row that says effect for a where I calculate exp(a)-1 and where I want to print exp(a)-1 %.
The table should look the following:
| | total[1]| total[2]|
|:---- |:------:| -----:|
| a| 0.014| 0.021|
| b| 0.031| 0.005|
| c| | 0.082|
| Effect| 1.4 %| 2.1 %|
How can I add this "Effect" row to my table using esttab? I tried using estadd which works for fixed values but I was not able to figure out how to include a calculation in there.
Thank you a lot!

Splitting file processing by initial keys

Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
๐Ÿ“‚state-data/
|-๐Ÿ“œAL.zip
|-๐Ÿ“œAK.zip
|-๐Ÿ“œ...
|-๐Ÿ“œWY.zip
๐Ÿ“‚county-data/
|-๐Ÿ“‚AL/
|-๐Ÿ“œCOUNTY1.csv
|-๐Ÿ“œCOUNTY2.csv
|-๐Ÿ“œ...
|-๐Ÿ“œCOUNTY68.csv
|-๐Ÿ“‚AK/
|-๐Ÿ“œ...
|-๐Ÿ“‚.../
|-๐Ÿ“‚WY/
|-๐Ÿ“œ...
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
uid
census_plot
flood_zone
abc121
ACVB-1249575
R50
abc122
ACVB-1249575
R50
abc123
ACVB-1249575
R51
abc124
ACVB-1249599
R51
abc125
ACVB-1249599
R50
...
...
...
County Level Data - thousands of these (~300 cols wide)
uid
county
subdivision
tax_id
abc121
04021
Roland Heights
3t4g
abc122
04021
Roland Heights
3g444
abc123
04021
Roland Heights
09udd
...
...
...
...
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
#
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
#
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
p
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
)
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
else:
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.

Add categories in MDS plot

I) PROBLEM
Letโ€™s say I have a matrix like this with distances (in kilometers) between the homes of different people.
| | Person 1 | Person 2 | Person 3 |
|----------|----------|----------|----------|
| Person 1 | | | |
| Person 2 | 24 | | |
| Person 3 | 17 | 153 | |
And I have a data table like this:
| Person | Party |
|----------|----------|
| Person 1 | Party A |
| Person 2 | Party B |
| Person 3 | Party C |
I want to do multidimensional scaling (dissimilarity by distance) to visualize i) how close each person lives to another; ii) which party each person votes for (different colors for each party)
II) CURRENT RESULT
My current plot of MDS (made with SPSS) is like this (I donโ€™t use a code line, but a menu commands in SPSS).
:
III) EXPECTED RESULT
I want to add a different color for each person depending on which party this person votes for:
IV) QUESTION(S)
Can I do it in SPSS? How to add the data about votes in the matrix and how to show it in MDS plot?
EDIT
There is quite the same problem and solution for R.
R) Create double-labeled MDS plot
But I want to do it in SPSS.
I don't believe it's possible to create a plot like the one you show directly from either of the MDS procedures currently available in SPSS Statistics, PROXSCAL or ALSCAL. I think what you'd need to do would be to save the common space coordinates to a new dataset or file, then add the Party variable to that new dataset or file, define it as Nominal in the measurement level designation in the Data Editor, and then use the Grouped Scatter option under Scatter/Dot in the chart Gallery in the Chart Builder, defining groups by the Party variable.
The PROXSCAL procedure lets you save things from the dialogs in the Output sub-dialog. The ALSCAL procedure only supports saving out of common space coordinates and other things using command syntax, specifically using the OUTFILE subcommand (you can paste the command from the dialogs, then add this subcommand).

Postgres vs MongoDB on modeling recursive multi-child relations

I am considering using either of the following stack for a personal project:
Nodes.js/MongoDB (learning)
Rails/Postgres (more familiar)
I would like to give MongoDB a try for learning purposes, but I am unsure if it is suitable for this problem. I would like to hear the trade-off and examples based on the following problem description, some specific questions are at the bottom:
There are a list of Products, let's say p1, p2, p3, and each product has the fields for some environmental impact, let's say A, B, C.
p1 p2
+ +
| |
| |
+------------------+ +----+----+
| | | | |
+ + + + +
p3 p4 p5 p3 p6
+ + |
| | |
+-----+-+ +---+--+ +---+--+
+ + + + + +
p7 p8 p2 p9 p10 p11
p1.A = p3.A + p4.A + p5.A
p1.B = p3.B + p4.B + p5.B
p3.A = p7.A + p8.A
Product Table would look something like this
id A B C parents children
1 4 5 6 [] [3, 4, 5]
2 10 11 12 [4] [3, 6]
3 6 7 8 [1,2] [7, 8]
4 3 9 6 [1] [2, 9]
5 3 3 10 [1] [10, 11]
6 3 1 2 [2] []
7 4 5 0 [3] []
...
Updates Process would look like this:
p1 is made of p2 and p3.
p2 is also made of p3
If p3 A, B, or C updates, it would trigger a p1 update to recalculate its A, B, C, although maybe still with old p2's value. Then when p3 updates p2, p2 updates will trigger the p1 updates again. There could be some redundant operations in the updates depending on the ordering. I am guessing that is ok.
Since the environmental impact is not a critical data, I am just looking that the data becomes eventually consistent.
In terms of scale, maybe tens of thousands of products at some point.
Questions:
1) I need to way to prevent infinite update cycle in a circular graph.
2) Can you handle this type of two-way associations in MongoDB easily, product has parents that are products, and children that are products.
3) What are the different approaches I can structure my data instead of parents and child arrays, and design this update process efficiently. If I design it such that when one product update, trigger another update, which trigger another update and the chain goes on, that could potentially make a long web request cycle?
Thanks.
Your model is best described as a directed graph
G = (V,E) ; V->P ; E = VxV
Therefore neither PostgreSQL or MongoDB is really good for your use case.
The biggest advantage of MongoDB compared to traditional RDBMS like PostgreSQL is the dynamic schema of MongoDB. That means you can add new records with various structure without re-defining the database schema.
But the model you have described is pretty static to me. So the argument doesn't count for your problem.
As far as I am concerned the best technology decision in your case is to use a graph database like Neo4j.
As alternative inspiration you could take a look on graph data structures. Therefore one efficient way to model a graph is the use of an adjacency matrix.

Exception in thread "main" java.lang.IllegalArgumentException: Wrong number of attributes in the string + Mahout

I am trying to create a file descriptor using the command:
$ MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
from the link:
https://mahout.apache.org/users/classification/partial-implementation.html on my data file but whatever file I take and change the number of attributes string N 3 C 2 N C 4 N C 8 N 2 C 19 N L .
I get the following exception:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong number of attributes in the string
Please help!
There are a couple of reasons for which you might get an error like that...
Wrong Descriptor: Putting this for a sake of completeness. You must have already checked this one out. You have actually given a wrong descriptor for the data. Re-check the number and type of columns and then give them correctly to the descriptor.
Bad separator: Re-check the delimiter used in the data. That also might create some trouble. May be the data you have has some wrongly placed delimiter in some records. Make sure of that.
Special Characters: In my few experiments, I have noticed mahout does not enjoy if there are certain special characters, or data consists of characters of language other than English (unless of course, you tweak around the code). So make sure you have a way of handling them, and you should be good to go.
Anyways all these fight just so you can create a descriptor of the data. ATB.
Old question, but I had a more acute answer that I discovered after landing here with the same problem.
In this particular case, the problem I found was that the format of data file (from http://nsl.cs.unb.ca/NSL-KDD/) seems to have changed from the example as listed on the Mahout Random Forest example page.
The example lists a line format with the specifier
N 3 C 2 N C 4 N C 8 N 2 C 19 N L
but there's an extra element at the end of the lines; for example:
13,tcp,telnet,SF,118,2425,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,26,10,0.38,0.12,0.04,0.00,0.00,0.00,0.12,0.30,guess_passwd,2
which has one more field. Adding another number field (N) to the end of the specifier, as
N 3 C 2 N C 4 N C 8 N 2 C 19 N L N
I had luck using just the plain .txt file format instead of the .arff file format.

Resources