Combine nodes by categorical attribute in Gephi - gephi

I'm working on a visualization of organizational structure in Gephi. I have a graph of individuals, connected by whether or not they have worked together in the past. Graphing individuals looks good, but I would like to combine nodes (individuals) based on a categorical attribute (department; string). The new graph -- or at least a visualization -- would have a node for every department, preferably with a numerical weight proportional to how many individuals comprise it.
I could do this in the scripts that generate the graph files before importing. But I did exactly this about a year ago entirely in Gephi. Either the functionality was removed (like the pie charts!) or I've just forgotten (more likely).
Am using Gephi 0.9.1. Any help, much appreciated.

Related

Keep a projected graph in synch with persisted graph in Neo4j GDS

I have a large dataset to run a specific Graph Data Science algorithm on.
The functional requirement is that the algorithm will be run often and that the dataset changes in real-time.
As I understand, in order to run an algorithm I have to project the persistent graph into memory first.
But, GDS only provides a projection of the whole dataset once (as a (filtered) snapshot), therefore, on each change to my dataset (i.e. a new relationship edge added between two nodes), I have to rerun the projection again, which seems quite an ineffective thing to do.
Is there a generic way to circumvent this and keep the Projection properly in sync with the persistent graph?
As per #tomaž-bratanič comment, it isn't possible at the moment.

Repeated measures for 3+ groups comparing percentages

I'm new to SPSS. I have data of skin cancer diagnosis for the years 2004 - 2018. I want to compare the changes in distribution of new cases with regards to which body part and compare between the different years. I've managed to create a crosstab and grouped bar graph that shows the percentages but I would like to run a statistical analysis to see if the changes in distribution are significant over time. The groups I have are face, trunk, arm, leg or not specified, the number of cases for each year vary greatly which is why I'm looking to compare the ratios (percentages) between the different body sites. The only explanations I've found all refer to repeated observations of the same subject which is not the case here (a person is only included with their first diagnosis so can only appear in one of the years).
The analysis would be similar to comparing the percentages of an election between 3+ parties and how that distribution changes over the years but I haven't found any such tutorials. Please help!
The CTABLES or Custom Tables procedure, if you have access to it, will let you create a crosstabulation like you mention, and then will let you test both for any changes overall in the distribution of types, as well as comparing each pair of columns for each row.
More generally, problems like this would usually be handled as loglinear or logit models.

How to realize a multi-client capability in Neo4j?

Initial situation
I have several independent and disconnected graphs, each of them have a hierarchical like structure with a local root element. Each of these graphs consists of approximately 8 million nodes and 40 million relationships. I have successfully created a three-digit number of Cypher queries, which should now be applied to a single graph only and not the entirety of all graphs. The graph, the queries have to apply to, is specified by its root node.
Challenge to be solved
How can I realize a kind of pseudo multi-client capability for a graph, if all graphs have to remain in a common Neo4j database for reasons of reporting and pattern matching?
approach to the problem / preliminary result
Implement a single shortest path to the given root element for selection purposes in really every query at the beginning? Cons:
huge performance losses expected
with high development costs
Expand each graph with a separate, additional label? Cons:
complex queries, high development effort
For these cases, adding a specific label per tenant/client to all nodes in the subgraph tends to be the approach taken. It would require you to ensure that when you match to the relevant nodes in the query that you additionally make sure the nodes you're working with have the client's label present.
As a note for the future, native multi-tenancy support is one of the key features we're working on for the next year.

Neo4j partition

Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.

what is the advantage of RDF and Triple Storage to Neo4j?

Neo4j is a really fast and scalable graph database, it seems that it can be used on business projects and it is free, too!
At the same time, there are no RDF triple stores that work well with large data or deliver a high-speed access. And what is more, free RDF triple stores perform even worse.
So what is the advantage of RDF and RDF triple stores to Neo4j?
The advantage of using a triple store for RDF rather than Neo4j is that that's what they're designed for. Neo4j is pretty good for many use cases, but in my experience its performance for loading and querying RDF is well below all dedicated RDF databases.
It's a fallacy that RDF databases don't scale or are not fast. Sure, they're not yet up to the performance & scale levels that relational databases have, but they have a 50 year head start. Many triple stores scale into the billions of triples, provide 'standard' enterprise features, and provide great performance for many use cases.
If you're going to use RDF for a project, use a triple store; it's going to provide the best performance and set of features/APIs for working with RDF to build your application.
RDF and SPARQL are standards, so you have a choice of multiple implementations, and can migrate your data from one RDF store to another.
Additionally, version 1.1 of the SPARQL query language is quite sophisticated (more expressive than most SQL implementations) and can do all kinds of queries that would require a lot of code to be written in Neo4J.
If you are going for graph mining (e.g., graph traversal) upon triples, neo4j is a good choice. For the large triples, you might want to use its batchInserter which is fairly fast.
So I think it's all about your use case. Both technologies can and do overlap.
In my mind, there its mostly about the use case. Do you want a full knowledge graph including all the ecosystems from the semantic web? Then go for the triple store.
If you need a general-purpose graph (e.g. store big data as a graph) use the property graph model. My reasoning is, that the underlying philosophy is very much different and this starts with how the data is stored which has implications for your usage scenario.
let's do some out-of-mind bullet points here to compare. Take it with a grain of salt please as this is not a benchmark paper just some experience-based 5 min write down.
Property graph (neo4j):
Think of nodes/Edges as documents
Implemented on top of e.g. linked list, key-value stores (deep searches, large data e.g. via gremlin)
Support for OWL/RDF but not natively (as i see its on a meta layer)
Really great when it comes to having the data in the graph and doing ML (it stores it as linked lists that gives you nice vectors which is cool for ML out of the box)
Made for large data at scale.
Use Cases: (focus is on the data entities and not their classes)
Social Graphs and other scenarios where you need deep traversal
Large data graphs, where you have a lot of documents that need to be searched in a schema-free graph manner .
Analyzing customer funnels from click data etc. You want to move out of your relational schema because actually, you are in a graph use case...
Triple Store (E.g. rdf4j)
Think of data in maximum normal form as triples (no redundant data at all)
Triples are stored in context triples. Works a lot with index.
Broad but searches and specific knowledge extractions. Deep searches are sometimes cumbersome.
Scale is impressive and can scale to trillions of nodes with fast performance. But i would not recommend storing big data in the graph e.g. time-series or so. The reason is the special way how indexes are used and in order to scale horizontally, you may consider working with subgraphs ...
Support for all the ecosystems like SPARQL, SHACL, SWIRL etc. this is a big plus in case
Use cases:
It's really about knowledge graphs. Do you need shape testing, rule evaluation, inference, and reasoning? Go for it because you have to focus on the ontology and class structure!
Also e.g. you have IoT and want to configure relations for logistics and smart factory while the telemetry is stored somewhere else and only referenced in the graph.
I have heard rumors that it takes whole day to load 10M triples into Neo4j (it is actually the slowest one because it's not built primarily for RDF).
Sesame and 4Store are the fastest ones but Jena has powerful API.

Resources