CPU-GPU hardware acceleration for large node graph visualizations - cytoscape

I'm looking to visualize a large node network (well over a million nodes and edges) and am looking for the proper tool to do this. Cytoscape has many of the functionalities we're looking for, but it isn't clear if it will support this scale of network, especially for interactive use.
The most likely way for this to be feasible is to parallelize the visualization with multiple cpus or gpus. Can this be done in Cytoscape?

It's not clear if you have over a million nodes and their edges or a total count of over a million (nodes+edges). If the latter, as long as you have enough memory on your machine (and a little patience) Cytoscape can handle that fine. I've rendered graphs with well over that on my 32Gb Mac Laptop. The initial read and render of the graph might take a bit, but once it's rendered, the manipulation of the graph is quite fast. For really large graphs (>5,000,000 edges) I generally use a 64Gb desktop machine. Oh, and for layouts, I recommend the Prefuse force directed layout -- it scales the best...
-- scooter


Is there any way to calculate DRAM access latency (cycles) from data size?

I need to calculate DRAM access latency using given data size to be transfered between DRAM-SRAM
The data is seperated to "load size" and "store size" and "number of iteration of load and store" is given.
I think the features I need to consider are many like first DRAM access latency, transfer one word latency, address load latency etc..
Is there some popular equation to get this by given information?
Thank you in advance.
Your question has many parts, I think I can help better if I knew the ultimate goal? If it's simply to measure access latency:
If you are using an x86 processor maybe the Intel Memory Latency Checker will help
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.
If not x86, I think the Gem5 Simulator has what you are looking for, here is the main page but more specifically, for your needs, I think this config for Gem5 will be the most helpful.
Now regarding a popular equation, the best I could find is this Carnegie Melon paper that goes over my head: https://users.ece.cmu.edu/~omutlu/pub/chargecache_low-latency-dram_hpca16.pdf However, it looks like your main "features" as you put it revolve around cores and memory channels. The equation from the paper:
Storagebits = C ∗MC ∗Entries∗(EntrySizebits +LRUbits)
Is used to create a cache that will ultimately (the goal of ChargeCache) reduce access latency in DRAM. I'm sure this isn't the equation you are looking for but just a piece of the puzzle. The LRUbits relate to the cache this mechanism (in the memory controller, no DRAM modification necessary) creates.
EntrySizebits is determined by this equation EntrySizebits = log2(R)+log2(B)+log2(Ro)+1 and
R, B, and Ro are the number of ranks, banks, and rows in DRAM, respectively
I was surprised to learn highly charged rows (recently accessed) will have a significantly lower access latency.
If this goes over your head as well, maybe this 2007 paper by Ulrich Drepper titled What Every Programmer Should Know About Memory will help you find the elements you need for your equation. I'm still working through this paper myself, and there is some dated references but those depend on what cpu you're working with. Hope this helps, I look forward to being corrected on any of this, as I'm new to the topic.

Neo4j partition

Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.

Neo4j or GraphX / Giraph what to choose?

Just started my excursion to graph processing methods and tools. What we basically do - count some standard metrics like pagerank, clustering coefficient, triangle count, diameter, connectivity etc. In the past was happy with Octave, but when we started to work with graphs having let's say 10^9 nodes/edges we stuck.
So the possible solutions can be distributed cloud made with Hadoop/Giraph, Spark/GraphX, Neo4j on top of them, etc.
But since I am a beginner, can someone advise what actually to choose? I did not get the difference when to use Spark/GraphX and when Neo4j? Right now I consider Spark/GraphX, since it have more Python alike syntax, while neo4j has the own Cypher. Visualization in neo4j is cool but not useful in such a large scale. I do not understand is there a reason to use additional level of software (neo4j) or just use Spark/GraphX? Since I understood neo4j will not save so much time like if we worked with pure hadoop vs Giraph or GraphX or Hive.
Thank you.
Neo4J: It is a graphical database which helps out identifying the relationships and entities data usually from the disk. It's popularity and choice is given in this link. But when it needs to process the very large data-sets and real time processing to produce the graphical results/representation it needs to scale horizontally. In this case combination of Neo4J with Apache Spark will give significant performance benefits in such a way Spark will serve as an external graph compute solution.
Mazerunner is a distributed graph processing platform which extends Neo4J. It uses message broker to process distribute graph processing jobs to Apache Spark GraphX module.
GraphX: GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. It supports multiple Graph algorithms.
It is always recommended to use the Hybrid combination of Neo4j with GraphX as they both easier to integrate.
For real time processing and processing large data-sets, use neo4j with GraphX.
For simple persistence and to show the entity relationship for a simple graphical display representation use standalone neo4j.
Neo4j: I have not used it, but I think it does all of a graph computation (like pagerank) on a single machine. Would that be able to handle your data set? It may depend on whether your entire graph fits into memory, and if not, how efficiently does it process data from disk. It may hit the same problems you encountered with Octave.
Spark GraphX: GraphX partitions graph data (vertices and edges) across a cluster of machines. This gives you horizontal scalability and parallelism in computation. Some things you may want to consider: it only has a Scala API right now (no Python yet). It does PageRank, triangle count, and connected components, but you may have to implement clustering coefficent and diameter yourself, using the provided graph API (pregel for example). The programming guide has a list of supported algorithms: https://spark.apache.org/docs/latest/graphx-programming-guide.html
GraphX is more of a realtime processing framework for the data that can be (and it's is better when) represented in a graph form. With GraphX you can use various algorithms that require large amounts of processing power (both RAM and CPU), and with neo4j you can (reliably) persist and update that data. This is what I'd suggest.
I know for sure that #kennybastani has done some pretty interesting advancements in that area, you can take a look at his mazerunner solution. It's also shipped as a docker image, so you can poke at it with a stick and find out for yourself whether you like it or not.
This image deploys a container with Apache Spark and uses GraphX to
perform ETL graph analysis on subgraphs exported from Neo4j. The
results of the analysis are applied back to the data in the Neo4j

Why do queries with shorter inverted lists perform better on CPU's when compared to GPU's

Moreover, why do queries with longer inverted list perform better on GPU's?
I read this result in a paper called Using Graphics Processors for High Performance IR querying.
Queries with longer lists work better on GPUs, because GPUs are highly parallel, and search is a mostly parallel problem.
However, GPUs (and other massively parallel computers) don't process things the same way few-core CPUs do. Like with any other problem, there is non-negligible work to be done to set up the problem for the GPU. For small problem sizes, this overhead swamps out any speedup provided by the GPUs.

Reasons for NOT scaling-up vs. -out?

As a programmer I make revolutionary findings every few years. I'm either ahead of the curve, or behind it by about π in the phase. One hard lesson I learned was that scaling OUT is not always better, quite often the biggest performance gains are when we regrouped and scaled up.
What reasons to you have for scaling out vs. up? Price, performance, vision, projected usage? If so, how did this work for you?
We once scaled out to several hundred nodes that would serialize and cache necessary data out to each node and run maths processes on the records. Many, many billions of records needed to be (cross-)analyzed. It was the perfect business and technical case to employ scale-out. We kept optimizing until we processed about 24 hours of data in 26 hours wallclock. Really long story short, we leased a gigantic (for the time) IBM pSeries, put Oracle Enterprise on it, indexed our data and ended up processing the same 24 hours of data in about 6 hours. Revolution for me.
So many enterprise systems are OLTP and the data are not shard'd, but the desire by many is to cluster or scale-out. Is this a reaction to new techniques or perceived performance?
Do applications in general today or our programming matras lend themselves better for scale-out? Do we/should we take this trend always into account in the future?
Because scaling up
Is limited ultimately by the size of box you can actually buy
Can become extremely cost-ineffective, e.g. a machine with 128 cores and 128G ram is vastly more expensive than 16 with 8 cores and 8G ram each.
Some things don't scale up well - such as IO read operations.
By scaling out, if your architecture is right, you can also achieve high availability. A 128-core, 128G ram machine is very expensive, but to have a 2nd redundant one is extortionate.
And also to some extent, because that's what Google do.
Scaling out is best for embarrassingly parallel problems. It takes some work, but a number of web services fit that category (thus the current popularity). Otherwise you run into Amdahl's law, which then means to gain speed you have to scale up not out. I suspect you ran into that problem. Also IO bound operations also tend to do well with scaling out largely because waiting for IO increases the % that is parallelizable.
The blog post Scaling Up vs. Scaling Out: Hidden Costs by Jeff Atwood has some interesting points to consider, such as software licensing and power costs.
Not surprisingly, it all depends on your problem. If you can easily partition it with into subproblems that don't communicate much, scaling out gives trivial speedups. For instance, searching for a word in 1B web pages can be done by one machine searching 1B pages, or by 1M machines doing 1000 pages each without a significant loss in efficiency (so with a 1,000,000x speedup). This is called "embarrassingly parallel".
Other algorithms, however, do require much more intensive communication between the subparts. Your example requiring cross-analysis is the perfect example of where communication can often drown out the performance gains of adding more boxes. In these cases, you'll want to keep communication inside a (bigger) box, going over high-speed interconnects, rather than something as 'common' as (10-)Gig-E.
Of course, this is a fairly theoretical point of view. Other factors, such as I/O, reliability, easy of programming (one big shared-memory machine usually gives a lot less headaches than a cluster) can also have a big influence.
Finally, due to the (often extreme) cost benefits of scaling out using cheap commodity hardware, the cluster/grid approach has recently attracted much more (algorithmic) research. This makes that new ways of parallelization have been developed that minimize communication, and thus do much better on a cluster -- whereas common knowledge used to dictate that these types of algorithms could only run effectively on big iron machines...
