Janusgraph capabilities and future - neo4j

The project I am working on currently uses Neo4j community. Currently we process 1-5M vertices with 5-20M edges but we aim to handle a volume of 10-20M vertices w/ 50-100M edges.
We are discussing the idea of switching to a graph database open source project that would enable us to scale in these proportion. Currently our mind is set on Janusgraph with Cassandra.
We have some questions regarding the capabilities and development of Janusgraph, we ould be glad if someone could answer! (Maybe Misha Brukman or Aaron Ploetz?)
On Janusgraph capabilities:
We did some experiments using the Janusgraph ready-to-use docker image, queries being issued through a java program. The java program and docker image are run on the same machine. At the magnitude of 10k-20k vertices with 50k-100k edges inserted, a query to with all the vertices possessing a give property takes 8 to 10 seconds (mean time over 10 identical queries, time elapsed before and after the command in the java program). The command itself is really simple:
g.V().has("secText", "some text").inE().outV();
Moreover, the docker image seems to break down when I try to insert more record (extending towards 100k vertices).
We wonder if it's due to the limited nature of the docker image or if there is any problem or if it could be normal? Anyway it seems really, really slow.
We set up a 2 nodes Cassandra cluster (on 2 different VMs) with Janusgraph on town, again the results were quite slow.
From what I read on the Internet, people seem to use Janusgraph deployment with millions of vertices in production, so I guess they can execute simple queries in matter of milliseconds. What is the secret there? Do you need like 128GB of RAM for the whole thing to perform correctly? Or maybe there is a guide a good practices to follow that I am unaware of? I tried my best using Janusgraph official documentation and user comments on forums like here but that ain't much I'm afraid :/
On Janusgraph future:
Janusgraph seemed to evolve quite quickly over the first years (like 2016-2018) but the past few monthes I didn't see much activity from the Janusgraph community, except for the release of version 0.5 a few monthes ago. For example, no meeting since last year.
So I'm wondering: is Janusgraph on the right tracks to last and be maintained for many years to come. Did things slow down a bit because of COVID or is there a thing?
Is backward compatibility considered in Janusgraph? From what I can read in the docs, many things have changed from version 0.2/0.3 to 0.4 and 0.5. Many are to come like, for example, Cassandra Thrift and embedded being deprecated. So, in a production environment where we can't always afford to update version every year, let aside the code modification in a case where some component is deprecated, does Janusgraph dev think of achieving some backward compatibility soon, or maybe should we still wait for the 1.0 version for that?
Thank you for reading all this and I am looking forward to all the answers you can give me :) have a nice day!
Mael

JanusGraph with Cassandra has design limitations at the storage layer which makes performance slow. In practice, its a large, scaleable, but slow graph database that offers the replication and redundancy benefits of Cassandra.
Cassandra shards data and is very good at distributing data randomly across the cluster, however this destroys data locality which is needed to make traversals fast and efficient. JanusGraph also supports several back-end storage options in addition to Cassandra, which means its not tightly tuned to any particular storage architecture.
Memory can make a difference, so verify how much memory you have allocated to the JVM on each node, use G1GC and disable swap. The VisualVM is helpful to profile your memory headroom.

Hello I know this might be late but please tell me. Are you accessing all the vertices for analysis or transactional queries ? OLAP or OLTP ? because how many vertices and edges you query and how you do that has a major effect. for example do you tell Janusgraph to return a vertex that have millions of edges with all those edges in one shot or only few of them. this is referred to as the hot vertex ( a vertex that has a lot of edges that cant be stored on one server instance ).

Related

Azure CI container per customer

I have a monolithic application based on .NET , the application itself is a web based app.
I am looking at multiple articles and trying to figure out if the Azure CI or similar would be an correct service to use.
The application will run 24/7 and i guess this is where confusion comes in, wouldn't it be normal to have always on application running on CI?
What i am trying to achieve is a container per customer where each customer gets one or more instances that he owns. The other question would be costs and scalability, i would expect to have thousands of containers so perhaps i should be looking at Kubernetes ?
Thanks.
Here is my understanding. I'm pretty new to both ACI and Kubernetes, so treat this as a suggestions and not a definitive answers 🙂.
Azure Container Instances is a quick, easy and cheap way to run a single-instance of a container in Azure. However, it doesn't scale very well on its own (it can scale up, but not out, and not automatically..), and it lacks the many container-orchestration features that kubernetes offers.
Kubernetes offers a lot more, such as zero-downtime deployments, scaling out with multiple replicates, and many more features. It is also a lot more complex, costs more, and takes much longer to set up.
I think ACI is a bit too simple to meet your use-case.

Neo4j and Hugepages

Since Neo4j works primarily in memory, I was wondering if it would be advantageous to enable hugepages (https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt) in my Linux Kernel, and then XX:+UseLargePages or maybe -XX:+UseHugeTLBFS in the (OpenJDK 8) JVM ?
If so, what rule of thumb should I use to decide how many hugepages to configured?
The Neo4j Performance Guide (http://neo4j.com/docs/stable/performance-guide.html) does not mention this, and Google didn't turn up anyone else discussing it (in the first couple of search pages anyway), so I thought I'd ask.
I'm wrestling to get acceptable performance from my new Neo4j instance (2.3.2-community). Any little bit will help. I want to know if this is worth trying before I bring down the database to change JVM flags... I'm hoping someone else has done some experiments along these lines already.
Since Neo4j does its own file paging and doesn't rely on the OS to do this, it should be advantageous or at least not hurt. Huge pages will reduce the probability of TLB cache misses when you use a large amount of memory, which Neo4j often would like to do when there's a lot of data stored in it.
However, Neo4j does not directly use hugepages even though it could and it would be a nice addition. This means you have to rely on transparent huge pages and whatever features the JVM provides. The transparent huge pages can cause more-or-less short stalls when smaller pages are merged.
If you have a representative staging environment then I advise you to make the changes there first, and measure their effect.
Transparent huge pages are mostly a problem for programs that use mmap because I think it can lead to changing the size of the unit of IO, which will make the hard-pagefault latency much higher. I'm not entirely sure about this, though, so please correct me if I'm wrong.
The JVM actually does use mmap for telemetry and tooling, through a file in /tmp so make sure this directory is mounted on tmpfs to avoid gnarly IO stalls, for instance during safe-points (!!!). Always do this even if you don't use huge pages.
Also make sure you are using the latest Linux kernel and the latest Java version.
You may be able to squeeze some percentage points out of it with tuning G1, but this is a bit of a black art.

how mature is SD erlang project?

do you have any experience with SD Erlang project?
There seems to be implemented many interesting concepts regarding the comm mesh optimalizations and I'm just curious if some of you used those in production already or in some real project at least.
SD erlang repo
Thanks!
The project has finished a week ago. The main ideas behind SD Erlang are reducing the number of connections Erlang nodes maintain while keeping transitivity and common namespace for groups of nodes. Benchmarks that we used (Orbit, Ant Colony Optimization (ACO), and Instant Messenger) showed very promising results. Unfortunately, we didn't have enough human resources to refactor Sim-Diasca simulation engine. So, no, SD Erlang hasn't been used yet in a real application.
At the moment we are writing up the last deliverable that will provide an overview of what has been achieved. It will appear here in a few weeks (D6.2). In general we are happy with the results we get using SD Erlang, so there are plans for a follow up project to continue to work on it but currently this is work in progress.
This is not a direct answer but I will use SD-Erlang in a embedded application which needs to scale to hundreds of nodes (small embedded CPUs). From what I have seen its ready to be tried out in a real application. To furtehr evaluate lets consider the alternatives:
You have only a few distributed nodes: then you probably don't need it and can just connect all the nodes and for name registry use either the global module (slow but sturdy) or gprocwith the new locks_leader branch which avoids the quite broken gen_leader which so far prevented using gproc in distributed mode in production.
You need many nodes (how many depends on your hardware and requirements but you start to get into interesting territory with > 70 nodes)
Use SD-Erlang and fix whatever problems you encounter in production, or at least report them. It certainly solves a lot of the problems you get with normal Erlang distribution
Roll your own solution either with playing with different cookie values or with hidden nodes: hint you can set different cookie values for different peer nodes. But then you need to roll your own global name registry and management code: looks like a variant of Greenspuns 10th rule or closer to Erlang Virdings 1st rule : you probably will result in implementing half of SD Erlang yourself.
Don't use Erlang distribution at all. That seems to be the industry standard that for anything involving more nodes or crossing data-centers you shouldn't use Erlang distribution at all but run your own protocols. My personal opinion is to rather fix Erlang Distributions problems than just ditch it. Its much too useful and time saving when it works for a use case to just give up on it. And I see SD-Erlang as being the fix for the "too many nodes" problem, its at least the right starting point.

What language/framework (technology) to use for website (flash games portal)

----edit-----
QUICK QUESTION: Does Grails take too much resources for high traffic website, and is it expensier to host?
For example: if I can make a site that has millions of users/m easier in CakePHP does it worth to make it in Grails just to save some webserver resources- or will it need more servers?
---------------
Hello,
I know there are a lot of similar questions on the net, but because I am a newbie in web development I didn't find the solution for my specific problem.
I am planing on creating a flash games portal from scratch. It is a big chance that there will be big traffic from the beginning (millions of pageviews). I want to reduce the server costs as much as possible but in the same time to not be tide to an expensive contract as there is a chance that the project will not be as successfully as I want and in that case the money would be very little.
The question is : what technology to use? I don't know any web dev technology yet so it doesn't matter what I will learn. My web dev experience is a little php 8 years ago, and from then I programmed in C++ / Java- game and mobile development. I like Java and C syntax and language very much and I tend to dislike dynamic typing or non robust scripting (like php)- but I can get along if these are the best
choices.
The candidates are now: -
Grails (my best for now)
Ruby on Rails
Cake PHP
Other technologies (Google App
Engine, Python/Django etc...)
I was considering at first using pure C and compiling the web app in the server- just to squeeze more from the servers, but soon I understand that this is overkill.
Next my eyes came on Ruby - as there is a lot of buzz for it's easiness of use. Next I discovered Grails and looked at Java because it is said that it is "faster". But I don't know what this "Faster" really means on my needs, so here comes the first question:
1) What will be my biggest consumption on the server, other than bandwidth, for a lot of flash content requests? Is it memory? I heard that Java needs a lot of memory, but is faster. Is it CPU? I am planning to take some daily VPS.NET nodes at first, to see if there is a demand, and if the "spike" is permanent to move to a dedicated server (serverloft.com has some good offers), else to remain with less nodes.
I was also considering developing in Google App Engine- cheap or free hosting to use at first - so I can test my assumption- and also very easy to use (no need for sys administration) but the costs became high if used more (> 3 million games played / month .. x mb/ each). And the issue with Google is that it looks me in this technology.
My other concern is scalability (not only for traffic/users, but as adding functionality) My plans are to release a functional site in just 4 weeks (just the basics frontend and some quick basic backend - so I can be able to modify some things and add games manually) - but then to raise it and add more things to it. I am planning to take a little different approach than other portals so I need to write it from scratch (a script will not do).
2) Will Grails take much more resources than RoR or Php server wise? I heard that making it on Java stack will be hardware expensive and is overkill if you don't make a bank application. My application will not be very complex (I hope and i will try to) but will have a lot of traffic.
I also took in account using CDN for files, but the cheapest CDN found was 5c/GB (vps.net) and the cost per gb on serverloft (http://www.serverloft.com/dedizierte-server/server-details.php?products=4) is only 1.79 cents/GB and comes with the other resources either.
I am new to this domain (web). I am learning the ropes and searching on the web for ~half of year but don't have any really practical experience, so I know that I must have some naive thinking and other issues that i don't know from now, so please give me any advice you want regarding anything, not just the specific questions asked.
And thank you so much for such great community!
This is how I (on my blog) view web performance, especially for highly abstracted frameworks like Grails.
I don't understand the obsession with
runtime performance. Given most
project scenarios your primary focus
should be on your performance, as in
your ability to get things done with a
chosen technology.
For example, you will get more done in
a given period of time with Groovy
than with Java any day. Often one line
of Groovy code will equate to 10 lines
of Java code etc etc
Very rarely will byte code execution
time be your performance issue, most
often its...
Bad algorithm implementation or
design. Bad DB design and / or queries.
Taking to long to get things done and
then having all sorts of commercial
relationship issues because of it.
With web applications you are usually
not performing lots of long running
CPU bound operations. Most of your
request / response time is spent in
the wire (internet routing etc) and in
the DB (executing queries).
Choose a technology that takes a load
off your mind and one that frees you
from writing mountains of boiler plate
code, so that you can rather
concentrate on designing and
implementing good algorithms, DB's and
queries etc etc

How different is Amazon Simple DB from Apache CouchDB?

Other than the monetary aspects, how different is Amazon's SimpleDB from Apache's CouchDB in the following terms
Interfacing with programming languages like Java, C++ etc
Performance and Scalability
Installation and maintenance
I'm a fairly heavy SimpleDB user (I'm the developer of http://www.backupsdb.com/) but am currently migrating some projects off SimpleDB and into Couch, so I guess I can see this from both sides now.
1. Interfacing with programming languages like Java, C++ etc
Easier with Couch as you can talk to it very easily using JSON. SimpleDB is a bit more work, largely due to the complexities of signing each request for security and the lower level access you get which requires you to implement exponential back off in the case of busy signals etc. You can get good libraries for SimpleDB though in many languages now and this takes the pain away in many respects.
2. Performance and Scalability
I don't have any benchmarks, but for my own use case, CouchDB outperforms SimpleDB. It's harder to scale though - SimpleDB is great at that, you chuck more at it and it autoscales around you.
There are lots of occasionally irritating limits in SimpleDB though, limits on the number of attributes, size of attributes, number of domains etc. The main annoyance for many applications is the attribute size limit which means you can't store large forum posts for example. The workaround is to offload those into something else such as S3, but it's a bit annoying at times. Obviously CouchDB doesn't have that issue and indeed the fact that you can attach large files to documents is one thing that particularly attracts me to it.
Scaling wise, you should also possibly be looking at bigcouch which gives you a distributed cluster and is closer to what you get with SDB.
3. Installation and Maintenance
I actually found it much easier with CouchDB. I suspect it depends on which library you need to use for SimpleDB, but when I was starting with it, the Amazon supplied libraries weren't very mature and the open source community ones had various issues that meant getting up and running and doing something serious with it took more time than I would have liked. I suspect this is much better now.
CouchDB was surprisingly easy to install and I love the web interface to it. Indeed that would be my major criticism of SimpleDB - Amazon still don't have any form of web console for it despite having web consoles for almost every other service. That's why we wrote the very basic BackupSDB just so we could extract data in XML and run queries from a web browser, I'd like to have seen Amazon do something similar (but more powerful and better) by now and have been very surprised that they haven't. There are lots of third party firefox plugins and some applications for it though but I have the impression that SimpleDB isn't that widely used - this is only a hunch really.
4. Other Observations
The biggest issue I think is that with SimpleDB you are entrusting all your data to a third party with no easy way of getting it out (you'll need to write something to do that), plus your costs keep gently rising. When you get to the point that the cost is comparable to a powerful dedicated database server, you kind of feel you'd get better value that way, but the migration headache is non trivial by this point as you'll have a large commitment to the cloud.
I started off as a huge Amazon evangelist, and for most things I still am, but when it comes to SDB, I feel it's a bit of a hobby project for Amazon the way the Apple TV was for Steve jobs.

Resources