neoj4 vs "traditional" RDBMS & Memcache / Messaging / Redis solutions - ruby-on-rails

I'm in the middle of prototyping a social network (using ROR 3) and decided to check out Neo4j and while it looks great, I have a question about scaling and performance in terms of design.
I've researched how Etsy puts together and activity feed (see http://www.slideshare.net/danmckinley/etsy-activity-feeds-architecture ), and understand how messaging queues can fan out activities (such as sharing a picture and making this activity available to your 500 or so friends in their news feed). I also understand how news feeds can be cached (memcache) and how lookups can be performed against Redis..
All in all, it seems that to make a high performance activity feed that scales well (and social network in general) the common pattern is to use sharding, horizontal scaling, memcache, rabbitmq, redis, Mongodb, innodb (mysql) etc - all in attempt to compensate for high volumes, disk reads, etc.. But this is quite a bit of overhead in terms of design..
Can Neo4J eliminate the need, at least early on, for such an arrangement? I mean is it so fast that I don't need to set a message queue for fan outs and messaging, don't need to set up "activities" cache for every action a user performs, and can use it to handle both ordering and storing messaging? Can a news feed like Facebook's be created with such a system, or is the high performance activity feed limited to basic status updates?
If those questions are too broad, let me ask it a different way: Could I write facebook or twitter using neo4j and eliminate the need for message queuing to fan out updates (instead I want to get a live stream of updates on the fly), memcache for newsfeeds, and cached activity feed objects? Or will I find myself doing the same thing or even more to handle hundreds of request per second?
I ask the because it would save quite a bit of time to use Neo4J if it can indeed handle high volumes without having to use the tricks Etsy, Twitter, and Facebook employ to maintain high performance.

Yes. In fact, it's been done already by Rene Pickhardt.
http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/

Related

Graphql and round trips. Is this just an ios app issue?

I'm taking another look at graphql, and I'm trying to understand why saving round trips is a benefit to developers. Is the cost of making a request so expensive? I'm coming from a web development background.
Lets compare a standard rest api with a graphql api.
I need to retrieve a users personal info, and a list of their friends. Traditional rest api might require 2 calls, one to get the personal info, and one to get the friends.
With graphql, I get the same results with one request.
But as a frontend developer, I want my page to have the shortest possible stagnant period. I would like to render only a portion of the page as fast as possible, rather than wait for all the information I need in one go to then render the page.
Now from my understanding, graphql was partly created to solve mobile application api issues. Is there something about ios apps that makes it more beneficial to load all the data at once, as opposed to parallel requests? Or is there something else that I'm missing?
So generally speaking, network traffic inside the system you control (i.e. your backend) is fast. A request that may take 205ms (200ms network, 5ms data) to the outside world, might take just 6ms (1ms network, 5ms data) internally.
If you want to build a screen of your app, and that requires making two REST requests because the 2nd depends on the result of the first, you're looking at (given these crude numbers) 410ms to get the data you need to render your screen.
With GraphQL (or any other gateway layer that consolidates the data), you'd get everything in slightly over 212ms (200ms latency to GraphQL server + 6ms for each internal call).
In scenarios where you could make all your requests in parallel (i.e. they don't depend on the data from other requests), the performance benefits aren't quite as apparent, but you'll find that these situations are actually pretty rare as the complexity of your app grows.
The general rule of thumb with GraphQL is that your initial query fetches enough data for the page to be functional, if there's less important content you can always fetch that with another query.
Alongside the performance benefits, letting mobile devices make fewer network requests is a big win when it comes to battery life. Network usage is expensive, and should be avoided as much as possible.
Before using GraphQL instead of REST API one needs to understand the benefits of GraphQL over REST
GraphQL is a query language and
it uses the type system you define GraphQLInt, GraphQLString, customType...
REST
multiple round trips - slow
overloaded data
Many EndPoint
GraphQL
1 Endpoint
declaratives:
types, queries, mutations
exact shape of the data you want in one call
no underfetching, no overfetching of data
query == result, better performance

How to do Neo4j Cache-based Sharding?

I've been reading Neo4j's Operational Manual on Cache Sharding, and posts all over the web, however I can hardly find any detailed example on how to configure HAProxy for cache sharding(yes the one on Operation Manual is rather brief) on a real-world graph, which may contain multiple node labels.
Has anyone ever done this before? Would be lovely if you could share your experience.
Moreover, I'm a bit confused on the mechanism of the way to shard the graph using HAProxy. How do sub-graphs get cached on certain slaves, merely by providing rules in HAProxy? It surprised me to learn that cache sharding isn't handled by Neo4j.
The goal is to send queries hitting the same region of your graph always to the same instance. This of course means that the request data indicates the region. What to use as "region indicator" is heavily depending on the structure and shape of your graph.
In a lot of cases of customer facing applications people successfully used the current user id and set it as additional http header which is then evaluated by haproxy.

Erlang fault-tolerant application: PA or CA of CAP?

I have already asked a question regarding a simple fault-tolerant soft real-time web application for a pizza delivery shop.
I have gotten really nice comments and answers there, but I disagree in that it is a true web service. Rather than a web service, it is more of a real-time system to accept orders from customers, control the dispatching of these orders and control the vehicles that deliver those orders in real time.
Moreover, unlike a 'true' web service this system is not intended to have many users - it is just a few dispatchers (telephone operators) and a few delivery drivers that will use it (as for now I have no requirement to provide direct access to the service to the actual customers; only the dispatchers and delivery drivers will have the direct access).
Hence this question is a bit more general.
I have found that in order to make a right choice for a NoSQL data storage option for this application first thing that I have to do is to make a choice between CA, PA and CP according to the CAP theorem.
Now, the Building Web Applications with Erlang book says that "while it [Mnesia] is not a SQL database, it is a CA database like a SQL database. It will not handle network partition". The same book says that the CouchDB database is a PA database.
Having that in mind, I think that the very first thing that I need to do with my application is to decide what the 'fault-tolerance' term means regarding to CAP.
The simple requirement that I have is to have the application available 24/7(R1). The other one is that there is no need to scale, the application will have a very modest amount of users (it is probably not possible to have thousands of dispatchers) (R2).
Now, does R1 require the application to provide Consistency, Availability and Partition Tolerance and with what priorities?
What type of data storage option will better handle the following issues:
Providing 24/7 availability for a dispatcher (a person who accepts phone calls from customers and who uses a CRM) to look up customer records and put orders into the system;
Looking up current ongoing served orders and their status (placed, baking, dispatched, delivering, delivered) in real time;
Keep track of all working vehicles' locations and their payloads in real time;
Recover any part of the system after system crash or network crash to continue providing 1,2 and 3;
To sum it up: What kind of Data Storage (CA, PA or CP) will suite the system described above better? What kind of Data Storage will better satisfy the R1 requirement?
For your 24/ requirement you are searching a database with (High) Availability because you want your requests to succeed everytime (even if they are only error results).
A netsplit would bringt your whole system down, when you have no partition tolerance
Consistency is nice to have, but you can only have 2 of 3.
Your best bet will be a PA solution. I highly recomment a solution which has been inspired by Amazon Dynamo. The best known dynamo implementations are riak and couchdb. Riak even allows you to change PA to some other form by tuning the read and write replicas.
First, don't confuse CAP "Availability" with "High Availability". They have nothing to do with each other. The A in CAP simply means "All DB nodes can answer queries". To get High Availability, you must be in multiple data centers, you must have robust documented procedures for maintenance, expansion, etc. None of that depends on your CAP choice.
Second, be realistic about your requirements. A stock-trading application might have a requirement for 100% uptime, because every second of downtime could loose millions of dollars. On the other hand, I'm guessing your pizza joint might loose tens of dollars for every minute it's down. So it doesn't make sense to spend millions trying to keep it up. Try to compute your actual costs.
Third, always evaluate your choice vs mainstream. You could just go CA (MySQL) and quickly fail-over to the slaves when problems happen. Be realistic about the costs (and risks) of building on new technology. If you really expect your system to run for 5 years without downtime, ask for proof that someone else has run that database for 5 years without downtime.
If you go "AP" and have remote people (drivers, etc.) then you'll need to write an app that stores their data on their phone and sends it in the background (with retries). Of course, you could do this regardless of weather your database was CA or AP.
If you want high uptimes, you can either:
Increase MTBF (Mean Time Between Failures) - Buy redundant power supplies, buy dual ethernet cards, etc..
Decrease MTTR (Mean Time To Recovery) - Just make sure when failure happens you can recover quickly. (Fail over to slave)
I've seen people spend tens of thousands of dollars on MTBF, only to be down for 8 hours while they restore their backup. It makes more sense to ensure MTTR is low before attacking MTBF.

Websocket scalability, broadcasting concerns

If you have a complex requirement set with many users(&servers) how will your websocket infrastructure (server[s]) will scale, especially with broadcasting?
Of course, broadcasting is not part of the any websocket spec but it's there even in basic chat examples (a.k.a. hello world for websocket).
Client side (asking for new data) solution still seems more scalable than server side (broadcasting) solution with websockets' low latency and relatively cheap (http headerless) nature.
Edit:
OK, just think that you want to replace all your ajax code with websocket implementations which may mean that so many connections within so many different contexts. This adds enormous complexity to your system if you want to keep track of every possible scenario for broadcasting.
Low (network/thread etc) level implementation suggestions are also part of the problem not the solution, because this means you have to code a special server unlike general http servers.
Moreover, broadcasting brings some sort of stateful nature to the table which can't easily scale. Think about adding more servers and load balancing.
Scaling realtime web solutions can be a complex problem but one that services like Pusher (who I work for) have solved, and one that there are most definitely solutions defined for self hosted realtime web solutions - the PubSub paradigm is well understood and has been solved many times and in order to solve the problem there needs to be some state (who is subscribing to what). This paradigm is used in broadcasting the the types of scenarios that you are talking about.
Realtime web technologies have been built with large amounts of simultaneous connections in mind - many from the ground up. If you wanted to create a scalable solution you would most likely use an existing realtime web server that supports WebSockets, in the same way that it's highly unlikely that you would decide to implement your own HTTP Server you are unlikely to want to implement your own server which supports WebSockets from scratch.
Dedicated Realtime web servers also let you separate your application logic from your realtime communication mechanism (separation of concerns). Your application might need to maintain some state but the realtime technology deals with managing subscriptions and connections. How communication between the application and the realtime web technology is achieved is up to you but frequently messages queues are used and specifically redis is very popular in this space.
HTTP polling may conceptually be easier to understand - you can maintain statelessness and with each HTTP poll request you specify exactly what you are looking for. But it most definitely means that you will need to start scaling much sooner (adding more resource to handle the load).
WebSocket polling is something I've not considered before and I don't think I've seen it suggested anywhere before either; the idea that the client should say "I'm ready for my next set of data and here's what I want" is an interesting one. WebSockets have generally taken a leap away from the request/response paradigm but there may be scenarios where the increased efficiency of WebSockets and request/response using them may have some benefits. The SocketStream application framework might be worth a look as it might be relevant; after the initial application load all communication is performed over WebSockets which means that event basic request/response functionality uses WebSockets.
However, since we are talking about broadcasting data we need to go back to the PubSub paradigm where it makes much more sense to have active subscriptions and when new data is available that new data is distributed to those active subscriptions (pushed). All your application needs to know is if there are any active subscriptions or not in order to decide whether to publish the data or not. That problem has been solved.
The idea of websockets is that you keep a persistent connection with each client. When there is new data that you want to send to every client, you already know who all the clients are so you should just send it.
It sound like you want each client to constantly be sending requests to the server for new data. Why? It seems like that would waste everyone's bandwidth and I don't know why you think it will be more scalable. Maybe you could add more detail to your question like what kind of information you are broadcasting, how often, how many bytes, how many clients, etc.
Why not just consider an open websocket connection to be like a standing request from the client for more data?

How do you build a torrent file indexer?

I am curious about the technology behind a search engine like torrentz.com. From what I could observe, it doesn't host any torrent files, but rather connects you to other servers that do.
you search for keywords, it brings up a list of potential titles matching your search.
then you pick one of these and it provides you with another list of potential servers hosting the corresponding torrent file.
What I'm interested in particularly is the strategy behind gathering and indexing all that content:
How do they collect then aggregate the data?
Is it a submission base service, where each of these servers submits its content for indexing?
Is it a crawling algorithm? If so how do you even start crawling a site like piratebay.org?
Do they have access to these other servers' databases?
My knowledge and understanding of the bittorrent protocol is not very elaborate, but the documentation that I found online pointed me more toward the processes involved in building a tracker service, which isn't exactly what I'm interested in. Any insight and recommended reading material is appreciated.
For beginning start indexing their rss feeds and gather data from it. The next step would be indexing of portal's (like Mininova, tpb, etc) pages but watch out for the fact that you can be banned (ip based) for doing so, since that would provoke huge amount of data requested from their servers (i don't think that they be too happy about that)..
That said i doubt that they have access to other server's databases, but rather it's crawling +rss.
Another thing that you can use is that when somebody make a query of an item which you don't have in qyour database, you make the query on the main bt portal's, cache the result in your db, and then display results. Then if another user make the same query (which is pretty common scenario) you can show him cached data + new data from rss.

Resources