Neo4J what is the fastest structure when you have 3 main entities? - neo4j

Say you have zip codes, services and customers. Given a zip and service, I want to find the corresponding customers as fast as possible.
Options:
Customers are connected to zips via a "service" relationship. this seems like the smallest version, search for a particular zip and only one type of relationship (the targeted service)
Customers are connected to service areas, which point to different zips and services. Here we search for all service areas that point to the targeted service and the targeted zip.
Zips each connect to a service node unique to them, which are then connected to customers. so when you search, you go to the zip you want, go to the service, then anything connected to there is what you want (this feels like i may be overly hand holding for neo4J)
Do these different versions have different performance? I am having trouble understanding the theoretical difference in search formats in Neo4J. 2 is an example where the results are limited on two sides at once, where for 1 and 3, you can travel linearly on the graph as you filter, does that make a difference?
Thanks,
Brian

Approach 1 has several major drawbacks. All data about a "service" (which I assume is a company that provides services) would have to be duplicated in every associated service relationship. That wastes storage space in the DB. Also, if you wanted to find all the customers for a specific service (regardless of zip code), you'd have to scan every service relationship.
Approach 2 introduces an extra "service area" layer to the data model that seems to provide no advantage and just makes processing your use case more complicated and slower.
Approach 3 (in which I assume every "service" has a unique node) should be the way to go. There is no data duplication, and no scanning is needed to find the desired customers (whether you start from a zip code, or from a service).

Related

database solution for multiple isolated graphs

I have an interesting problem that I don't know how to solve.
I have collected a large dataset of 80 million graphs (they are CFG as in Control Flow Graph produced by programs I have analysed from Github) which I need to be able to search efficiently.
I looked into existing solutions like Neo4j but they are all designed to store a global single graph.
In my case this is the opposite all graphs are independent -like rows in a table - but I need to search through all of them efficiently.
For example I want to find all CFGs that has a particular IF condition or a WHILE loop with a particular condition.
What's the best database for this use case?
I don't think that there's a reason not to simply store all those graphs in a single graph, whether it's Neo4j or a different graph database. It's not a problem to have many disparate graphs in a single graph where the disparate graphs are disconnected from one another.
As for searching them efficiently, you would either (1) identify properties in your CFGs that you want to search on and convert them to some indexed value of the graph or (2) introduce some graph structure (additional vertices/edges) between the CFGs that will allow you to do the searches you want via graph traversal.
Depending on what you need to search on approach 1 may not be flexible enough for you especially, if what you intend to search on is not completely known at the time of loading the data. Also, it is important to note that with approach 2 you do not really lose the fact that you have 80 million distinct graphs just because you provided some connection between them. Those physical connections don't change that basic logical fact. You just need to consider those additional connections when you write traversals that you expect to occur only within a single CFG.
I'm not sure what Neo4j supports in this area, but with Apache TinkerPop (an open source graph processing framework that lets you write vendor agnostic code over different graph databases, including Neo4j), you might consider doing some form of graph partitioning to help with approach 2. Or you might subgraph() the larger graph to only contain the CFG and then operate with that purely in memory when querying. Both of these approaches will help you to blind your query to just the individual CFG you want to traverse.
Ultimately, however, I see this issue as a modelling problem. You will just need to make some choices on how to best establish the schema for your use case and virtually any graph database should be able to support that.

Master Data Management using Graph Database

I am building a master database to store all relevant information about our customers. I am using Neo4j.
Below is a sample of our model. We have Person, that can be registered in 3 of our mobile applications. (App.01, App. 02, App. 03 - We use CPF key, it is like a SSN). In those apps the user can be registered with an email. So it is represented by Email entity. Those user can have multiple address represented by Address entity.
The question is:
As I am building a Master Data, IMO, if someone query the mdm database asking for all "best" information about a person, I would return for example:
Name: John
Best email: email2 (because it has two apps using it)
Best address: addr1 (because it has tow apps using it)
So I am going to build some heuristis to define what is the "best" email and address.
For this purpose, I have some options:
I could create an edge from John to email2 and to addr1. So it's going to be easy for an user of MDM to get the "best" address/email from John.
I could build a rest API endpoint and create this heuristic in query time.
Does anyone have experience using graph database or design MDM database?
Is it a good approach?
This question is a complement for the question: Using Neo4j to build a Master Data Management
The graph data model is good to store your master data, however, your master data most likely will co-exist with operational and reference data in the form of dimensions.
if you decide to go with a graph model for your DMD, make sure that you have a well defined semantic model for the core dimension is MDM, usually:
products
customer
employees
Assets
Location
These core dimensions become attributes of your nodes.
Also, decide what DMD architecture style you are going to adopt, some popular ones are:
The Registry - Graph fits very well with this style because your master data remains in the SOS(system of record) and the references can be represented in the graph very nicely.
Master data Hub - Extra transformations ar4e required to transpose your system of record from tabular to the graph.
Master-Master. - this style fits well with your MDM in the graph if you do not have too many legacy apps that depend on your MDM.
Approach 1 would add a lot of essentially redundant information (about 2N extra relationships, where N is the number of people), and also require more complex coding to handle changes to a person's apps. And, as always when information is stored redundantly, you would have to be especially careful that inconsistencies do not creep in. But, it should be faster when querying for the "best" contact info.
Approach 2 keeps the DB the same size, but requires a more complex and slower query to get the "best" contact info. However, changing a person's apps and contact info is straightforward.
To decide which approach to use, you should consider whether DB size is an issue, and also look at your use cases and how frequently they will be performed.
Here is a simple heuristic if DB size is not an issue. Suppose G is the frequency at which you need to get a person's "best" contact info, and M is the frequency at which you need to modify a person's apps or contact info. You would pick approach 1 if the value of G/M exceeds some threshold value, K, that you would have to decide on, taking into consideration the above considerations.

WCF Data Services timeout on large queries

I have a self-hosted WCF Data Service (OData) that I'm developing. As I've been testing this, I noticed that most client applications I'm using (Excel, Browsers, etc) timeout on a request to pull a particular query in my service. There are about 140k records in the query. Applications just crash after a long query.
Right now, the only work around is to do client-side paging but if I can simply increase the limit then I would be most grateful for the answer.
Note that my Entity Model is mapped with database Views and not actual tables, just in case it has a relation with the issue.
Cheers!
Do you really need to transfer a so large amount of data?
I think OData is not a protocol for data replication.
The main advantage of OData is the opportunity to query and thus limit the amount of data to be transferred.
In an application that handles a lot of data, a common approach is to first present aggregations then refine querying (depending, for example, of successive choices made by the user).
The AdaptiveLINQ component I developed can help you implement this type of service. This is based on the notion of cube: dimensions and measures are defined as C# expressions.
For example, one can imagine a service to look in a product catalog (containing lots of data) as follows:
List of product categories and for each of them the amount of products available:
http://.../catalogService?$select=Category,ItemQuantity
List of available colors in category "shirt":
http://.../catalogService?$select=Color,ItemQuantity&$filter=Category eq shirt
List of "green shirts":
http://.../catalogService?$select=ProductLabel,ProductID&$filter=Category eq shirt and Color eq green

p2p long term storage

We are building a long term preservation cluster made of 3 geographical (far) node of 32TB (each one).
The 3 nodes must have the same files (3 level redoundancy).
My idea is to use a p2p protocol to keep the 3 nodes syncronized. I mean: If someone puts a file (a document) on one node (using a specific web based app), the other 2 nodes must take a copy of it (in asyncronous way) automatically.
I searched for p2p file systems but it seems, in general, that they split files in many nodes and optimize access performances, which is not our case. We need only an automated replica system. We prevent large amount of files.
Anyone knows some open source project can help?
Thanks.
P2P is overkill in your case. Especially if your server have all a public address. rsync or something similar would be much easier to implement.

How do you build a scalable infrastucture for a family of related but separate websites?

So, my company is talking about building an ecommerce platform that will serve many different clients. Each client would have a different look and feel, and its own set of users, but the backing code (ie: adminstration services, authentication servers, checkout services, possibly admin pages, etc.), and some users would be shared, so a bug fix could be applied to all sites at the same time, a primary admin could log into all websites.
As the entire StackExchange set of websites (with pretty high traffic) run off a small number of servers (two I believe) I wonder what it would involve to serve up many unrelated (but similar) websites through one webapp, or even one database.
In order to have one database, I imagine that every table would have columns identifying which realm the entity belonged to, and every SQL call would filter by that column. That seems like it would become an maintenance nightmare, and (less importantly for me) a DBA's hell.
Another option, with one webapp, but multiple databases, I imagine the realm could be tied to a specific data source, where all non-shared data could be specified. Then when any request was made, the appropriate datasource could be loaded and the webapp would run as if there was only single source. This would have an added benefit of being easily horizontally scalable since the exact same webapp, but a different set of realms and datasources, could be spawned when necessary. Websites could be easily moved to new servers as well by simply copying the webapp and moving the database.
I wondering what other possibilities there are, as well as, specific examples if they're out there.
Note: I'm not talking about Twitter-scale scalability, nor about hardware/languages/etc, but rather design methodologies and patterns.
The architecture you are talking about is called "multi-tenant" architecture. There are diff. ways of going about architecting a multi-tenant application. Broadly speaking, data tier can be architected in 3 ways:
seperate database for each client - easier to code, diff. to maintain
one database, different schema
one database, one schema (with clientid in each table; except for meta-data tables) - more time in coding, easier to maintain
Each has its own advantages/ disadvantages. Take a look at this article by Microsoft on multi-tenancy. http://msdn.microsoft.com/en-us/library/aa479086.aspx
Broadly speaking, I would suggest option 3 as it offers true multi-tenancy. If you have some tables that you expect to become very large, you can partition that table based on the clientid (for eg: if you want 10 partitons, you could do partition based on mod of clientid)

Resources