Storing entire documents as node properties - neo4j

I'm creating a knowledge graph from text documents, This will include document nodes.
I want to store the entire text of these documents as a node property.
Is there anything inherently wrong with this?
Would it be better to store this information elsewhere and then just store a reference to that instead?
Are there any limitations to the amount of data you should add to nodes or relationships?

You can check these two options:
Store the relationships in a graph database and the document
information in a different database, such as CouchDB (Cons: managing
the two stores and keeping them in sync).
If you have text, then you need to think more about what kind of graph you're trying to get out of it. Do you want person/place/thing
information linked together? Then you might check out the GraphAware
NLP plugins for Neo4j - but they work on the text, not on docs/PDFs.

Related

Managing Schema/Data In Static/Fixed-Content Dimensions with Lakehouse

In the absence of DML (not leveraging Delta Lake as of yet), I'm looking for ways to manage Static/Fixed-Content Dimensions in a Data Lakehouse (i.e. Gender, OrderType, Country).
Ideally the schema and data within these dimensions would be managed by non-technical staff, but at this point, I'm just looking for development patterns to support the concept technically without being able DML. Preferably with history on source (who added 10 rows to this dimension?)
Thank you in advance for any assistance.
The Lake/Lakehouse/Warehouse should not be the system-of-record for any of your data. So you have another system where that data is "mastered" and then you copy it. Just like anything else.
The "system-of-record" can be a SharePoint List, an Excel workbook on OneDrive, a Dataverse table, or whatever you find convenient and is accessible to the people managing the data.

Is core data is a kind of Graph Database?

I am required to develop a big application,required to know graph database concepts the link http://sparsity-technologies.com/UserManual/API.html#transactions.I am planning to use core data instead of above link frame work. I want answerers for the following questions.
1)What is Graph Database exactly?.Explain with simple general example.which we can not perform with sqlite.
2)Does core data come under relational data base or not ? Explain.
3)Does core data come under Graph Database? But in apple documentation they mentioned that core data is for object graph management.object graph management means Graph Database .If i want to make relation ships ,weighted edge between objects core data is suitable?.
1)What is Graph Database exactly?.Explain with simple general
example.which we can not perform with sqlite.
Well, since this is all Turing complete, you can do it any database operation with any other database, the real question is a matter of efficiency.
In conventional "relational" databases the "relationships" are nothing but pointers to entries in other tables. They don't inherently communicate any information other than, "A is connected to B" To capture and structure anything more complex than that, you have to build a lot of pseudo-structure.
A1-->B1 // e.g. first-name, last-name
Which is fine but the relationship doesn't necessarily have a reciprocal, nor does the data in each table cell have to be names. To make the relationship always make sense, you've got build a lot of logic to put the data into the tables directly. Ditto for getting it out.
In a GraphDB you have "nodes" and "relationships". Nodes are not entries in a table. They can be arbitrarily complex objects, persisted or not, and persisted in a variety of ways. Nodes general model some "real-world" object like a person.
"Relationships" GraphDBs, owing to the previous meaning in SQL et al, really need another term because instead of be simple pointers, they to can be arbitrarily complex objects. In a node of names (way to simple to actually justify it)
Node-Name-A--(comes before)-->Node-Name-B
Node-Name-B--(comes after)-->Node-Name-B
In a sqlite, to find first and last names you query both tables. In a Graph, you grab one of the nodes and follow its relationship to other node.
(Come to think of it, graph theory in math started out as a way to model bridges of Konigsberg connecting the islands that made up the city. So maybe a transportation map would be a better example)
If cities are nodes, the roads are relationships. The road objects/descriptors would just connect the two but would contain their own logic and data such as their direction, length, conditions, traffic, suseptiblity to weather, and so on.
When you wanted to fetch and optimum route between widely separated cities, nodes for any particular time, traffic weather etc between two different nodes, you'd start with the node representing the start city and the follow the relationship/road-descriptors. In a complex model, any two nearby city-nodes might have several roads connecting them each best in certain circumstances.
All you have to do computationally though is compare the relationships between any to nodes. This is called "walking the graph" The huge benefit is that no matter how big the overall DB is, you only have to process the relationships coming out of the first node, say 3, and ignore utterly the the millions of other nodes and relationships that might be in the DB.
Can't do that in sqlite. The more data, the more "relationships" the more you have to process
2)Does core data come under relational data base or not ? Explain.
No, but if you hum a few bars you can fake it. By default, Core Data is an Object graph, which means it does connect object/nodes, but the relationships are themselves not objects but are instead defined by information contained in the class for each Object. E.g. you could have a Core Data of the usual Company, manager and employee.
CompanyClass
set_of_manager_objects
min_managers==1, max_managers==undefined
delete_Company_Object_delete_all_manager_objects
reciprocal_relationship_from_manager_is_company
ManagerClass
one company object
min_companies==1, max_companies==1
delete_manager_object_nullify (remove from set in company class)
recipocal_relationship_from_company_is_manager
So, Core Data a kind of "missing link" in the evolution of GraphDBs. I has relationships but they're not objects of themselves. They're inside the object/node. The relationship properties are hard coded into the classes themselves and just a few, but not all values can be changed. Still, Core Data does have the advantage of walking the graph. To find the Employees of one manager at one company. You just start at the company object, go through a small set of managers to find the right one, then walk down to the employee set. Even if you had hundreds of companies, thousands of managers and tens of thousands of employees. You can find one employee out of tens of thousands with a couple of hops.
But you can fake a GraphDB by creating relationship objects and putting them between any two object/nodes. Because Core Data allows any subclasses of relationship definition to be in the same relationship set e.g. ManagerClass--> LowManager,MidManager,HighManager, you can define a simple relationship in any given class and then populate with objects of arbitrary complexity as long as they are subclasses. These are usually termed "linking classes" or "linking relationships"
The normal pattern is to have the linking class have a relationship to the two or more classes it might have to link (which can be generic as well, I've started class trees with a base class with nothing but relationship properties, although their is a performance penalty if you get huge.)
If you give each node/object several relationships all defined on separate base linking classes, you can link the same nodes together in multiple ways.
3)Does core data come under Graph Database?
No, because the fundamental task of a database is persistence, saving the data. The fundamental task of Core Data is modeling the logic of the data inside the app.
Two different things. For example, when I start building a Core Data model, I start with an in-memory store, usually with test. The model graph is built from scratch every run, in memory, never touches the disk. As it progresses, I will shift to an XML store on disk, so I can examine it if necessary. The XML and binary stores are written out once entire and read in the same way. Only, at the end do I change the store to MySQL or something custom.
In a GraphDB, the nodes, relationships and the general graph are tied to the persistence systems innately AFAK and can't be altered. When you walk the graph, you walk the persistence, every time (except for caching.)
The usual question people ask is when to use Core Data and when to use SQL in the Apple Ecosystem.
The answer is pretty simple:
Core Data handles complexity inside the running app. The more complex the data model interactions, the more you get free with Core Data.
SQL derived solutions handle volumes of simple data. If the data model inside the app has little or no logic and there's a lot of it.
If your app is displaying something that would fit on a bunch of index cards, library book records, baseball cards etc, the an SQL solution is best because of the logic is just getting particular cards in and out of persistence.
If your app is complex vector drawing app, where every document will be different and of arbitrary complexity, or you're modeling an V8 engine, then most of the logic in the active data model while the app is running while persistence is trivial, then Core Data is the better choice.
Graph Databases are catching on because our data is getting 1) really, really big and 2) increasing complex. We need to model the complexity in the node-relationship graph in persistence so we don't have chew through the entire DB to find the data and then have to add an additional layer of logic
Core data is nothing but Data Model Layer, core data is NOT a datatbase and far away from being a graph database.
Core data only helps you to
Create Tables (Entities)
Columns in a table (Attribute)
Relationship (such as primary key, foreign key, one to one, one to many)
Core Data uses sqlite to store data and make queries.
Core Data is used in iOS mobile apps, I believe what you want is a backend solution for database.

Permissions to be stored as a Node or a property

We have six different types of permissions for content nodes. If we want to query neo4j for the content by the permission type, is it better to store the permissions as an attribute for each content node, or as a separate node to which each piece of content has a relationship?
This is a good data modeling question, and the truth is it depends.
I'm personally in favor of storing them as a separate node, so you don't have to traverse all nodes(or at least all user nodes) in order to find all the permissions you are looking for, especially if you start to get a lot of users and will be looking for all users of permission X.
This also adds a level of normalization, as well as the ability to perform counts easily.

How to store countries and counties list in Umbraco?

We want to store list of countries and thier counties/states for our latest Umbraco project.
These country and county ids are required in other modules for filtering and searching.
We are not using any custom database tables or custom sections all modules.
One option we found is to store country and it's counties as Umbraco Content Library nodes, but not sure about the performance impact.
Is there any other suitable way to overcome this situation?
Umbraco content library nodes are perfect for this:
The number of countries is limited, therefore no risk of having thousands of entries all of a sudden
The data is probably not updated frequently.
This will be published to umbraco.config which is accessible via xslt and cached in memory - performance impact: very fast!
States can be stored as child nodes of each country
Other content nodes can be linked with built-in content pickers to countries/states (and filter/search etc).
Integrated Umbraco functionality (publishing, node order, etc.) can be used since they are just nodes
No need for a developer to add a state/country (though you probably want to import the first batch...)
You may consider grouping countries in regions (or similar) because approx. 250 nodes is still a lot of nodes to look through in the content library.
There is another way to store these data - static file, such as Xml.
But this way has some limit:
1) you can not manage these data in Umbraco
2) You have to write your own code to read these data.
I'd go with the Content Library option. But you may also find something useful here:
http://ucomponents.codeplex.com/documentation

When developing web applications when would you use a Graph database versus a Document database?

I am developing a web-based application using Rails. I am debating between using a Graph Database, such as InfoGrid, or a Document Database, such as MongoDB.
My application will need to store both small sets of data, such as a URL, and very large sets of data, such as Virtual Machines. This data will be tied to a single user.
I am interested in learning about peoples experiences with either Graph or Document databases and why they would use either of the options.
Thank you
I don't feel enough experienced with both worlds to properly and fully answer your question, however I'm using a document database for some time and here are some personal hints.
The document databases are based on a concept of key,value, and static views and are pretty cool for finding a set of documents that have a particular value.
They don't conceptualize the relations between documents.
So if your software have to provide advanced "queries" where selection criteria act on several 'types of document' or if you simply need to perform a selection using several elements, the [key,value] concept is not appropriate.
There are also a number of other cases where document databases are inappropriate : presenting large datasets in "paged" tables, sortable on several columns is one of the cases where the performances are low and disk space usage is huge.
So in many cases you'll have to perform "server side" processing in order to pick up the pieces, and with rails, or any other ruby based framework, you might run into performance issues.
The graph database are based on the concept of tripplestore, meaning that they also conceptualize the relations between the entities.
The graph can be traversed using the relations (and entity roles), and might be more convenient when performing searches across relation-structured data.
As I have no experience with graph database, I'm not aware if the graph database can be easily queried/traversed with several criterias, however if an advised reader has such an information I'd really appreciate any examples of such queries/traversals.
I'm currently reading about InfoGrid and trying to figure if such databases could by handy in order to perform complex requests on a very large set of data, relations included ....
From what I can read, the InfoGrah should be considered as a "data federator" able to search/mine the data from several sources (Stores) wich can also be a NoSQL database such as Mongo.
Wich means that you could use a mongo store for updates and InfoGraph for data searching, and maybe spare a lot of cpu and disk when it comes to complex searches inside a nosql database.
Of course it might seem a little "overkill" if your app simply stores a large set of huge binary files in a database and all you need is to perform simple key queries and to retrieve the result. In that cas a nosql database such as mongo or couch would probably be handy.
Hope some of this helps ;)
When connecting related documents by edges, will you get a shallow or a deep graph? I think the answer to that question is important when deciding between graphdbs and documentdbs. See Square Pegs and Round Holes in the NOSQL World by Jim Webber for thoughts along these lines.

Resources