Building and extending a Knowledge Graph with entity extraction while Neo4j for my database - neo4j

My goal is to build an automated Knowledge Graph. I have decided to use Neo4j as my database. I am intending to load a json file from my local directory to Neo4j. The data I will be using are the yelp datasets(the json files are quite large).
I have seen some Neo4j examples with Graphaware and OpenNLP. I read that Neo4j has a good support for JAVA apps. I have also read that Neoj supports python(I am intending to use nltk). Is it advisable to use Neo4j with JAVA maven/gradle and OpenNLP? Or should I use it with py2neo with nltk.
I am really sorry that I don't have any prior experience with these tools. Any advice or recommendation will be greatly appreciated. Thank you so much!

Welcome to Stack Overflow! Unfortunately, this question is a suggestion/opinion question so isn't appropriate for this forum.
However, this is an area I have worked in so I can confidently say that Java (or Kotlin) is the best way to go for Neo. The reason being, it is the native language for Neo and there is significantly more support in terms of the community for questions and libraries available out there.
However, NLTK is much more powerful than OpenNLP. So, if your usecase is simple enough for OpenNLP, then purely Java/Kotlin is a perfect approach. Alternatively, you can utilize java as an interfacing layer for the stored graph, but use python with NLTK for language work feeding into the graph. This would, of course, dramatically increase the complexity of your project.
Ultimately, the best approach depends on your exact use-case and which trade-offs make the most sense for you.

Related

What's the purpose and mechanism of Ontology in D3WEB

In the expert system D3WEB, it is possible to insert\develop\use Ontology. However, I cannot get the point what's the purpose to introduce ontology in D3WEB?
The nice example on this page, https://www.d3web.de/Wiki.jsp?page=Demo%20-%20Ontology , shows how to develop an ontology in D3WEB. In my opinion, it can be more efficiently developed using Protégé. If the contents shall be changed with a real application, for instance, an ontology about 'dog', in the real application there could be instance dog A, B, C, D. It might be not feasible to 'insert' the instances into the D3WEB knowledge base. However, if the ontology changes over time, how to use the ontology in D3WEB then?
In my opinion, the best way is to develop an ontology outside of D3WEB using Java code. However, I believe the designer of D3WEB would have a nice reason to introduce ontology in D3WEB. I will appreciate it if someone let me know.
This is a somewhat common question we get regarding d3web-KnowWE, one reason might be, that our naming is somewhat misleading. So let me explain.
First there is d3web the java framework to run knowledge bases with strong problem solving knowledge, including rules, decision trees, flow-charts, covering lists, cost-benefit dialog strategies, time based reasoning, and so on. This framework in its core does not provide any GUIs, but is meant to integrate problem solving capabilities in other applications/expert systems. It also does not provide a way to properly create/author the knowledge bases it runs, aside maybe from doing it in the Java code on an API level.
To also provide proper means to author and develop a knowledge base, including some basic dialogs to run, demo, test, and debug the authored knowledge bases, we began working on the wiki system KnowWE, which today is basically a heavily extended JSPWiki. The page d3web.de itself for example is also just a build of KnowWE with specific content.
While we were working on and with KnowWE, we began to really like the approach to edit and author large knowledge bases in this 'wiki way', were you automatically support multiple distributed users to work on the same knowledge base, have automatic versioning, can add nice documentation directly beside the actual formal knowledge, can generate knowledge using script (because it's all just simple text markup), and so forth. Also, the underlying architecture of KnowWE became quite good and mature over the years.
So after some time of this, we found ourselves in the need to also author large ontologies. And yes, Protégé is a nice tool to develop ontologies, but for our use cases, it was just not well suited and we also found it to not scale very well. So we began to implement some simple markups to also allow to also develop ontologies in KnowWE. After then recognizing, that authoring ontologies the 'wiki way' indeed works pretty nicely, we decided to again also share these tools with everybody else on d3web.de. And that is why today you can author/develop both d3web knowledge bases and ontologies in KnowWE, although there is no actual connection/interoperability between both as of now. That would be nice of course and maybe we add this in the future, but for KnowWE is just a development environment for these two knowledge representation.
Maybe you can see KnowWE similar to an IDE like eclipse or IntelliJ, where the same application can be used to develop many different programming languages. KnowWE does the same for different knowledge representations.
A problem is maybe, that historically, we didn't differentiate very well between KnowWE and d3web, because KnowWE was narrowly used to build d3web knowledge bases. We also like to call KnowWE and its distribution package d3web-KnowWE for example. But maybe this should change...
Thanks for pointing this out, I will try to correct/clarify this on d3web.de

Which should I use to implement a collaborative filtering on top of Neo4j?

I'm working on a project (a social network) which use Neo4j (v1.9) as the underlying datastore and Spring Data Neo4j.
I'm trying to add a tag system to the project and I'm searching for ways to efficiently implement tag recommendation using collaborative filtering strategies.
After a lot of researches, I've come with these options:
Cypher. It is the embedded query language used by Neo4j. No other framework needed, maybe the computational times are better than the others. Maybe I can easily implement the queries using Spring Data Neo4j.
Apache Mahout. It offers machine learning algorithms focused primarly in the areas of collaborative filtering, clustering and classification. However, it isn't designed for graph databases and could be potentially slow.
Apache Giraph. Open source counterpart of Google Pregel.
Apache Spark. It is a fast and general engine for large-scale data processing.
reco4j. It is the best suited solution until now, but the project seems dead.
Apache Spark GraphX + Mazerunner. Suggested by the answer of #johnymontana. I'm documenting on it. The main issue is that I don't know if it supports collaborative filtering.
Graphaware Reco. Suggested by #ChristopheWillemsen in a comment. From the official site
is an extensible high-performance recommendation engine skeleton for
Neo4j, allowing for computing and serving real-time as well as
pre-computed recommendations.
However, I haven't understand yet if it works with old version of Neo4j (I can't upgrade the Neo4j version at the moment).
So, what do you suggest and why? Feel free to suggest other interesting frameworks not listed above.
Cypher is very fast when it comes to local traversals, but is not optimized for global graph operations. If you want to do something like compute similarity metrics between all pairs of users then using a graph processing framework (like Apache Spark GraphX) would be better. There is a project called Mazerunner that connects Neo4j and Spark that you might want to take a look at.
For a pure Cypher approach, here and here are a couple of recent blog posts demonstrating Cypher queries for recommendations.

neo4j core java api sample projects

We need to write some restful services to access / creates data on Neo4j. I have found many examples in Traverser Framework but I would like to explore Java CORE API as it is mentioned that the performance of Java Core API is far better than Traverser as per this link
Is it true? that Java CORE API is better than Traverser? Can someone guide me with useful tutorials of Java Core API for Neo4j?
Consider asking a different question here.
I don't dispute the performance finding that the traverser API is slower than the core API, but keep in mind that it's only for the kinds of things they were trying to do in that test.
Which API you should use depends on what you're trying to do. Without providing information on that, we can't suggest which will be the fastest for you.
Here are your tradeoff options: if you use the core API, then you can perform exactly the low-level operations on the graph that you want. On the flipside, you have to do all of the work. If the operations you're trying to do are complex, far reaching, or order-sensitive, you'll find yourself writing so much code that you'll re-implement a buggy version of the Traversal API on you own. Avoid this at all costs! The performance of the Traversal API is almost certainly better than what you'll write on your own.
On the other hand, if the operations you're performing are very simple (look up a node, grab its immediate neighbors by some edge type, then return them) then the core API is an excellent choice. In this (very simple) case, you don't need all the bells and whistles that Traversal gives you.
Bigger than just your question though: in general it's good to avoid "premature optimization". If a library or framework gives you a technique like the Traversal API, as a starting point it's a good bet to learn that abstraction and use it, because the developers gave it to you to make your life easier, not to make your code slower. If it turns out to be more than you need, or performance is really lagging -- then consider using the core API.
In the long run, if you're going to write RESTful services over top of Neo4J, you'll probably end up knowing both APIs. Bottom line - it's not a matter of choosing which one you should use, it's a matter of understanding their differences and which situations play to their strengths.

Coding Standards for Gremlin/Cypher

We are in the process of developing a review tool for Gremlin/Cypher as we predominantly work with Neo4j graph databases in our project to reduce the manual review effort and also deliver quality code.
Are there any list of coding standards(formatting/performance tips etc.,) for Gremlin and Cypher scripts which can be used as a checklist for performing review of these scripts?
I don't think you're going to find one specific answer, as discussing coding standards can lead to very subjective (and debate-laden) answers. That said: I'll go with something more objective:
First step would be to decide on Gremlin vs Cypher since they're not the same thing nor the same style. When making that decision (and maybe that decision is use both), you should really take a close look at Neo4j 2.0 development (currently at Milestone 4), as Cypher is maturing rapidly and there's a lot of work being put into it, both from expressiveness point of view and performance point of view.
Assuming you go with Cypher, I'd suggest you look at the samples being published by Neo Technology, especially the Cypher learning module. I don't know of any published guidelines, but I'd think most of the guidelines would be similar to any scripting guidelines you already have (such as naming conventions, spacing, etc.). Going further, you're likely going to use Cypher via code as well as through the console. So you'll want to continue using your traditional programming style guidelines, as well as specifying the language-specific library you'll be using.
I can only give you an answer related to Gremlin. First, it is important to note that almost all examples, in the Gremlin wiki, GremlinDocs, the gremlin-users mailing list, etc. are meant for the REPL. Example traversals from these sources tend to be lengthy one-liners, that are written that way for easy transfer via copy/paste to the command line for execution. It is somewhat satisfying to get your answer in one-line in the REPL, but for production code that requires maintenance over time, consider avoiding the temptation to do so, unless there is some specific reason that dictates it.
Second, from a style/formatting perspective, Gremlin is a DSL built on Groovy. Whatever style you like for Groovy should generally work for Gremlin. Start with the Groovy recommendations and tweak them to your needs/liking. I would expect that a tool like CodeNarc would help with general style checks and identifying common Groovy coding problems.
Also see the message from the mailing list https://groups.google.com/forum/#!searchin/neo4j/coding$20standards/neo4j/JYz2APHV-_k/V1BKRwyv5hAJ

Metamodelling tools

What tools are available for metamodelling?
Especially for developing diagram editors, at the moment trying out Eclipse GMF
Wondering what other options are out there?
Any comparison available?
Your question is simply too broad for a single answer - due to many aspects.
First, meta-modelling is not a set term, but rather a very fuzzy thing, including modelling models of models and reaching out to terms like MDA.
Second, there are numerous options to developing diagram editors - going the Eclipse way is surely a nice option.
To get you at least started in the Eclipse department:
have a look at MOF, that is architecture for "meta-modelling" from the OMG (the guys, that maintain UML)
from there approach EMOF, a sub set which is supported by the Eclipse Modelling Framework in the incarnation of Ecore.
building something on top of GMF might be indeed a good idea, because that's the way existing diagram editors for the Eclipse platform take (e.g. Omondo's EclipseUML)
there are a lot of tools existing in the Eclipse environment, that can utilize Ecore - I simply hope, that GMF builts on top of Ecore itself.
Dia has an API for this - I was able to fairly trivially frig their UML editor into a basic ER modelling tool by changing the arrow styles. With a DB reversengineering tool I found in sourceforge (took the schema and spat out dia files) you could use this to document databases. While what I did was fairly trivial, the API was quite straightforward and it didn't take me that long to work out how to make the change.
If you're of a mind to try out Smalltalk There used to be a Smalltalk meta-case framework called DOME which does this sort of thing. If you download VisualWorks, DOME is one of the contributed packages.
GMF is a nice example. At the core of this sits EMF/Ecore, like computerkram sais. Ecore is also used for the base of Eclipse's UML2 . The prestige use case and proof of concept for GMF is certainly UML2 Tools.
Although generally a UML tool, I would look at StarUML. It supports additional modules beyond what are already built in. If it doesn't have what you need built in or as a module, I supposed you could make your own, but I don't know how difficult that is.
Meta-modeling is mostly done in Smalltalk.
You might want to take a look at MOOSE (http://moose.unibe.ch). There are a lot of tools being developed for program understanding. Most are Smalltalk based. There is also some java and c++ work.
Two of the most impressive tools are CodeCity and Mondrian. CodeCity can visualize code development over time, Mondrian provides scriptable visualization technology.
And of course there is the classic HotDraw, which is also available in java.
For web development there is also Magritte, providing meta-descriptions for Seaside.
I would strongly recommend you look into DSM (Domain Specific Modeling) as a general topic, meta-modeling is directly related. There are eclipse based tools like GMF that currently require java coding, but integrate nicely with other eclipse tools and UML. However there are two other classes out there.
MetaCase which I will call a pure DSM tool as it focuses on allowing a developer/modeler with out nearly as much coding create a usable graphical model. Additionally it can be easily deployed for others to use. GMF and Microsoft's Beta software factory/DSM tool fall into this category.
Pure Meta-modeling tools which are not intended for DSM tooling, code generation, and the like. I do not follow these tools as closely as I am interested in applications that generate tooling for SMEs, Domain Experts, and others to use and contribute value to an active project not modeling for models sake, or just documentation and theory.
If you want to learn more about number 1, the tooling applications for DSMs/Meta-modeling, then check out my post "DSMForum.org great resources, worth a look." or just navigate directly to the DSMForum.org
In case you are interested in something that is related to modelling and not generation of code, have a look at adoxx.org. As a metamodelling platform it does provide functionalities and mechanisms to quickly develop your own DSL and allows you to focus on the models needs (business requirements, conceptual level design/specification). There is an active community from academia and practice involved developing prototypical as well as commercial application based on the platform. Could be interesting ...

Resources