I have a JSON-LD ontology that I'd like to convert to HTML-documentation.
I've found a few tools that can do that, Ontodocs (Python, generates nice documentation in different themes) and Widoco (Java). These seem to do the job pretty well, but there are some issues with them and I'm now trying to find other similar tools that can do this.
Many of the tools I have found are not really suited for JSON-LD, but more for RDF and OWL ontologies.
It seems that the Hydra specification is using ReSpec for this purpose, but I'm not too familiar with it, and I'd prefer Python for the generation of the documentation.
So, does anyone know of any alternatives to Ontodocs and Widoco?
Related
My goal is to build an automated Knowledge Graph. I have decided to use Neo4j as my database. I am intending to load a json file from my local directory to Neo4j. The data I will be using are the yelp datasets(the json files are quite large).
I have seen some Neo4j examples with Graphaware and OpenNLP. I read that Neo4j has a good support for JAVA apps. I have also read that Neoj supports python(I am intending to use nltk). Is it advisable to use Neo4j with JAVA maven/gradle and OpenNLP? Or should I use it with py2neo with nltk.
I am really sorry that I don't have any prior experience with these tools. Any advice or recommendation will be greatly appreciated. Thank you so much!
Welcome to Stack Overflow! Unfortunately, this question is a suggestion/opinion question so isn't appropriate for this forum.
However, this is an area I have worked in so I can confidently say that Java (or Kotlin) is the best way to go for Neo. The reason being, it is the native language for Neo and there is significantly more support in terms of the community for questions and libraries available out there.
However, NLTK is much more powerful than OpenNLP. So, if your usecase is simple enough for OpenNLP, then purely Java/Kotlin is a perfect approach. Alternatively, you can utilize java as an interfacing layer for the stored graph, but use python with NLTK for language work feeding into the graph. This would, of course, dramatically increase the complexity of your project.
Ultimately, the best approach depends on your exact use-case and which trade-offs make the most sense for you.
I'm a member of a group of enthusiast writers, who decided to collaborate on a cookbook-style book for one of programming languages.
We're trying to pick a pipeline for the collaboration.
I like how ProGit is made.
That is Markdown + some custom pre-processing, processed by Pandoc. But I'm concerned that Markdown is too simple for our case.
I look at Sphinx, but I have no experience using it.
I know that LaTeX would work — but I'm afraid that it will scare off the contributors. Also it may be too powerful, and too easy to build a byzantine pipeline if you don't have the necessary experience (which I do not).
Please do not suggest solutions where a person have to write XML by hand or must use some specific GUI (optionally available GUIs are good, of course). Commercial and non-crossplatform solutions are not an option as well.
It's hard to say whether pandoc's extended version of markdown would be too simple for your case unless you say what features you need. Note also that, if you're able to do a bit of very simple Haskell scripting, you can use the pandoc API to add features.
Are there any tools available for generating RDF from natural language? A list of RDFizers compiled by the SIMILE project only mentions one, the Monrai Cypher. Unfortunately, it seems to have been a proprietary tool developed by Monrai Technologies, which has since disappeared, and I can't find any download links. Has anyone seen anything similar?
You want some ontology learning and population tools.
This online article lists 4 different systems:
Text2Onto,
Abraxas,
KnowItAll,
OntoLearn
You may want to check out the book; it reviews several ontology learning tools as well:
Ontology learning from text: methods, evaluation and applications, by Paul Buitelaar, Philipp Cimiano, Bernardo Magnini
You might look into OpenCalias, Zemanta and Hakia which all have nice APIs for extracting semantic data out of internet resources. Not familiar with Monrai Cypher, but possibly these might help.
you could use the python nltk
to parse the text and emit the rdf tripplets
We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provided in a tabular form (you can think of a flight table, hotel table, etc.). Notice, even though we parse HTML, this is not web scraping.
Currently we are using QL2's WebQL engine, but we are looking to replace it from business reasons. Can you recommend on another engine? It must run on Linux and be accessible from Java (a Java API would be the the best, but Web services are good solution as well). It also must support regular expressions for text extraction and not just to be based on the HTML structure.
I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
In addition, R provides many tools for parsing HTML or XML. Have a look at this question for an example using the RCurl and XML packages.
Edit: You can integrate R with Java with JRI. It's a very widely used package, with many examples. You can also see these related questions.
Have a look at:
LingPipe - LingPipe is a suite of Java libraries for the linguistic analysis of human language.
Lucene - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Just wanted to update - our final decision was to implement the parsing in groovy, and to add some required functionality (html to text, pdf to text, clean whitespace, etc.) either by implementing it in Java ot by relying on 3rd party libraries.
I use a custom parser made with Flex and C++ for similar purposes. I'd suggest you take a look at parser generators in java (javaCC .jj files) javacc-faq Nutch does it this way. (NutchAnalysis.jj)
I love Ruby and its framework, but I don't think that Ruby On Rails is the best choise to develop a Feed-parser and Indexer.
Maybe Python or Java are better choises. What language do you suggest?
I think Ruby is just fine for any of these kind of tasks:
http://rubyrss.com/
http://www.ruby-doc.org/stdlib/libdoc/rss/rdoc/index.html
http://railscasts.com/episodes/173-screen-scraping-with-scrapi
If you are comfortable with Ruby I see no reason to shell out to Java, Python et el. for most tasks. Keep in mind lots of the Ruby libraries sit on native implementations.
A Feed (RSS?) is usually pretty well structured (compared to a regular web page, at least). Check out Web Harvest, a Java / bean shell-based DOM parser (among other things). You can use this to automate grabbing data off the internet. There is a domain-specific language (defined in XML) that you'll have to learn. It's learning curve might be a bit steep, but I felt that it's well worth the effort.
I am not very familiar with Java, but I can say Python is very well suited for the job.
There is this very fast XML parser module called BeautifulStoneSoup, which you can use. It is part of the BeautifulSoup library. And if you're only looking for a simple indexer, Python has an sqlite engine builtin which is also lightweight and very fast.