Are there any utilities to import database from Neo4j into ArangoDB? arangoimp utility expects the data to be in certain format for edges and vertices than what is exported by Neo4j.
Thanks!
Note: This is not an answer per se, but a comment wouldn't allow me to structure the information I gathered in a readable way.
Resources online seem to be scarce w/r to the transition from neo4j to arangodb.
One possible way is to combine APOC (https://github.com/neo4j-contrib/neo4j-apoc-procedures) and neo4j-shell-tools (https://github.com/jexp/neo4j-shell-tools)
Use apoc to create a cypher export file for the database (see https://neo4j.com/developer/kb/export-sub-graph-to-cypher-and-import/)
Use the neo4j-shell-tool cypher import with the -o switch -- this should generate csv-files
Analyse the csv-files,
massage them with csvtool OR
create json-data with one of the numerous csv2json converters available (npm, ...) and massage these files with jq
Feed the files to arangoimp, repeat 3 if necessary
There is also a graphml to json converter (https://github.com/uskudnik/GraphGL/blob/master/examples/graphml-to-json.py) available, so that you could use the afforementioned neo4j-shell-tools to export to graphml, convert this representation to json and massage these files to the necessary format.
I'm sorry that I can't be of more help, but maybe these thoughts get you started.
Related
APOC procedures support the following export/import formats: (export, import)
CSV
JSON
Cypher script
Graphml
Gephi
From what I've tried using csv and cypher script is very slow compared to graphml, thus I haven't followed up on these.
Using graphml is fast but it seems that some information gets lost between export and import (field types
like integers suddenly become strings even though useTypes is set to true on export). It is also not feasible to track changes with git (when exporting/importing schema and structural nodes only, no data) because when exporting -> wiping database -> importing (from first export) -> exporting again changes the order of the exported data completely between the first export and second export file.
JSON has an option to select between four formats: JSON_LINES (default), ARRAY_JSON, JSON, JSON_ID_AS_KEYS
Is JSON as fast as graphml and a typesafe format? Are any of the JSON format options objectively better than others?
I am using Graph() in RDFLib, i am correctly getting results of from the graph using sparql. Is it possible to get the results directly in HTML table format?
rdflib is a library to work with rdf in python, not an HTML rendering engine. Usually if you work on a graph.sparql() query, you want to access the result in python itself.
That said, there is a fork focusing on hosting RDF called rdflib-web. In it you can find a htmlresults.py which does pretty much what i think you want.
I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.
I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.
Can I convert Neo4J Database files to XML?
I agree, GraphML is the way to go, if you don't have problems with the verbosity of XML. A simple way to do it is to open the Neo4j graph from Gremlin, where GraphML is the default import/export format, something like
peters: ./gremlin.sh
gremlin> $_g := neo4j:open('/tmp/neo4j')
==>neograph[/tmp/neo4j, vertices:2, edges:1]
gremlin> g:save('graphml-export.xml')
As described here
Does that solve your problem?
With Blueprints, simply do:
Graph graph = new Neo4jGraph("/tmp/mygraph");
GraphMLWriter.outputGraph(graph, new FileOutputStream("mygraph.xml"));
Or, with Gremlin (which does the same thing in the back):
g = new Neo4jGraph('/tmp/mygraph');
g.saveGraphML('mygraph.xml');
Finally, to the constructor for Neo4jGraph, you can also pass in a GraphDatabaseService instance.
I don't believe anything exists out there for this, not as of few months ago when messing with it. From what I saw, there are 2 main roadblocks:
XML is hierarchical, you can't represent graph data readily in this format.
Lack of explicit IDs for nodes. Even though implicit IDs exist it'd be like using ROWID in oracle for import/export...not guaranteed to be the same.
Some people have suggested that GraphML would be the proper format for this, I'm inclined to agree. If you don't have graphical structures and you would be fine represented in an XML/hierarchical format...well then that's just bad luck. Since the majority of users who would tackle this sort of enhancement task are using data that wouldn't store that way, I don't see an XML solution coming out...more likely to see a format supporting all uses first.
Take a look at NoSqlUnit
It has tools for converting GraphML to neo4j and back again.
In particular, there is com.lordofthejars.nosqlunit.graph.parser.GraphMLWriter and com.lordofthejars.nosqlunit.graph.parser.GraphMLReader which read / write XML files to / from a neo4j database.