Often CSV files use a tab delimiter, how can the Univocity Parsers .csv parser be configured to allow a tab delimiter? - parsing

Often CSV files use a tab delimiter, how can Univocity Parsers be configured so that the following can use a tab delimiter?:
CsvParserSettings parserSettings = new CsvParserSettings();
When parsing .csv files delimited by tabs is required, although Univocity Parsers has a TSVreader, having more than one settings instance creates coding obstacles.
The code and stack trace are below.
Any help would be greatly appreciated.
import com.univocity.parsers.csv.CsvParserSettings;
import com.univocity.parsers.common.processor.*;
import com.univocity.parsers.csv.*;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.lang.IllegalStateException;
import java.lang.String;
import java.util.List;
public class UnivocityParsers {
public Reader getReader(String relativePath) {
try {
return new InputStreamReader(this.getClass().getResourceAsStream(relativePath), "Windows-1252");
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("Unable to read input", e);
}
}
public void columnSelection() {
RowListProcessor rowProcessor = new RowListProcessor();
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setSkipEmptyLines(true);
parserSettings.getFormat().setDelimiter('\t');
// Here we select only the columns "Price", "Year" and "Make".
// The parser just skips the other fields
parserSettings.selectFields("AUTHOR", "ISBN");
CsvParser parser = new CsvParser(parserSettings);
parser.parse(getReader("list4.csv"));
List<String[]> rows = rowProcessor.getRows();
String[] strings = rows.get(0);
System.out.print(strings[0]);
}
public static void main(String arg[]) {
UnivocityParsers univocityParsers = new UnivocityParsers();
univocityParsers.columnSelection();
}
}
Stack trace:
Exception in thread "JavaFX Application Thread" java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:97)
at parse.Controller.getReader(Controller.java:34)
at parse.Controller.columnSelection(Controller.java:107)
... 56 more
Here is the file being parsed:
"REVIEW_DATE" "AUTHOR" "ISBN" "DISCOUNTED_PRICE"
"1985/01/21" "Douglas Adams" 345391802 5.95
"1990/01/12" "Douglas Hofstadter" 465026567 9.95
"1998/07/15" "Timothy ""The Parser"" Campbell" 968411304 18.99
"1999/12/03" "Richard Friedman" 60630353 5.95
"2001/09/19" "Karen Armstrong" 345384563 9.95
"2002/06/23" "David Jones" 198504691 9.95
"2002/06/23" "Julian Jaynes" 618057072 12.5
"2003/09/30" "Scott Adams" 740721909 4.95
"2004/10/04" "Benjamin Radcliff" 804818088 4.95
"2004/10/04" "Randel Helms" 879755725 4.5

The problem comes from the getReader method. It is not finding the file in your classpath.
This line is producing a null:
this.getClass().getResourceAsStream(relativePath)
Maybe you should use this (note the leading slash on the file name):
parser.parse(getReader("/list4.csv"));
Also note that the TSV parser is a different implementation. TSV is not just CSV with tab delimiters (it's all good if in your case it works). Just keep in mind trying to read a TSV using a CSV parser is a bad idea as characters such as '\n' or '\t' may be escaped as actual sequences of '\' and 'n'. When a CSV parser reads this you will get the 2 characters ('\' + 'n') instead of the new line character ('\n')

Related

Can we do image processing with Palantir Foundry?

I'm exploring the Palantir Foundry platform and it seems to have ton of options for rectangular data or structured data. Does anyone have experience of working with Unstructured Big data on Foundry platform? How can we use Foundry for Image analysis?
Although most examples are given using tabular data, in reality a lot of use case are using foundry for both unstructured and semi-structured data processing.
You should think of a dataset as a container of files with an API to access and process the files.
using the file level API you can get access to the files in the dataset and process them as you like. If these files are images you can extract information from the file and use it as you like.
a common use case is to have PDFs as files in a dataset and to extract information from the PDF and store it as tabular information so you can do both structured and unstructured search over it.
here is an example of file access to extract PDF:
import com.palantir.transforms.lang.java.api.Compute;
import com.palantir.transforms.lang.java.api.FoundryInput;
import com.palantir.transforms.lang.java.api.FoundryOutput;
import com.palantir.transforms.lang.java.api.Input;
import com.palantir.transforms.lang.java.api.Output;
import com.palantir.util.syntacticpath.Paths;
import com.google.common.collect.AbstractIterator;
import com.palantir.spark.binarystream.data.PortableFile;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.UUID;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public final class ExtractPDFText {
private static String pdf_source_files_rid = "SOME RID";
private static String dataProxyPath = "/foundry-data-proxy/api/dataproxy/datasets/";
private static String datasetViewPath = "/views/master/";
#Compute
public void compute(
#Input("/Base/project_name/treasury_pdf_docs") FoundryInput pdfFiles,
#Output("/Base/project_name/clean/pdf_text_extracted") FoundryOutput output) throws IOException {
Dataset<PortableFile> filesDataset = pdfFiles.asFiles().getFileSystem().filesAsDataset();
Dataset<String> mappedDataset = filesDataset.flatMap((FlatMapFunction<PortableFile, String>) portableFile ->
portableFile.convertToIterator(inputStream -> {
String pdfFileName = portableFile.getLogicalPath().getFileName().toString();
return new PDFIterator(inputStream, pdfFileName);
}), Encoders.STRING());
Dataset<Row> dataset = filesDataset
.sparkSession()
.read()
.option("inferSchema", "false")
.json(mappedDataset);
output.getDataFrameWriter(dataset).write();
}
private static final class PDFIterator extends AbstractIterator<String> {
private InputStream inputStream;
private String pdfFileName;
private boolean done;
PDFIterator(InputStream inputStream, String pdfFileName) throws IOException {
this.inputStream = inputStream;
this.pdfFileName = pdfFileName;
this.done = false;
}
#Override
protected String computeNext() {
if (done) {
return endOfData();
}
try {
String objectId = pdfFileName;
String appUrl = dataProxyPath.concat(pdf_source_files_rid).concat(datasetViewPath).concat(pdfFileName);
PDDocument document = PDDocument.load(inputStream);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String strippedText = text.replace("\"", "'").replace("\\", "").replace("“", "'").replace("”", "'").replace("\n", "").replace("\r", "");
done = true;
return "{\"id\": \"" + String.valueOf(UUID.randomUUID()) + "\", \"file_name\": \"" + pdfFileName + "\", \"app_url\": \"" + appUrl + "\", \"object_id\": \"" + objectId + "\", \"text\": \"" + strippedText + "\"}\n";
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
}
Indeed you can do image analysis on Foundry as you have access to files and can use arbitrary libraries (for example pillow or skimage for python). This can be done at scale as well as it can be parallelised.
A simple python snippet to stitch two pictures together should get you started:
from transforms.api import transform, Input, Output
from PIL import Image
#transform(
output=Output("/processed/stitched_images"),
raw=Input("/raw/images"),
image_meta=Input("/processed/image_meta")
)
def my_compute_function(raw, image_meta, output, ctx):
image_meta = image_meta.dataframe()
def stitch_images(clone):
left = clone["left_file_name"]
right = clone["right_file_name"]
image_name = clone["image_name"]
with raw.filesystem().open(left, mode="rb") as left_file:
with raw.filesystem().open(right, mode="rb") as right_file:
with output.filesystem().open(image_name, 'wb') as out_file:
left_image = Image.open(left_file)
right_image = Image.open(right_file)
(width, height) = left_image.size
result_width = width * 2
result_height = height
result = Image.new('RGB', (result_width, result_height))
result.paste(im=left_image, box=(0, 0))
result.paste(im=right_image, box=(height, 0))
result.save(out_file, format='jpeg', quality=90)
image_meta.rdd.foreach(stitch_images)
The image_meta dataset is just a dataset that has 2 file names per row. To extract filenames from a dataset of raw files you can use something like:
#transform(
output=Output("/processed/image_meta"),
raw=Input("/raw/images"),
)
def my_compute_function(raw, output, ctx):
file_names = [(file_status.path, 1) for file_status in raw.filesystem().ls(glob="*.jpg")]
# create and write spark dataframe based on array
As others have mentioned Palantir-Foundry’s focus is on tabular data and doesn’t currently provide GPU or other tensor processing unit access. So doing anything intense like an FFT transform or Deep Learning would be ill advised at best if not downright impossible at worst.
That being said you can upload image files into dataset nodes for read/write access. You could also store their binary information as a blob type into a Dataframe in order to store files in a given record field. Given that there are a multitude of Python image processing adjacent and matrix math libraries available on the platform and given that it’s also possible to upload library packages to the platform manually through the Code Repo app, it is conceivable that someone could make use of simple manipulations on a somewhat large scale as long as it wasn’t overly complex or memory intensive.
Note that Foundry seems to have no GPU support at the moment, so if you are thinking about running deep learning based image processing, this will be quite slow on CPUs.

Could someone share how to delete a paragraph form a textbox

I am currently working on a project to manipulate Docx file with the Apache POI project. I have used the api to remove text from a run inside of a text box, but cannot figure out how to remove a paragraph inside a text box. I assume that I need to use the class CTP to obtain the paragraph object to remove. Any examples or suggestion would be greatly appreciated.
In Replace text in text box of docx by using Apache POI I have shown how to replace text in Word text-box-contents. The approach is getting a list of XML text run elements from the XPath .//*/w:txbxContent/w:p/w:r using a XmlCursor which selects that path from /word/document.xml.
The same of course can be done using the path .//*/w:txbxContent/w:p, which gets the text paragraphs in text-box-contents. Having those low level paragraph XML, we can converting them into XWPFParagraphs to get the plain text out of them. Then, if the plain text contains some criterion, we can simply removing the paragraph's XML.
Source:
Code:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import java.util.List;
import java.util.ArrayList;
public class WordRemoveParagraphInTextBox {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordRemoveParagraphInTextBox.docx"));
for (XWPFParagraph paragraph : document.getParagraphs()) {
XmlCursor cursor = paragraph.getCTP().newCursor();
cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:txbxContent/w:p");
List<XmlObject> ctpsintxtbx = new ArrayList<XmlObject>();
while(cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
ctpsintxtbx.add(obj);
}
for (XmlObject obj : ctpsintxtbx) {
CTP ctp = CTP.Factory.parse(obj.xmlText());
//CTP ctp = CTP.Factory.parse(obj.newInputStream());
XWPFParagraph bufferparagraph = new XWPFParagraph(ctp, document);
String text = bufferparagraph.getText();
if (text != null && text.contains("remove")) {
obj.newCursor().removeXml();
}
}
}
FileOutputStream out = new FileOutputStream("WordRemoveParagraphInTextBoxNew.docx");
document.write(out);
out.close();
document.close();
}
}
Result:

Xtext problem referencing grammar A from validator of grammar B

In Xtext, how do I follow a reference from grammar B to grammar A, within a validator of grammar B (which is in the ui-plugin)? Consider the following example.
Grammar A is org.xtext.people.People
grammar org.xtext.people.People with org.eclipse.xtext.common.Terminals
generate people "http://www.xtext.org/people/People"
People:
people+=Person*;
Person:
'person' name=ID ';';
and an instance
person Alice {citizenship "MN"; id "12345"; }
person Bob {citizenship "CH"; id "54321";}
person Malice {citizenship "XXX"; id "66666"; }
At an airport, entries of people are recorded.
enter Alice;
enter Bob;
enter Malice;
Entries are modelled with a second grammar B org.xtext.entries.Entries
grammar org.xtext.entries.Entries with org.eclipse.xtext.common.Terminals
generate entries "http://www.xtext.org/entries/Entries"
import "http://www.xtext.org/people/People"
Entries:
entries+=Entry*;
Entry:
'enter' person=[Person] ';';
After ensuring that the Eclipse project org.xtext.entries has the project org.xtext.people on it's classpath, and ensuring that the org.xtext.entries plugin has the org.xtext.people as a dependency, all works as expected.
There is a travel ban on people from country XXX, although certain deserving people are excluded. Only the CIA knows who is excluded from the ban. Entries must not be allowed for people from XXX unless excluded.
The updated grammar is
grammar org.xtext.entries.Entries with org.eclipse.xtext.common.Terminals
generate entries "http://www.xtext.org/entries/Entries"
import "http://www.xtext.org/people/People"
Entries:
entries+=Entry*;
Entry:
travelBanOverride=TravelBanOverride?
'enter' person=[Person] ';';
TravelBanOverride: '#TravelBanOverride' '(' code=STRING ')';
with validator
package org.xtext.entries.validation
import org.eclipse.xtext.validation.Check
import org.xtext.entries.entries.EntriesPackage
import org.xtext.entries.entries.Entry
import org.xtext.entries.CIA
class EntriesValidator extends AbstractEntriesValidator {
public static val BAN = 'BAN'
public static val ILLEGAL_OVERRIDE = 'ILLEGAL_OVERRIDE'
#Check
def checkBan(Entry entry) {
if (entry.person.citizenship == "XXX") {
if (entry.travelBanOverride === null) {
error('Violation of Travel Ban', EntriesPackage.Literals.ENTRY__PERSON, BAN)
}
else {
val overridecode = entry.travelBanOverride.code;
val valid = CIA.valid(entry.person.name, entry.person.id, overridecode)
if (!valid) {
error('Illegal override code', EntriesPackage.Literals.ENTRY__TRAVEL_BAN_OVERRIDE, ILLEGAL_OVERRIDE)
}
}
}
}
}
where the driver for the external CIA web-app is modelled for example by
package org.xtext.entries;
public class CIA {
public static boolean valid(String name, String id, String overrideCode) {
System.out.println("UNValid["+name+","+overrideCode+"]");
return name.equals("Malice") && id.equals("66666") && overrideCode.equals("123");
}
}
The validations work as expected.
I now wish to provided a quick-fix for BAN, that checks for an override code from the CIA.
package org.xtext.entries.ui.quickfix
import org.eclipse.xtext.ui.editor.quickfix.DefaultQuickfixProvider
import org.eclipse.xtext.ui.editor.quickfix.Fix
import org.xtext.entries.validation.EntriesValidator
import org.eclipse.xtext.validation.Issue
import org.eclipse.xtext.ui.editor.quickfix.IssueResolutionAcceptor
import org.xtext.entries.entries.Entry
import org.xtext.entries.Helper
class EntriesQuickfixProvider extends DefaultQuickfixProvider {
#Fix(EntriesValidator.BAN)
def tryOverride(Issue issue, IssueResolutionAcceptor acceptor) {
acceptor.accept(issue, 'Try override', 'Override if CIA says so.', 'override.png')
[element ,context |
val entry = element as Entry
// val person = entry.person // no such attribute
//val person = Helper.get(entry); // The method get(Entry) from the type Helper refers to the missing type Object
]
}
}
The first commented line does not compile: there is no attribute person. The second commented line is an attempt to solve the problem by getting a helper class in org.xtext.entries to get the person, but this does not compile either, giving a "The method get(Entry) from the type Helper refers to the missing type Object" error message.
For completeness, here is that helper.
package org.xtext.entries
import org.xtext.people.people.Person
import org.xtext.entries.entries.Entry
class Helper {
static def Person get(Entry entry) {
return entry.person;
}
}
Further, entry.travelBanOverride compiles fine, but entry.person does not. Clicking on Entry in Eclipse takes one to the expected code, which has both travelBanOverride and person.
The issue does not occur with a Java class in the same project and package.
package org.xtext.entries.ui.quickfix;
import org.xtext.entries.entries.Entry;
import org.xtext.people.people.Person;
public class Test {
public static void main(String[] args) {
Entry entry = null;
Person p = entry.getPerson();
}
}
Rewriting the quickfix in Java solves the problem.
package org.xtext.entries.ui.quickfix;
import org.eclipse.xtext.ui.editor.quickfix.DefaultQuickfixProvider;
import org.eclipse.xtext.ui.editor.quickfix.Fix;
import org.xtext.entries.validation.EntriesValidator;
import org.eclipse.xtext.validation.Issue;
import org.eclipse.xtext.ui.editor.quickfix.IssueResolutionAcceptor;
import org.xtext.entries.entries.Entry;
import org.xtext.entries.Helper;
import org.eclipse.xtext.ui.editor.model.edit.IModificationContext;
import org.eclipse.xtext.ui.editor.model.edit.ISemanticModification;
import org.eclipse.emf.ecore.EObject;
import org.xtext.entries.entries.Entry;
import org.xtext.people.people.Person;
public class EntriesQuickfixProvider extends DefaultQuickfixProvider {
#Fix(EntriesValidator.BAN)
public void tryOverride(final Issue issue, IssueResolutionAcceptor acceptor) {
acceptor.accept(issue,
"Try to override",
"Override",
"override.gif",
new ISemanticModification() {
public void apply(EObject element, IModificationContext context) {
Entry entry = (Entry) element;
System.out.println(entry.getPerson());
}
}
);
}
}
How do I follow a reference from grammar B (Entries) to grammar A (People), within a validator of grammar B?
My mistake is the following.
After ensuring that the Eclipse project org.xtext.entries has the
project org.xtext.people on it's classpath, and ensuring that the
org.xtext.entries plugin has the org.xtext.people as a dependency, all
works as expected.
The org.xtext.entries.ui ui-plugin must also have the org.xtext.people on its Java (Eclipse project) build path. Exporting and making a plugin-dependency it not enough.
Note that this setting should be made early, before crafting the quick-fix, because the Xtend editor has refreshing issues.

Optimizations for merging graph with million nodes and csv with million rows

Consider I have graph with following nodes and relationships:
(:GPE)-[:contains]->(:PE)-[:has]->(:E)
with following attributes:
GPE: joinProp1, outProp1, randomProps
PE : joinProp2, outProp2, randomProps
E : joinProp3, outProp3, randomProps
Now consider I have csv of following format:
joinCol1, joinCol2, joinCol3, outCol1, outCol2, outCol3, randomProps
Now consider I have million rows in this csv file. Also I have million instances of each of (:GPE),(:PE),(:E) in graph. I want to merge the graph and csv into new csv. For that I want to map / equate
joinCol1 with joinProp1
joinCol2 with joinProp2
joinCol3 with joinProp3
something like this (pseudo cypher) for each row in csv:
MATCH (gpe:GPE {joinProp1:joinCol1})-[:contains]->(pe:PE {joinProp2:joinCol2})-[:has]->(e:E {joinProp3:joinCol3}) RETURN gpe.outProp1, pe.outProp2, e.outProp3
So the output csv format would be:
joinCol1, joinCol2, joinCol3, outCol1, outCol2, outCol3, outProp1, outProp2, outProp3
What is rough minimum execution time estimates (minutes or hours) in which I can complete this task if I create indices on all joinProps and use parameterized cypher (considering I am implementing this simple logic with java api). I just want to know what are rough estimates. We have similar (possibly un-optimized) task implemented and it takes several hours to do this. The challenge is to bring down that execution time. What all things I can do to optimize and bring down execution time to some minutes? Any quick optimization points / links? Will using some approach other than java api provide performance improvements?
Well I tried some things which improved performance considerably.
Some neo4j performance guidelines that applies to my scenario:
Process in batch: Avoid doing cypher call (through bolt api call) for each csv row. Iterate through several fixed number of rows of csv, form list of maps where each map will be row of csv. Then pass this list of maps to cypher as parameter. UNWIND this list inside cypher and perform the desired action. Repeat the same for next set of csv rows.
Dont return node relationship objects from cypher to java side. Instead try returning list of maps required as a final output. When we return list of nodes / relationships, we may have to reiterate through them to merge the properties with csv columns to form final output row (or map)
Pass the csv column values to cypher: To achieve point 2, send the csv column values (to be merged with graph properties) to cypher. Perform matching in cypher and form the output map by merging matched node's properties and input csv columns.
Index node / relationship properties which are to be matched (Official docs)
Parameterized the cypher (API Example, Official docs)
I did a quick dirty experiment as I explained below.
My input csv looked like this.
inputDataCsv.csv
csv-id1,csv-id2,csv-id3,outcol1,outcol2,outcol3
gppe0,ppe1,pe2,outcol1-val-0,outcol2-val-1,outcol3-val-2
gppe3,ppe4,pe5,outcol1-val-3,outcol2-val-4,outcol3-val-5
gppe6,ppe7,pe8,outcol1-val-6,outcol2-val-7,outcol3-val-8
...
To created the graph, I first created csvs of the form:
gppe.csv (entities)
gppe0,gppe_out_prop_1_val_0,gppe_out_prop_2_val_0,gppe_prop_X_val_0
gppe3,gppe_out_prop_1_val_3,gppe_out_prop_2_val_3,gppe_prop_X_val_3
gppe6,gppe_out_prop_1_val_6,gppe_out_prop_2_val_6,gppe_prop_X_val_6
...
ppe.csv (entities)
ppe1,ppe_out_prop_1_val_1,ppe_out_prop_2_val_1,ppe_prop_X_val_1
ppe4,ppe_out_prop_1_val_4,ppe_out_prop_2_val_4,ppe_prop_X_val_4
ppe7,ppe_out_prop_1_val_7,ppe_out_prop_2_val_7,ppe_prop_X_val_7
...
pe.csv (entities)
pe2,pe_out_prop_1_val_2,pe_out_prop_2_val_2,pe_prop_X_val_2
pe5,pe_out_prop_1_val_5,pe_out_prop_2_val_5,pe_prop_X_val_5
pe8,pe_out_prop_1_val_8,pe_out_prop_2_val_8,pe_prop_X_val_8
...
gppeHasPpe.csv (relationships)
gppe0,ppe1
gppe3,ppe4
gppe6,ppe7
...
ppeContainsPe.csv (relationships)
ppe1,pe2
ppe4,pe5
ppe7,pe8
...
I loaded this in neo4j as follows:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///gppe.csv' AS line
CREATE (:GPPocEntity {id:line[0],gppe_out_prop_1: line[1], gppe_out_prop_2: line[2],gppe_out_prop_X: line[3]})  
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///ppe.csv' AS line
CREATE (:PPocEntity {id:line[0],ppe_out_prop_1: line[1], ppe_out_prop_2: line[2],ppe_out_prop_X: line[3]})
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///pe.csv' AS line
CREATE (:PocEntity {id:line[0],pe_out_prop_1: line[1], pe_out_prop_2: line[2],pe_out_prop_X: line[3]})
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///gppeHasPpe.csv' AS line
MATCH(gppe:GPPocEntity {id:line[0]})
MATCH(ppe:PPocEntity {id:line[1]})
MERGE (gppe)-[:has]->(ppe)
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///ppeContainsPe.csv' AS line
MATCH(ppe:PPocEntity {id:line[0]})
MATCH(pe:PocEntity {id:line[1]})
MERGE (ppe)-[:contains]->(pe)
Next I created indixes on lookup properties:
CREATE INDEX ON :GPPocEntity(id)
CREATE INDEX ON :PPocEntity(id)
CREATE INDEX ON :PocEntity(id)
Below is utility class which reads csv into list of maps:
package csv2csv;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.w3c.dom.stylesheets.LinkStyle;
public class Config {
String configFilePath;
Map csvColumnToGraphNodeMapping;
List<Map> mappingsGraphRelations;
Map<String,Map<String,String>> mappingsGraphRelationsMap = new HashMap<String, Map<String,String>>();
List<String> outputColumnsFromCsv;
Map outputColumnsFromGraph;
public Config(String pConfigFilePath) {
configFilePath = pConfigFilePath;
JSONParser parser = new JSONParser();
try {
Object obj = parser.parse(new FileReader(configFilePath));
JSONObject jsonObject = (JSONObject) obj;
csvColumnToGraphNodeMapping = (HashMap) ((HashMap) jsonObject.get("csvColumn-graphNodeProperty-mapping"))
.get("mappings");
mappingsGraphRelations = (ArrayList) ((HashMap) jsonObject.get("csvColumn-graphNodeProperty-mapping"))
.get("mappings-graph-relations");
for(Map m : mappingsGraphRelations)
{
mappingsGraphRelationsMap.put(""+ m.get("start-entity") + "-" + m.get("end-entity"), m);
}
outputColumnsFromCsv = (ArrayList) ((HashMap) jsonObject.get("output-csv-columns"))
.get("columns-from-input-csv");
outputColumnsFromGraph = (HashMap) ((HashMap) jsonObject.get("output-csv-columns"))
.get("columns-from-graph");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
}
Below class performs merging and creates another csv:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.neo4j.driver.internal.value.MapValue;
import org.neo4j.driver.internal.value.NodeValue;
import org.neo4j.driver.v1.AuthTokens;
import org.neo4j.driver.v1.Driver;
import org.neo4j.driver.v1.GraphDatabase;
import org.neo4j.driver.v1.Record;
import org.neo4j.driver.v1.Session;
import org.neo4j.driver.v1.StatementResult;
import org.apache.commons.lang3.time.StopWatch;
public class Csv2CsvUtil2 {
static String inCsvFilePath = "D:\\Mahesh\\work\\files\\inputDataCsv.csv";
static String outCsvFilePath = "D:\\Mahesh\\work\\files\\csvout.csv";
private final static Driver driver = GraphDatabase.driver(
"bolt://localhost:7687", AuthTokens.basic("neo4j", "password"));
public static void main(String[] args) throws FileNotFoundException, IOException {
mergeNonBatch();
}
private static void merge() throws FileNotFoundException, IOException
{
List<Map<String,String>> csvRowMapList = new CsvReader(inCsvFilePath).getMapListFromCsv();
Session session = driver.session();
String cypherFilter = "";
String cypher;
PrintWriter pw = new PrintWriter(new File(outCsvFilePath));
StringBuilder sb = new StringBuilder();
List<Map<String, String>> inputMapList = new ArrayList<Map<String,String>>();
Map<String,Object> inputMapListMap = new HashMap<String,Object>();
Map<String, Object> params = new HashMap<String, Object>();
cypher = "WITH {inputMapList} AS inputMapList"
+ " UNWIND inputMapList AS rowMap"
+ " WITH rowMap"
+ " MATCH (gppe:GPPocEntity {id:rowMap.csvid1})-[:has]->(ppe:PPocEntity {id:rowMap.csvid2})-[:contains]->(pe:PocEntity {id:rowMap.csvid3})"
+ " RETURN {id1:gppe.id,id2:ppe.id,id3:pe.id"
+ ",gppeprop1: gppe.gppe_out_prop_1,gppeprop2: gppe.gppe_out_prop_2"
+ ",ppeprop1: ppe.ppe_out_prop_1,ppeprop2: ppe.ppe_out_prop_2"
+ ",peprop1: pe.pe_out_prop_1,peprop2: pe.pe_out_prop_2"
+ ",outcol1:rowMap.outcol1,outcol2:rowMap.outcol2,outcol3:rowMap.outcol3}";
int i;
for(i=0;i<csvRowMapList.size();i++)
{
Map<String, String> rowMap = new HashMap<String, String>();
rowMap.put("csvid1", csvRowMapList.get(i).get("csv-id1"));
rowMap.put("csvid2", csvRowMapList.get(i).get("csv-id2"));
rowMap.put("csvid3", csvRowMapList.get(i).get("csv-id3"));
rowMap.put("outcol1", csvRowMapList.get(i).get("outcol1"));
rowMap.put("outcol2", csvRowMapList.get(i).get("outcol2"));
rowMap.put("outcol3", csvRowMapList.get(i).get("outcol3"));
inputMapList.add(rowMap);
if(i%10000 == 0) //run in batch
{
inputMapListMap.put("inputMapList", inputMapList);
StatementResult stmtRes = session.run(cypher,inputMapListMap);
List<Record> retList = stmtRes.list();
for (Record record2 : retList) {
MapValue retMap = (MapValue) record2.get(0);
sb.append(retMap.get("id1")
+","+retMap.get("id2")
+","+retMap.get("id3")
+","+retMap.get("gppeprop1")
+","+retMap.get("gppeprop2")
+","+retMap.get("ppeprop1")
+","+retMap.get("ppeprop2")
+","+retMap.get("peprop1")
+","+retMap.get("peprop2")
+","+retMap.get("outcol1")
+","+retMap.get("outcol2")
+","+retMap.get("outcol3")
+"\n"
);
}
inputMapList.clear();
}
}
if(inputMapList.size() != 0) //ingest remaining rows which does not complete
//10000 reords failing to form next batch
{
inputMapListMap.put("inputMapList", inputMapList);
StatementResult stmtRes = session.run(cypher,inputMapListMap);
List<Record> retList = stmtRes.list();
for (Record record2 : retList) {
MapValue retMap = (MapValue) record2.get(0);
sb.append(retMap.get("id1")
+","+retMap.get("id2")
+","+retMap.get("id3")
+","+retMap.get("gppeprop1")
+","+retMap.get("gppeprop2")
+","+retMap.get("ppeprop1")
+","+retMap.get("ppeprop2")
+","+retMap.get("peprop1")
+","+retMap.get("peprop2")
+","+retMap.get("outcol1")
+","+retMap.get("outcol2")
+","+retMap.get("outcol3")
+"\n"
);
}
}
pw.write(sb.toString());
pw.close();
}
}
Running in non batch mode was slowing stuff 10 times. When guidelines 2 to 4 werent followed, it was a lot slower. Will love someone if all above things are correct, if I made any mistake, if I am missing anything, if stuff can be further improved.

Xtext Grammar mixins result in Guice injection error

I am writing an Xtext grammar that can access documentation that are declared before a functions.
Our current grammar defines hidden(ML_COMMENT, SL_COMMENT,...) with:
ML_COMMENT: '/*' -> '*/'
SL_COMMENT: '//' -> EOL
I have now created a second Xtext project, with the following grammar:
grammar my.DocumentationGrammar with my.OriginalGrammar hidden(WS, FUNCTION_BODY, EOL, SL_COMMENT)
import "http://www.originalGrammar.my"
generate documentationGrammar "http://www.documentationGrammar.my"
/* Parser rules */
TranslationUnit:
eds+=DoxExternalDefinition*
;
DoxExternalDefinition:
def = Definition
| lib = CtrlLibUsage
| comment=ML_COMMENT
;
FunctionDefinition:
aml=AccessModifiersList ts=TypeSpecifier? f=Function '(' pl=ParameterTypeList? ')' /* cs=CompoundStatement */ // the compound statement is ignored
;
//terminal DOXYGEN_COMMENT: ML_COMMENT;
terminal FUNCTION_BODY: '{' -> '}';
I have created the dependency in the plugin and added this to the
bean = StandaloneSetup {
scanClassPath = true
platformUri = "${runtimeProject}/.."
// The following two lines can be removed, if Xbase is not used.
registerGeneratedEPackage = "org.eclipse.xtext.xbase.XbasePackage"
registerGenModelFile = "platform:/resource/org.eclipse.xtext.xbase/model/Xbase.genmodel"
// we need to register the super genmodel
registerGeneratedEPackage = "my.OriginalGrammar.OriginalGrammarPackage"
registerGenModelFile = "platform:/resource/my.OriginalGrammar/model/generated/OriginalGrammar.genmodel"
}
Now in my third plugin project, I want to access this parser in a Standalone fashion. So I created the following Parser file (based on this example: http://davehofmann.de/blog/?p=101) :
import java.io.IOException;
import java.io.Reader;
import java.nio.file.Path;
import org.eclipse.emf.common.util.URI;
import org.eclipse.emf.ecore.EObject;
import org.eclipse.emf.ecore.resource.Resource;
import org.eclipse.xtext.parser.IParseResult;
import org.eclipse.xtext.parser.IParser;
import org.eclipse.xtext.parser.ParseException;
import org.eclipse.xtext.resource.XtextResource;
import org.eclipse.xtext.resource.XtextResourceSet;
import my.DocumentationGrammar.DocumentationGrammarStandaloneSetup;
import com.google.inject.Inject;
import com.google.inject.Injector;
public class DoxygenParser {
#Inject
private IParser parser;
private Injector injector;
public DoxygenParser() {
setupParser();
}
private void setupParser() {
injector = new DocumentationGrammarStandaloneSetup().createInjectorAndDoEMFRegistration();
injector.injectMembers(this);
}
/**
* Parses data provided by an input reader using Xtext and returns the root node of the resulting object tree.
* #param reader Input reader
* #return root object node
* #throws IOException when errors occur during the parsing process
*/
public EObject parse(Reader reader) throws IOException
{
IParseResult result = parser.parse(reader);
if(result.hasSyntaxErrors())
{
throw new ParseException("Provided input contains syntax errors.");
}
return result.getRootASTElement();
}
}
However, when I try to run it, I receive Guice Injection errors saying that
com.google.inject.ProvisionException: Guice provision errors:
1) Error injecting constructor, org.eclipse.emf.common.util.WrappedException: java.lang.RuntimeException: Cannot create a resource for 'classpath:/my/documentationGrammar/DocumentationGrammar.xtextbin'; a registered resource factory is needed
I know that the parser "should" be correct, since when I use the OriginalGrammarStandaloneSetup it works perfectly fine.
You have to make sure that your sublanguage also invokes the standalone setup of your super language. Usually this is generated in the DocumentationGrammarStandaloneSetupGenerated class, but please make sure that this was properly setup by Xtext. In the end, there should be something like
if (!Resource.Factory.Registry.INSTANCE.getExtensionToFactoryMap().containsKey("xtextbin"))
Resource.Factory.Registry.INSTANCE.getExtensionToFactoryMap().put(
"xtextbin", new BinaryGrammarResourceFactoryImpl());
in the callchain of your setup's createInjectorAndDoEMFRegistration method.

Resources