Jena multiline string output - jena

is it possible to output turtle, with string that are multiline that is have jena keep the multi-line when outputing the result ?
I'm currently struggling to do so, i have properties that have multi-line text, but jena keep outputing as one string, with the escape in it i.e. "\n" all over
Edit1
Found the following discussions (curious to know where it landed)
https://mail-archives.apache.org/mod_mbox/jena-users/201512.mbox/%3c56798EBD.9010807#apache.org%3e
I actually tried
ARQ.getContext().set(RIOT.multilineLiterals, true)
I the beginning of my code but with no success
val program = for {
_ <- IO { ARQ.getContext().set(RIOT.multilineLiterals, true)}
model <- IO { ModelFactory.createDefaultModel() }
_ <- IO { model.read("xxxxx.ttl")}
_ <- IO { model.write(System.out, Lang.TURTLE.getName)}
} yield ()
Note, the file contain string with multiline string.

These ways to write work with context settings:
RDFDataMgr.write(System.out, model, Lang.TURTLE);
RDFWriter.create(model).lang(Lang.TURTLE).output(System.out);
Looking at the code, model.write involves backwards compatibility with setting writer properties (for RDF/XML).
It does not need to ignore global settings for other writers. This has been filed as a bug JENA-2148.

Related

Preserving whitespace in Rascal when transforming Java code

I am trying to add instrumentation (e.g. logging some information) to methods in a Java file. I am using the following Rascal code which seems to work mostly:
import ParseTree;
import lang::java::\syntax::Java15;
// .. more imports
// project is a loc
M3 model = createM3FromEclipseProject(project);
set[loc] projectFiles = { file | file <- files(model)} ;
for (pFile <- projectFiles) {
CompilationUnit cunit = parse(#CompilationUnit, pFile);
cUnitNew = visit(cunit) {
case (MethodBody) `{<BlockStm* post>}`
=> (MethodBody) `{
'System.out.println(new Throwable().getStackTrace()[0]);
'<BlockStm* post>
'}`
}
writeFile(pFile, cUnitNew);
}
I am running into two issues regarding whitespace, which might be unrelated.
The line of code that I am inserting does not preserve whitespace that was there previously. If there was a tab character, it will now be removed. The same is true for the line directly following the line I am inserting and the closing brace. How can I 'capture' whitespace in my pattern?
Example before transforming (all lines start with a tab character, line 2 and 3 with two):
void beforeFirst() throws Exception {
rowIdx = -1;
rowSource.beforeFirst();
}
Example after transforming:
void beforeFirst() throws Exception {
System.out.println(new Throwable().getStackTrace()[0]);
rowIdx = -1;
rowSource.beforeFirst();
}
An additional issue regarding whitespace; if a file ends on a newline character, the parse function will throw a ParseError without further details. Removing this newline from the original source will fix the issue, but I'd rather not 'manually' have to fix code before parsing. How can I circumvent this issue?
Alas, capturing whitespace with a concrete pattern is not a feature of the current version of Rascal. We used to have it, but now it's back on the TODO list. I can point you to papers about the topic if you are interested. So for now you have to deal with this "damage" later.
You could write a Tree to Tree transformation on the generic level (see ParseTree.rsc), to fix indentation issues in a parse tree after your transformation, or to re-insert the comments that you lost. This is about matching the Tree data-type and appl constructors. The Tree format is a form of reflection on the parse trees of Rascal that allow any kind of transformation, including whitespace and comments.
The parse error you talked about is caused by not using the start non-terminal. If you use parse(#start[CompilationUnit], ...) then whitespace and comments before and after the CompilationUnit are accepted.

How to access shadowed DXL variables/functions?

I encountered an error in a script I was debugging because somebody had created a variable with a name matching a built-in function, rendering the function inaccessible. I got strange errors when I tried to use the function, like:
incorrect arguments for (-)
incorrect arguments for (by)
incorrect arguments for ([)
incorrect arguments for (=)
Example code:
int length
// ...
// ...
string substr
string str = "big long string with lots of text"
substr = str[0:length(str)-2]
Is there a way to access the original length() function in this situation? I was actually just trying to add debug output to the existing script, not trying to modify the script, when I encountered this error.
For now I have just renamed the variable.
Well, in the case that you had no chance to modify the code, e.g. because it is encrypted you could do sth like
int length_original (string s) { return length s }
<<here is the code of your function>>
int length (string s) {return length_original s }

How should I terminate a func in F#

let Trans(DirPath:string) =
if ( numOfVmFiles(DirPath) > 1) then //if there are more than 1 vm files in the directory
Init("")
let VmFiles = ListOfVmFiles(DirPath)
for VmFile in VmFiles do // for each vm file
ReadFile(VmFile)
I got this error:
Error Block following this 'let' is unfinished. Expect an expression.
what should I write?thznk u
The existing answers give you good hints on how to write your code better. You said you are getting an error:
Block following this 'let' is unfinished. Expect an expression.
This typically indicates missing = or wrong indentation after the end of your function (or some easy to miss syntax error like that). In the snippet you posted, all syntax looks good to me, so I suspect there is something wrong elsewhere too. The following gives no errors:
let numOfVmFiles a = 0
let Init a = ()
let ListOfVmFiles a = []
let ReadFile a = ()
let Trans(DirPath:string) =
if ( numOfVmFiles(DirPath) > 1) then
Init("")
let VmFiles = ListOfVmFiles(DirPath)
for VmFile in VmFiles do // for each vm file
ReadFile(VmFile)
You get two warnings - because variable names should be camelCase rather than PascalCase, but no error. As others said, you should probably make Init and ReadFile return something and then you need to collect the results (to make your code more functional), but that's a separate problem.
The problem is that the function does not return a value. Functions must always return a value. If there is nothing to return, return unit. You can return unit as ().
I made some possibly incorrect assumptions here but I tried to make clear what they were. On the trans function I also show how you can specify the return type. It is usually best to let the compiler infer the type until it cannot. Hover over the functions and see what the compiler is telling you about the types. string -> int -> string list means a function takes a string and an int and returns a list of strings.
let init dirName = () //unit is returned... kind of like void but is actually a return value
let listOfVmFiles dirName = ["some";"files"] // list of string
let readFile path = "content of file" //string
let trans(dirPath:string) : string list = // takes a string and returns a list of string represented as string-> string list
let vmFiles = listOfVmFiles(dirPath) // get files from path
if(vmFiles.Length > 1) then init("") // init if more than 1 file
List.map readFile vmFiles // return a list of the content of the files
If a function is performing a side-effect and does not return something then it can be done like so:
let trans(dirPath:string) : unit =
let vmFiles = listOfVmFiles(dirPath)
if(vmFiles.Length > 1) then init("")
List.map readFile vmFiles |> ignore //ignore the result
()
This ignores the result of mapping the readFile function over the list and then returns unit using ().
I recommend fsharp for fun and profit for learning fsharp.
Hope this helps and good luck. Although the syntax seems weird initially stick with it. It's great!
In F# a function returns the value of the last expression it evaluated.
In your particular case it returns unit (), because a for loop returns unit unless its body yields values (in which case you would need to wrap it in a seq).
As mentioned in the comments, this code parses ok- your issue is with an unfinished expression elsewhere in your code.
The rewrite by Devon Buriss is a good example of best practices:
Explicitly declare your function return value in the signature.
ignore function return values if the function is only called for side effects (eg readFile and init("")).
Prefer functional behaviors such as map over imperatives such as for .. do.
As an aside, relying so heavily on side effects is likely to cause you difficulties elsewhere. A more common practice with data crunching is do have a function like readFile return file contents as a seq, and pipe the result to downstream processing:
List.map readFile vmFiles
|> seq.Concat // concatenate the file outputs
|> processContents
Whether or not this is the right thing for you depends on what exactly you intend to do with the contents of each file.

SideInputs corrupt the data in DataFlow's Pipeline

I have a Dataflow pipeline (SDK 2.1.0, Apache Beam 2.2.0) which simply reads RDF (in N-Triples, so it's just text files) from GCS, transforms it somehow and writes it back to GCS, but in a different bucket. In this pipeline I employ side inputs which are three single files (one file per side input) and use them in a ParDo.
To work with RDF in Java I use Apache Jena, so each file is read into an instance of Model class. Since Dataflow doesn't have Coder for it, I developed it myself (RDFModelCoder, see below). It works fine in number of other pipelines I created.
The problem with this particular pipeline is when I add the side inputs, the execution fails with an exception indicating a corruption of the data, i.e. some garbage is added. Once I remove the side inputs, the pipeline finishes execution successfully.
The exception (it's thrown from RDFModelCoder, see below):
Caused by: org.apache.jena.atlas.RuntimeIOException: java.nio.charset.MalformedInputException: Input length = 1
at org.apache.jena.atlas.io.IO.exception(IO.java:233)
at org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:77)
at org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:154)
at org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:137)
at org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:235)
at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:229)
at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:151)
at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:92)
at org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:48)
at org.apache.jena.riot.lang.RiotParsers.createParser(RiotParsers.java:57)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:198)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:298)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:288)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:237)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:417)
at org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:870)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:268)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:254)
at org.apache.jena.riot.adapters.RDFReaderRIOT.read(RDFReaderRIOT.java:69)
at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:305)
And here you can see the garbage (at the end):
<http://example.com/typeofrepresentative/08> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#NamedIndividual> . ������** �����I��.�������������u�������
The pipeline:
val one = p.apply(TextIO.read().from(config.getString("source.one")))
.apply(Combine.globally(SingleValue()))
.apply(ParDo.of(ConvertToRDFModel(RDFLanguages.NTRIPLES)))
val two = p.apply(TextIO.read().from(config.getString("source.two")))
.apply(Combine.globally(SingleValue()))
.apply(ParDo.of(ConvertToRDFModel(RDFLanguages.NTRIPLES)))
val three = p.apply(TextIO.read().from(config.getString("source.three")))
.apply(Combine.globally(SingleValue()))
.apply(ParDo.of(ConvertToRDFModel(RDFLanguages.NTRIPLES)))
val sideInput = PCollectionList.of(one).and(two).and(three)
.apply(Flatten.pCollections())
.apply(View.asList())
p.apply(RDFIO.Read
.from(options.getSource())
.withSuffix(RDFLanguages.strLangNTriples))
.apply(ParDo.of(SparqlConstructETL(config, sideInput))
.withSideInputs(sideInput))
.apply(RDFIO.Write
.to(options.getDestination())
.withSuffix(RDFLanguages.NTRIPLES))
And just to provide the whole picture here are implementations of SingleValue and ConvertToRDFModel ParDos:
class SingleValue : SerializableFunction<Iterable<String>, String> {
override fun apply(input: Iterable<String>?): String {
if (input != null) {
return input.joinToString(separator = " ")
}
return ""
}
}
class ConvertToRDFModel(outputLang: Lang) : DoFn<String, Model>() {
private val lang: String = outputLang.name
#ProcessElement
fun processElement(c: ProcessContext?) {
if (c != null) {
val model = ModelFactory.createDefaultModel()
model.read(StringReader(c.element()), null, lang)
c.output(model)
}
}
}
The implementation of RDFModelCoder:
class RDFModelCoder(private val decodeLang: String = RDFLanguages.strLangNTriples,
private val encodeLang: String = RDFLanguages.strLangNTriples)
: AtomicCoder<Model>() {
private val LOG = LoggerFactory.getLogger(RDFModelCoder::class.java)
override fun decode(inStream: InputStream): Model {
val bytes = StreamUtils.getBytes(inStream)
val model = ModelFactory.createDefaultModel()
model.read(ByteArrayInputStream(bytes), null, decodeLang) // the exception is thrown from here
return model
}
override fun encode(value: Model, outStream: OutputStream?) {
value.write(outStream, encodeLang, null)
}
}
I checked the side input files multiple times, they're fine, they have UTF-8 encoding.
Most likely the error is in the implementation of RDFModelCoder. When implementing encode/decode one has to remember that the provided InputStream and OutputStream are not exclusively owned by the current instance being encoded/decoded. E.g. there might be more data in the InputStream after the encoded form of your current Model. When using StreamUtils.getBytes(inStream) you are grabbing both data of the current encoded Model and anything else that was in the stream.
Generally when writing a new Coder it's a good idea to only combine existing Coder's rather than hand-parsing the stream: that is less error-prone. I would suggest to convert the model to/from byte[] and use ByteArrayCoder.of() to encode/decode it.
Apache Jena provides the Elephas IO modules which have Hadoop IO support, since Beam supports Hadoop InputFormat IO you should be able to use that to read in your NTriples file.
This will likely be far more efficient since the NTriples support in Elephas is able to parallelise the IO and avoid caching the entire model into memory (in fact it won't use Model at all):
Configuration myHadoopConfiguration = new Configuration(false);
// Set Hadoop InputFormat, key and value class in configuration
myHadoopConfiguration.setClass("mapreduce.job.inputformat.class",
NTriplesInputFormat.class, InputFormat.class);
myHadoopConfiguration.setClass("key.class", LongWritable.class, Object.class);
myHadoopConfiguration.setClass("value.class", TripleWritable.class, Object.class);
// Set any other Hadoop config you might need
// Read data only with Hadoop configuration.
p.apply("read",
HadoopInputFormatIO.<LongWritable, TripleWritable>read()
.withConfiguration(myHadoopConfiguration);
Of course this may require you to refactor your overall pipeline somewhat.

CoreNLP : provide pos tags

I have text that is already tokenized, sentence-split, and POS-tagged.
I would like to use CoreNLP to additionally annotate lemmas (lemma), named entities (ner), contituency and dependency parse (parse), and coreferences (dcoref).
Is there a combination of commandline options and option file specifications that makes this possible from the command line?
According to this question, I can ask the parser to view whitespace as delimiting tokens, and newlines as delimiting sentences by adding this to my properties file:
tokenize.whitespace = true
ssplit.eolonly = true
This works well, so all that remains is to specify to CoreNLP that I would like to provide POS tags too.
When using the Stanford Parser standing alone, it seems to be possible to have it use existing POS tags, but copying that syntax to the invocation of CoreNLP doesn't seem to work. For example, this does not work:
java -cp *:./* -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props my-properties-file -outputFormat xml -outputDirectory my-output-dir -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -file my-annotated-text.txt
While this question covers programmatic invocation, I'm invoking CoreNLP form the commandline as part of a larger system, so I'm really asking whether this is possible to achieve this with commandline options.
I don't think this is possible with command line options.
If you want you can make a custom annotator and include it in your pipeline you could go that route.
Here is some sample code:
package edu.stanford.nlp.pipeline;
import edu.stanford.nlp.util.logging.Redwood;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.concurrent.MulticoreWrapper;
import edu.stanford.nlp.util.concurrent.ThreadsafeProcessor;
import java.util.*;
public class ProvidedPOSTaggerAnnotator {
public String tagSeparator;
public ProvidedPOSTaggerAnnotator(String annotatorName, Properties props) {
tagSeparator = props.getProperty(annotatorName + ".tagSeparator", "_");
}
public void annotate(Annotation annotation) {
for (CoreLabel token : annotation.get(CoreAnnotations.TokensAnnotation.class)) {
int tagSeparatorSplitLength = token.word().split(tagSeparator).length;
String posTag = token.word().split(tagSeparator)[tagSeparatorSplitLength-1];
String[] wordParts = Arrays.copyOfRange(token.word().split(tagSeparator), 0, tagSeparatorSplitLength-1);
String tokenString = String.join(tagSeparator, wordParts);
// set the word with the POS tag removed
token.set(CoreAnnotations.TextAnnotation.class, tokenString);
// set the POS
token.set(CoreAnnotations.PartOfSpeechAnnotation.class, posTag);
}
}
}
This should work if you provide your token with POS tokens separated by "_". You can change it with the forcedpos.tagSeparator property.
If you set customAnnotator.forcedpos = edu.stanford.nlp.pipeline.ProvidedPOSTaggerAnnotator
to the property file, include the above class in your CLASSPATH, and then include "forcedpos" in your list of annotators after "tokenize", you should be able to pass in your own pos tags.
I may clean this up some more and actually include it in future releases for people!
I have not had time to actually test this code out, if you try it out and find errors please let me know and I'll fix it!

Resources