Delimiter for CSV file in IIB - mapping

I am developing an integration in IIB and one of the requirements for output (multiple CSV files) is a comma delimiter instead of semicollon. Semicolon is is on the input. Im using two mapping nodes to produce separate files from one input, but struggle to find option for delimiter.
There are two mapping nodes that uses xsd shemas and .maps to produce output.
First mapping creates canonical dfdl format that is ready to be parsed to multipe files in second mapping node.
There is not much code. just setup in IIB
I would like to produce comma separated CSV instead of semicollon.
Thanks in advance

I found a solution. You can simply view and edit the xsd code in text editor and change the delimiter there.

Related

Reading multiline files in Apache beam separated with custom delimiters

I have a text file separated by two delimiters(#*) and one of the field contains multiline statements. ex:
test#*123#*"contain
multiline"
test#*321#*"contain
multiline"
Those are actual 2 rows but in text file it's 4 lines. The way I was trying is to retrieve the files with FileIO and then using pardo to open the file , find the last character in a line and if it's not ending with " then find the next line and append it with 1st line. my concern is beam processes the file in bundles .So if 2 lines are not in the same bundle then it will fail.
is my understanding correct ? and pls let me know the best way to handle the same.

SSIS pipe delimiter issue for CRLF csv file

I am Facing an below pipe delimiter issue in SSIS.
CRLF Pipe delimited text file:
-----------------------------
Col1|Col2 |Col3
1 |A/C No|2015
2 |A|C No|2016
Because of embedded pipe within pipes SSIS failing to read the data.
Bad news: once you have a file with this problem, there is NO standard way for ANY software program to correctly parse the file.
Good news: if you can control (or affect) the way the file is generated to begin with, you would usually address this problem by including what is called a "Text Delimiter" (for example, having field values surrounded by double quotes) in addition to the Field Delimiter (pipe). The Text Delimiter will help because a program (like SSIS) can tell the field values apart from the delimiters, even if the values contain the Field Delimiter (e.g. pipes).
If you can't control how the file is generated, the best you can usually do is GUESS, which is problematic for obvious reasons.

Is it bad to parse the data for white spaces, tabs and other non-printable characters before passing it for indexing to Solr master?

I am talking about the custom parsing phase happening in some program not related to Solr and even before the Solr tokenizers can work on it. If I parse the data for say white spaces, tabs and other non printable characters then when that data actually comes to Solr master for indexing, how would the Solr tokenizers differentiate between separate words which were previously separated by spaces or tabs or some other non-printable characters?
Example code and output from pre-processor:
<?php$text = '<div>This is a sample text to be indexed</div>';
//Remove HTML tags
$text_refined1 = strip_tags($text);
//Remove non-printable unicode characters
$text_refined2 = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x9F]/u', '', $text_refined1);
//Remove line feeds, carriage returns and tabs
$text_refined3 = preg_replace('/\s+/', '', $text_refined2);
echo $text_refined3;
---output---
Thisisasampletexttobeindexed
Based on the example you give. e.g. output Thisisasampletexttobeindexed, Solr's existing query analyzer will not be able to tokenize it correctly.
Solr(Lucene) needs some way to seperate the individual words from the input.
You can use solr's analysis admin UI to test this string with different analyzers. In my solr test instance, they all return the original string.
You can configure which Tokenizer to use in Solr. There is a list at https://cwiki.apache.org/confluence/display/solr/Tokenizers
Indexing a stream of non-delimited English words properly is not supported by any existing Tokenizer in Solr. You could conceivably build a custom one with a dictionary, but it would produce errors as the input is ambiguous. Or you could use the N-Gram Tokenizer and accept a lot of false positives when you search.
The right solution is not to feed such a stream in the first place. If you need the tightly concatenated string for something internal, then produce a separate version for indexing, where you replace the offending characters with space instead of the empty string.

Stanford NLP - Using Parsed or Tagged text to generate Full XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Recommended column delimiter for Click stream data to consumed by SSIS

I am working with some click stream data and i would need to give specifications to the vendor regarding a preferred format to be consumed by SSIS.
As its URL data in the text file which column delimiter would you recommend. I was thinking pipe "|" but i realize that pipes can be used within the URL.
I did some testing to specify multiple charecters as delimiter lile |^| but when I am creating a flat file connection there is not option in SSIS. I had type these charecters. But when i went to edit the flat file connection manager it had changed to {|}^{|}. It just made me nervous to the import succeeded.
I just wanted to see if anybody has good ideas as to which would safe column delimiter to use.
Probably tab-delimited would be fairly safe, at least assuming that by "clickstream" you mean a list of URLs or something similar. But in theory any delimiter should be fine as long as the supplier quotes the data appropriately.

Resources