Is it possible to adjust a csv file using xslt? - xslt-2.0

if you take this basic comma separated input:
replace,something,incsvfile
Can you use xslt to convert the file i.e. make each value a row for example or replace the word 'replace' with 'replaced'.
Can you point me to a reference link that may help me to learn how to do this please?
The reason for the question is I'm working with an application that can output csv files, and you can apply xslt (2.0) to the output but I dont know how to convert csv into another csv output to post process it and I cannot work out how to do this.
Any help would be appreciated.

Related

Best way to ingest a list of csv files with dataflow

I'm looking for a way to read from a list of csv files and convert each row into json format. Assuming I cannot get header names beforehand, I must ensure that each worker can read from the beginning of one csv file, otherwise we don't know the header names.
My plan is to use FileIO.readMatches to get ReadableFile as elements, and for each element, read the first line as header and combine header with each other line into json format. My questions are:
Is it safe to assume ReadableFile will always contain a whole file, not a partial file?
Will this approach require worker memory to be larger than file size?
Any other better approaches?
Thanks!
Yes, ReadableFile will always give you a whole file.
No. As you go through the file line-by-line, you first read one line to determine the columns, then you read each line to output the rows - this should work!
This seems like the right approach to me, unless you have few files that are very large (GBs, TBs). If you have at least a dozen or a few dozen files, you should be fine.
An extra tip - it may be convenient to insert an apply(Reshuffle.viaRandomKey()) in between your CSV parser and your next transform. This will allow you to shuffle the output of each file into multiple workers downstream - it will give you more parallelism downstream.
Good luck! Feel free to ask follow up questions in the comments.

Delimiter for CSV file in IIB

I am developing an integration in IIB and one of the requirements for output (multiple CSV files) is a comma delimiter instead of semicollon. Semicolon is is on the input. Im using two mapping nodes to produce separate files from one input, but struggle to find option for delimiter.
There are two mapping nodes that uses xsd shemas and .maps to produce output.
First mapping creates canonical dfdl format that is ready to be parsed to multipe files in second mapping node.
There is not much code. just setup in IIB
I would like to produce comma separated CSV instead of semicollon.
Thanks in advance
I found a solution. You can simply view and edit the xsd code in text editor and change the delimiter there.

How to parse a JSON file line by line in objective c

I am working with very large JSON files, so I do not want to read the entire file and then iterate and parse each data entry.
Instead, I would like to iterate on the JSON file itself (for example: line-by-line/one object at a time).
I thought about holding the next line location as part of the current line data, so the JSON is a semi linked list, but I did not manage to extract a specific line from the JSON file.
Am I missing an easier way to achieve that? Is it even possible to extract and parse a specific line from a JSON file?
Thanks a lot!
JSON is not a line oriented format, so the idea of parsing "line by line" doesn't really make sense.
That said, there is at least one event-driven JSON parser for iOS that I know of, https://github.com/stig/json-framework. The built-in parser NSJSONSerialization only works on entire files.

Convert a Text file in to ARFF Format

I know how to convert a Set of text or web page files in to arff file using TextDirectoryLoader.
I want to know how to convert a single Text file in to Arff file.
Any help will be highly appreciated.
Please be more specific. Anyway:
If the text in the file corresponds to a single document (that it, a
single instance), then all you need is to replace all "new lines"
with the escape code \n to make the full text be in a single line,
then manually format as an arff with a single text attribute and a
single instance.
If the text corresponds to several instances (e.g. documents), then I
suggest to make an script to break it into several files and to apply
TextDirectoryLoader. If there is any specific formating (e.g.
instances are enclosed in XML tags), you can either do the same (by
taking advantage of the XML format), or to write a custom Loader
class in WEKA to recognize your format and build an Instances object.
If you post an example, it would be easier to get a more precise suggestion.

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

Resources