Reading multiline files in Apache beam separated with custom delimiters

Reading multiline files in Apache beam separated with custom delimiters - google-cloud-dataflow

I have a text file separated by two delimiters(#*) and one of the field contains multiline statements. ex:
test#*123#*"contain
multiline"
test#*321#*"contain
multiline"
Those are actual 2 rows but in text file it's 4 lines. The way I was trying is to retrieve the files with FileIO and then using pardo to open the file , find the last character in a line and if it's not ending with " then find the next line and append it with 1st line. my concern is beam processes the file in bundles .So if 2 lines are not in the same bundle then it will fail.
is my understanding correct ? and pls let me know the best way to handle the same.

Related

Delimiter for CSV file in IIB

I am developing an integration in IIB and one of the requirements for output (multiple CSV files) is a comma delimiter instead of semicollon. Semicolon is is on the input. Im using two mapping nodes to produce separate files from one input, but struggle to find option for delimiter.
There are two mapping nodes that uses xsd shemas and .maps to produce output.
First mapping creates canonical dfdl format that is ready to be parsed to multipe files in second mapping node.
There is not much code. just setup in IIB
I would like to produce comma separated CSV instead of semicollon.
Thanks in advance

I found a solution. You can simply view and edit the xsd code in text editor and change the delimiter there.

dataset import error for AutoML text classification

I have trying to import dataset into AutoML NL Text Classification. However, the Ui gave me an error of Invalid row in CSV file , Error details: Error detected: "FILE_TYPE_NOT_SUPPORTED"
I am uploading the csv file, what should I do?

Please make sure there is no hidden quotes in your dataset. Complete requirements can be found on “Preparing your training data” page.
Common .csv errors:
Using Unicode characters in labels. For example, Japanese characters are not supported.
Using spaces and non-alphanumeric characters in labels.
Empty lines.
Empty columns (lines with two successive commas).
Missing quotes around embedded text that includes commas.
Incorrect capitalization of Cloud Storage text paths.
Incorrect access control configured for your text files. Your service account should have read or greater access, or files must be publicly-readable.
References to non-text files, such as JPEG files. Likewise, files that are not text files but that have been renamed with a text extension will cause an error.
The URI of a text file points to a different bucket than the current project. > > - Only files in the project bucket can be accessed.
Non-CSV-formatted files.

SSIS pipe delimiter issue for CRLF csv file

I am Facing an below pipe delimiter issue in SSIS.
CRLF Pipe delimited text file:
-----------------------------
Col1|Col2 |Col3
1 |A/C No|2015
2 |A|C No|2016
Because of embedded pipe within pipes SSIS failing to read the data.

Bad news: once you have a file with this problem, there is NO standard way for ANY software program to correctly parse the file.
Good news: if you can control (or affect) the way the file is generated to begin with, you would usually address this problem by including what is called a "Text Delimiter" (for example, having field values surrounded by double quotes) in addition to the Field Delimiter (pipe). The Text Delimiter will help because a program (like SSIS) can tell the field values apart from the delimiters, even if the values contain the Field Delimiter (e.g. pipes).
If you can't control how the file is generated, the best you can usually do is GUESS, which is problematic for obvious reasons.

Script for fasta sequence replacement based on header name

I have two fasta file (one file has about 50,000 and another has 150,000 sequences) with two kinds of header formats. I want to replace sequences of interest in one file based on header name (I have two list of headers for two fasta files as txt format). Could you please advise me what should I do?
For example header format for file 1 and 2 are as >contig10002|m.12543 and >c26528_g1_i1|m.14066, respectively, and I want to replace the related sequence of >c26528_g1_i1|m.14066 in file 2 with related sequence of >contig10002|m.12543.
Thanks in advance

One suggestion is to use BioPython. It can parse fasta files and format them and it can possibly handle headers in different formats.
For example here is how you can read a fasta file and loop over the IDs:
fasta_sequences = SeqIO.parse(open('file1.fasta'),'fasta')
for fasta in fasta_sequences:
# do something with fasta.id, e.g. >c26528_g1_i1|m.14066
Here is how you would write a fasta record:
with open(output_file, 'w') as output_handle:
for fasta in fasta_sequences:
SeqIO.write([fasta], output_handle, "fasta")
You might want to start reading the BioPyton Tutorial and Coookbook

Convert a Text file in to ARFF Format

I know how to convert a Set of text or web page files in to arff file using TextDirectoryLoader.
I want to know how to convert a single Text file in to Arff file.
Any help will be highly appreciated.

Please be more specific. Anyway:
If the text in the file corresponds to a single document (that it, a
single instance), then all you need is to replace all "new lines"
with the escape code \n to make the full text be in a single line,
then manually format as an arff with a single text attribute and a
single instance.
If the text corresponds to several instances (e.g. documents), then I
suggest to make an script to break it into several files and to apply
TextDirectoryLoader. If there is any specific formating (e.g.
instances are enclosed in XML tags), you can either do the same (by
taking advantage of the XML format), or to write a custom Loader
class in WEKA to recognize your format and build an Instances object.
If you post an example, it would be easier to get a more precise suggestion.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Reading multiline files in Apache beam separated with custom delimiters - google-cloud-dataflow

Related

Delimiter for CSV file in IIB

dataset import error for AutoML text classification

SSIS pipe delimiter issue for CRLF csv file

Script for fasta sequence replacement based on header name

Convert a Text file in to ARFF Format

Categories

Resources