Datastage defining ; as delimiter and !; as not a delimeter - delimiter

I have a data txt file which looks like following
1;2;3;4;5
1;2;3!;4;4;5
I'm expecting my output should look like as follows after reading the sequential file.
1 2 3 4 5
1 2 34 4 5
since there is only possiblity to define what's the delimiter in Datastage it don't detect !; as not a delimiter.
Could someone let me how can i overcome this problem.

One option could be to import it as a single column and and remove the "!," in a transformer and then do a column import stage dividing up the columns.

Read data as a single string. Convert "!;" to "" using Ereplace() or Change() function. Then parse using Transformer loop or Column Import stage.

Related

How can I split a string and sum all numbers from that string?

I'm making a list for buying groceries in Google Sheets and have the following value in cell B4.
0.95 - Lemon Juice
2.49 - Pringle Chips
1.29 - Baby Carrots
9.50 - Chicken Kebab
What I'm trying to do is split using the dash character and combine the costs (0.95+2.49+1.29+9.50).
I've tried to use Index(SPLIT(B22,"-"), 7) and SPLIT(B22,"-") but I don't know how to use only numbers from the split string.
Does someone know how to do this? Here's a sample sheet.
Answer
The following formula should produce the result you desire:
=SUM(ARRAYFORMULA(VALUE(REGEXEXTRACT(SPLIT(B4,CHAR(10)),"(.*)-"))))
Explanation
The first thing to do is to split the entry in B4 into its component parts. This is done by using the =SPLIT function, which takes the text in B4 and returns a separate result every time it encounters a specific delimiter. In this case, that is =CHAR(10), the newline character.
Next, all non-number information needs to be removed. This is relatively easy in your sample data because the numbers always appear to the left of a dash. =REGEXEXTRACT uses a regular expression to only return the text to the left of the dash.
Before the numbers can be added together, however, they must be converted to be in a number format. The =VALUE function is used to convert each result from a text string containing a number to an actual number.
All of this is wrapped in an =ARRAYFORMULA so that =VALUE and =REGEXEXTRACT parse each returned value from =SPLIT, rather than just the first.
Finally, all results are added together using =SUM.
Functions used:
=CHAR
=SPLIT
=REGEXEXTRACT
=VALUE
=ARRAYFORMULA
=SUM
Firstly you can add , symbols start and ends of numbers with below code:
REGEXREPLACE(B4,"([0-9\.]+)",",$1,")
Then split it based of , sign.
SPLIT(A8, ",")
Try below formula (see your sheet)-
=SUM(ArrayFormula(--REGEXEXTRACT(SPLIT(B4,CHAR(10)),"-*\d*\.?\d+")))

Split or tokenize within Stata program with using statement?

I am trying to use a program to speed up a repetitive Stata task. This is the first part of my program:
program alphaoj
syntax [varlist] , using(string) occ_level(integer) ind_level(integer)
import excel `using', firstrow
display "`using'"
split "`using'", parse(_)
local year = `2'
display "`year'"
display `year'
When I run this program, using the line alphaoj, ind_level(4) occ_level(5) using("nat4d_2002_dl.xls"), I receive the error factor-variable and time-series operators not allowed r(101);
I am not quite sure what is being treated as a factor or time series operator.
I have replaced the split line with tokenize, and the parse statement with parse("_"), and I continue to run into errors. In that case, it says _ not found r(111);
Ideally, I would have it take the year from the filename and use that year as the local.
I am struggling with how I should perform this seemingly simple task.
An error is returned because the split command only accepts string variables. You can't pass a string directly to it. See help split for more details.
You can achieve your goal of extracting the year from the filename and storing that as a local macro. See below:
program alphaoj
syntax [varlist], using(string)
import excel `using', firstrow
gen stringvar = "`using'"
split stringvar, parse(_)
local year = stringvar2
display `year'
end
alphaoj, using("nat4d_2002_dl.xls")
The last line prints "2002" to the console.
Alternative solution that avoids creating an extra variable:
program alphaoj
syntax [varlist], using(string)
import excel `using', firstrow
local year = substr("`using'",7,4)
di `year'
end
alphaoj, using("nat4d_2002_dl.xls")
Please note that this solution is reliant on the Excel files all having the exact same character structure.

Google-spreadsheets automatically cancels the code's indentation, how to recover?

I use google-spreadsheets,I want to copy python code to google-spreadsheets,but I find google-spreadsheets automatically cancels the code's indentation.
My code:
import pandas as pd
import csv
rs = pd.read_csv(r'D:/Clustering_TOP.csv',encoding='utf-8')
with open('D:/Clustering_TOP.csv','r') as csvfile:
reader = csv.reader(csvfile)
rows = [row for row in reader]
csv_title = rows[0]
csv_title = csv_title[1:]
len_csv_title = len(csv_title)
for i in range(len_csv_title):
for j in range(i+1):
print(str(csv_title[j])+'_'+str(csv_title[i]) + " = " + str(rs[csv_title[i]].corr(rs[csv_title[j]])), end='\t')
print()
When I paste the code to google-docs,the code turns into this:
And there is no indentation in the paste option.
How can I keep the indentation of the code?
How about this workaround? In your title, we can see how to recover?. Do you want to add the indentation to the script without indentation which had already been pasted? Or when you paste the script, do you want to keep the indentation? If in the case of later, I think that there are 2 patterns. If the situation is the later, please choose one of 2 patterns. I think that there are several workarounds. So please think of this as one of them.
Pattern 1
Paste the script with line by line.
Pattern 2
Paste the script in the cell "A1".
Paste this formula in the cell "A2".
=TRANSPOSE(SPLIT(A1, char(10)))
You can retrieve only the script with the indentation by copying values and pasting as values only.
If this was not what you want, I'm sorry.

FastqGeneralIterator Output

I'm using FastqGeneralIterator, but I find that it removes the # from the 1st line of a fastq file and also the information for the 3rd line (it removes the entire 3rd line).
I added the # in the 1st line in the following way:
for line in open("prova_FiltraN_CE_filt.fastq"):
fout.write(line.replace('SEQ', '#SEQ'))
I want to add also the 3rd line, that starts with + and there is nothing after that. For example:
#SEQILMN0
TCATCGTA....
+
#<BBBFFF.....
Can someone help me?
you can use, String Formatting Operations %
from Bio.SeqIO.QualityIO import FastqGeneralIterator
with open("prova_FiltraN_CE_filt.fastq", "rU") as handle:
for (title, sequence, quality) in FastqGeneralIterator(handle):
print("#%s\n%s\n+\n%s" % (title, sequence, quality))
you get fastq print format, using FastqGeneralIterator
#SEQILMN0
TCATCGTA....
+
#<BBBFFF....

Funny CSV format help

I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob

Resources