Google Sheets Split remove characters and unwanted words - google-sheets

I have this data as a sample in a column:
3 PACK BAG 1500 ML CONTAIN 600 ML AMINO ACID, 600 ML GLUCOSE, 300 ML LIPID EMULSION
I am using this formula to remove unwanted characters: =SPLIT(A2:A,"1234567890-=[]\;',./!##$%^&*()")
So it returns me:
PACK BAG ML C NTAIN ML AMIN ACID ML GLUC SE ML LIPID EMULSI N
Now i would like to add in my formula =SPLIT(A2:A,"1234567890-=[]\;',./!##$%^&*()") a function to remove "MC" and "C" OR "SE".
How i can update my formula split to remove the specific chain of characters (words) ?

=SPLIT(REGEXREPLACE(A2:A, "(MC|C|SE)", " "),"1234567890-=[]\;',./!##$%^&*()")
You could pre-process your string with REGEXREPLACE to substitute a specific character (eg. whitespace) for these specific words before applying the SPLIT function.

Related

How can I split a string and sum all numbers from that string?

I'm making a list for buying groceries in Google Sheets and have the following value in cell B4.
0.95 - Lemon Juice
2.49 - Pringle Chips
1.29 - Baby Carrots
9.50 - Chicken Kebab
What I'm trying to do is split using the dash character and combine the costs (0.95+2.49+1.29+9.50).
I've tried to use Index(SPLIT(B22,"-"), 7) and SPLIT(B22,"-") but I don't know how to use only numbers from the split string.
Does someone know how to do this? Here's a sample sheet.
Answer
The following formula should produce the result you desire:
=SUM(ARRAYFORMULA(VALUE(REGEXEXTRACT(SPLIT(B4,CHAR(10)),"(.*)-"))))
Explanation
The first thing to do is to split the entry in B4 into its component parts. This is done by using the =SPLIT function, which takes the text in B4 and returns a separate result every time it encounters a specific delimiter. In this case, that is =CHAR(10), the newline character.
Next, all non-number information needs to be removed. This is relatively easy in your sample data because the numbers always appear to the left of a dash. =REGEXEXTRACT uses a regular expression to only return the text to the left of the dash.
Before the numbers can be added together, however, they must be converted to be in a number format. The =VALUE function is used to convert each result from a text string containing a number to an actual number.
All of this is wrapped in an =ARRAYFORMULA so that =VALUE and =REGEXEXTRACT parse each returned value from =SPLIT, rather than just the first.
Finally, all results are added together using =SUM.
Functions used:
=CHAR
=SPLIT
=REGEXEXTRACT
=VALUE
=ARRAYFORMULA
=SUM
Firstly you can add , symbols start and ends of numbers with below code:
REGEXREPLACE(B4,"([0-9\.]+)",",$1,")
Then split it based of , sign.
SPLIT(A8, ",")
Try below formula (see your sheet)-
=SUM(ArrayFormula(--REGEXEXTRACT(SPLIT(B4,CHAR(10)),"-*\d*\.?\d+")))

Is there a function to split cells that will ignore delimiters within quotation marks?

I am trying to separate CSV text into columns using a formula in google sheets, but when I do it ends up separating strings with commas within quotes.
https://docs.google.com/spreadsheets/d/1bqG82qVNv8_VaSarVHFJ4dn78khf_2nNajRe9-ulXL4/edit?usp=sharing
For example when I use
=split(A1,","):
119,"6.65","","en","Ezuri, Renegade Leader","6734497c-16f0-4c4b-ba24-337333511fc6","1","rare","e9544132-bbb5-4ec4-af82-dad56e5091af","som","Scars of Mirrodin"
Gets turned into:
119
6.65
en
Ezuri
Renegade Leader
6734497c-16f0-4c4b-ba24-337333511fc6
1
rare
e9544132-bbb5-4ec4-af82-dad56e5091af
som
Scars of Mirrodin
Any help would be much appreciated!
=SPLIT(REGEXREPLACE(A1,"""?,""","🤑"),"🤑")
Steps:
Replace "," and ," with 🤑:
"119🤑6.65🤑🤑en🤑Ezuri, Renegade Leader🤑6734497c-16f0-4c4b-ba24-337333511fc6🤑1🤑rare🤑e9544132-bbb5-4ec4-af82-dad56e5091af🤑som🤑Scars of Mirrodin"
and split by 🤑:
119
6.65
en
Ezuri, Renegade Leader
6734497c-16f0-4c4b-ba24-337333511fc6
1
rare
e9544132-bbb5-4ec4-af82-dad56e5091af
som "
Scars of Mirrodin"""

Extracting numbers with REGEXEXTRACT that might have a comma or dot

I have a list of numbers in a few formats that may or may not include a dot and a comma. The numbers are locked in a string. For example:
hello 1,000 goodbye
hola 2,000.12 ciao
Hallo 3000.00 Auf Wiedersehen
How can I extract the numbers?
I don't care if the comma is added but the dot is obviously important.
I need the regular_expression to be used in REGEXEXTRACT (and the rest of the REGEX formulas.
The output should be:
1000
2000.12
3000.00
Supposing that your raw data is in A2:A, use this in B2 (or the second cell) of an otherwise empty column:
=ArrayFormula(IF(A2:A="",,IFERROR(VALUE(REGEXEXTRACT(A2:A,"\d[\d,\.]*\d")))))
The REGEX portion reads, in plain English, "Extract any portion that starts with a digit followed by any number of digits, commas or periods (or none of these) and ends with a digit."
You will likely want to apply Format > Number > Currency to the results column.

Using Stata's esttab, add dollar sign to cell format, export to Latex

My advisor wants me to add dollar signs to my table of summary statistics. I generate this table and export it to Latex using Stata's esttab command.
I need to 1) Add dollar signs to some of the results cells (not all) and 2) Make sure that Latex can handle the dollar signs.
I think that 2 might be accomplished using the substitute option, but I can't figure out how to do 1. Here is some minimal code that I am trying to use to solve this problem.
sysuse auto, clear
estpost summarize price mpg weight length if foreign==0
est store A
estpost summarize price mpg weight length if foreign==1
est store B
esttab A B using $root/Outputs/test.tex, replace /// //a file path on my machine
cells("mean (fmt(%9.0fc %9.2fc %9.0fc))" "sd(par fmt(%9.0fc %9.2fc %9.0fc))") ///
mtitle("Domestic" "Foreign") ///
mgroups("Type", pattern(1 0) prefix(\multicolumn{#span}{c}{) suffix(}) span erepeat( \cmidrule(lr){#span})) ///
nonumber booktabs f label collabels(none)
eststo clear
This produces:
&\multicolumn{2}{c}{Type} \\\cmidrule(lr){2-3}
&\multicolumn{1}{c}{Domestic}&\multicolumn{1}{c}{Foreign}\\
\midrule
Price & 6,072& 6,385\\
& (3,097)& (2,622)\\
Mileage (mpg) & 19.83& 24.77\\
& (4.74)& (6.61)\\
Weight (lbs.) & 3,317& 2,316\\
& (695)& (433)\\
Length (in.) & 196& 169\\
& (20)& (14)\\
\midrule
Observations & 52& 22\\
I'd like to get it so the output would have \$ in front of the 6,072 and the 6,385
I see some discussion on the Statalist regarding workarounds for graphs, but nothing for esttab. Someone also mentions creating "custom formats" but I can't seem to find documentation on that anywhere.
I had a similar problem once: I wanted to color cells differently depending on the significance level. In the end, the easiest automated solution I could come up with was modifying the esttab code... That is easier done than it sounds, in fact.
Look for the following code in estout.ado:
if `:length local thevalue'<245 {
local thevalue: di `fmt_m' `"`macval(thevalue)'"'
}
Just after that you can insert, e.g.
local thevalue `"\$`macval(thevalue)'\$"'
That would produce:
&\multicolumn{2}{c}{Type} \\\cmidrule(lr){2-3}
&\multicolumn{1}{c}{Domestic}&\multicolumn{1}{c}{Foreign}\\
\midrule
Price &$ 6,072$&$ 6,385$\\
&$ (3,097)$&$ (2,622)$\\
Mileage (mpg) &$ 19.83$&$ 24.77$\\
&$ (4.74)$&$ (6.61)$\\
Weight (lbs.) &$ 3,317$&$ 2,316$\\
&$ (695)$&$ (433)$\\
Length (in.) &$ 196$&$ 169$\\
&$ (20)$&$ (14)$\\
\midrule
Observations & 52& 22\\
(Don't forget to program drop estout before exporting, so that the .ado reloads)
So, all the numeric values in the main table are encapsulated in $ signs. If you want specific values only, you can do a simple regex condition. E.g., if you care capturing only those values where there is a comma (for whatever reason), you can do:
if strpos("`macval(thevalue)'", ",") {
local thevalue `"\$`macval(thevalue)'\$"'
}
And you can also add you own option (just in the beginning of estout.ado), so that the modified behaviour is not triggered all the time.

pyBrain using letters as input

With pybrain, it's not possible to use letters as input in a dataset. For example, if I do this:
from pybrain.datasets import ClassificationDataSet
ds = ClassificationDataSet(2)
ds.addSample(('a','b'),1)
I get:
ValueError: could not convert string to float: a
Does it make sense to convert each letter to an integer and make those integers be the features for pybrain? For example, the letter a would be 1 and the letter z would be 26.
My concern with this is that there is 0 relation between letters, and I'm not sure whether a number replacing each position in the string would be incorrectly treated as greater/less quantities of some feature by the neural network.

Resources