How to extract number from text in Google Sheets - google-sheets

My product names contain different sizes: e.g. 100ml, 50ml, 30ml.
Sample product name:
Babaria Aloe Vera Shaving Gel Sensitive Skin 200ml
From this I try extract capacity: 200ml.
To make it easier to do this, I am sharing a document.
https://drive.google.com/file/d/1gPuNiNwvK1bG5WxD-4kFgrYjSF72SkwX/view?usp=sharing
Could somebody help me and give me some example?
If it's hard to extract numbers with "ml", it also might be helpful for me to extract only numbers.

You can use:
=INDEX(IF(A2:A="","",REGEXEXTRACT(A2:A,"\b\d+ml\b")))
Where \b\d+ml\b means:
\b - A word-boundary to assert position has no leading word-characters.
\d+ml - 1+ Digits followed by literally "ml".
\b - A word-boundary to assert position has no trailing word-characters.
If you just want the numbers without "ml" then try to change the pattern to \b(\d+)ml\b where the only difference is the use of a capture group which will be the result of the REGEXEXTRACT() function here.

Related

How can I split a string and sum all numbers from that string?

I'm making a list for buying groceries in Google Sheets and have the following value in cell B4.
0.95 - Lemon Juice
2.49 - Pringle Chips
1.29 - Baby Carrots
9.50 - Chicken Kebab
What I'm trying to do is split using the dash character and combine the costs (0.95+2.49+1.29+9.50).
I've tried to use Index(SPLIT(B22,"-"), 7) and SPLIT(B22,"-") but I don't know how to use only numbers from the split string.
Does someone know how to do this? Here's a sample sheet.
Answer
The following formula should produce the result you desire:
=SUM(ARRAYFORMULA(VALUE(REGEXEXTRACT(SPLIT(B4,CHAR(10)),"(.*)-"))))
Explanation
The first thing to do is to split the entry in B4 into its component parts. This is done by using the =SPLIT function, which takes the text in B4 and returns a separate result every time it encounters a specific delimiter. In this case, that is =CHAR(10), the newline character.
Next, all non-number information needs to be removed. This is relatively easy in your sample data because the numbers always appear to the left of a dash. =REGEXEXTRACT uses a regular expression to only return the text to the left of the dash.
Before the numbers can be added together, however, they must be converted to be in a number format. The =VALUE function is used to convert each result from a text string containing a number to an actual number.
All of this is wrapped in an =ARRAYFORMULA so that =VALUE and =REGEXEXTRACT parse each returned value from =SPLIT, rather than just the first.
Finally, all results are added together using =SUM.
Functions used:
=CHAR
=SPLIT
=REGEXEXTRACT
=VALUE
=ARRAYFORMULA
=SUM
Firstly you can add , symbols start and ends of numbers with below code:
REGEXREPLACE(B4,"([0-9\.]+)",",$1,")
Then split it based of , sign.
SPLIT(A8, ",")
Try below formula (see your sheet)-
=SUM(ArrayFormula(--REGEXEXTRACT(SPLIT(B4,CHAR(10)),"-*\d*\.?\d+")))

Extracting numbers with REGEXEXTRACT that might have a comma or dot

I have a list of numbers in a few formats that may or may not include a dot and a comma. The numbers are locked in a string. For example:
hello 1,000 goodbye
hola 2,000.12 ciao
Hallo 3000.00 Auf Wiedersehen
How can I extract the numbers?
I don't care if the comma is added but the dot is obviously important.
I need the regular_expression to be used in REGEXEXTRACT (and the rest of the REGEX formulas.
The output should be:
1000
2000.12
3000.00
Supposing that your raw data is in A2:A, use this in B2 (or the second cell) of an otherwise empty column:
=ArrayFormula(IF(A2:A="",,IFERROR(VALUE(REGEXEXTRACT(A2:A,"\d[\d,\.]*\d")))))
The REGEX portion reads, in plain English, "Extract any portion that starts with a digit followed by any number of digits, commas or periods (or none of these) and ends with a digit."
You will likely want to apply Format > Number > Currency to the results column.

Ruby: Split a string into substring of maximum 40 characters

I have some strings with a sentence and i need to subdivise it into a substring of maximum 40 characters.
But i don't want to split the sentence in the middle of a word.
I tried with .gsub function but it's return 40 characters maximum and avoid to cut the string in the middle of a word. But it's return only the first occurence.
sentence[0..40].gsub(/\s\w+$/,'')
I tried with split but i can select only the fist 40 characters and split in the middle of a word...
sentence.split(...){40}
My string is "Sure, we will show ourselves only when we know the east door has been opened.".
The string output i want is
["Sure, we will show ourselves only when we","know the east door has
been opened."]
Do you have a solution ? Thanks
Your first attempt:
sentence[0..40].gsub(/\s\w+$/,'')
almost works, but it has one fatal flaw. You are splitting on the number of characters before cutting off the last word. This means you have no way of knowing whether the bit being trimmed off was a whole word, or a partial word.
Because of this, your code will always cut off the last word.
I would solve the problem as follows:
sentence[/\A.{0,39}[a-z]\b/mi]
\A is an anchor to fix the regex to the start of the string.
.{0,39}[a-z] matches on 1 to 40 characters, where the last character must be a letter. This is to prevent the last selected character from being punctuation or space. (Is that desired behaviour? Your question didn't really specify. Feel free to tweak/remove that [a-z] part, e.g. [a-z.] to match a full stop, if desired.)
\b is a word boundary look-around. It is a zero-width matcher, on beginning/end of words.
/mi modifiers will include case insensitive (i.e. A-Z) and multi-line matches.
One very minor note is that because this regex is matching 1 to 40 characters (rather than zero), it is possible to get a null result. (Although this is seemingly very unlikely, since you'd need a 1-word, 41+ letter string!!) To account for this edge case, call .to_s on the result if needed.
Update: Thank you for the improved edit to your question, providing a concrete example of an input/result. This makes it much clearer what you are asking for, as the original post was somewhat ambiguous.
You could solve this with something like the following:
sentence.scan(/.{0,39}[a-z.!?,;](?:\b|$)/mi)
String#scan returns an array of strings that match the pattern - so you can then re-join these strings to reconstruct the original.
Again, I have added a few more characters (!?,;) to the list of "final characters in the substring". Feel free to tweak this as desired.
(?:\b|$) means "either a word boundary, or the end of the line". This fixes the issue of the result not including the final . in the substrings. Note that I have used a non-capture group (?:) to prevent the result of scan from changing.

Extracting text from APA citation

I have a spreadsheet containing APA citation style text and I want to split them into author(s), date, and title.
An example of a citation would be:
Parikka, J. (2010). Insect Media: An Archaeology of Animals and Technology. Minneapolis: Univ Of Minnesota Press.
Given this string is in field I2 I managed to do the following:
Name: =LEFT(I2, FIND("(", I2)-1) yields Parikka, J.
Date: =MID(I2,FIND("(",I2)+1,FIND(")",I2)-FIND("(",I2)-1) yields 2010
However, I am stuck at extracting the name of the title Insect Media: An Archaeology of Animals and Technology.
My current formula =MID(I2,FIND(").",I2)+2,FIND(").",I2)-FIND(".",I2)) only returns the title partially - the output should show every character between ).and the following ..
I tried =REGEXEXTRACT(I2, "\)\.\s(.*[^\.])\.\s" ) and this generally works but does not stop at the first ". " - Like with this example:
Sanders, E. B.-N., Brandt, E., & Binder, T. (2010). A framework for organizing the tools and techniques of participatory design. In Proceedings of the 11th biennial participatory design conference (pp. 195–198). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1900476
Where is the mistake?
The title can be found (in the two examples you've given, at least) with this:
=MID(I2,find("). ",I2)+3,find(". ",I2,find("). ",I2)+3)-(find("). ",I2)+3)+1)
In English: Get the substring starting after the first occurrence of )., up to and including the first occurrence of . following.
If you wish to use REGEXEXTRACT, then this works (on your two examples). (You can also see a Regex101 demo.):
=REGEXEXTRACT(I3,"(?:.*\(\d{4}\)\.\s)([^.]*\.)(?: .*)")
Where is the mistake?
In your expression, you were capturing (.*[^\.]), which greedily includes any number of characters followed by a character in the character class not (backslash or dot), which means that multiple sentences can be captured. The expression finished with \.\s, which wasn't captured, so the capture group would end before a period-then-space, rather than including it.
Try:
=split(SUBSTITUTE(SUBSTITUTE(I2, "(",""), ")", ""),".")
If you don't replace the parentheses around 2010, it thinks it is a negative number -2010.
For your Title try adding index split to your existing formula:
=index(split(REGEXEXTRACT(A5, "\)\.\s(.*[^\.])\.\s" ),"."),0,1)&"."

Checking whether a string contains a phone number

Trying to work out how to parse out phone numbers that are left in a string.
e.g.
"Hi Han, this is Chewie, Could you give me a call on 02031234567"
"Hi Han, this is Chewie, Could you give me a call on +442031234567"
"Hi Han, this is Chewie, Could you give me a call on +44 (0) 203 123 4567"
"Hi Han, this is Chewie, Could you give me a call on 0207-123-4567"
"Hi Han, this is Chewie, Could you give me a call on 02031234567 OR +44207-1234567"
And be able to consistently replace any one of them with some other item (e.g. some text, or a link).
Am assuming it's a regex type approach (I'm already doing something similar with email which works well).
I've got to
text.scan(/([^A-Z|^"]{6,})/i)
Which leaves me a leading space I can't work out how to drop (would appreciate the help there).
Is there a standard way of doing this that people use?
It also drops things into arrays, which isn't particularly helpful
i.e. if there were multiple numbers.
[["02031234567"]["+44207-1234567"]]
as opposed to
["02031234567","+44207-1234567"]
Adding in the third use-case with spaces is difficult. I think the only way to successfully meet that acceptance criteria would be to chain a #gsub call on to your #scan.
Thus:
text.gsub(/\s+/, "").scan(/([^A-Z|^"|^\s]{6,})/i)
The following code will extract all the numbers for you:
text.scan(/(?<=[ ])[\d \-+()]+$|(?<=[ ])[\d \-+()]+(?=[ ]\w)/)
For the examples you supplied this results in:
["02031234567"]
["+442031234567"]
["+44 (0) 203 123 4567"]
["0207-123-4567"]
["02031234567", "+44207-1234567"]
To understand this regex, what we are matching is:
[\d \-+()]+ which is a sequence of one or more digits, spaces, minus, plus, opening or closing brackets (in any order - NB regex is greedy by default, so it will match as many of these characters next to each other as possible)
that must be preceded by a space (?<=[ ]) - NB the space in the positive look-behind is not captured, and therefore this makes sure that there are no leading spaces in the results
and is either at the end of the string $, or | is followed by a space then a word character (?=[ ]\w) (NB this lookahead is not captured)
This pattern will get rid of the space but not match your third case with spaces:
/([^A-Z|^"|^\s]{6,})/i
This is what I came to in the end in case it helps somebody
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
That gives me an array of
["+442031234567", "02031234567"]
I'm sure there is a more elegant way of doing this and possibly you'd want to check the numbers for likelihood of being phonelike - e.g. using the brilliant Phony gem.
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
real_numbers = numbers.keep_if{|n| Phony.plausible? PhonyRails.normalize_number(n, default_country_code: "GB")}
Which should help exclude serial numbers or the like from being identified as numbers. You'll obviously want to change the country code to something relevant for you.

Resources