Find a time in some text, allowing for multiple formats - google-sheets

I have the following formula.
=INDEX(Lookups!$L$1:$L$726,MAX(IF(ISERROR(FIND(Lookups!$L$1:$L$726,$A1)),-1,1)*(ROW(Lookups!$L$1:$L$726)-ROW(Lookups!$L$1)+1)))
The idea is to pick up the time for a certain item from an email (already parsed into google sheets). The emails come in various formats so I'm unable to specify the location in the the text string to look at specifically.
The times are not always written in a conventional time format either so as you can see from the formula there are 726 possibilities that I work with. For example, sometimes the time could be written as 13:15 and others as 1:15 or even 1.15 or 1-15 etc etc.
The issue I have is that the above formula seems to start with the smallest string possible and work 'upwards', therefore picking up 3:15 from the email string rather than the full time string which is 13:15. Is there a way I can amend the formula to search for the longest string first, in that example looking for 13:15 and then only searching for 3:15 if the prior is not found.
Hope that makes sense. Thanks in advance for any assistance.

One way is to reorder those 726 possibilities so that you have the longer ones first. You can do it by creating another column with =len(L1), copying that formula down, and sorting the range by this new column in descending order.
But it would be easier to use regexextract instead, because regular expressions are designed to solve the problem you are facing. For example,
=regexextract(L1, "\b\d{1,2}[:.-]\d{1,2}\b")
picks up all of the variants 1:15, 13:15, 1-15 or 13.15. (It looks for the following sequence: word boundary, 1-2 digits, one of characters :, ., -, then 1-2 digits, and another word boundary.) The match is greedy, so it will find 13:15 when it's there, not just 3:15.
A more complex form
=regexextract(L1, "(?i)\b\d{1,2}[:.-]\d{1,2} ?(?:am|pm)?\b")
also supports "am" or "pm" after the time, case-insensitive and possibly separated by a space from the digits.
This can be refined further, for example the hours part would be more precisely stated as [0-2]?\d instead of \d{1,2}, and the minutes part as [0-6]?\d.

Related

Why are hyphens screwing up my duplicate finding conditional formatting?

I have a sheet I use as a database of scientific papers. I copy journal article titles from different sources (some could be from an email, others are links on a web page, or just the title from the article page). I have conditional formatting set to let me know if I'm adding a title that is already in the list. I've noticed that there are some titles that are "ignoring" the conditional formatting, and it looks like there are hyphens in all of the offenders. If I remove the hyphens, the conditional formatting works. So there is some 'difference' in the hyphens originating from the same title that is preventing the conditional formatting from viewing them as identical.
Shared sheet
Examples of offending titles:
End-to-end continuous bioprocessing: impact on facility design, cost of goods and cost of development for monoclonal antibodies
End‐to‐end continuous bioprocessing: impact on facility design, cost of goods and cost of development for monoclonal antibodies
End‐to‐end continuous bioprocessing: Impact on facility design, cost of goods, and cost of development for monoclonal antibodies
What is this difference, and is there a way to fix it? Do I need to write a script to find/replace the hyphens to get this to work?
TIA
Just because characters appear identical, it does not mean that they are identical. You have fallen foul of the similarity between the hyphen and dashes. Visually, they are almost identical - dashes are slightly widest than the hyphen.
Dashes are regarded as "special characters" (i.e. they aren't keys on the keyboard) but they are used widely in html. So if, for instance, you copied an item from a website then you might unwittingly have copied dashes rather than hyphens.
You can identify the exact nature of a character by using the CODE function.
You ask "What is this difference, and is there a way to fix it? Do I need to write a script to find/replace the hyphens to get this to work?"
WHAT IS THIS DIFFERENCE?
It's important to recognise that though these examples appear identical, there are other differences that are more than just hyphens vs dashes.
Example#1 - Hyphen - CODE returns "45"
Example#2 - Dash - CODE returns "8208"
Example#3 - Dash - CODE returns "8208".
But there are other factors that contribute to fail to trigger the conditional formatting rule:
Length = 128 (vs 127 for the other examples). There is an additional comma (after "cost of sales")
the word "Impact" is spelled with an upper case "I" (lower case for the other examples)
MOVING FORWARD
Do you need a script? No (IMHO)
Is there a way to fix it? As outlined above, there are more differences that just hyphens and dashes. And, as time goes by, the number & type of difference might increase. However, there is a solution to the "Hyphen Vs Dash" problem which is the focus of this question.
FORMULA AND FORMATTING
Your data is currently in Column A and Column A is also subject to conditional formatting.
Remove the conditional formatting rules from Column A
Insert this formula in cell B2
=arrayformula(if(LEN($A2:A)-LEN(SUBSTITUTE($A2:A, char(8208), ""))=0,A2:A,arrayformula(substitute(A2:A,char(8208),char(45)))))
Conditional Formatting for Column B
select the range in Column B
select, Format, Conditional Formatting.
Select "Custom Formula is" and enter this formula: =countif($B$2:$B2,B2)>1
Select a preferred Formatting Style and then click Done.
FORMULA LOGIC
arrayformula enables the formula to automatically populate all the relevant cell in the column.
LEN($A2:A)-LEN(SUBSTITUTE($A2:A, char(8208), ""))=0
a test for dashes in a string. It substitutes a nil value for any/all instances of a dash (char(8208)), then compares the length to the adjusted length. If the value is zero, then there are no dashes in the string.
IF: Test for any dashes,
if the string doesn't contain any dashes then use that value
else, the string must contains dashes so substitute any dashes for hyphens, and use the substituted value
arrayformula(substitute(A2:A,char(8208),char(45)))
The conditional formatting rule then looks for duplicate values in the column, and formats any/all duplicate values.
You'll note that Example#3 is not flagged as a duplicate despite containing dashes. This is because of the spelling of "Impact" and the extra comma after "cost of sales".
Sample

Updating hours with text and dashes in Google Sheets

I am trying to figure out how to add +8hours to a schedule formatted with a text in front and a dash in between the hours. Please see example below:
Training 01:45AM - 02:45AM
Convert that time to +8 hours.
Training 09:45AM - 10:45AM.
I just cant seem to figure out the best formula to use in this scenario.
Supposing the raw data string from your post were in A2, this should work:
=REGEXREPLACE(REGEXREPLACE(A2,REGEXEXTRACT(A2,"(\d.+M) -"),TEXT(VALUE(REGEXEXTRACT(A2,"(\d.+M) -"))+TIME(8,0,0),"hh:mmAM/PM")),REGEXEXTRACT(A2,"- (\d.+M)"),TEXT(VALUE(REGEXEXTRACT(A2,"- (\d.+M)"))+TIME(8,0,0),"hh:mmAM/PM"))
In short, this formula uses REGEXEXTRACT to pull each time from the string, convert it to a value, add eight hours, convert it back to TEXT and finally reinsert that transformed substring back into the original string with REGEXREPLACE. Because this happens twice, you'll see two such setups, one wrapped within the other.

Why is this Google Sheets Concatenate Formula giving me weird results?

I'm trying to use Google Sheets to concatenate a bit of data. It works 90% of the time, however on certain numbers, I get an odd result. I have to copy the result of this data and paste it into a financial program in a specific format and am using the concatenate formula to do this. The format the program requires is that each field be separated by one period, even if it is a dollar amount as the program will automatically move the decimal point two places to the left while it is evaluating the information. The issue is that on some numbers the formula adds two periods between the fields, which stops the evaluation of the data in our financial program.
Here is a screenshot including the formula
You can see that it works with most numbers in the amount column, but with two of the amounts and several others, it adds two periods after the amount.
Would you please take a look at this and see if you can help me find the issue?
Thank you!!!!
Looks like it's an existing floating point calculation error in Google Sheets, the multiplication by 100 did not return exact value for certain numbers but with extra very small decimal. That's why there's an additional period on the result.
As a workaround, use ROUND() upon multiplying by 100 to "snap" it to an integer.
Sample:
References:
Floating Point Calculation Error
use just:
=B2&"."&ROUND(C2*100)&"."&D2

Google Sheets - Split Data

I have these data in Google Sheets
$71,675_x000d_
$80,356_x000d_
$107,361_x000d_
$123,393_x000d_
$116,878
I want them to be split into different columns.
However, when I do so using Data > Split Data into Different Columns, it separates $71 and 675_x000d_ but I need the $71,275 and remove the xoood
Please note that the last number doesn't have those extra characters.
Please help.
Your post says you want to "remove the x000d (that is, extract only the dollar amounts). That said, let's say your raw data starts in A2 (i.e., the data is in A2:A). Place the following formula into the first cell of another otherwise empty column (e.g., B1):
=ArrayFormula({"Extracted";IF(A2:A="",,REGEXEXTRACT(SUBSTITUTE(A2:A&"_",",",""),"\d+"))})
How It Works:
ArrayFormula(...) signifies that we'll be processing an entire range and not just one cell.
The outer curly brackets {...} signify that a virtual array will be formed from non-like or non-contiguous pieces.
The first piece of the virtual array is the header. Here, that is "Extracted"; but you can change it as you like.
The semicolon means "place the next information below the previous part."
IF(A2:A="",, ...) is a standard check that basically says "Don't try to process any blank cells in Column A"; or alternatively worded, "If any cell in A2:A is blank/null, do nothing."
Skipping the REGEXEXTRACT for now, A2:A&"_" appends an underscore to every entry in A2:A. This allows entries in A2:A that are just a dollar amount (e.g., from the post, $116,878) to have a consistent symbol following them if not already there. (And adding the underscore to anything that already has an underscore won't matter, because we won't be extracting that far out.)
Now that we've got the new strings, we SUBSTITUTE every comma for a null (i.e., delete all commas).
Finally, REGEXEXTRACT will take all of the virtually modified strings and extract \d+, which means only digits (\d) in an unbroken sequence of any length greater than 0 (+). Note that REGEXEXTRACT will only return the first such match it encounters as written, so 000 will not be extracted.
An IFERROR wrap is placed around the REGEXEXTRACT, just in case you have any situations in real life that don't have any sequence of numbers at all. In these cases, nothing will be returned (whereas, without the IFERROR, an error would have been returned).
Once the extraction is done, you can apply Format > Number > Currency (rounded) to the entire column.
Addendum:
After an additional comment (below), it appears that the raw data is in Column T, that all five entries are in one cell and that the OP would like all five amounts extracted across each row. That being the case, assuming that Columns U:Y are empty to start, place the following in cell U1 (not U2):
=ArrayFormula({"Va11","Val2","Val3","Val4","Val5";IF(T2:T="",,IFERROR(REGEXEXTRACT(SUBSTITUTE(T2:T&"_",",",""),REPT("\$(\d+)[^\$]*",5))))})
This works much the same way as the previous formula. The differences:
There are five headers now.
You'll see REPT(...,5) here. This is an easy way to repeat the same extraction five times.
That repeated extraction is now the following:
\$(\d+)[^\$]*
The backslash in front of the dollar signs means to treat those symbols as literals instead of as their usual meaning (i.e., end-of-string). So the extraction reads as follows:
\$ anything that starts with a dollar sign
(\d+) extract what is between the ( ), which is any group of digits [^$]*` followed by any number (including 0) characters that are not dollar signs
As I said, the REPT will repeat this five times; so five groups matching this pattern will be extracted.
Understand that if you have any groups that don't follow the pattern exactly, resulting in five matching extractions, nothing will be returned.
Be sure to format U:Y as currency rounded, or you will wind up with some of those numbers translating as raw dates and therefore being completely off.
Please use the following formula and format cells to your needs.
=ArrayFormula(IFERROR(SPLIT(REGEXREPLACE(A2:A,"\n|_x000d_","√"),"√")))
The big advantage of the above formula compared to others is that it works for any number of lines included within a single cell (as shown in the image below).
Functions used:
ArrayFormula
IFERROR
SPLIT
REGEXREPLACE
You can use SPLIT function:
=ArrayFormula(IF(LEN(A:A),SPLIT(A:A,"_x000d_",FALSE),""))

This regex matches in BBEdit and regex.com, but not on iOS - why?

I am trying to "highlight" references to law statutes in some text I'm displaying. These references are of the form <number>-<number>-<number>(char)(char), where:
"number" may be whole numbers 18 or decimal numbers 12.5;
the parenthetical terms are entirely optional: zero or one or more;
if a parenthetical term does exist, there may or may not be a space between the last number and the first parenthesis, as in 18-1.3-401(8)(g) or 18-3-402 (2).
I am using the regex
((\d+(\.\d+)*-){2}(\d+(\.\d+)*))( ?(\([0-9a-zA-Z]+\))*)
to find the ranges of these strings and then highlight them in my text. This expression works perfectly, 100% of the time, in all of the cases I've tried (dozens), in BBEdit, and on regex101.com and regexr.com.
However, when I use that exact same expression in my code, on iOS 12.2, it is extremely hit-or-miss as to whether a string matching the regex is actually found. So hit-or-miss, in fact, that a string of the exact same form of two other matches in a specific bit of text is NOT found. E.g., in this one paragraph I have, there are five instances of xxx-x-xxx; the first and the last are matched, but the middle three are not matched. This makes no sense to me.
I'm using the String method func range(of:options:range:locale:) with options of .regularExpression (and nil locale) to do the matching. I see that iOS uses ICU-compatible regexes, whereas these other tools use PCRE (I think). But, from what I can tell, my expression should be compatible and valid for my case with the ICU parsing. But, something is definitely different, and I cannot figure out what it is.
Anyone? (I'm going to give NSRegularExpression a go and see if it behaves differently, but I'd still like to figure out what's going on here.)

Resources