So I have a document with 30k+ emails. The probleme is, during the export random characters appeared after the emails, something like name#email.com2019-10-10T0545152019-10-10T054515f or name#email.com00000000000700392019-11-28T070033f
My question is, how do i remove everything after ".com" or ".fr" in all the cells ?
You could try using REGEXREPLACE.
=REGEXREPLACE(A1,"\.com.*|\.fr.*", "")
Try
=REGEXEXTRACT(A1,".+\.com|.+\.fr")
Working from what other people added, you can get all emails from the column A and use regular expressions to get the values. Using ARRAYFORMULA you can do it in a single formula:
=ARRAYFORMULA(IF(A:A<>""; REGEXEXTRACT(A:A; ".+\.(?:com|fr)"); ""))
Rundown
ARRAYFORMULA allows to execute the formula to the entire column
REGEXEXTRACT extracts part of the string using regular expressions
IF conditional. In this case it's used to no execute when the cell is empty, preventing an error.
References
ARRAYFORMULA (Docs Editor Help)
REGEXEXTRACT (Docs Editor Help)
IF (Docs Editor Help)
Supposing your raw-data email list were in A2:A, try this in, Row 2 of an otherwise empty column (e.g., B2):
=ArrayFormula(IF(A2:A="",,REGEXEXTRACT(A2:A,"^.+\.\D+")))
In plain English, this means "Extract everything up to the last dot found that is followed by some number of non-digits."
This should pull up to any suffix (e.g., .com, .co, .biz, .org, .ma.gov, etc.).
Related
I'm scraping this site, specifically the content of the tables inside the div tags with class containing 'ranking-data'. So for the first td that would be:
//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[1]/text()"
This is working fine for all columns in all tables (with needed modifications) except for a cell in column 2 that contains an i tag: on Google Spreadsheets it adds an extra blank cell below the cell with the text itself. I've first tried to scrap it with:
//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()
Then I've tried something like *[not(i[contains(#class,'info-circle')])]/text() after the td[2], and some other variants, but it doesn't work.
How can I avoid this i tag?
try:
=QUERY(IMPORTXML(A1, "//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()"), "where Col1 <>' '", )
Answer given by #player0 is working for my case, and since it was the first answer I won't remove the "accepted" mark from it; but I'm stubborn and I've find an alternative with just XPath (which may be useful for other cases). It was as simple as adding an [1] at the end of my first query:
//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()[1]
emails
vera#mail.com
estebangarrido#mail.c
hurtado#mail com
jmariano2mail.com
How can I pass a fuction which correct all domains to #mail.com. I know I have to use =RIGHT(,9) but when you reach the last error it does not apply
Try below formula-
=ArrayFormula(IF(A2:A="",,QUERY(SPLIT(SUBSTITUTE(SUBSTITUTE(A2:A,"mail","|"),"#",""),"|"),"select Col1",0)&"#mail.com"))
This should also work.
=INDEX(IF(LEN(A2:A),QUERY(SPLIT(SUBSTITUTE(SUBSTITUTE(A2:A,"mail","|"),"#",""),"|"),"select Col1")&"#mail.com",""))
Answer
The following formula should produce the results you desire. It assumes that the data you provide is in cells A2:A5 of your spreadsheet. If this is not the case, adjust the A2:A5 portion of the formula appropriately.
=ARRAYFORMULA(REGEXREPLACE(A2:A5,"[#|2].*","#mail.com"))
Explanation
This formula uses REGEXREPLACE to get rid of all rogue characters and replace them with #mail.com. The first argument of REGEXREPLACE is the string to be evaluated. In this case, that is the range from A2 through A5. The second argument is which characters to look for. In this case that is all characters (done using .*) that follow either an at-sign or a numeral two (done using [#|2]). The third argument is which new string to replace the found characters with. In this case that is #mail.com, the correct domain without typos.
The REGEXREPLACE is wrapped in =ARRAYFORMULA because normally REGEXREPLACE can only be used with a single cell rather than a range of cells.
Please note that this solution relies on the assumption you stated that "Everything before # or 2 is correct."
A user pastes in a value to see if there is a full or partial match. I need to do a vlookup and keep removing characters until there is a match. A full match of something like test1.test2.test3 is no problem because it's a full match to my list. But if someone pastes in something like test1.test2.test3.test4, I need to remove a character one at a time from the end until there is a match. So in this example, it would match test1.test2.test3 and return that result.
Conceptually I see this as a for loop that counts the characters using len, using left to remove the number of characters from the end based on the current iteration, and doing vlookups until returning the value when true. But I'm not sure how to do this in Google Sheets.
This formula will give you the matching value that was found in the data(i.e. test1.test2.test3)
=FILTER([column_with_data], REGEXMATCH([cell_with_pasted_value_to_look], [column_with_data]))
This formula will give you the matching data and the cell reference where it was found (i.e. test1.test2.test3 # $A$4)
=FILTER([column_with_data], REGEXMATCH([cell_with_pasted_value_to_look], [column_with_data]))&" # "&CELL("address",INDEX([column_with_data],MATCH(FILTER([column_with_data], REGEXMATCH([cell_with_pasted_value_to_look], [column_with_data])),[column_with_data],0),1))
Simply copy & paste any of the above formulas next to the cell where users paste a value to look. Then, replace the two references in the square brackets [ ] with the proper coordinates in your sheet:
replace [column_with_data] with the coordinates of the column containing all the stored data (i.e. A1:A)
replace [cell_with_pasted_value_to_look] with the absolute ($col$row)coordinates of the cell where users paste the value to look (i.e. $B$1)
Would it be a problem to download the data from Google sheets, transform the file type to use the for loop in another software, and re-upload? I think your idea for a for loop would work.
It might be quicker if this is a long term project, but not so great if the client is continually monitoring/uploading.
I have the following sheet where I need to retrieve only duplicates based on the column K in this example. Please bear in mind that I actually have over 10k data and I need to retrieve them from a different spreadsheet, but I could use some help with the formula.
Thank you.
This formula should work for you:
=ArrayFormula({J1:L1; FILTER(J2:L,J2:J<>"",COUNTIF(K2:K,K2:K)>1)})
The curly brackets { } allow us to build a virtual array.
J1:L1 will place your original headers at the top.
The semicolon means "move down to the next row" (i.e., place the results underneath the headers).
FILTER will filter in only entries where Col J is not blank and where there the COUNTIF from Col K is more than 1 (i.e., where there are duplicates).
If the formula does not work, you are likely in an international locale that uses semicolons as parameter delineations. In that case, use this version of the formula:
=ArrayFormula({J1:L1; FILTER(J2:L;J2:J<>"";COUNTIF(K2:K;K2:K)>1)})
I have a column XXX like this :
XXX
A
Aruin
Avolyn
B
Batracia
Buna
...
I would like to count a cell only if the string in the cell has a length > 1.
How to do that?
I'm trying :
COUNTIF(XXX1:XXX30, LEN(...) > 1)
But what should I write instead of ... ?
Thank you in advance.
For ranges that contain strings, I have used a formula like below, which counts any value that starts with one character (the ?) followed by 0 or more characters (the *). I haven't tested on ranges that contain numbers.
=COUNTIF(range,"=?*")
To do this in one cell, without needing to create a separate column or use arrayformula{}, you can use sumproduct.
=SUMPRODUCT(LEN(XXX1:XXX30)>1)
If you have an array of True/False values then you can use -- to force them to be converted to numeric values like this:
=SUMPRODUCT(--(LEN(XXX1:XXX30)>1))
Credit to #greg who posted this in the comments - I think it is arguably the best answer and should be displayed as such. Sumproduct is a powerful function that can often to be used to get around shortcomings in countif type formulae.
Create another list using an =ARRAYFORMULA(len(XXX1:XXX30)>1) and then do a COUNTIF based on that new list: =countif(XXY1:XXY30,true()).
A simple formula that works for my needs is =ROWS(FILTER(range,LEN(range)>X))
The Google Sheets criteria syntax seems inconsistent, because the expression that works fine with FILTER() gives an erroneous zero result with COUNTIF().
Here's a demo worksheet
Another approach is to use the QUERY function.
This way you can write a simple SQL like statement to achieve this.
For example:
=QUERY(XXX1:XXX30,"SELECT COUNT(X) WHERE X MATCHES '.{1,}'")
To explain the MATCHES criteria:
It is a regex that matches every cell that contains 1 or more characters.
The . operator matches any character.
The {1,} qualifies that you only want to match cells that have at 1 or more characters in them.
Here is a link to another SO question that describes this method.