I'm scraping this site, specifically the content of the tables inside the div tags with class containing 'ranking-data'. So for the first td that would be:
//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[1]/text()"
This is working fine for all columns in all tables (with needed modifications) except for a cell in column 2 that contains an i tag: on Google Spreadsheets it adds an extra blank cell below the cell with the text itself. I've first tried to scrap it with:
//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()
Then I've tried something like *[not(i[contains(#class,'info-circle')])]/text() after the td[2], and some other variants, but it doesn't work.
How can I avoid this i tag?
try:
=QUERY(IMPORTXML(A1, "//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()"), "where Col1 <>' '", )
Answer given by #player0 is working for my case, and since it was the first answer I won't remove the "accepted" mark from it; but I'm stubborn and I've find an alternative with just XPath (which may be useful for other cases). It was as simple as adding an [1] at the end of my first query:
//div[contains(#class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()[1]
Related
So I have two rows:
ID
TagDog
TagCat
TagChair
TagArm
Grouped Tags (need help with this)
1
TRUE
TRUE
TagDog,TagArm
Row 1 consists mainly of Tags, while rows 2+ are entries. This data ties ENTRIES to TAGS.
What I'm needing to do is concatenate/join the tag names per entry. For example, look at the last column above.
I suspect we could write a formula that would:
Create an array of non-empty cells in the row. (IE: [2,4])
Return it with the header row A (IE: [A2,A4])
Then join them together by a comma
But I am unsure how to write the formula, or if this is even the best approach.
Here's the formula:
={
"Grouped Tags (need help with this)";
ARRAYFORMULA(
REGEXREPLACE(TRIM(
TRANSPOSE(QUERY(TRANSPOSE(
IF(NOT(B2:E11),, B1:E1)
),, COLUMNS(B1:E1)))
), "\s+", ",")
)
}
The trick used is called vertical query smash. That's the part:
TRANSPOSE(QUERY(TRANSPOSE(...),, Nnumber_of_columns))
You can find a brief description of this one and his friends here.
I wasn't able to create a single formula that would do this for me, so instead, I utilized a formula inside of Sheets' Find/Replace tool, and it worked like a charm!
I did a find/replace, replacing all instances of TRUE with the following formula:
=INDIRECT(SUBSTITUTE(LEFT(ADDRESS(ROW(),COLUMN()),3),"$","")&"$1")
What this formula does is it finds the cell's letter, then gets the first row of the cell using INDIRECT.
Breaking down the formula:
ADDRESS(ROW(),COLUMN()) returns the direct reference: $H$1
LEFT("$H$1",3) returns $H$
SUBSTITUBE("$H$","$","") replaces the dollar signs ($) and returns H
INDIRECT(H&"$1") references the exact cell H$1
Now, I can replace all instances of TRUE with that formula and the magic happens!
Here is a video explanation: https://youtu.be/SXXlv4JHDA8
Hopefully, that helps someone -- however, I would still be interested in seeing what the formula is for this solution.
So I have a document with 30k+ emails. The probleme is, during the export random characters appeared after the emails, something like name#email.com2019-10-10T0545152019-10-10T054515f or name#email.com00000000000700392019-11-28T070033f
My question is, how do i remove everything after ".com" or ".fr" in all the cells ?
You could try using REGEXREPLACE.
=REGEXREPLACE(A1,"\.com.*|\.fr.*", "")
Try
=REGEXEXTRACT(A1,".+\.com|.+\.fr")
Working from what other people added, you can get all emails from the column A and use regular expressions to get the values. Using ARRAYFORMULA you can do it in a single formula:
=ARRAYFORMULA(IF(A:A<>""; REGEXEXTRACT(A:A; ".+\.(?:com|fr)"); ""))
Rundown
ARRAYFORMULA allows to execute the formula to the entire column
REGEXEXTRACT extracts part of the string using regular expressions
IF conditional. In this case it's used to no execute when the cell is empty, preventing an error.
References
ARRAYFORMULA (Docs Editor Help)
REGEXEXTRACT (Docs Editor Help)
IF (Docs Editor Help)
Supposing your raw-data email list were in A2:A, try this in, Row 2 of an otherwise empty column (e.g., B2):
=ArrayFormula(IF(A2:A="",,REGEXEXTRACT(A2:A,"^.+\.\D+")))
In plain English, this means "Extract everything up to the last dot found that is followed by some number of non-digits."
This should pull up to any suffix (e.g., .com, .co, .biz, .org, .ma.gov, etc.).
Now this following query runs great on one column and does use a space as a search separator however it only seaches one column.
=QUERY(Data!A1:O, "SELECT * WHERE LOWER(N) LIKE LOWER(""%" &JOIN("%"") AND LOWER(N) LIKE LOWER(""%", SPLIT(B1," "))&"%"")",1)
However then I found this snippet and that searches the entire sheet but cannot separate words within a cell.
=ARRAY_CONSTRAIN(IFERROR(QUERY({Data!A:O, TRANSPOSE(QUERY(TRANSPOSE(Data!A:O),,99^99))}, "where lower(Col16) contains '"&LOWER(B1)&"'", 1)), 99^99, COLUMNS(A:O))
The issue is I want to search multiple of my columns namely D,E,G,H,M,N where M contain multiple words that should be searched separated by a comma and space since the data comes from a form.
Is there a way that makes it possible to achieve this?
Link to a very obfuscated sheet upon request data is somewhat similar yet document is very simplified and shortened.
https://docs.google.com/spreadsheets/d/1PxdObZsn62rQ3QeYVdy9HjToIHZsmw1d0cXK2jOuiC4/edit?usp=sharing
Solution
In this case you should use the same approach you were using before, but now when searching in the built Col16:
=ARRAY_CONSTRAIN(IFERROR(QUERY({Data!A1:O, TRANSPOSE(QUERY(TRANSPOSE(Data!A1:O),,99^99))}, "where lower(Col16) LIKE LOWER(""%" &JOIN("%"") AND LOWER(Col16) LIKE LOWER(""%", SPLIT(B1," "))&"%"")", 1)), 99^99, COLUMNS(A:O))
I'm trying to get the seperate <td>'s to show up in Google Sheet of a <tr> that I'm importing through IMPORTXML.
This code should get my match data based on the match ID I provide, and my player ID. I feel that simply adding /* or /td to end of Xpath should work, but that's the end of my knowledge.
I tried: adding /*, /td and other to end of xPath Query but doesn't seem to work.
Even disabled JavaScript and inspected website again but to no avail.
FORMULA:
=IMPORTXML("https://www.dotabuff.com/matches/5011379854";"//tr[contains(#class,'9764136')]")
Also tried:
=IMPORTXML("https://www.dotabuff.com/matches/5011379854";"//td[parent::tr[contains(#class,'9764136')]]")
Which only gives the first of all the /td's and not the rest.
Current outputis all mushed together:
"19LemthTop (Off)ZeusCoreTop (Off) Roaminglost27108.7k127933650626.5k-183-/-5m7m21m31m"
The output that I want is separate <td> on separate lines:
"19
LemthTop (Off)ZeusCoreTop (Off) Roaminglost
2
7
10
8.7k
127
9
336
506
26.5k
-
183
-/-
5m7m21m31m"
Issue and workaround:
Although I have tried to parse the values for each row, unfortunately, it seemed that td cannot be directly parsed using a xpath with IMPORTXML as each row. But fortunately, each table can be retrieved by IMPORTHTML and also each tab can be accessed. Using them, how about the following workaround?
Retrieve a table from the URL using IMPORTHTML.
Retrieve a row including the name corresponding to 9764136 you want using a query.
Modified formula:
=TRANSPOSE(SPLIT(TEXTJOIN("#",TRUE,QUERY(IMPORTHTML(A1,"table",1), "where Col4 contains '"&IMPORTXML(A1,"//a[contains(#href,'9764136')]")&"'", 0)),"#",TRUE,TRUE))
The URL of https://www.dotabuff.com/matches/5011379854 is put to the cell "A1".
After the table was retrieved, the row is retrieved from the table by the query.
The important point of this workaround is the methodology. I think that there are various formulas for retrieving the value. So please think of above sample formula as just one of them.
Result:
Note:
If you use above formula for other URL, an error might occur. Please be careful this.
References:
IMPORTHTML
IMPORTXML
TEXTJOIN
SPLIT
TRANSPOSE
I would like to use spreadsheets to get all unique names from Column A in a table but in the same time I would like blank cells to be ignored. So far I've got this formula that returns all of the unique names from column A but I don't know how to go about ignoring blank cells and not repeating values that have once been added previously.
Here is how my document looks so far. As you can see everything stops after Megan because there is a blank cell.
=IFERROR(INDEX($A$2:$A$90, MATCH(0, COUNTIF($I$10:I10, $A$2:$A$90), 0)), "")
Searched long and wide but came up with nothing, if anyone has any idea how one could do that I would really appreciate it. Thanks!
=unique(A2:A) should work
=unique(filter(A2:A,A2:A<>"")) to also ignore blanks
Yet another hack
=SORT(UNIQUE(A2:A))
Technically, this does not remove the blank result. But nonetheless puts it at the end of the list. You'll also benefit from the sort if you need it. 😁
You can use query:
=unique(query(A2:A,"select A where A<>''"))
You can use this code:
=IFERROR(INDEX($A$2:$A$90, MATCH(0, INDEX(COUNTIF($I$10:I10, $A$2:$A$90)+($A$2:$A$90=""), ), 0)), "")
should work