Using =IMPORTXML in Google Spreadsheets to extract a table by descriptions - google-sheets

From the website http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396 I am trying to extract just the text data such as, birth, death, bio, location, and the created by sections into different rows/columns. I want to be able to have a spreadsheet where I can input a FindAGrave URL and have it extract the above data for me. I read here Using =importXML in Google Docs that its possible to do it by descriptions. From there I learned to omit the Xpath tbody. That successfully got my import to work, but without using the descriptions. I'm sure if using descriptions would be more efficient or not. I just want to learn how other people would go about importing data from tables.
Thanks
Here is what I got so far. This will extract the Birth information and put in rows. One problem is that it adds an extra cell in between each data.
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//html/body/table/tr/td[3]/table/tr[4]/td[1]/table/tr/td/table/tr/td/table/tr[1]/td[2]")
Result
Dec. 2, 1882 Humphreys County Tennessee, USA
Update: I think I made some process along in the code. This is what I'm working with now.
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr'][1]//tr/td/table/tr/td/table/tr[1]/td[1]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[1]/td[2]/text()[1]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[1]/td[2]/text()[2]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[1]/td[2]/text()[3]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[1]/td[2]/text()[4]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr'][1]//tr/td/table/tr/td/table/tr[2]/td[1]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[2]/td[2]/text()[1]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[2]/td[2]/text()[2]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[2]/td[2]/text()[3]")
=IMPORTXML("http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396", "//*[#class='gr']//tr/td/table/tr/td/table/tr[2]/td[2]/text()[4]")
Results:
Birth:
Nov. 8, 1948
Benton
Saline County
Arkansas, USA
Death:
Jan. 6, 2006
Tulsa
Tulsa County
Oklahoma, USA
Is there a way to split this data up within the code?

The following formula
=IMPORTXML(
"http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396",
"//html/body/table/tr/td[3]/table/tr[4]/td[1]/table/tr/td/table/tr/td/table/tr[position()<=2]/td/text()"
)
returns
Birth:
Nov. 8, 1948
Benton
Saline County
Arkansas, USA
Death:
Jan. 6, 2006
Tulsa
Tulsa County
Oklahoma, USA
A shorter alternative,
=IMPORTXML(
"http://www.findagrave.com/cgi-bin/fg.cgi?page=gr&GScid=97961&GRid=22682396",
"//tr[4]/td[1]//tr[position()<=2]/td/text()"
)
returns the same result

You can get multiple fields by simplifying your xpaths - you can also add multiple into one single function call separating the xpaths with a | :
`=ARRAYFORMULA(TRIM(TRANSPOSE(IMPORTXML($A3,"//td[#align='left']/text()|//tr[6]/td/a|//tr[3]/td/text()[1]"))))`
The three xpaths used are:
//td[#align='left']/text()
//tr[6]/td/a
//tr[3]/td/text()[1]

Related

Data Scraping Google Sheets Formula N/a and Incomplete

Right now I am scraping certain Productinformation from bol.com.
The product data is getting scraped. But for each 30 products, about 5 are either scraping incomplete data, for instance the EAN is missing, while its in the same content block as normal or it just gives N/A. While there is information.
Any tips?
Ps. This is my current formula: =importxml(C27;"//*[#id='mainContent']/div/div[1]/div[5]/div[1]/div[4]/div/div/div[1]/div/div[1]/dl")
C27 is the URL. (https://www.bol.com/nl/nl/p/adroitgoods-hondenriem-180-cm-hondenlijn-looplijn-hondenlijn-reflecterend-rood-lange-lijn-hond/9300000101425619/)
It should import the following product information:
EAN
Kleur
Materiaal
Maat
Reflecterend
Speciaal voor hardlopen
Type uitlaatriem
Verpakkingsinhoud
try:
=INDEX(TRIM(IMPORTxML(A1, "//dl[#class='specs__list']/div")))

Sorting and removing non-duplicate rows in google sheet and keeping non-duplicate rows and duplicate rows

I am fairly new to Google sheets, and essentially what I am trying to do is remove all non-duplicate values that do not exist or is listed in another sheet or row - and also store the non-duplicate values somewhere else
In my example sheet here, I am trying to only keep the Alcohol names that are listed in column G
So in my case, I only want to keep the following records:
Alcohol Name Alcohol Type Origin
Martell Cognac France
Captain Morgans Rum Jamaica
Wray & Nephew Rum Jamaica
Hennessey Cognac France
Barcardi Rum Cuba
Courvoiser Cognac France
Famous Grouse Scotch Scotland
Jack Daniels Whisky USA
Grants Scotch Scotland
Ciroc Vodka France
I also want to keep any that did not appear in the list in a separate table like this:
Alcohol Name Alcohol Type Origin
Russian Standard Vodka Russia
Southern Comfort Bourbon USA
Ciroc Whisky France
At the moment I am having to manually check a longer list one by one and it is taking lot of time and my arm hurts..
If someone can please help me with sorting it such that it looks like this, would be great! I don't know if there are formulas we can use
Use this formula to only keep the Alcohol names that are listed in column G
=QUERY(A1:C," where A matches '"&TEXTJOIN("|",1,G2:G)&"' ",1)
To order them use
=QUERY(A1:C," where A matches '"&TEXTJOIN("|",1,G2:G)&"' order by A",1)
Use this to keep any that did not appear in the list in a separate table.
You see, you only put not in the formula
=QUERY(A1:C," where not A matches '"&TEXTJOIN("|",1,G2:G)&"' ",1)

Import 'Name' and 'Expected Return' list according to team name

Link:
https://www.sportsgambler.com/injuries/football/england-premier-league/
Test Fail (link in row A2 and team name in A1):
A1 = Aston Villa
A2 = https://www.sportsgambler.com/injuries/football/england-premier-league/
=IMPORTXML(A2,"//h3[contains(#class,'"&A1&"')]//span[#class='inj-player'] | //h3[contains(#class,'"&A1&"')]//span[contains(#class,'inj-return')]")
But it returns with an error, in case I also need help to know how it would be the best way to import these two columns of data and divide them into two columns in the spreadsheet, because with the pure importxml you tried trying to know that the import will come all the data in one column only.
Expected Result:
Emiliano Martinez Doubtful
Jack Grealish Early March
Matty Cash Mid March
Kortney Hause Mid March
Wesley Moraes Late March
I believe your goal as follows.
You want to retrieve the following values.
Emiliano Martinez Doubtful
Jack Grealish Early March
Matty Cash Mid March
Kortney Hause Mid March
Wesley Moraes Late March
When you want to use one IMPORTXML, in this case, how about the following sample formula?
Sample formula:
=QUERY(IMPORTXML(A2,"//div[./h3/a[text()='"&A1&"']]/div/div[#class='inj-container']"),"SELECT Col2,Col8")
In this formula, the cells "A1" and "A2" are Aston Villa and https://www.sportsgambler.com/injuries/football/england-premier-league/, respectively.
When I tested your formula, no values are returned. When I saw the HTML of the URL, A1 of h3[contains(#class,'"&A1&"')] is in the text of the tag a. I think that by this, no values are returned.
Result:
References:
IMPORTXML
QUERY

How to sum specific information w/ multiple criteria including dates from form submission

I have a sheet that is linked to a google form so when a person submits the form, the information is populated into the sheet automatically with a timestamp, ex.1/17/2020 17:26:16. I'm trying to sum information based on multiple criteria and one is to only pull a full days worth of data but the formula is reading time as well and so I keep yielding 0.
For example, here is some data
1/8/2020 17:38:49 Danny PM Beetlejuice on Broadway 1144
1/8/2020 17:38:49 Danny PM Oklahoma! on Broadway 1181
1/8/2020 17:38:49 Danny PM Oklahoma! on Broadway 1000.5
1/8/2020 12:47:18 Jeff PM To Kill a Mockingbird 1675
1/8/2020 12:48:19 Jeff PM Jagged Little Pill 2390
On another tab I'm trying to calculate how much was spent by each person on this day. This new tab is looking at a persons shift and name to sumifs their spend:
=SUMIFS('Form Responses 1'!$E:$E,'Form Responses 1'!$D:$D,B$5,'Form Responses 1'!$B:$B,$A9,'Form Responses 1'!$C:$C,$B$2)
I don't believe you'll need to know what each piece in this current code means since I just need to add to it for it to read a range of dates and narrow down to one day.
I've tried adding the date range 'Form Responses 1'!$E:$E and having the criterion be the desired date filled in B2 but this is when it is reading for an exact match of the time from the range which is not going to work since I don't want it to read the time. I want to find a solution that won't involve having to manually update the submission data each time.
I've included a sample sheet here so whoever wants to try and tackle this can better see what it is I'm working with. In the review tab I have my current formula not specifying date and next to it the same but trying to specify the date.
Thank you in advance. My brain is a scattered mess so I hope everything makes sense.
If you want all records for the specified day to be included, you must use the >= and <= operators.
Something like this:
=SUMIFS('Form Responses 1'!$E:$E,
'Form Responses 1'!$B:$B, $A6,
'Form Responses 1'!$C:$C, $B$2,
'Form Responses 1'!$A:$A, ">="&$B$1,
'Form Responses 1'!$A:$A, "<="&$B$1+1)
In addition to the accepted correct answer, you could also try the following QUERY formula so you can get everything with just one formula instead of 5.
=IFERROR({QUERY(A:E,"select B, sum(E) where not A='' and C='"&H2&"' group by B label B 'Runner Name', sum(E) 'Total Spend' ",1),
QUERY(A:E,"select sum(E) where not A='' and C='"&H2&"' and todate(A)=date '"&text(H1,"yyyy-mm-dd")&"' group by B label sum(E) 'Totals per day' ",1)},
"No data")
(Please adjust ranges to your needs)
By using todate(A) we extract the date value from a timestamp.
The big advantage of using a single query is that -since you use the data from a form- your results will auto update as new answers come through.
Please feel free to ask if you need further information.
Another query that should work for you i've left on the new MK.Help tab in cell A5.
=ARRAYFORMULA(QUERY({INT('Form Responses 1'!A:A),'Form Responses 1'!B1:E},"select Col2,SUM(Col5) where Col1="&B1&" and Col3='"&B2&"' group by Col2 order by SUM(Col5) desc label SUM(Col5)'Total'"))
Agree with the previous poster that a query is the way to go since it'll autopopulate. Also it allows you to display the table in any order you like. I chose to sort by the total with the highest totals at the top.

How to group data by age range?

Given data list with two columns: 'Division' and 'Age.'
username year_of_birth
Albert Albo 1977
Bob Bilo 1974
Conan Cornic 1989
Don Duan 1954
Etan Etin 1967
Fabio Forio 1976
I want to put this data into a Pivot Table and group the ages into specified ranges; however, I'm having issues figuring out how to get around grouping them into set increments that don't vary. My first range would need to be 18-24, my next would be 25-29, then 30-34, 35-39, and so on until I hit 64. Then, I would have 65+ all grouped into one, like so:
How could I make it work ?
A simpler (also single formula) might be:
=ArrayFormula(vlookup(year(now())-B2:B+1,Larry,2))
where year of birth is in ColumnB. This though does require a named range (Larry) of:
This repeats the assumption that, wanting month, day, time, everyone is treated as having been born at the very start of the year_of_birth.
A contingency is included for under 18s where 0-17 in the array might be replaced by invalid or such like.
Just for fun, let's see if we can make it in a single formula
Creating a pivot from here is trivial.

Resources