Data Scraping Google Sheets Formula N/a and Incomplete - google-sheets

Right now I am scraping certain Productinformation from bol.com.
The product data is getting scraped. But for each 30 products, about 5 are either scraping incomplete data, for instance the EAN is missing, while its in the same content block as normal or it just gives N/A. While there is information.
Any tips?
Ps. This is my current formula: =importxml(C27;"//*[#id='mainContent']/div/div[1]/div[5]/div[1]/div[4]/div/div/div[1]/div/div[1]/dl")
C27 is the URL. (https://www.bol.com/nl/nl/p/adroitgoods-hondenriem-180-cm-hondenlijn-looplijn-hondenlijn-reflecterend-rood-lange-lijn-hond/9300000101425619/)
It should import the following product information:
EAN
Kleur
Materiaal
Maat
Reflecterend
Speciaal voor hardlopen
Type uitlaatriem
Verpakkingsinhoud

try:
=INDEX(TRIM(IMPORTxML(A1, "//dl[#class='specs__list']/div")))

Related

How to check for overlapping dates

I am looking for a solution on either Google sheets or app script to check for overlapping dates for the same account. There will be multiple accounts and the dates won't be in any particular order. Here is an example below. I am trying to achieve the right column "check" with some formula or automation. Any suggestions would be greatly appreciated.
Start Date
End Date
Account No.
Check
2023-01-01
2023-01-02
123
ERROR
2023-01-02
2023-01-05
123
ERROR
2023-02-25
2023-02-27
456
OK
2023-01-11
2023-01-12
456
OK
2023-01-01
2023-01-15
789
ERROR
2023-01-04
2023-01-07
789
ERROR
2023-01-01
2023-01-10
012
OK
2023-01-15
2023-01-20
012
OK
I also found some similar past questions, but they don't have the "for the same account" component and/or requires some sort of chronological order, which my sheet will not have.
How to calculate the overlap between some Google Sheet time frames?
How to check if any of the time ranges overlap with each other in Google Sheets
Another approach (to be entered in D2):
=arrayformula(lambda(last_row,
lambda(acc_no,start_date,end_date,
if(isnumber(match(acc_no,unique(query(query(split(flatten(acc_no&"|"&split(map(start_date,end_date,lambda(start_date,end_date,join("|",sequence(1,end_date-(start_date-1),start_date)))),"|")),"|"),"select Col1,count(Col2) where Col2 is not null group by Col1,Col2",0),"select Col1 where Col2>1",1)),0)),"ERROR","OK"))(
C2:index(C2:C,last_row),A2:index(A2:A,last_row),B2:index(B2:B,last_row)))(
counta(A2:A)))
Briefly, we are creating a sequence of dateserial numbers between the start & end dates for each row, doing some string manipulation to turn it into a table of account number against each date, then QUERYing it to get each account number which has dateserials with count>1 (i.e. overlaps), using UNIQUE to get the distinct list of those account numbers, then finally matching this list against the original list of account numbers to give the ERROR/OK output.
(1) Here is one way, considering each case which could result in an overlap separately:
=ArrayFormula(if(A2:A="",,
if((countifs(A2:A,"<="&A2:A,B2:B,">="&A2:A,C2:C,C2:C,row(A2:A),"<>"&row(A2:A))
+countifs(A2:A,"<="&B2:B,B2:B,">="&B2:B,C2:C,C2:C,row(A2:A),"<>"&row(A2:A))
+countifs(A2:A,">="&A2:A,B2:B,"<="&B2:B,C2:C,C2:C,row(A2:A),"<>"&row(A2:A))
)>0,"ERROR","OK")
)
)
(2) Here is the method using the Overlap formula
min(end1,end2)-max(start1,start2)+1
which results in
=ArrayFormula(if(byrow(A2:index(C:C,counta(A:A)),lambda(r,sum(text(if(index(r,2)<B2:B,index(r,2),B2:B)-if(index(r,1)>A2:A,index(r,1),A2:A)+1,"0;\0;\0")*(C2:C=index(r,3))*(row(A2:A)<>row(r)))))>0,"ERROR","OK"))
(3) Most efficient is to use the original method of comparing previous and next dates, but then you need to sort and sort back like this:
=lambda(data,sort(map(sequence(rows(data)),lambda(c,if(if(c=1,0,(index(data,c-1,2)>=index(data,c,1))*(index(data,c-1,3)=index(data,c,3)))+if(c=rows(data),0,(index(data,c+1,1)<=index(data,c,2))*(index(data,c+1,3)=index(data,c,3)))>0,"ERROR","OK"))),index(data,0,4),1))(SORT(filter({A2:C,row(A2:A)},A2:A<>""),3,1,1,1))
HOWEVER, this only checks for local overlaps. not globally. You can see what I mean if you change the dataset slightly:
Clearly the first and third pair of dates have an overlap but G4 contains "OK". This is because each pair of dates is only checked against the adjacent pairs of dates. This also applies to the original reference cited by OP - here's an example where it would give a similar result:
The formula posted by #The God of Biscuits gives the correct (global) result :-)

Web Scraping Google-Sheets ImportXML - xpath - specific Number in URL

I am trying to get a specific Number from an URL, which is hyperlinked on the website.
Please see here a copy of my spreadsheet.
In Row "I" - i did a code, so it will directly go the the search of the eBay website, and combines the EAN number ="https://www.ebay.de/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw="&""&D2
this is the outcome:
https://www.ebay.de/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=8713439712292
Till here it works.
On the page, i want the ebay Kategorie ID for that articel, which can be found as a Hyperlink on the Categories [See Image of eBay Categorie here] Navigation on the left.
In the URl it is always the first Number, eg. https://www.ebay.de/sch/**158817**/i.html?_from=R40&_nkw=650135421227
InspectCode URL I need
All I want know, is to put the Number 158817 in my google spreadsheet.
With this code
=IMPORTXML(I2;"//*[#id='x-refine__group__0']/ul/li/ul/li/ul")
I only get the categorie name, but I need the number to make my CSV upload work.
What code do I need? Can Someone please guide me?
thank you
Lisa
With A1 = https://www.ebay.de/sch/**158817**/i.html?_from=R40&_nkw=650135421227, try this
=regexextract(IMPORTXML(A1;"//*[#id='x-refine__group__0']/ul/li/ul/li/ul/li/a/#href");"[0-9]+")
assuming that the url is always at the same position in the nomenclature
or, to get all numbers
=arrayformula(regexextract(IMPORTXML(A1;"//*[#id='x-refine__group__0']/ul/li//a/#href");"[0-9]+"))

Import site-specific data

The data on the page is delivered as follows:
https://int.soccerway.com/international/europe/uefa-champions-league/20192020/group-stage/r54142/
1 - Below each schedule is a link to the match.
2 - I would like to import all data at once.
3 - The result I seek would be as follows:
4 - Import separately, I can, but as they are separate formulas, it takes a long time, I would like a way to import all at once, for a formula only if it were possible.
5 - The Xpath are:
"//*[#class='date no-repetition']"
"//*[#class='score-time status']/a"
"//*[#class='score-time status']/a/#href"
6 - An important detail, I indicated the 'score-time status' because there are games that appear as 'score-time score' but these cannot be imported.
7 - There is another detail that complicates, the time comes with spaces between the sign of :, so for him I use the =SUBSTITUTE(," ","")
Is there any way to do this that I want?
I've tried using ={;;} to import the data, but can't make calls to more than two =IMPORTXML().
I also tried for =IMPORTHML() but it can't fetch the links from each of the below-hours matches and the date also appears in only one of the games...
How about this answer? I think that there are several answers for your situation. So please think of this as just one of several possible answers.
xpath:
Unfortunately, I couldn't find the xpath for directly retrieving the 3 values in your question. So in this answer, the following xpath are used.
Date: //td[#class='date no-repetition']/span
Time: //td[#class='score-time status']/a/span
URL: //td[#class='score-time status']/a/#href
Sample formula:
=ARRAYFORMULA({IMPORTXML(A1,"//td[#class='date no-repetition']/span"),IMPORTXML(A1,"//td[#class='score-time status']/a/span"),"https://"&IMPORTXML(A1,"//td[#class='score-time status']/a/#href")})
In this formula, the URL of https://int.soccerway.com/international/europe/uefa-champions-league/20192020/group-stage/r54142/ is put to the cell "A1".
Retrieved 3 values are put to the column "A", "B" and "C".
Result:
Note:
In above case, I think that the time zone might be the place when the values are retrieved by IMPORTXML.
If you want to change the timezone to your own Spreadsheet, how about the following sample formula?
=ARRAYFORMULA({IMPORTXML(A1,"//td[#class='date no-repetition']/span/#data-value")/86400+DATE(1970,1,1),IMPORTXML(A1,"//td[#class='date no-repetition']/span/#data-value")/86400+DATE(1970,1,1),"https://"&IMPORTXML(A1,"//td[#class='score-time status']/a/#href")})
In this case, please set the format to the column "A" and "B".
In above formula, the date and time is retrieved the unix time. This value is converted to the serial number. So the converted value can be used as the date and time at Spreadsheet.
References:
IMPORTXML
ARRAYFORMULA
If this was not the direction you want, I apologize.

How to get child nodes through importxml xpath query?

I'm trying to get the seperate <td>'s to show up in Google Sheet of a <tr> that I'm importing through IMPORTXML.
This code should get my match data based on the match ID I provide, and my player ID. I feel that simply adding /* or /td to end of Xpath should work, but that's the end of my knowledge.
I tried: adding /*, /td and other to end of xPath Query but doesn't seem to work.
Even disabled JavaScript and inspected website again but to no avail.
FORMULA:
=IMPORTXML("https://www.dotabuff.com/matches/5011379854";"//tr[contains(#class,'9764136')]")
Also tried:
=IMPORTXML("https://www.dotabuff.com/matches/5011379854";"//td[parent::tr[contains(#class,'9764136')]]")
Which only gives the first of all the /td's and not the rest.
Current outputis all mushed together:
"19LemthTop (Off)ZeusCoreTop (Off) Roaminglost27108.7k127933650626.5k-183-/-5m7m21m31m"
The output that I want is separate <td> on separate lines:
"19
LemthTop (Off)ZeusCoreTop (Off) Roaminglost
2
7
10
8.7k
127
9
336
506
26.5k
-
183
-/-
5m7m21m31m"
Issue and workaround:
Although I have tried to parse the values for each row, unfortunately, it seemed that td cannot be directly parsed using a xpath with IMPORTXML as each row. But fortunately, each table can be retrieved by IMPORTHTML and also each tab can be accessed. Using them, how about the following workaround?
Retrieve a table from the URL using IMPORTHTML.
Retrieve a row including the name corresponding to 9764136 you want using a query.
Modified formula:
=TRANSPOSE(SPLIT(TEXTJOIN("#",TRUE,QUERY(IMPORTHTML(A1,"table",1), "where Col4 contains '"&IMPORTXML(A1,"//a[contains(#href,'9764136')]")&"'", 0)),"#",TRUE,TRUE))
The URL of https://www.dotabuff.com/matches/5011379854 is put to the cell "A1".
After the table was retrieved, the row is retrieved from the table by the query.
The important point of this workaround is the methodology. I think that there are various formulas for retrieving the value. So please think of above sample formula as just one of them.
Result:
Note:
If you use above formula for other URL, an error might occur. Please be careful this.
References:
IMPORTHTML
IMPORTXML
TEXTJOIN
SPLIT
TRANSPOSE

Google query function

I am trying to return multiple records from a logbook into a final monthly statement... I'm using the query function but I do not get multiple records, it only displays the first match.
My sheets are from 1-31 for days of the month, then the last sheet labeled 717 is for Unit #717's monthly statement.
On Sheet 717, I would like to display information from sheets 1 through 31. Where column A=717, display values from columns B,C,D. Currently, it will only show me the first match. The amount column should show the corresponding rate for that row.
I hope my explanation is not confusing, any help is much appreciated. Thanks.
Here is a link to sample spreadsheet.
As you are concatenating the output of QUERY functions, you are actually performing an "array calculation", and you'll need use an "array calculation enabler", otherwise you will indeed only get the first applicable result.
=ArrayFormula(QUERY('1'!A3:G60;"select B where A=717")&QUERY('1'!A3:G60;"select C where A=717")&QUERY('1'!A3:G60;"select D where A=717"))

Resources