Scraping bylines from web page using ImportXML in Google Sheets - google-sheets

Looking to extract the author name from articles. Currrently using =IMPORTXML(G2,"//*[#class='author-details']")
When I do this, it creates 4 cells underneath which contain the word 'By', which I can't get rid of.
Very new to code - what am I doing wrong?
Attached example: https://docs.google.com/spreadsheets/d/1Mi1D5G1-_gNsQwVQ6I_ealDqcWixKA2p-hFqJpjlGt4/edit?usp=sharing

You can use:
=index(IMPORTXML(G2,"//*[#class='author-details']"),1,2)
This displays only the first row of the second column of what is returned. The information You are after.
Edit:
Additionally, since you highlighted that you want author name. If all the names are in that format "By FIRST LAST #TwitterHandle Affiliation" then you can use this to get just the author's name:
=trim(split(right(index(IMPORTXML(G2,"//*[#class='author-details']"),1,2),len(index(IMPORTXML(G2,"//*[#class='author-details']"),1,2))-3),"#",true,true))
Will likely look like voodoo but paste it in, it works. It removes the first 3 characters ("By "), splits the text at the "#" symbol, and then keeps only the text on the left side of it, the name.

Related

GoogleSheets - bringing back last "n" characters that are uppercase only

I have data that brings back a name plus the ticker code. I need to extract the ticker code only into google sheets.
using: =IMPORTXML("https://www.coingecko.com/?page=5","//tr") I bring back 100 coins but the first column is for example
Dogecoin9DOGE
or
Ethereum2ETH
or
Ethereum ClassicETC
I would like to create a column that simply has DOGE or ETH or ETC, anyone know of a way to manipulate the 1st column to get to that?
Thanks
the webpage looks like it is 2 separate columns but the extract does not work that way. I was trying to think of a way to count the uppercase values and then maybe use a RIGHT(len(uppercasevalues)) but not sure how to get there.
You can use REGEXEXTRACT.
=REGEXEXTRACT(A1,"[A-Z]+$")

Remove text between square bracket in complicated array formula

I have a complicated formula that copies data from other cells on a different tab and I want to add the removal of brackets and text between them. This information is private so I don't want it displayed on the new tab. I am working in Google Sheets.
The formula below results in copying first name, last name, and a course that meets specific criteria on another tab. The problem is that we are indicating Legal Names with brackets on the primary tab, but those legal names should NOT be displayed on the new tab. I'm sure there is a better way to write this, but it works extremely well. I just need to add the removal of the brackets and texts between them. I know there is a way, but I can't seem to make it work with the current formula. HELP!
=IFERROR(INDEX('7th Class List'!$C$3:$E$66,SMALL(IF('7th Class List'!$A$3:$A$66="A",ROW('7th Class List'!$A$3:$A$66)-ROW('7th Class List'!$C$3)+1),ROW(1:1))),"")
its done like this:
=REGEXREPLACE(A1, "\[.*\]", )

Parse current cell contents into URL based on the content of the same cell in Google Sheets

I would like to parse the content of a cell into a URL based on the entered content of the same cell in Google Sheets. I need the entire column to be processed.
Right now I can only accomplish this with two columns like this...
Column A has an ID number. Column B uses an array to parse a URL based on the ID number in Column A. The array formula I'm using in B2 is...
=ARRAYFORMULA(HYPERLINK("http://www.website.com/content/"&B3:B, ""&B3:B))
So A3 might have the ID number entry: 216856
And this creates the URL in B3: http://www.website.com/content/216856
But what I would really love, is a way to do this with one column. Perhaps through a script? Can anyone help me with this, please? Thank you!
If you highlight the column you want to transform then go to edit and choose the find and replace functionality or press command shift and H it will also pop up - then in the find field enter: ^ and in the replace field enter http://www.website.com/content/ and check the search using regular expressions checkbox.
Once you click replace all, it will add that part of the url to the beginning of all the cells thus turning it into a URL for you. Super easy and generally pretty quick depending on how many rows you have - I have done this with tens of thousands of rows and more.

How can I SUM alphanumeric values in Google Sheets?

It's not the best system but I've been using ImportXML to pull in YouTube view counts for my videos so I can keep track. I knew at some point YouTube would make a front end change that would break this. So now recently instead of just the number inside of the DIV I am referencing they are always showing the word "views". So here's what I'm using now
=IMPORTXML("https://www.youtube.com/watch?v=qXnr03IIPTI","//div[#class='watch-view-count']")
This will output "300,000 views" (or whatever the current view count is)
Before they added the word "views" in this DIV I was able to add up this column. So I added another column to strip out the text.
(where G7 is the cell containing the above value)
=SUBSTITUTE(G7, "views", "")
I thought for sure I would then be able to run a SUM on this column but alas the SUM is 0. I believe this is something Excel would be able to deal with.
Any ideas here?
Bonus points if you watch the video :)
If the result is in G7, use formula:
=JOIN("",REGEXEXTRACT(G7,"([0-9]+),([0-9]+),?([0-9]+)?"))*1
it will convert string "300,000 views" into number. It'll handle bigger numbers, like: 1,368,142 views. This formula replaces commas because in some countries comma is used as sepatator integer and decimal part of number.
You can convert it to a value:
=VALUE(REGEXREPLACE(IMPORTXML("https://www.youtube.com/watch?v=qXnr03IIPTI","//div[#class='watch-view-count']")," views",""))
or
=VALUE(REGEXREPLACE(IMPORTXML("https://www.youtube.com/watch?v=qXnr03IIPTI","//div[#class='watch-view-count']"),"[, views]",""))
if you really want the commas etc, wrap the final amount back into a text at the very end if you want such as =TEXT("FINAL SUM","#,#")
Try this:
=REGEXEXTRACT(IMPORTXML("https://www.youtube.com/watch?v=qXnr03IIPTI","//div[#class='watch-view-count']"), "[0-9]*\,[0-9]+[0-9]+")

Getting inconsistent tab delimiter width when pasting from Google docs spreadsheet

I am trying to create a gadget for some people, where all they need to do is really copy the contents of a spreadsheet, then paste it in a textbox, which will in turn create a nice table for them to embed in their articles.
I managed to do everything, however Google docs, when copying and pasting data in a text editor, seems to get the size (width) of the tab delimiter wrong between values. So, instead of getting 4 spaces that is the default, i am getting 2 in some cases and so far i managed to find out that the reason is that some of the cells contain strings with spaces. For some reason, this seems to confuse Google docs, thus supplying wrong spacings, which in turn, ruin my script.
I know i can use comma separated values here, but the issue is we are trying to give people the ability to simply copy and paste. Look at the example output below:
School Name Location Type No. eligible pupils
In this example, School Name is one cell, Location is another, Type is another and No. eligible pupils is the last one. It is clear that the first cell does not have the necessary space on the right.
Any ideas? I thought about converting all blank spaces that take more than 1 space to commas, but this might lead to a situation users might actually use 2... which would not work again.
For some reason, it was the code editor that was actually not showing the tabs right. Using a regexp and another code editor (vim) showed that all of them were actual tabs. :)

Resources