Apple Automator to view a list of URLs from a text file - automator

I run into scenarios where I want to compile a list of URLs in a text file. Then I want to use Apple Automator to view each one of those URLs in a browser. My MAC is OSX 10.9.4, Automator V2.4. I can use
Get Specified URLs > Display Web Pages
but I have to cut and paste the URLs into Get Specified URLs. I want to read the URLs from a text file. I saw some posts that say to use "Get Specified Finder Items" > "Combine Text Files" > "Filter Paragraphs" but this does not work either. The flow I am trying is
Get Specified Finder Items > Combine Text Files > Filter Paragraphs > Get Specified URLs > Display Web Pages
When I run this work flow I see Filter Paragraphs outputs the list of URLs but with parens around them
(
URL 1
URL 2
URL3
)
I see the output of Get Specified URLs is empty which causes the error "The action Display Webpages was not supplied the required data".

Related

Use VBScript to Parse a Webpage's Text

I'm currently working on a VBScript that will open multiple URLs in order to update documents on a server. I was wondering if there was a way to parse a webpage's content for a specific string, in this case being the updateResult SUCCESS line shown below:
I need to be able to record the success of this webpage text as opposed to the failure page below:
This is all that is on the webpage. How would I go about parsing the text of both these types of pages in order to know that the document has updated correctly or not?

CommonCrawl: How to find a specific web page?

I am using CommonCrawl to restore pages I should have achieved but have not.
In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.
A simple script downloads all indices from the available crawls:
./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on
Afterwards I have 112mb of data and simply grep:
grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r
The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?
Update: Thanks to Sebastian, two links are left...
Two URLs are:
http://www.thesun.co.uk/sol/homepage/news/50569/Locals-tell-of-terror-shock.html
http://www.thesun.co.uk/sol/homepage/news/54032/Sir-Ians-raid-apology.html
They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...
You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
The latest version of search on CC index provides the ability to search and get results of all the urls from particular tld.
In your case, you can use http://index.commoncrawl.org and then select index of your choice. Search for http://www.thesun.co.uk/*.
Hope you get all the urls from tld and then you can filter the urls of your choice from json response.
AFAIK pages are crawled once and only once, so the pages you're looking for could be in any of the archives.
I wrote a small software that can be used to search all archives at once (here's also a demonstration showing how to do this). So in your case I searched all archives (2008 to 2019) and typed your URLs on the common crawl editor, and found these results for your first URL (couldn't find the second so I guess is not in the database?):
FileName Offset Length
------------------------------------------------------------- ---------- --------
parse-output/segment/1346876860877/1346943319237_751.arc.gz 7374762 12162
crawl-002/2009/11/21/8/1258808591287_8.arc.gz 87621562 20028
crawl-002/2010/01/07/5/1262876334932_5.arc.gz 80863242 20075
Not sure why there're three results. I guess they do re-scan some URLs.
Of if you open any of these URLs on the application I linked you should be able to see the pages in a browser (this is a custom scheme that that includes the filename, offset and length in order to load HTML from the common crawl database):
crawl://page.common/parse-output/segment/1346876860877/1346943319237_751.arc.gz?o=7374762&l=12162&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2009/11/21/8/1258808591287_8.arc.gz?o=87621562&l=20028&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2010/01/07/5/1262876334932_5.arc.gz?o=80863242&l=20075&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html

Are URLs with & instead of & treated the same by search engines?

I'm validating one of my web pages and its throwing up errors as below :
& did not start a character reference. (& probably should have been escaped as &.)
This is because on my page I am linking to internal webpages which has &'s in the URL as below:
www.example.com/test.php?param1=1&param2=2
My question is that if I change the URLs in the a hrefs to include & as below:
www.example.com/test.php?param1=1&param2=2
Will Google and other search engines treat the 2 URLs above as separate pages or will they treat them both as the one below:
www.example.com/test.php?param1=1&param2=2
I dont want to loose my search engine rankings.
There is no reason to assume that search engines would knowingly ignore how HTML works.
Take, for example, this hyperlink:
…
The URL is not http://example.com/test.php?param1=1&param2=2!
It’s just the way how the URL http://example.com/test.php?param1=1&param2=2 is stored in attributes in an HTML document.
So when a conforming consumer comes across this hyperlink, it never visits http://example.com/test.php?param1=1&param2=2.

Is the format of URLs imported to IP4.1 from IP3.9 correct?

I have now successfully imported the text, pictures and pages from IP3.9 to IP4.1.
IP4.1 on localhost has truncated URLs.
For example URL in IP3.9
localhost/ip39/en/top/graphene/cvd-graphen/cvd-on-metals/multilayer-graphene-on-nickel-foil/
when imported to IP4.1 becomes
localhost/ip41/multilayer-graphene-on-nickel-foil
Is this normal ? If IP4 changes the format of the URLs then I think all the Google links will be lost.
Alan
Since v4.x we removed requirement to have language, zone or parent page in the path of pages. Which means that each page (despite its location in the menu tree) can have any URL starting from the root. You can put URL with slashes, too.
You have two options:
Manually change back all paths to the format you need (you can do that in the archive before import, too).
Create required redirects for search engines to understand what happened.

AppleScript create URL & anchor text, pass to clipboard as URL type data

Is there a way to take a text label and an URL and use Applescript add them to the Mac clipboard such that rich-text apps see the data as a URL and make linked anchor text when pasting in the data.
I have untitled reference URLs (generated via code) that need to have a 'screen' title (anchor text). I can create the URL and anchor text. But, how do I combine them so that the Mac clipboard treats the data as a URL? I tried:
set the clipboard to "" & theAnchor & ""
..but when using this data from other apps I get the HTML string and not a link with the anchor text as the visible screen text.
StandardAdditions has 'URL' and 'web page' classes but I can't see how to apply them. This compiles but fails when run:
set myURL to theURI as URL
set myLinkAnchor to theAnchor as text
set linkURL to {URL: myURL, name: myLinkAnchor} as web page
AppleScript does support putting hex encoded HTML into the clipboard:
set the clipboard to «data HTML3c6120687265663d22687474703a2f2f7777772e69727261646961746564736f6674776172652e636f6d22207461726765743d225f626c616e6b223e4972726164696174656420536f6674776172653c2f613e»
It's pretty round-about, but here's how to do what you want:
set theURI to "http://www.irradiatedsoftware.com"
set theAnchor to "Irradiated Software"
set the clipboard to "" & theAnchor & ""
set theHEX to do shell script "pbpaste | hexdump -ve '1/1 \"%.2x\"'"
if theHEX is "" then
beep
else
run script "set the clipboard to «data HTML" & theHEX & "»"
end if
YMMV. I tried pasting this into Word 2011 and it worked fine. I couldn't get Pages to take take the paste. Also, I tried pasting into a new email message (Mail.app), and the link is fine, but you can't click on it. However, the recipient can click on the link.
You can't assume that all applications accept HTMLand convert it automatically into a clickable string. Your ability to do this depends on how the target application handles rich text. Some may use Cocoa's built-in rich text area classes while others may have their own completely custom text areas, each handling URLs their own way. You can look in the target application's Dictionary to see if it allows you to create URLs in an identifiable way.

Resources