CommonCrawl: How to find a specific web page? - search-engine

I am using CommonCrawl to restore pages I should have achieved but have not.
In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.
A simple script downloads all indices from the available crawls:
./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on
Afterwards I have 112mb of data and simply grep:
grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r
The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?
Update: Thanks to Sebastian, two links are left...
Two URLs are:
http://www.thesun.co.uk/sol/homepage/news/50569/Locals-tell-of-terror-shock.html
http://www.thesun.co.uk/sol/homepage/news/54032/Sir-Ians-raid-apology.html
They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...

You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

The latest version of search on CC index provides the ability to search and get results of all the urls from particular tld.
In your case, you can use http://index.commoncrawl.org and then select index of your choice. Search for http://www.thesun.co.uk/*.
Hope you get all the urls from tld and then you can filter the urls of your choice from json response.

AFAIK pages are crawled once and only once, so the pages you're looking for could be in any of the archives.
I wrote a small software that can be used to search all archives at once (here's also a demonstration showing how to do this). So in your case I searched all archives (2008 to 2019) and typed your URLs on the common crawl editor, and found these results for your first URL (couldn't find the second so I guess is not in the database?):
FileName Offset Length
------------------------------------------------------------- ---------- --------
parse-output/segment/1346876860877/1346943319237_751.arc.gz 7374762 12162
crawl-002/2009/11/21/8/1258808591287_8.arc.gz 87621562 20028
crawl-002/2010/01/07/5/1262876334932_5.arc.gz 80863242 20075
Not sure why there're three results. I guess they do re-scan some URLs.
Of if you open any of these URLs on the application I linked you should be able to see the pages in a browser (this is a custom scheme that that includes the filename, offset and length in order to load HTML from the common crawl database):
crawl://page.common/parse-output/segment/1346876860877/1346943319237_751.arc.gz?o=7374762&l=12162&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2009/11/21/8/1258808591287_8.arc.gz?o=87621562&l=20028&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2010/01/07/5/1262876334932_5.arc.gz?o=80863242&l=20075&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html

Related

URL decoding and understanding

Recently I started learning Web scraping. For this purpose I need to focus on URLs and there basic structures. I considered two URLs from Amazon and Priceline for home work purpose.
The some basic concepts of URL
A query string comes at the end of a URL, starting with a single
question mark, “?”.
Parameters are provided as key-value pairs and separated by an
ampersand, “&”.
The key and value are separated using an equals sign, “=”
most web frameworks will allow us to define “nice
looking” URLs that just include the parameters in the path of a URL
Amazon URL
https://www.amazon.com/books-used-books-textbooks/b/?ie=UTF8&node=283155&ref_=nav_cs_books_788dc1d04dfe44a2b3249e7a7c245230
As per my understanding:
Parameters
ie=UTF8
node = 283155
ref_=nav_cs_books_788dc1d04dfe44a2b3249e7a7c245230
Key Values
ie UTF8
node 283155
ref_ nav_cs_books_788dc1d04dfe44a2b3249e7a7c245230
Priceline URL
https://www.priceline.com/relax/in/3000005381/from/20210310/to/20210317/rooms/1?vrid=16e829a6d7ee5b5538fe51bb7e6925e8
This url is based on the hotel booking in Chicago from 03/10/2021 to 03/17/2021.
As per my understanding:
key values
from 20210310 2021 - 03 -10
to 20210317 2021 - 03 -17
rooms 1
I did not find out anything more than that. I just make sure am I missing something? Can those URLS analysis more precisely?
Tips that may help are:
Data can be posted via GET or POST. What you are describing with URLs is GET. POST is when you don't see anything in the url.
In both cases getting familiar with using your browser's developer console will help you explore how websites work. In Chrome, you can hit F12 or right click any element and select "inspect element." This is especially helpful when trying to inspect data that is passed using POST, since you can't see them in the url. Use the "network" tab while clicking around to see what the website is doing in the background.
Lastly, just play around with websites. For example, when you browse Amazon you might notice the urls look like https://www.amazon.com/Avalon-Organics-Creme-Radiant-Renewal/dp/B082G172GL/?_encoding=UTF8 but if you play around with it you notice you can delete out the title and the url still works like this: https://www.amazon.com/dp/B082G172GL

Nutch : Filter URLs at Indexing time

I am crawling sites using Nutch and integrating it with Solr.
I am crawling all the URLs on the site, but want to index only a few of them.
Adding URL pattern in regex_urlfilter.txt would filter the URLs from crawling. But, that, however, isn't what I am looking for. I want to crawl all the sites, but index only a few.
Is there something like regex-urlfilter.txt at index time rather than at crawl time?
When doing step by step.
Don't supply filter until the dedup step.Once your urls have been updated to crawlDb and you ready for indexing,supply the filter to regex-urlfilter.txt.
Do as bin/nutch index .... -filter

Are URLs with & instead of & treated the same by search engines?

I'm validating one of my web pages and its throwing up errors as below :
& did not start a character reference. (& probably should have been escaped as &.)
This is because on my page I am linking to internal webpages which has &'s in the URL as below:
www.example.com/test.php?param1=1&param2=2
My question is that if I change the URLs in the a hrefs to include & as below:
www.example.com/test.php?param1=1&param2=2
Will Google and other search engines treat the 2 URLs above as separate pages or will they treat them both as the one below:
www.example.com/test.php?param1=1&param2=2
I dont want to loose my search engine rankings.
There is no reason to assume that search engines would knowingly ignore how HTML works.
Take, for example, this hyperlink:
…
The URL is not http://example.com/test.php?param1=1&param2=2!
It’s just the way how the URL http://example.com/test.php?param1=1&param2=2 is stored in attributes in an HTML document.
So when a conforming consumer comes across this hyperlink, it never visits http://example.com/test.php?param1=1&param2=2.

Custom URL format for news in Expression Engine

Our site is migrating from MovableType to ExpressionEngine, and there is one small issue we are having. MT uses a date based URL structure, e.g. www.site.com/2012/03/post-title.html, while EE uses a category based structure, e.g. www.site.com/index.php/news/comments/post-title. The issue is that our MT page used Disqus for comments, and as such comments are tied to a specific URL, meaning that we'd lose all of our comments if we were to migrate. I am wondering if there's a way to change the URL structure in EE to match MT's, thus allowing us to keep the comments. Thanks in advance.
Correction: EE uses a Template Group/Template based structure for URLs, not categories - just to clarify.
You've got a couple of options here.
One is to create an .htaccess rule which internally redirects all requests matching YYYY/MM/ to your EE template which displays your posts (say, /news/entry/). I don't know exactly what those rewrite rules would look like off the top of my head, my mod_rewrite-fu is pretty shallow. But it could definitely work.
Another is to export all of your comments from Disqus via their XML export tool, then do a grep-based find and replace using something like BBEdit, replacing all /YYYY/MM/ strings in that file with /news/entry/; delete all of your existing comments on Disqus; then import your newly-modifed XML file.

Efficient design of crawler4J to get data

I am trying to get the data from various websites.After searcing in stack overflow, am using crawler4j as many suggested this. Below is my understanding/design:
1. Get sitemap.xml from robots.txt.
2. If sitemap.xml is not available in robots.txt, look for sitemap.xml directly.
3. Now, get the list of all URL's from sitemap.xml
4. Now, fetch the content for all above URL's
5. If sitemap.xml is also not available, then scan entire website.
Now, can you please please let me know, is crawler4J able to do steps 1, 2 and 3 ???
Please suggest any more good design is available (Assuming no feeds are available)
If so can you please guide me how to do.
Thanks
Venkat
Crawler4J is not able to perform steps 1,2 and 3, however it performs quite well for steps 4 and 5. My advice would be to use a Java HTTP Client such as the one from Http Components
to get the sitemap. Parse the XML using any Java XML parser and add the urls into a collection. Then populate your crawler4j seeds with the list :
for(String url : sitemapsUrl){
controller.addSeed(url);
}
controller.start(YourCrawler, nbthreads);
I have never used crawler4j, so take my opinion with a grain of salt:
I think that it can be done by the crawler, but it looks like you have to modify some code. Specifically, you can take a look at the RobotstxtParser.java and HostDirectives.java. You would have to modify the parser to extract the sitemap and create a new field in the directives to return the sitemap.xml. Step 3 can be done in the fetcher if no directives were returned from sitemap.txt.
However, I'm not sure exactly what you gain by checking the sitemap.txt: it seems to be a useless thing to do unless you're looking for something specific.

Resources