Nutch : Filter URLs at Indexing time - url

I am crawling sites using Nutch and integrating it with Solr.
I am crawling all the URLs on the site, but want to index only a few of them.
Adding URL pattern in regex_urlfilter.txt would filter the URLs from crawling. But, that, however, isn't what I am looking for. I want to crawl all the sites, but index only a few.
Is there something like regex-urlfilter.txt at index time rather than at crawl time?

When doing step by step.
Don't supply filter until the dedup step.Once your urls have been updated to crawlDb and you ready for indexing,supply the filter to regex-urlfilter.txt.
Do as bin/nutch index .... -filter

Related

CommonCrawl: How to find a specific web page?

I am using CommonCrawl to restore pages I should have achieved but have not.
In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.
A simple script downloads all indices from the available crawls:
./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on
Afterwards I have 112mb of data and simply grep:
grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r
The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?
Update: Thanks to Sebastian, two links are left...
Two URLs are:
http://www.thesun.co.uk/sol/homepage/news/50569/Locals-tell-of-terror-shock.html
http://www.thesun.co.uk/sol/homepage/news/54032/Sir-Ians-raid-apology.html
They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...
You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
The latest version of search on CC index provides the ability to search and get results of all the urls from particular tld.
In your case, you can use http://index.commoncrawl.org and then select index of your choice. Search for http://www.thesun.co.uk/*.
Hope you get all the urls from tld and then you can filter the urls of your choice from json response.
AFAIK pages are crawled once and only once, so the pages you're looking for could be in any of the archives.
I wrote a small software that can be used to search all archives at once (here's also a demonstration showing how to do this). So in your case I searched all archives (2008 to 2019) and typed your URLs on the common crawl editor, and found these results for your first URL (couldn't find the second so I guess is not in the database?):
FileName Offset Length
------------------------------------------------------------- ---------- --------
parse-output/segment/1346876860877/1346943319237_751.arc.gz 7374762 12162
crawl-002/2009/11/21/8/1258808591287_8.arc.gz 87621562 20028
crawl-002/2010/01/07/5/1262876334932_5.arc.gz 80863242 20075
Not sure why there're three results. I guess they do re-scan some URLs.
Of if you open any of these URLs on the application I linked you should be able to see the pages in a browser (this is a custom scheme that that includes the filename, offset and length in order to load HTML from the common crawl database):
crawl://page.common/parse-output/segment/1346876860877/1346943319237_751.arc.gz?o=7374762&l=12162&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2009/11/21/8/1258808591287_8.arc.gz?o=87621562&l=20028&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2010/01/07/5/1262876334932_5.arc.gz?o=80863242&l=20075&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html

Google Indexing and Url

On my website, brand URLs like this
.../shopbybrands.aspx?auth=430&brand=AQUATALIA by MARVIN K
have no Hyphen or Underscore: "AQUATALIA by MARVIN K". But when we open the URL in a browser it looks like:
.../shopbybrands.aspx?auth=430&brand=AQUATALIA%20by%20MARVIN%20K
Automatically, %20 appears in the URLs. The issue is that if I search the first URL in Google, I will not get any results; but if I search the second URLI do.
Also, in XML sitemap, URLs are included (e.g. the first URL) and I'm seeing in Google webmaster tool that only 35 URLs indexed out of 105. While I'm searching Google with %20 URLS, I'm getting index or results for the same but when I search the normal URL, like first, I don't.
So, please suggest what I need to do to fix this?

Add # hashtag before parameters in URL

I'm creating a dynamic website which is displaying many products. There are also filters like price (from - to), year (from-to) etc. I need to put # symbol before the filter parameters in URL because of Googlebot indexing. But I have no idea how to do it and found no documentation on the internet.
I think it could be done with AJAX script but I don't know where to start.
The question is:
How do I insert a # hash symbol before parameters in URL?
I've got this:
http://domain.com/pd/?rps=100&a=2001
and I need to make it look like
http://domain.com/pd/#rps=100&a=2001
Why do you want to replace the "?" with "#". For well optimised url to seo, you can leave them unchanged. You can also use Google Webmaster Tool to declare your url parameters. Here is another resource for you to optimize your url :Faceted navigation

Efficient design of crawler4J to get data

I am trying to get the data from various websites.After searcing in stack overflow, am using crawler4j as many suggested this. Below is my understanding/design:
1. Get sitemap.xml from robots.txt.
2. If sitemap.xml is not available in robots.txt, look for sitemap.xml directly.
3. Now, get the list of all URL's from sitemap.xml
4. Now, fetch the content for all above URL's
5. If sitemap.xml is also not available, then scan entire website.
Now, can you please please let me know, is crawler4J able to do steps 1, 2 and 3 ???
Please suggest any more good design is available (Assuming no feeds are available)
If so can you please guide me how to do.
Thanks
Venkat
Crawler4J is not able to perform steps 1,2 and 3, however it performs quite well for steps 4 and 5. My advice would be to use a Java HTTP Client such as the one from Http Components
to get the sitemap. Parse the XML using any Java XML parser and add the urls into a collection. Then populate your crawler4j seeds with the list :
for(String url : sitemapsUrl){
controller.addSeed(url);
}
controller.start(YourCrawler, nbthreads);
I have never used crawler4j, so take my opinion with a grain of salt:
I think that it can be done by the crawler, but it looks like you have to modify some code. Specifically, you can take a look at the RobotstxtParser.java and HostDirectives.java. You would have to modify the parser to extract the sitemap and create a new field in the directives to return the sitemap.xml. Step 3 can be done in the fetcher if no directives were returned from sitemap.txt.
However, I'm not sure exactly what you gain by checking the sitemap.txt: it seems to be a useless thing to do unless you're looking for something specific.

ColdFusion - What's the best URL naming convention to use?

I am using ColdFusion 9.
I am creating a brand new site that uses three templates. The first template is the home page, where users are prompted to select a brand or a specific model. The second template is where the user can view all of the models of the selected brand. The third template shows all of the specific information on a specific model.
A long time ago... I would make the URLs like this:
.com/Index.cfm // home page
.com/Brands.cfm?BrandID=123 // specific brand page
.com/Models.cfm?ModelID=123 // specific model page
Now, for SEO purposes and for easy reading, I might want my URLs to look like this:
.com/? // home page
.com/?Brand=Worthington
.com/?Model=Worthington&Model=TX193A
Or, I might want my URLs to look like this:
.com/? // home
.com/?Worthington // specific brand
.com/?Worthington/TX193A // specific model
My question is, are there really any SEO benefits or easy reading or security benefits to either naming convention?
Is there a best URL naming convention to use?
Is there a real benefit to having a URL like this?
http://stackoverflow.com/questions/7113295/sql-should-i-use-a-junction-table-or-not
Use URLs that make sense for your users. If you use sensible URLs which humans understand, it'll work with search engines too.
i.e. Don't do SEO, do HO. Human Optimisation. Optimise your pages for the users of your page and in doing so you'll make Google (and others) happy.
Do NOT stuff keywords into URLs unless it helps the people your site is for.
To decide what your URL should look like, you need to understand what the parts of a URL are for.
So, given this URL: http://domain.com/whatever/you/like/here?q=search_terms#page-frament.
It breaks down like this:
http
what protocol is used to deliver the page
:
divides protocol from rest of url
//domain.com
indicates what server to load
/whatever/you/like/here
Between the domain and the ? should indicate which page to load.
?
divides query string from rest of url
q=search_terms
Between the ? and the # can be used for a dynamic search query or setting.
#
divides page fragment from rest of the url
page-frament
Between the # and the end of line indicates which part of the page to focus on.
If your system setup lets you, a system like this is probably the most human friendly:
domain.com
domain.com/Worthington
domain.com/Worthington/TX193A
However, sometimes a unique ID is needed to ensure there is no ambiguity (with SO, there might be multiple questions with the same title, thus why ID is included, whilst the question is included because it's easier for humans that way).
Since all models must belong to a brand, you don't need both ID numbers though, so you can use something like this:
domain.com
domain.com/123/Worthington
domain.com/456/Worthington/TX193A
(where 123 is the brand number, and 456 is the model number)
You only need extra things (like /questions/ or /index.cfm or /brand.cfm or whatever) if you are unable to disambiguate different pages without them.
Remember: this part of the URL identifies the page - it needs to be possible to identify a single page with a single URL - to put it another way, every page should have a unique URL, and every unique URL should be a different page. (Excluding the query string and page fragment parts.)
Again, using the SO example - there are more than just questions here, there are users and tags and so on too. so they couldn't just do stackoverflow.com/7275745/question-title because it's not clearly distinct from stackoverflow.com/651924/evik-james - which they solve by inserting /questions and /users into each of those to make it obvious what each one is.
Ultimately, the best URL system to use depends on what pages your site has and who the people using your site are - you need to consider these and come up with a suitable solution. Simpler URLs are better, but too much simplicity may cause confusion.
Hopefully this all makes sense?
Here is an answer based on what I know about SEO and what we have implemented:
The first thing that get searched and considered is your domain name, and thus picking something related to your domain name is very important
URL with query string has lower priority than the one that doesn't. The reason is that query string is associated with dynamic content that could change over time. The search engine might also deprioritize those with query string fearing that it might be used for SPAM and diluting the result of SEO itself
As for using the URL such as
http://stackoverflow.com/questions/7113295/sql-should-i-use-a-junction-table-or-not
As the search engine looks at both the domain and the path, having the question in the path will help the Search Engine and elevate the question as a more relevant page when someone typing part of the question in the search engine.
I am not an SEO expert, but the company I work for has a dedicated dept to managing the SEO of our site. They much prefer the params to be in the URI, rather than in the query string, and I'm sure they prefer this for a reason (not simply to make the web team's job slightly trickier... all though there could be an element of that ;-)
That said, the bulk of what they concern themselves with is the content within and composition of the page. The domain name and URL are insignificant compared to having good, relevant content in a well defined structure.

Resources