Efficient design of crawler4J to get data

Efficient design of crawler4J to get data - parsing

I am trying to get the data from various websites.After searcing in stack overflow, am using crawler4j as many suggested this. Below is my understanding/design:
1. Get sitemap.xml from robots.txt.
2. If sitemap.xml is not available in robots.txt, look for sitemap.xml directly.
3. Now, get the list of all URL's from sitemap.xml
4. Now, fetch the content for all above URL's
5. If sitemap.xml is also not available, then scan entire website.
Now, can you please please let me know, is crawler4J able to do steps 1, 2 and 3 ???
Please suggest any more good design is available (Assuming no feeds are available)
If so can you please guide me how to do.
Thanks
Venkat

Crawler4J is not able to perform steps 1,2 and 3, however it performs quite well for steps 4 and 5. My advice would be to use a Java HTTP Client such as the one from Http Components
to get the sitemap. Parse the XML using any Java XML parser and add the urls into a collection. Then populate your crawler4j seeds with the list :
for(String url : sitemapsUrl){
controller.addSeed(url);
}
controller.start(YourCrawler, nbthreads);

I have never used crawler4j, so take my opinion with a grain of salt:
I think that it can be done by the crawler, but it looks like you have to modify some code. Specifically, you can take a look at the RobotstxtParser.java and HostDirectives.java. You would have to modify the parser to extract the sitemap and create a new field in the directives to return the sitemap.xml. Step 3 can be done in the fetcher if no directives were returned from sitemap.txt.
However, I'm not sure exactly what you gain by checking the sitemap.txt: it seems to be a useless thing to do unless you're looking for something specific.

Related

Using Python RegEx in Zapier Formatter Extract Pattern

I have a field in an RSS item that includes a URL such as:
https://www.facebook.com/9999249845065110
https://www.yelp.com/biz/bix-berkeley-2?hrid=TaFUhHhVrhEJdCPjaB6RUQ
https://www.google.com/search?q=hello%20Signs%20&%20Graphics&ludocid=1720220414695611454#lrd=0x0:0x17df735a614e9c3e,1
I'm trying to setup a Zap in Zapier using the Formatter tool to essentially extract the root domain without the .com. So:
facebook
yelp
google
I have no clue how to use the Formatter Extract Pattern tool though. Can't figure out the syntax.
Best case scenario, it can look at any URL and extract the name of the site (e.g. facebook/google/yelp). If that's too complicated, then I could provide a finite list of what terms to look for and have it return the first (and only) one found. So it would check if the URL contained facebook or google or yelp and if so return that name as a value.
Any help would be appreciated. Thanks.

David here, from the Zapier Platform team.
This is totally possible. The input is the text you want to search (the full url) and the pattern is your regular expression.
In your case, you want to find the word between www. and .com. Use the regular expression www\.(\w+)\.com.
That worked for me, and pulled out yelp.
You can see each part of the regex explained here: https://regex101.com/r/KmwMAV/1
Let me know if you've got any other questions!

CommonCrawl: How to find a specific web page?

I am using CommonCrawl to restore pages I should have achieved but have not.
In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.
A simple script downloads all indices from the available crawls:
./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on
Afterwards I have 112mb of data and simply grep:
grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r
The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?
Update: Thanks to Sebastian, two links are left...
Two URLs are:
http://www.thesun.co.uk/sol/homepage/news/50569/Locals-tell-of-terror-shock.html
http://www.thesun.co.uk/sol/homepage/news/54032/Sir-Ians-raid-apology.html
They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...

You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

The latest version of search on CC index provides the ability to search and get results of all the urls from particular tld.
In your case, you can use http://index.commoncrawl.org and then select index of your choice. Search for http://www.thesun.co.uk/*.
Hope you get all the urls from tld and then you can filter the urls of your choice from json response.

AFAIK pages are crawled once and only once, so the pages you're looking for could be in any of the archives.
I wrote a small software that can be used to search all archives at once (here's also a demonstration showing how to do this). So in your case I searched all archives (2008 to 2019) and typed your URLs on the common crawl editor, and found these results for your first URL (couldn't find the second so I guess is not in the database?):
FileName Offset Length
------------------------------------------------------------- ---------- --------
parse-output/segment/1346876860877/1346943319237_751.arc.gz 7374762 12162
crawl-002/2009/11/21/8/1258808591287_8.arc.gz 87621562 20028
crawl-002/2010/01/07/5/1262876334932_5.arc.gz 80863242 20075
Not sure why there're three results. I guess they do re-scan some URLs.
Of if you open any of these URLs on the application I linked you should be able to see the pages in a browser (this is a custom scheme that that includes the filename, offset and length in order to load HTML from the common crawl database):
crawl://page.common/parse-output/segment/1346876860877/1346943319237_751.arc.gz?o=7374762&l=12162&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2009/11/21/8/1258808591287_8.arc.gz?o=87621562&l=20028&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2010/01/07/5/1262876334932_5.arc.gz?o=80863242&l=20075&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html

Custom URL format for news in Expression Engine

Our site is migrating from MovableType to ExpressionEngine, and there is one small issue we are having. MT uses a date based URL structure, e.g. www.site.com/2012/03/post-title.html, while EE uses a category based structure, e.g. www.site.com/index.php/news/comments/post-title. The issue is that our MT page used Disqus for comments, and as such comments are tied to a specific URL, meaning that we'd lose all of our comments if we were to migrate. I am wondering if there's a way to change the URL structure in EE to match MT's, thus allowing us to keep the comments. Thanks in advance.

Correction: EE uses a Template Group/Template based structure for URLs, not categories - just to clarify.
You've got a couple of options here.
One is to create an .htaccess rule which internally redirects all requests matching YYYY/MM/ to your EE template which displays your posts (say, /news/entry/). I don't know exactly what those rewrite rules would look like off the top of my head, my mod_rewrite-fu is pretty shallow. But it could definitely work.
Another is to export all of your comments from Disqus via their XML export tool, then do a grep-based find and replace using something like BBEdit, replacing all /YYYY/MM/ strings in that file with /news/entry/; delete all of your existing comments on Disqus; then import your newly-modifed XML file.

dynamic seo title for news articles

I have a news section where the pages resolve to urls like
newsArticle.php?id=210
What I would like to do is use the title from the database to create seo friendly titles like
newsArticle/joe-goes-to-town
Any ideas how I can achieve this?
Thanks,
R.

I suggest you actually include the ID in the URL, before the title part, and ignore the title itself when routing. So your URL might become
/news/210/joe-goes-to-town
That's exactly what Stack Overflow does, and it works well. It means that the title can change without links breaking.
Obviously the exact details will depend on what platform you're using - you haven't specified - but the basic steps will be:
When generating a link, take the article title and convert it into something URL-friendly; you probably want to remove all punctuation, and you should consider accented characters etc. Bear in mind that the title won't need to be unique, because you've got the ID as well
When handling a request to anything starting with /news, take the next part of the path, parse it as an integer and load the appropriate article.

Assuming you are using PHP and can alter your source code (this is quite mandatory to get the article's title), I'd do the following:
First, you'll need to have a function (or maybe a method in an object-oriented architecture) to generate the URLs for you in your code. You'd supply the function with the article object or the article ID and it returns the friendly URL with the ID and the friendly title.
Basically function url(Article $article) => URL.
You will also need some URL rewriting rules to remove the PHP script from the URL. For Apache, refer to the mod_rewrite documentation for details (RewriteEngine, RewriteRule, RewriteCond).

How do you structure a restful route with several GET constraints?

Suppose you are working on an API, and you want nice URLs. For example, you want to provide the ability to query articles based on author, perhaps with sorting.
Standard:
GET http://example.com/articles.php?author=5&sort=desc
I imagine a RESTful way of doing this might be:
GET http://example.com/articles/all/author/5/sort/desc
Am I correct? Or have I got this REST thing all wrong?

I'm afraid your question really misses the point of REST. From a purely theoretical perspective there is absolutely no advantage or disadvantage to either of those urls from a REST perspective. In practice, those urls may behave differently with different caches, and certainly server frameworks are going to parse them differently. Despite what you hear from the framework developers, there is no such thing as a RESTful URL.
From the perspective of REST those two URLs are simply identifiers that can be dereferenced. If you want to start building REST apis that will benefit from the characteristics described in the dissertation, you need to start thinking in terms of content that is returned when you dereference the URL and how that content is linked together using URLs embedded in the content.
I realize this does not help you much in trying to resolve what you consider to be your problem. What I can tell you is that one of the major intents of REST is to allow your URLs to be completely under the control of the server and can change without impacting your client applications. Therefore, my recommendation is to pick whatever url structure works most easily with the framework you are using to serve the resource representations. Certainly do not look to the REST dissertation to tell you what is the right and wrong way of formatting your URLs and anyone who tells you that your URLs are not RESTful is confused. Probably what they are telling you is the server framework, they are used to using for creating RESTful interfaces, requires URLs to be structured this way.
It's not what your URI looks like that matters, it is what you do with it that matters.

Using a query string is not more or less RESTful than using path components. The URI Generic Syntax (RFC 3986, January 2005) defines that they're just as important in identifying the resource. So yes, as others point out, it's not important to REST. (Note that in the obsoleted-by-RFC-3986 RFC 2396, the query string was not defined to be identifying the resource, but rather a string of information to be interpreted by the resource.)
However, URI design is important, because as an owner of a URI namespace (i.e. the holder of the domain name where the URIs will live) you want the URIs to be long lived. As wise men have stated earlier: Cool URIs don't change!
The choice of using query strings vs path components depends on how your resources are identified, and how they will be identified in years to come. If there's a hierarchy that stands out, then it might be that this should be reflected in the URI, at least if that hierarchy is relatively permanent, and that things don't move around all the time.
It's also important to note that the actual URIs are only meaningful to two parties:
Servers, who need to forge and parse URIs
Human beings who might see a URI in passing might learn things from the URI.
By contrast, client applications are usually not allowed to do URI introspection. So your choice of query strings vs path components boils down to what you think you can live with ten (or 100) years from now.

You are mostly right. The thing with REST api's is to focus on the nouns.
What does the noun all do in this case? Wouldn't you expect your API to always return all articles, unless you filter it?
I would make sort a query string parameters, further, I would make any and all filtering query string parameters. If you look at how Stack is implemented when you click on the "Newest" questions link, you get a query string to filter the questions.
So perhaps something like:
GET http://example.com/aritcles/authors/5?sort=desc
But also think about what happens with each URL:
GET http://example.com/aritcles/ might return all current articles
GET http://example.com/aritcles/authors/ What does this url do? does it return all authors of all articles, or does it return all the articles for all authors (which is essentially the same functionality of the URL above.)
GET http://example.com/aritcles/authors/5/ might return all articles by author 5, or does it return author 5's information?
I would maybe change it to:
http://example.com/aritcles returns all articles
http://example.com/aritcles/5 returns all articles from author 5
http://example.com/authors returns all authors
http://example.com/authors/5 returns information for author 5

Alan is mostly right but his URLs are misleading. I believe the correct routes / urls should reflect the following behavior:
[GET] http://domain.com/articles #=> returns all articles (index action)
[GET] http://domain.com/articles/5 #=> returns article ID 5 (show action)
[GET] http://domain.com/authors/#=> returns all authors (index action)
[GET] http://domain.com/authors/5 #=> returns author ID 5 (show action)
[GET] http://domain.com/authors/5/articles OR http://domain.com/articles/authors/5 #=> depending on the hierarchy of your routes (both belong to the index action)
Best regards,
DBA

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart