We are building a vertical search engine which will search on computer domain. So, we want all URLs of Wikipedia which belong to computer category. Is there any such database available? If not how can we fetch all URLs from Wikipedia belonging to Computer category? We need only URLs not complete webpages.
Is there any such database available?
You can try at http://dbpedia.org.
how can we fetch all URLs from Wikipedia belonging to Computer category?
Check the Categorymembers API. You will however need to recursively traverse the subcategories, and filter out a lot pages manually.
Related
I'm building a large news site and we'll have several thousand articles. So far we have over 20,000. We plan on having a main menu which contains links which will display articles based on those criteria. Therefore, clicking "baking" will show all articles related to "baking", and "baking/cakes" will show everything related to cakes.
Right now, we're weighing whether or not to use hierarchical URLs for each article. If I'm on the "baking/cakes" page, and I click an article that says "Chocolate Raspberry Cake", would it be best to put that article at a specific, hierarchical URL like this:
website.com/baking/cakes/chocolate-raspberry-cake
or a generic, flat one like this:
website.com/articles/chocolate-raspberry-cake
What are the pros and cons of doing each? I can think of cases for each approach, but I'm wondering what you think.
Thanks!
It really depends on the structure of your site. There's no one correct answer for every site.
That being said, here's my recommendation for a news site: instead of embedding the category in the URL, embed the date. For example: website.com/article/2016/11/18/chocolate-raspberry-cake or even website.com/2016/11/18/chocolate-raspberry-cake. This allows you to write about Chocolate Raspberry Cake more than once, as long as you don't do it on the same day. When I'm browsing news I find it helpful to identify the date an article was written as quickly as possible; embedding it in the URL is very helpful.
Hierarchical URLs based on categories lock you into a single category for each article, which may be too limiting. There may be articles which fit multiple categories. If you've set up your site to require each article to have a single primary category, then this may not be an issue for you.
Hierarchical URLs based on categories can also be problematic if any of the categories ever change. For example, in the case of typos, changes to pluralization, a new term coming into vogue and replacing an existing term, or even just a change in wording (e.g. "baking" could become "baked goods"). The terms as they existed when you created the article will be forever immortalized in your URL structure, unless you retroactively change them all (invalidating old links, so make sure to use Drupal's Redirect module).
If embedding the date in the URL is not an option, then my second choice would be the flat URL structure because it will give you URLs which are shorter and easier to remember. I would recommend using "article" instead of "articles" in the URL because it saves you a character.
How can I identify ad links from a website? I am doing a research on malvertising. As a part of that, I need to extract all the advertisement urls from the website. How can I do that?
(Of course it’s impossible to correctly identify all URLs.)
You could make use of the filter lists of various ad filtering tools. They typically contain absolute URLs (submitted by the community) and strings that often appear in such URLs.
For example, AdBlock Plus hosts some filter lists.
Example from EasyList (big text file):
&adbannerid=
.com/js/adsense
/2013/ads/*
/60x468.
/ad-rotator-
I enter this reference into the google search field
Nature 2008 May 8;453(7192):164-6
I expect at least one link to be from the nature.com website, I mean it says "Nature" in the query. All results are from ncbi, which only collects abstracts. Is that so hard? Usually references in journals are in such or a similar format... How come it's not recognized as such?
Please redirect the question to the appropriate stackexchange sub-field if necessary.
When Google lists query results, it takes into consideration website popularity. Your paper is both on ncbi and nature.com and because nsbi website is ranked at 361-st and nature.com is at 2,984-th for internet traffic from alexa.com, your Google results will be like that.
I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex
Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any massive external search engine like Google or Yahoo.
Is there any free open source solution we could use to install it on our server?
No point in re-inventing the wheel.
DataparkSearch might be the one you need. Or review this list of other Open Source Search Engines.