Sorry for the bad title and description, but I was wondering if there is anyway I could search/list products from other sites (say Express, American Eagle), from a web app I create, even if the site doesn't have an API.
Thanks
Sure. How do you think Google and every other search engine does it? They just spider the sites and index the contents. The devil, of course, is in the details. But it's certainly possible to do.
I don't think so. Unless you want only to fetch some data from a certain HTML page, then you need to use some regular expressions. But searching the database is not possible if you don't have the ability to connect to it directly or via some APIs.
Related
Which URL structure should I use for my Web-app?
Clean URLs like this
http://dashboard.company.com/sales/john-doe/2017/32
or with URL parameters?
http://dashboard.company.com/sales?person=john.doe&year=2017&week=32
Are there any guidelines for this?
Edit to explain my question better From the user perspective, the two ways are identical in ways of sharing the url. For the programming part they are not, I use Flask. I want know if there's a standard way of handling it, what is the better way?
Background
I am developing a Sales Dashboard for internal use at my company. It display the sales of every sales person. I want to make the reports shareable so that my colleagues can send their own page for a certain weeknumber with each other, or whatever. Or the boss can easily get the page for a meeting with the sales person.
No SEO
Just to stress this point. I don't need clean URLs for SEO.
It doesn't matter at all, by adding the parameters as GET or POST they will be visible but if you use a framework for your app, you should use clean as possible because the parameters to the controllers must be specific and not by data. Otherwise if is not a big project you can use like that but you need to make sure that soon you wont have something like lang?en or something which will be as main parameter. It's up to you, read GET x POST differences and you'll figure it out better.
I want to make a torrent search engine which will provide links to other torrent sites. So I need data from other sites to index them in the database. So, is it legal to crawl a website for this purpose or is there some other way to do that.
Depending on the site, without permission it is not legal.
You might wish to investigate Common Crawl, a website that has already crawled the entirety of the web. Check out their ToU section to check on the legality of it all.
I've been doing some programming off and on for my brother, who is a stock trader. I'm wondering if it is possible to receive a push notification when a site server adds a page. For example, the site smallcapfortunes.com frequently adds pages that are simple extensions off the main URL. For example, the site regularly adds pages under URLs such as /neca/, /stev/, etc.
Are there existing methods to execute this? Or is this something I need to write myself? Has anyone here written anything like that?
I know there are existing sites to track basic updates to a single page. In my research, though, I haven't found anything like this.
Please let me know if there are any other details I need to provide.
Generally you can only get a push notification if a specific website offers that service.
Some websites publish a structured (XML) site map. If the one you're interested in does that, you could pull that sitemap on a regular basis and look for differences.
you're most likely going to want to use http://scrapy.org/ to go through the site and find new /neca/ and /stev/ urls, etc, then just trigger the script every so often.
Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/
I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex