Want to make a Search Engine - search-engine

I want to make a torrent search engine which will provide links to other torrent sites. So I need data from other sites to index them in the database. So, is it legal to crawl a website for this purpose or is there some other way to do that.

Depending on the site, without permission it is not legal.
You might wish to investigate Common Crawl, a website that has already crawled the entirety of the web. Check out their ToU section to check on the legality of it all.

Related

Can malicious code, virus, etc be loaded onto a site that accepts embed links?

I operate a CMS Site (Video Server like YouTube-like)and it permits users to embed links to videos elsewhere on the web, i.e. www.vimeo.com/videos/sjek3469df
Is there any way someone could input any type or URL "link" that could infect my website?
Thanks in advance all!
It really depends on how your site is set up, but yes, there would be XSS concerns. At the very least, I'd suggest a whitelist for allowed video hosts (with particular URL patterns, not just acceptable domains). You should also consider parsing the URLs to obtain the video IDs, and using those to generate your own embedding code on a per-host basis. That would give you more customization power, not just more security.

How do Bit Torrent search engines work?

For a normal search engine, I understand that it regularly travel across the internet to gather web page information, and sometimes the web page can voluntarily submit to engines their latest updates. But how about BT search engines? These torrents cannot be simply find through viewing web pages. Then how do they work? User submit?
A publisher submits their torrent to a tracker, and then distributes a link to the file on that tracker. Users in turn use that file to connect to the specified tracker and download that file; the tracker then gives a list of users who are sharing that file. The torrent search sites just list what trackers are available and what files can be found on what trackers, which are submitted by publishers.
However, I think this may be better suited to something like the superuser rather than stackoverflow...
No, users do not submit torrents. As we made with our torrent search site http://tornado.li/, we've created different robots that scan all added torrent sites for new torrents and add them to database. The whole process is fully automated, only in this way it's possible to give a good choise of torrents.

Search product websites without api

Sorry for the bad title and description, but I was wondering if there is anyway I could search/list products from other sites (say Express, American Eagle), from a web app I create, even if the site doesn't have an API.
Thanks
Sure. How do you think Google and every other search engine does it? They just spider the sites and index the contents. The devil, of course, is in the details. But it's certainly possible to do.
I don't think so. Unless you want only to fetch some data from a certain HTML page, then you need to use some regular expressions. But searching the database is not possible if you don't have the ability to connect to it directly or via some APIs.

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

How-To get private pages being crawled by google

How can i get private pages of my web site being crawled and indexed by google ?
maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page.
EDIT: Based on the addition of "maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page." To the question:
You can check the User Agent in your php code to basically allow google to see pages if it was a registered user (google's user agent is "Googlebot/1.0" and you can search to find user agents for other common engines).
However, this behavior is specifically against google's rules and they can and will remove your site from the index if they catch you doing it. Their policy is you should not treat googlebot any differently than you treat any random person who visits your site.
(Original Answer) One way is to use a sitemap to show google how to find all of your pages.
In general, and even in the case of sitemaps, if the content you want indexed is not linked to from a page that can be found through the "root" (/) (i.e. there is no way for the public to find it), then it probably won't get indexed. The only way to get it indexed is to link it in someplace.
The question is though, why do you want your private pages in google anyway?
They'll get crawled if and only if they're publicly accessible and your robots.txt file allows it. That's pretty much all you need to do.
Are you asking how to get Google to index your pages?
There are a couple of ways. You need to ensure that you have SEO'd, or Search Engine Optimisation, the pages properly with title text and description key words in your meta data.
You can also submit your site to Google, it's a free service, and it'll be placed in a queue of things that Google will index. May take some time though.
By far the best way to get your pages indexed is using the meta data in the pages themselves.
Google will only index what is
linked from somewhere already in Google's index
accessible to its crawler via normal (unauthenticated) HTTP
It will also
make the contents available in search results to anyone.
This may conflict with your idea of a "private" page.
I'm going to assume that all the other previous answerers are misunderstanding you. As I read it, you aren't asking how to get Google to index your pages, but rather how to get a list of all the pages that Google currently has already indexed on your site? If that is true, you should have a look at the Google Webmaster Tools.

Resources