Where do search engines start crawling? - search-engine

What do search engine bots use as a starting point? Is it DNS look-up or do they start with some fixed list of well-know sites? Any guesses or suggestions?

Your question can be interpreted in two ways:
Are you asking where search engines start their crawl from in general, or where they start to crawl a particular site?
I don't know how the big players work; but if you were to make your own search engine you'd probably seed it with popular portal sites. DMOZ.org seems to be a popular starting point. Since the big players have so much more data than we do they probably start their crawls from a variety of places.
If you're asking where a SE starts to crawl your particular site, it probably has a lot to do with which of your pages are the most popular. I imagine that if you have one super popular page that lots of other sites link to, then that would be the page that SEs starts will enter from because there are so many more entry points from other sites.
Note that I am not in SEO or anything; I just studied bot and SE traffic for a while for a project I was working on.

You can submit your site to search engines using their site submission forms - this will get you into their system. When you actually get crawled after that is impossible to say - from experience it's usually about a week or so for an initial crawl (homepage, couple of other pages 1-link deep from there). You can increase how many of your pages get crawled and indexed using clear semantic link structure and submitting a sitemap - these allow you to list all of your pages, and weight them relative to one another, which helps the search engines understand how important you view each part of site relative to the others.
If your site is linked from other crawled websites, then your site will also be crawled, starting with the page linked, and eventually spreading to the rest of your site. This can take a long time, and depends on the crawl frequency of the linking sites, so the url submission is the quickest way to let google know about you!
One tool I can't recommend highly enough is the Google Webmaster Tool. It allows you to see how often you've been crawled, any errors the googlebot has stumbled across (broken links, etc) and has a host of other useful tools in there.

In principle they start with nothing. Only when somebody explicitly tells them to include their website they can start crawling this site and use the links on that site to search more.
However, in practice the creator(s) of a search engine will put in some arbitrary sites they can think of. For example, their own blogs or the sites they have in their bookmarks.
In theory one could also just pick some random adresses and see if there is a website there. I doubt anyone does this though; the above method will work just fine and does not require extra coding just to bootstrap the search engine.

Related

Mautik hosting best practice

I am new to Mautik and therefore need a guidance on the same.
Where should we setup mautik... on some folder or on sub domain to main site or a separate domain? How does the landing pages and forms gets its URL? Can it be embedded on another site on another domain or is it required to be hosted where mautik is hosted?
Moreover does single installation of mautik can be used for two or more different businesses site... which are not relevant.. and mainly a different customer for a marketing company? Or is it better to install mautik per business?
Can we track interactions from mobile app too using mautik?
First thing, I expect you are talking about Mautic and not Mautik.
You are free to choose whatever type of hosting you want, personally I Like to use independent container(could be lightweight) however I have seen people hosting on shared hosting as well.
If you are hosting on say example.com the landing page url will be example.com/landing-page same goes for all elements of mautic.
Yes forms can be embedded on other websites with a completely different domain. say example-something-else.com, you will need to put your tracking script on other site's head to make it work better. I for example check out this small tutorial https://tutorialsjoint.com/mautic-wordpress-integration/ it shows how you can use it in wordpress.
No it is not required that wherever you want to use mautic form should be on same host or domain.
However I recommend to use subdomain if usually just to save the hassle of buying a new domain and keeping the landing page urls more relevant. https://www.youtube.com/watch?v=K8lWaCabH1w this video shows how tracking works, it'll help you understand little better. Also here's official documentation: https://docs.mautic.org/en/contacts/manage-contacts/contact-monitoring.
You can use use one instance to manage multiple businesses I know people who are doing it but when the number of contacts, segments, campaigns, form, emails, landing pages grow with time it becomes a hassle to keep it clean. You can use category and a specific naming convention to keep them organized. But in a good way i will recommend to keep different instances in long run.
I am not sure about mobile apps but ideally it should be possible using tracking script or tracking pixel, perhaps you will need to turn off CORS restrictions.
I hope it was helpful.
Cheers!
No, you must use a VPS with Devian or Ubuntu, In a shared hosting it can cause problems. If you send many emails.
Landing pages can be made in html and pasted or edited in Mautic.
To use it in more sites you must create a user for each one, with their respective different email.

Integrating GCS on a staging Jekyll website

We currently have a staging website, which has an IP address like xx.xx.xxx.xxx and we would like to have integrated and tested GCS on it before pushing it live. Is it possible?
Otherwise, is there any alternative to GCS to add a search bar in a Jekyll blog without using plugin?
Thanks!
PART I: Google
Google custom search cannot index an application that isn't available to the internet.
No, that's not entirely true. You can arrange something with Google (in theory, never done it) but it doesn't look easy. Or cheap.
You could set up a custom search for an unrelated site and embed those results in your local page, if you want to test out CSS prior to launch.
Remember, Google Custom Search also comes with ads, unless you pay. And the results tend to look like they came from Google.
PART II: Alternatives
I've looked into this extensively and I haven't come up with a good answer. Here are some not-so-good answers:
1) Tapir Search. This actually worked pretty well, but appears to have died. They do have recent twitter activity, however, so maybe worth checking back in a bit. twitter. It's basically a (free) front end for an elasticsearch server. I think. Neat service, obviously not super-dependable.
2) Go javascript. Lunr for example. There are many, many similar solutions available. Sadly, they are client-side and doing a full-text search on even a smallish blog type site can be very slow. Works okay if you limit the search to titles, but then...you're only searching titles.
3) Build a search engine server. Maybe some breed of Lucene. Upside: very robust search while keeping the snappy response of a flat HTML site. Downside: building and maintaining a search engine server is difficult, expensive and probably overkill.
4) Hosted search engine. Algolia for example. They're basically doing 3) for you. Relatively expensive (~$50/month) but well worth the cost, because, seriously, search engine servers are finicky and prone to explosions. I've never gone this route with Jekyll because I've never had a Jekyll project I was quite that serious about, but I did consider it.
If anyone has anything to add, I'd love to hear it. This question has been irritating me for a while.

make sitemap page for many pages in rails

I just noticed that Google didn't crawl quite a lot of my web pages. When I searched for that reason, I thought sitemap could be a reason for some part of that missing crawling because I had no sitemap or robots.txt kind of thing.
I have a very simple structured website with index, QnA, login(no user's page yet) and search page. But Search page can result in many restaurants' pages, which are upto almost 160,000 like example.com/restaurants/1000001 to example.com/restaurants/1160000.
I just learned some concepts about sitemap and I think dynamic sitemap examples are quite a lot in google. But I saw that over 50,000 pages require some more sitemaps google sitemap help page.
My website has very very simple structure but would it make some burden for my server?(is it necessary?) And I have no clear standard to split those 160,000 pages, then what can be a good standard to split them for googlebot?
Any tips would be a huge help for me. Thanks!

SEO Strategies: Directory, separate domain, or sub-domain?

What is the optimal SEO strategy for storing a blog?
1) In a directory: www.example.com/blog
2) In a separate domain: www.exampleblog.com
3) In a sub-domain: www.blog.example.com
With a directory, the repetitions earned by the blog are directly transferred to the main domain (www.example.com). With a separate domain, any links to my site would count as backlinks.
I'm leaning towards option 1. What other pros and cons should I consider?
This is comprehensive, Sub-domain versus sub-directory (via Webmasters SE). It was updated in November 2012. Look at this answer too, as it specifically describes, with a huge chart, what effect sub-folders (meaning sub-directory in this context) versus sub-domains have on SEO, and how use of reverse proxy can affect blog SEO. The gist of it is that a sub-domain is preferable to a sub-directory.
EDIT
I may have mis-read the question. If the choice is between
mywebsitename.com/blog
versus
mywebsitenameblog.com
then I would definitely recommend using the sub-directory. This is why:
If you use an entirely different domain name, even if it is your website's domain with the four letters blog concatenated, it will be confusing to users, as no one does that!
You will need to pay for a second domain and that costs more money.
You'll be doing something that is inconsistent with typical website naming conventions, which I'd avoid if I were concerned about SEO and were developing an e-commerce website. I don't know if it will negatively affect SEO ranking, but it won't help, as it will be an entirely different domain name, without any of the positive reputation or credibility of your primary domain name.
It will be four characters longer, which is never good, as it will be less convenient, more difficult to remember, etc.
Better yet, use a sub-domain of your primary website for your blog. To summarize, you should do the following, in order of best to worst:
blog.mywebsitename.com
mywebsitename.com/blog
mywebsitenameblog.com
There is slight difference depending on which search engine is going to look at and add or subtract value to your blog on basis of this decision.
Read this blog post from Matt Cutts. or Watch this video for summary
If you go for another domain then search engine expect it to be separate content and not much relation in terms of your main domain rank.
I would install the blog in sub-directory called blog and stop worrying about actual juice from search engines as it may vary from one to another.

What is the best conceptual approach, in Rails, to managing content areas in what is otherwise a web application?

(A while back I read this great post: http://aaronlongwell.com/2009/06/the-ruby-on-rails-cms-dilemma.html, discussing the "Rails CMS Dilemma". It describes conceptual approaches to managing content in websites vs web apps. I'm still a beginner with Rails, but had a bit of a PHP background, and I still have trouble wrapping my brain around this.
A lot of what I run into is customers who want a website that is not 100% website, and not 100% web app... That is, perhaps there are several pages of business-to-public facing content, but then there are application elements, and the whole overall look is supposed to be cohesive. This was always fairly simple in PHP, as you just kind of dropped your app code into the PHP "script", etc (though I know there are plenty of cons to this platform and approach).
So I am wondering, what is the best approach in Rails for doing this?
Say you have an application with user authentication and some sort of CRUD stuff going on, where users collaborate on projects or something. Well, what is the optimal approach for managing the text/images of the "How This Site Works" and "Our Company" pages, which people may also want to view? Is it just simply having a pages controller and several text fields, with an admin panel on the back end that lets you edit those fields? Or is it perhaps a common approach to start off with something like Refinery, and then build on top of it for the non-content-driven areas of a site?
Sorry if this is a dumb question. It's just that I've read Hartl's book and others, and they never address this practical low-level stuff for a beginner... Sure, I can build a Twitter feed now, but what Twitter's "About" page (http://twitter.com/about)? I can't just throw text into a view and give that to a client... They want a super easy way to see the site tree, edit content areas, AND administrate/run their Twitter feed or whatever.
Thanks for your help.
I think you're looking for a CMS that runs as a plugin in your Rails application. If that's the case, I'd suggest that you try http://github.com/twg/comfortable-mexican-sofa

Resources