Does google ever use guesses to find webpages? - url

If I have a webpage (www.example.com/very_long_random_name.htm) on a site that's already been indexed by google, will it ever be found if it has no incoming links?
Or can google find such unlinked pages by some other method?

No, it will not be indexed. At least in theory, that is. The crucial thing is that there are no incoming links, the fact that the name is long and random does not matter.
However, it is hard to be sure that it really won't be indexed, since incoming links can come from everywhere and without your knowledge. For example some email that contains a link might get indexed (especially if one particicpant in the conversation uses a questionnable mail provider), or someone might post a link on some forum, etc.

Related

Twitter streaming API not tracking URLs

Have gone through https://dev.twitter.com/docs/streaming-apis/parameters
Per documentation it should be able to track URLs such as example.com/foobarbaz but I can't seem it to be tracking such URLs. It just doesn't return me any result when I tweet this URL and track it using Streaming API. Am I missing something?
Pretty late, but I found this by Google so this might help someone...
There are a few answers to this. The main answer being that Twitter treats URLs differently than anything else.
First, make sure you do NOT include the "www".
Twitter currently canonicalizes the domain “www.example.com” to “example.com” before the match is performed, so omit the “www” from URL track terms.
For me, sending the track parameter as "example.com/foobarz" and then tweeting "a test, please ignore: http://example.com/foobarz" worked perfectly.
You can NOT, in general, ask for substrings of URLs:
URLs are considered words for the purposes of matches which means that the entire domain and path must be included in the track query for a Tweet containing an URL to match.
But if you are willing to take every tweet from the whole domain (and a bit more edge cases), Twitter will accommodate:
Finally, to address a common use case where you may want to track all mentions of a particular domain name (i.e., regardless of subdomain or path), you should use “example com” as the track parameter for “example.com” (notice the lack of period between “example” and “com” in the track parameter).
All quotes are from the Twitter docs: https://dev.twitter.com/streaming/overview/request-parameters#track
They have more information, including examples.
Good luck!

How do search engines see dynamic profiles?

Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?
As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.
EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?
In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.
In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)
And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.
OK, that was vastly oversimplified, but let's see what else people have to say..
As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.
The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.
Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.
profiles can be linked from outside;
site may provide sitemap

How would I find all the short urls that link to a particular long url?

Basically I want to know how many people have tweeted a link to a url, but since there are dozens of link shortener out there I don't see any way to do this without having access to all of their url maps. I found a previous question here but it was over a year old and didn't have any new answers.
So #1, does anyone know of a service/API for doing this?
And #2, can anyone think of a way to accomplish this task other than submitting the long url in question to all the popular link shortening sites?
ps- I'm also open to comments about why this is impossible or impractical.
You could perform a Google search (or the equivalent via API) for any pages that link to your page. This is done with the link: keyword. So if you're trying to figure out how many people link to www.example.com (regardless of whether it's through a link shortner URL), then you would just do a Google search for link:www.example.com.
e.g.: http://www.google.com/search?q=link:www.example.com
Note that this will only find pages that have been indexed, so pages that haven't been crawled, or pages that get crawled infrequently, will not show up in the results until a later date (if at all).
Since all sites have different algorithms for shortening the URLs, and these are different sites that most likely do not share their data with each other, how can you hope to find all of them in a single or small number of queries?
All you can do is brute-force it, and even then this might not be any good if a site is content to create a new value for the same long-form URL (especially if you send a different long-form URL that maps to the same place, like http://www.stackoverflow.com/ rather than http://stackoverflow.com/).
In order to really get this to work, there would have to be a site that ALREADY automatically collects all of this information from every site, which the URL shortening sites voluntarily call. And even if you wrote such a site, that doesn't account for the URL-shortening sites already out there who already have data!
In short, I do not see how this is remotely possible, unless I'm wrong about there being such a database somewhere out there.
So months after asking this question I came across a solution to a similar question, that is how to tell how many times a link has been shared on facebook. The solution, via a simple new API call:
http://graph.facebook.com/http://stackoverflow.com
returns the following json data:
{
"id": "http://stackoverflow.com",
"shares": 1627
}

Retweets by a given user

I am trying to get a full list of a user's tweets. I do not particularly care what order they come in, but I need all of them, including what the user has ReTweeted. Essentially, I would like to have status/retweeted_by_me, but for specified user.
Is this at all possible?
This was addressed recently by the Twitter devs. You can now add a include_rts=true to your call to user_timeline. See the full discussion here: http://groups.google.com/group/twitter-development-talk/browse_thread/thread/7a4be385ff549ed0
Nope. Twitter's API leaves a lot to be desired (especially in terms of actual RESTfulness), but this particular issue is also my biggest gripe with it. You can only get your own retweets, but not the retweets of others. Look at any Desktop Twitter client, and compare a user's timeline in there with the actual timeline on the web. The latter contains all the retweets. This has been a big problem ever since they introduced the new retweets, and in my opinion is one of the reasons why acceptance of new-style RTs is slower than it could be among Twitter users. I've tried to make #twitterapi aware of the issue but never got a reply; maybe you (and anyone reading this) could do the same thing.
Their argument regarding BC is, of course, utter nonsense. It does not break BC at all, since these retweets never showed up in the first place. And even if they did, a ?retweets=true query argument would be enough to fix that. I really have no clue why they're not implementing this; their own website shows the retweets fine already, they just need to expose it in the API.

Can an "SEO Friendly" url contain a unique ID?

I'd like to start using "SEO Friendly Urls" but the notion of generating and looking up large, unique text "ids" seems to be a significant performance challenge relative to simply looking up by an integer. Now, I know this isn't as "human friendly", but if I switched from
http://mysite.com/products/details?id=1000
to
http://mysite.com/products/spacelysprokets/sproket/id
I could still use the ID alone to quickly lookup the details, but the URL itself contains keywords that will display in that detail. Is that friendly enough for Google? I hope so as it seems a much easier process than generating something at the end that is both unique and meaningful.
Thanks!
James
Be careful with allowing a page to render using the same method as Stack overflow.
http://stackoverflow.com/questions/820493/random-text-can-cause-problems
Black hats can this to cause duplicate content penalty for long tail competitors (trust me).
Here are two things you can do to protect yourself from this.
HTTP 301 redirect any inbound display url that matches your ID but doesn't match the text to the correct text.
Example:
http://stackoverflow.com/questions/820493/random-text-can-cause-problems
301 ->
http://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id
Use canonical URLs.
<link rel="canonical"
href="http://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id"
/>
https://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id
I'd say you're fine.
Have a look at the URLs that StackOverflow uses. They have a unique id, then they have the SEO-friendly stuff. You can omit the SEO-friendly stuff and the URL still works.
You are making a devils bargan here, you are trading away business goals for technology goals.
If you were to ask "From a purely business and SEO prospective, is it better to include unique IDs in the URL or not?"; the answer would clearly be to not use them.
The question then becomes, if you do use them, how much does it hurt you in the search engines? The answer is that it definately has some negative impact. How much is yet to be determined.
In terms of "user friendly", no, they are definitely not user friendly.
In terms of Google, they state "Whenever possible, shorten URLs by trimming unnecessary parameters." See their URL structure document.
I'm not aware of any problems caused by adding an ID to a URL. In fact it can be extremely useful, as it allows the human/search engine friendly part of the URL to be changed without causing a broken link to a page that a search engine has already indexed. Using SO as an example, here's a link to your question:
https://stackoverflow.com/questions/820493/you-can-put-any-text-you-want-here
Nothing wrong with that. An increasing number of services have started to use a hybrid solution as Paul Tomblin already pointed out. In addition to SO, Tumblr uses this pattern too (maybe it was the first).
Furthermore, in certain services—like Google News—the URL must contain a unique numeric ID.
Getting rid of the parameterized URL will definitely help. From my experience, including the ID does not hurt or help, as long as there are no '?key=value' pairs in the url.
I have two seemingly contradictory points to make here:-
Nobody looks at URLs! Experience has "trained" browser users to render the "Address" box contents as invisable, they know the contents will be any two of 'ureadable', 'meaningless' and 'confusing', hence they just ignore it completely.
Using a String which can be easily converted to an integer may offer a slight performance advantage over using a longer string which is slightly harder (hash() vs. to_int() ) to convert into an integer. However in the context of the average web application any performance difference would would be negligable.
My advice would be to stick with what your comfortable with.
Use something like modrewrite to parse URLs before they reach your server. So you could convert a slug like http://oorl.com/99942/My-Friendly-Text-For-Search-Engines/ into http://oorl.com/lookup.php?id=99942. This will also let you change slug and keywords used to optimize certain links without damaging functionality.
Duplicate refer cause more negative impact compare to friendly URL, be careful about using fake text with id, your competitors could miss use this.
Yes, and in fact it's more SEO friendly to include a number in your url as it implies to google that you are consistently updating your content.
I am fairly sure that it makes it much more difficult to get indexed in Google News if you don't have an incrementing number attached in some way to your URLs.

Resources