Got some pages with url rewriting
that appears like this :
http://www.xx.com/de/xx/kontakt#.ULTZZuT8JuI
or even without url rewrite
http://www.xx.com/index.php?page=2#.ULTZu-T8JuI
this doesen't seems to me like google sessions.
Anyone any idea what's that at the end?
I'm not sure I can fully recover the context of your question, but those URL postfixes look to me like ajax cache control tags. (Well, at least if I take it for granted that they are indeed not session IDs. But they can well be that, too; it fully depends on what that "xx.com" actually is. :) )
See the "Make Ajax Cacheable" section on this (aging) performance research page from Yahoo.
Related
I did not think very much about using a rather complicated PWA offline. But now I want to try it. However I am using links like this (inside the PWA so to say):
https://example.com/page.html?param=val
When clicking on a link like that offline in the PWA I get "This site can't be reached". This link however works fine:
https://example.com/page.html
The parameters are all handled in JavaScript in the web browser. What options do I have to handle this? Is the best perhaps to rewrite it as # links? Or do that get me into other troubles?
https://example.com/page.html#param=val
The problem came from the cache. In your sw.js, you give the list of files you want to cache but you give the name of the file without the parameter. Which is logical as in many cases you can't know the full value of the parameters.
So in your case you cache "https://example.com/page.html" but when you call "https://example.com/page.html?param=val" the comparison fail and you get the error message.
The way to solve that is to tell the retreivng function in your sw.js file, to ignore parameter.
Rather than
caches.match(event.request)
just use
caches.match(event.request, {ignoreSearch: true})
I'd like to know any Subpages of a certain URL. E.g. I have the URL example.com. There might exist the subpages example.com/home, example.com/help and so on. Is it possible to get all of such subpages without knowing there exact name?
I thought I can handle this problem with a web crawler. But it just crawls for pages mentioned on the page itself.
I hope you understand my problem and can help me with it.
Thank you!
To answer your question, yes. Scrapy "crawl" spiders work by setting rules that can be set to do exactly what you're trying to. When in doubt, always go to the docs!
Couple things to note:
You can create a crawl spider the same way when creating the generic spider!
scrapy genspider -t crawl nameOfSpider website.com
With a crawl spider, you then have to set rules to basically tell scrapy where and where not to go; how's your regex?!
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com'] # PART 1: Domain Restriction
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(allow=('.*')), callback='parse_item'), # PART 2: Call Back
)
Now I copied and pasted this from the Official docs, and changed up what it should look like for you but I havent checked the code so yeah... teh logic is there though..
IThis works by getting ALL the link that it can see depending on the rules you set, does something with said link.
You want to restrict all other domains but the one your scrapinng
In the example I set the wildcard to literally accept every and any page in the domain... once you figure out tehs tructure of a website, you can use logic to build out what you need.
You should take a look at the docs more often though. I have been using scrapy for about 6-7 years and I still find myself going back to the man pages!
No, you can’t.
The way you describe the situation, the website intends those desired URLs to be secret.
Any way to find such URLs would be a security exploit that should be reported to the website owners right away so they can fix it.
Is there a way to hide the URL key for each page in Magento? For example a site URL www.test.com/Logitech-bluetooth-keyboard. Can I disable or hide the URL key so it is www.test.com for every page?
I'm using Magento Community 1.7.0
The short answer is: No. I'm curious why you would want to do something like that, from a user's perspective it would be bad usability, and very bad SEO since you would only be able to Google your homepage.
What you could do is to rewrite the whole thing so all links would load Ajax requests instead. This would take some serious time and effort and I don't think it's worth it because I fail to see the benefits of it.
I have a pager on a table using ajax and I would like each such request also to change the browser's url, so when I hit refresh button I won't skip back to first page. I was fighting the Url parameter of AjaxOptions, but it keeps winning over me. Please help.
Trim
You can safely change the URL past the hash mark without redirecting the page. However, the user can (in most browsers) navigate through these changes with the Back and Forwards buttons. This technique is usually called "history."
Because the technique is difficult to get working in all browsers, you'll want to use a framework. Take a look at http://www.mikage.to/jquery/jquery_history.html.
I can also recommend ExtJS's history stuff too. Take a look at this example:
http://www.extjs.com/deploy/dev/examples/history/history.html#main-tabs:tab2
Again, notice that not only does the URL change when the user does stuff, but changing the URL (via Back and Forward) also affects the page. This is good, awesome even, but means it must be done very carefully.
There is not really a quick and easy way to do this, here is an article on the topic. The problem is that not only does the Ajax have to generate the URLs, it also has to take those URLs into account when loading the page to get the appropriate content.
I feel dumb for not knowing this, but I see a lot of links in web pages and instead of this:
<a href="http://foo.com/">
...they use this:
<a href="http://foo.com/?src=bar.com">
Now I understand that the ?src= is telling something that this referral is coming from bar.com, but I don't understand why this needs to be called out explicitly. Can anyone shed some light on it for me? Is this something I need to include in my program generated links?
EDIT: Ok, sorry, I'm not being clear enough. I understand the GET syntax with a question mark and parameters separated by ampersands. I'm wondering what's this special src parameter? Why would one site link to another and tack an src parameter on the end even though there's no indication that the destination site uses this normally.
For example, on this page hover your mouse over the screenshot. The link URL is http://moms4mom.com/?src=stackexchangesites
But moms4mom.com is our site. Passing the src parameter does nothing, so why include it?
There are a few reasons that the src is being used explicitly. But in general, it is easier and more reliable to trust a query string to determine referer[sic] than it is to trust the referer, since the latter is often broken, deliberately or not. On the other hand, browsers almost never break the query string in a url, since this, unlike referers, is pretty important for pages to function. Besides, a referer is often done without any deliberate action on the part of the site doing the refering, which some users dislike.
The reason (I do it) is that popular analytics tools sometimes make it easier to filter on query strings than referrers.
There is no standard to the src parameter. Each site has its own and it's usually up to the site that gets the link to define how it wants to read it (as usually it's that site that's going to pay for the click).
The second is a dynamic link, it's a URL that another language(like ASP and PHP) interpret as something to do, like in those Google URLs, but i never used this site(foo.com), then i don't much things about this parameter.
Depending on how the site processes its URL, you may or may not need to include the ?... information.
This is passed to the website, and the server can process it just like form input. Some sites require this - and build their navigation off a single page, using nothing but the "extra" stuff passed afterwards. If you're generating a link to a site like that, it will be required.
In other cases, this is just used to pass extra, unrequired info (such as advertising, tracking info, etc)... In those cases, you can leave it off.
Unfortunately, there's no way to know without trying whether you can remove the "extra" bits from the URL.
After reading some of your comments - I'll also say:
There is nothing special about the "src" field in a query string. The server is free to use it any way it wishes. Unless you know specific info about the server, you cannot assume it can be left out.
The part after the ? is the query string. Different sites use it for different things, and it is usually used for passing information to the server side code for that URL, but can also be used in javascript.
For more info see Query String