Recently I started learning Web scraping. For this purpose I need to focus on URLs and there basic structures. I considered two URLs from Amazon and Priceline for home work purpose.
The some basic concepts of URL
A query string comes at the end of a URL, starting with a single
question mark, “?”.
Parameters are provided as key-value pairs and separated by an
ampersand, “&”.
The key and value are separated using an equals sign, “=”
most web frameworks will allow us to define “nice
looking” URLs that just include the parameters in the path of a URL
Amazon URL
https://www.amazon.com/books-used-books-textbooks/b/?ie=UTF8&node=283155&ref_=nav_cs_books_788dc1d04dfe44a2b3249e7a7c245230
As per my understanding:
Parameters
ie=UTF8
node = 283155
ref_=nav_cs_books_788dc1d04dfe44a2b3249e7a7c245230
Key Values
ie UTF8
node 283155
ref_ nav_cs_books_788dc1d04dfe44a2b3249e7a7c245230
Priceline URL
https://www.priceline.com/relax/in/3000005381/from/20210310/to/20210317/rooms/1?vrid=16e829a6d7ee5b5538fe51bb7e6925e8
This url is based on the hotel booking in Chicago from 03/10/2021 to 03/17/2021.
As per my understanding:
key values
from 20210310 2021 - 03 -10
to 20210317 2021 - 03 -17
rooms 1
I did not find out anything more than that. I just make sure am I missing something? Can those URLS analysis more precisely?
Tips that may help are:
Data can be posted via GET or POST. What you are describing with URLs is GET. POST is when you don't see anything in the url.
In both cases getting familiar with using your browser's developer console will help you explore how websites work. In Chrome, you can hit F12 or right click any element and select "inspect element." This is especially helpful when trying to inspect data that is passed using POST, since you can't see them in the url. Use the "network" tab while clicking around to see what the website is doing in the background.
Lastly, just play around with websites. For example, when you browse Amazon you might notice the urls look like https://www.amazon.com/Avalon-Organics-Creme-Radiant-Renewal/dp/B082G172GL/?_encoding=UTF8 but if you play around with it you notice you can delete out the title and the url still works like this: https://www.amazon.com/dp/B082G172GL
Related
I'm trying to add param in the url, like in this example:
https://www.google.com/ > https://www.google.com/search?q=qq
Opening the last link you can see "qq" in the "q" input.
For this site it doesn't work (this is the problem):
https://www.calabriasue.it/assistenza/richiesta-assistenza-e-supporto/
https://www.calabriasue.it/assistenza/richiesta-assistenza-e-supporto/?nome=mario
Can I add url param also in the last one? I need it.
Thanks!
I tried using different input names, different params ecc but it doesn't work.
Google's server side code is designed to generate an HTML document with an input field that is prefilled with the current search term which is reads from the URL. That is why adding q=search+term to the URL populates the input field.
You can't make arbitrary third-party websites prefill inputs. They have to explicitly provide a mechanism to make it possible.
Parameters only work as long as the code for the target website is expecting to handle a parameter named "nome" with a value "mario". In the case of the google website, it is expecting a parameter named "q" and has a form input for it.
Clicking a URL sends a a GET request type, and the target site may only be accepting parameters from a POST request type. You could consider using the application known as "PostMan" to help with that.
Alternately, the target page you are viewing may be forwarded / routed from a different page which accepts parameters.
Our site is migrating from MovableType to ExpressionEngine, and there is one small issue we are having. MT uses a date based URL structure, e.g. www.site.com/2012/03/post-title.html, while EE uses a category based structure, e.g. www.site.com/index.php/news/comments/post-title. The issue is that our MT page used Disqus for comments, and as such comments are tied to a specific URL, meaning that we'd lose all of our comments if we were to migrate. I am wondering if there's a way to change the URL structure in EE to match MT's, thus allowing us to keep the comments. Thanks in advance.
Correction: EE uses a Template Group/Template based structure for URLs, not categories - just to clarify.
You've got a couple of options here.
One is to create an .htaccess rule which internally redirects all requests matching YYYY/MM/ to your EE template which displays your posts (say, /news/entry/). I don't know exactly what those rewrite rules would look like off the top of my head, my mod_rewrite-fu is pretty shallow. But it could definitely work.
Another is to export all of your comments from Disqus via their XML export tool, then do a grep-based find and replace using something like BBEdit, replacing all /YYYY/MM/ strings in that file with /news/entry/; delete all of your existing comments on Disqus; then import your newly-modifed XML file.
I'm designing a permalink system and I just noticed that Twitter and Hipmunk both prefix their permalinks with #!. I was wondering why this is, and if the exclamation point in particular is there for a reason. Wouldn't #/ work just as well, since they're no doubt using a framework that lets them redirect queries to certain templates with a regex URL parser?
http://www.hipmunk.com/#!BOS.SEA,Dec15.Jan02
http://twitter.com/#!/dozba
My only guess is it's because browsers use # to link to an anchor element. Is this why the exclamation point is appended?
This is done to make an "AJAX" page crawlable [by google] for indexing -- It does not affect the other well-defined semantics of the fragment identifier at all!
See Making AJAX Applications Crawlable: Getting Started
Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler. The search results will show the original URL.
I am sure other search-engines are also following this lead/protocol.
Happy coding.
Also, It is actually perfectly valid, at least per HTML5, to have an element with an ID of "!foo" so the
reasoning in the post is invalid. See the article "The id attribute just got more classy":
HTML5 gets rid of the additional restrictions on the id attribute. The only requirements left — apart from being unique in the document — are that the value must contain at least one character (can’t be empty), and that it can’t contain any space characters.
My guess is that both pages use this in their JavaScript to differ between # (a link to an anchor) and their custom #! which loads some additional content using Ajax.
In that case pretty much everything else would work after the # sign.
I have a news section where the pages resolve to urls like
newsArticle.php?id=210
What I would like to do is use the title from the database to create seo friendly titles like
newsArticle/joe-goes-to-town
Any ideas how I can achieve this?
Thanks,
R.
I suggest you actually include the ID in the URL, before the title part, and ignore the title itself when routing. So your URL might become
/news/210/joe-goes-to-town
That's exactly what Stack Overflow does, and it works well. It means that the title can change without links breaking.
Obviously the exact details will depend on what platform you're using - you haven't specified - but the basic steps will be:
When generating a link, take the article title and convert it into something URL-friendly; you probably want to remove all punctuation, and you should consider accented characters etc. Bear in mind that the title won't need to be unique, because you've got the ID as well
When handling a request to anything starting with /news, take the next part of the path, parse it as an integer and load the appropriate article.
Assuming you are using PHP and can alter your source code (this is quite mandatory to get the article's title), I'd do the following:
First, you'll need to have a function (or maybe a method in an object-oriented architecture) to generate the URLs for you in your code. You'd supply the function with the article object or the article ID and it returns the friendly URL with the ID and the friendly title.
Basically function url(Article $article) => URL.
You will also need some URL rewriting rules to remove the PHP script from the URL. For Apache, refer to the mod_rewrite documentation for details (RewriteEngine, RewriteRule, RewriteCond).
Suppose you are working on an API, and you want nice URLs. For example, you want to provide the ability to query articles based on author, perhaps with sorting.
Standard:
GET http://example.com/articles.php?author=5&sort=desc
I imagine a RESTful way of doing this might be:
GET http://example.com/articles/all/author/5/sort/desc
Am I correct? Or have I got this REST thing all wrong?
I'm afraid your question really misses the point of REST. From a purely theoretical perspective there is absolutely no advantage or disadvantage to either of those urls from a REST perspective. In practice, those urls may behave differently with different caches, and certainly server frameworks are going to parse them differently. Despite what you hear from the framework developers, there is no such thing as a RESTful URL.
From the perspective of REST those two URLs are simply identifiers that can be dereferenced. If you want to start building REST apis that will benefit from the characteristics described in the dissertation, you need to start thinking in terms of content that is returned when you dereference the URL and how that content is linked together using URLs embedded in the content.
I realize this does not help you much in trying to resolve what you consider to be your problem. What I can tell you is that one of the major intents of REST is to allow your URLs to be completely under the control of the server and can change without impacting your client applications. Therefore, my recommendation is to pick whatever url structure works most easily with the framework you are using to serve the resource representations. Certainly do not look to the REST dissertation to tell you what is the right and wrong way of formatting your URLs and anyone who tells you that your URLs are not RESTful is confused. Probably what they are telling you is the server framework, they are used to using for creating RESTful interfaces, requires URLs to be structured this way.
It's not what your URI looks like that matters, it is what you do with it that matters.
Using a query string is not more or less RESTful than using path components. The URI Generic Syntax (RFC 3986, January 2005) defines that they're just as important in identifying the resource. So yes, as others point out, it's not important to REST. (Note that in the obsoleted-by-RFC-3986 RFC 2396, the query string was not defined to be identifying the resource, but rather a string of information to be interpreted by the resource.)
However, URI design is important, because as an owner of a URI namespace (i.e. the holder of the domain name where the URIs will live) you want the URIs to be long lived. As wise men have stated earlier: Cool URIs don't change!
The choice of using query strings vs path components depends on how your resources are identified, and how they will be identified in years to come. If there's a hierarchy that stands out, then it might be that this should be reflected in the URI, at least if that hierarchy is relatively permanent, and that things don't move around all the time.
It's also important to note that the actual URIs are only meaningful to two parties:
Servers, who need to forge and parse URIs
Human beings who might see a URI in passing might learn things from the URI.
By contrast, client applications are usually not allowed to do URI introspection. So your choice of query strings vs path components boils down to what you think you can live with ten (or 100) years from now.
You are mostly right. The thing with REST api's is to focus on the nouns.
What does the noun all do in this case? Wouldn't you expect your API to always return all articles, unless you filter it?
I would make sort a query string parameters, further, I would make any and all filtering query string parameters. If you look at how Stack is implemented when you click on the "Newest" questions link, you get a query string to filter the questions.
So perhaps something like:
GET http://example.com/aritcles/authors/5?sort=desc
But also think about what happens with each URL:
GET http://example.com/aritcles/ might return all current articles
GET http://example.com/aritcles/authors/ What does this url do? does it return all authors of all articles, or does it return all the articles for all authors (which is essentially the same functionality of the URL above.)
GET http://example.com/aritcles/authors/5/ might return all articles by author 5, or does it return author 5's information?
I would maybe change it to:
http://example.com/aritcles returns all articles
http://example.com/aritcles/5 returns all articles from author 5
http://example.com/authors returns all authors
http://example.com/authors/5 returns information for author 5
Alan is mostly right but his URLs are misleading. I believe the correct routes / urls should reflect the following behavior:
[GET] http://domain.com/articles #=> returns all articles (index action)
[GET] http://domain.com/articles/5 #=> returns article ID 5 (show action)
[GET] http://domain.com/authors/#=> returns all authors (index action)
[GET] http://domain.com/authors/5 #=> returns author ID 5 (show action)
[GET] http://domain.com/authors/5/articles OR http://domain.com/articles/authors/5 #=> depending on the hierarchy of your routes (both belong to the index action)
Best regards,
DBA