web scraping/parsing of college course site - parsing

Trying to parse/scrape the course site for memphis. The site is "https://spectrumssb2.memphis.edu/pls/PROD/bwckgens.p_proc_term_date". It appears to be some sort of javascript issue, or dynamic generation of the text. I can see the underlying DOM structure using livehttpdheaders/Firefox, but not when I simply view the underlying source/text of the page..
Thoughts/Comments/Pointers would be appreciated...

Well this modern days the site may be assembled in few steps. First the main structure is pulled in and then, often based on identity of the user additional AJAX calls are executed. Your best bet is to sniff HTTP to see what kind of requests are issued between the site is initially requested and when it's fully built
Since you are using firebug you can get HttpFox add-on which gives you what you need

Related

Reverse engineering website : cannot find the form inputs in the post request

I need to interact with an external website I don't own. This external website requires credentials that I have. My goal is to add a user but the external website does not offer an external API. It looks like they are using Vaadin.
So to add a new user I need to manually fill in a form. Yet I have been searching for the route the "form" takes to post the input I give but could not find any.
Here is my issue : when I look at the HTML source code in browser I cannot see any form tag. The buttons have all the same id "button". When I fill in the form and look at the network tab in the developer tools, in the "parameters" section I cannot see the inputs I just gave although the POST request does appear. The cookies tab does not show the inputs either.
Consequently my questions are : why can't I find the inputs in the POST request and where can they be ?
Please note : this external website is a medical site so I prefer not share the url and they don't offer a mobile app, so there is no mobile API I could reverse engineer.
Any help appreciated :-)
Not stating the Vaadin version makes that a tad harder give an exact
answer, but at the core both the Vaadin 8 and 10+ behave the same way.
And the short answer to your question is: without another entry-point,
like an API, this can not simply be done using just some POST-Request.
Vaadin is not simply a html-form/request/response-html based framework;
it holds the scenegraph on the serverside in a session. All
communication is done via a single endpoint to the server and state
changes only are communicated back to the client.
For what you are after, your best bet is to use test automation
frameworks like selenium, geb, cypress, ...

Sending data to form, but cant work out encrypted post data - work around

Im trying to send some data to a form on a site were im a member using cURL, but when i look at the headers being sent, they seem to have been encrypted.
Is there a way i can get around this by making the computer / server visit the site and actual add the data to the inputs on the form and then hit submit, so that it would generate the correct data and post the form ?
You have got a few options:
reverse engineer the JavaScript that does the encryption (or possibly just encoding) process
get a browser engine (e.g. the Gecko engine), and add some scripting to it to fill in the forms and push the submit button - of course you would need JavaScript support within the page itself
parse the HTML using an HTML parser, feed the JavaScript in it to a JavaScript runtime with the correct libraries, fill in the "form" and hit the submit button
It's probably easiest to go for the first option. The JavaScript must be in the open to be able to be executed in the browser. But it may take some time to reverse-engineer as it is likely obfuscated.
You can use a framework to automate user interaction on the web pages, like Selenium.
This would enable you to not bother reverse engineering anything.
Selenium has binding in various languages, including Python and java.
Provided the javascript is visible on the website in question, you should be able to simply copy and paste their encryption routines to prepare the headers exactly as they do
A hacky fix if you can isolate the function that encodes the data you type in the form - is to use something like PyV8 to execute the JS inside python.
Use AutoHotKeyIt and actually have it use the Browser Normally. It can read from files, and do repetitive tasks infinitely. Also you can push a flag to make it only happen within that application, which means you can have it minimized and yet still preform the action.
You seem to be having issues with the problem of them encrypting the headers and such, so why not simply use that too your advantage? Your still pushing the same data in, but now your working around their system. With little to no side effect too you.

How is this URL modification possible?

Could anyone please tell how the site http://www.outsharked.com/imagemapster/default.aspx?what.html is working in such way? Modifying the url without loading/reloading the page. I think this is not done by html5. Because it works in IE6 which doesn't support html5.
I created that site. The commenter is correct, it uses Javascript to change the URL. There's nothing about how that navigation works that is different for IE6 - that browser supports the necessary client-side functionality to do this kind of thing. The basic functionality involves:
capturing click events on the nav, and loading the inner content via AJAX
update the URL to reflect a working direct URL to target.
The links also are valid anchor links that, in the absence of Javascript, would go to the same page (but load the whole thing). This is your basic AJAX web site setup with one minor difference. It's common practice to use a URLs like this in AJAX/single page web sites:
http://mysite.com/home#somepage
or even just
http://mysite.com/#somepage
Where the hashtag part represents the actual page a user has navigated to. If someone accessed that url directly, e.g. from outside the site, the site would use Javascript to load the correct content based on the hashtag, after the page had loaded. This means that there might be a little delay for the inner content to reflect the correct page, since it has to run another request after the initial page has loaded from the browser to get the inner content via AJAX.
I was trying to avoid that by creating a setup that worked completely with and without Javascript. If you go directly to a URL within the site such as http://www.outsharked.com/imagemapster/default.aspx?faq.html you will notice it loads the content directly. This URL will work even if Javascript is disabled. You can't actually do this using hashtags, since hashtag content is not sent to the server. Only the client knows what's after the hashtag in a URL. That's why I was using query strings to represent inner pages.
This site architecture was sort of an experiment at the time. It works pretty well but the code isn't fantastic, I didn't really do anything else with it, and I'm sure there are other better-fleshed-out/tested/full-featured frameworks out there to do much the same thing.
But it might not be a bad example of the nuts and bolts of creating a basic AJAX navigation setup, as a learning tool, since it's pretty concise, and also does HTML5 history navigation (e.g. so the back button works on modern browsers).

Why do some websites have "#!" in the URL [duplicate]

I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?
This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.
The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto
First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.
I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.
To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.
I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.

Progressive enhancement - what to do when JavaScript is off?

I understand what progressive enhancement is, I'm just fuzzy on some of the details in actually pulling it off. Of course, that could be because I'm looking at it in the wrong way. Let me try to explain my difficulty with a hypothetical:
ASP.NET MVC site. I have a view that has tabbed navigation. Each tab is for a movie category/genre which displays 5-10 links to movies in that category. The movie data is obtained through Netflix's Odata.
My initial thought is to use Ajax to pull and parse the JSON from the proper OData GET requests when each tab is clicked. How would I provide a non-JavaScript version of that? Is it even possible?
For simpler requests where JSON isn't necessary - like, say, having a user log into the system - I see how I could simply set a cookie and dynamically change the page based on it to reflect the change. But what if I need to return and parse JSON? How do I provide an alternative?
The deal with progressive enhancement is that your server side must be fully capable of generating every last bit of HTML that appears in all of your pages. This is obvious, since otherwise (if JS is turned off) there will be no part of your application capable of doing said rendering.
Since the server side must know how to render everything, it doesn't make much sense to generate things (DOM elements/HTML) on the client side from JSON responses the server gives you. Why repeat yourself?
This brings us to the logical conclusion that when doing dynamic updates on the client, you need to get ready-made HTML from the server (since the rendering logic is over there) and insert it into the DOM as appropriate. You are then free to work on the newly inserted elements with jQuery and enhance them all you want.
So -- forget about parsing JSON on the client, otherwise you 're locking yourself out of progressive enhancement. If you want to call a third party, have the server be your intermediary: call the server with all the necessary information for it to call the third party and get ready-made HTML back.
If you do this, then the server can of course provide non-JS versions of everything on your site with no problem. Total non-reliance on JS achieved.
There is no JSON without JS, by definition (JavaScript Object Notation). Without JS you won't make AJAX calls. Your pages will render as is, just like oldschool sites.
If you need to do this progressively, you will have to call the odata service server-side, and provide .net objects to the site in viewdata, or your viewmodel, and have your views/partials render it.
In ASP.Net MVC actions, the httpcontext available via the controller will have a property on this path: this.HttpContext.Request.IsAjaxRequest() and can be used to test whether you want to return a view or just json data, or whatever type of ActionResult you want. This can be an excellent timesaver for building progressive enhancement style sites.

Resources