How to solve a captcha with phantomjs? - parsing

My parser uses phantomjs: it sends many post-requests to a website and gets page contents. From time to time this website displays a captcha that I have to solve.
So the question is. Is there any way to solve captchas manually (e.g. in another webbrowser, Firefox or another one) when the website displays captchas? For example, when my program receives the captcha request, it stops for a while (e.g. waits for a keypress), I open the webpage in another webbrowser, solve the captcha manually and then resume the program.
Will it work? If not, what is the way to make it work?
I didn't try it yet, because I'm affraid of a possible ban by the website I work with.

Related

Sending data to form, but cant work out encrypted post data - work around

Im trying to send some data to a form on a site were im a member using cURL, but when i look at the headers being sent, they seem to have been encrypted.
Is there a way i can get around this by making the computer / server visit the site and actual add the data to the inputs on the form and then hit submit, so that it would generate the correct data and post the form ?
You have got a few options:
reverse engineer the JavaScript that does the encryption (or possibly just encoding) process
get a browser engine (e.g. the Gecko engine), and add some scripting to it to fill in the forms and push the submit button - of course you would need JavaScript support within the page itself
parse the HTML using an HTML parser, feed the JavaScript in it to a JavaScript runtime with the correct libraries, fill in the "form" and hit the submit button
It's probably easiest to go for the first option. The JavaScript must be in the open to be able to be executed in the browser. But it may take some time to reverse-engineer as it is likely obfuscated.
You can use a framework to automate user interaction on the web pages, like Selenium.
This would enable you to not bother reverse engineering anything.
Selenium has binding in various languages, including Python and java.
Provided the javascript is visible on the website in question, you should be able to simply copy and paste their encryption routines to prepare the headers exactly as they do
A hacky fix if you can isolate the function that encodes the data you type in the form - is to use something like PyV8 to execute the JS inside python.
Use AutoHotKeyIt and actually have it use the Browser Normally. It can read from files, and do repetitive tasks infinitely. Also you can push a flag to make it only happen within that application, which means you can have it minimized and yet still preform the action.
You seem to be having issues with the problem of them encrypting the headers and such, so why not simply use that too your advantage? Your still pushing the same data in, but now your working around their system. With little to no side effect too you.

How is this URL modification possible?

Could anyone please tell how the site http://www.outsharked.com/imagemapster/default.aspx?what.html is working in such way? Modifying the url without loading/reloading the page. I think this is not done by html5. Because it works in IE6 which doesn't support html5.
I created that site. The commenter is correct, it uses Javascript to change the URL. There's nothing about how that navigation works that is different for IE6 - that browser supports the necessary client-side functionality to do this kind of thing. The basic functionality involves:
capturing click events on the nav, and loading the inner content via AJAX
update the URL to reflect a working direct URL to target.
The links also are valid anchor links that, in the absence of Javascript, would go to the same page (but load the whole thing). This is your basic AJAX web site setup with one minor difference. It's common practice to use a URLs like this in AJAX/single page web sites:
http://mysite.com/home#somepage
or even just
http://mysite.com/#somepage
Where the hashtag part represents the actual page a user has navigated to. If someone accessed that url directly, e.g. from outside the site, the site would use Javascript to load the correct content based on the hashtag, after the page had loaded. This means that there might be a little delay for the inner content to reflect the correct page, since it has to run another request after the initial page has loaded from the browser to get the inner content via AJAX.
I was trying to avoid that by creating a setup that worked completely with and without Javascript. If you go directly to a URL within the site such as http://www.outsharked.com/imagemapster/default.aspx?faq.html you will notice it loads the content directly. This URL will work even if Javascript is disabled. You can't actually do this using hashtags, since hashtag content is not sent to the server. Only the client knows what's after the hashtag in a URL. That's why I was using query strings to represent inner pages.
This site architecture was sort of an experiment at the time. It works pretty well but the code isn't fantastic, I didn't really do anything else with it, and I'm sure there are other better-fleshed-out/tested/full-featured frameworks out there to do much the same thing.
But it might not be a bad example of the nuts and bolts of creating a basic AJAX navigation setup, as a learning tool, since it's pretty concise, and also does HTML5 history navigation (e.g. so the back button works on modern browsers).

How could I make an app login in a website and get info in the background?

I think I am mostly struggling with this problem because I do not know what to search for.
I want to make an app that allows the user to enter their gift card number and use that number to login to this website:
https://www.getmybalance.com
I have no idea how to do this without control over the website. Is it even possible to do so?
I don't want to use a UIWebView to show the page.
You should read up on NSURLConnection, you're going to have to execute a post request to login. Then you're going to have to determine whether or not you logged in successfully probably by parsing the returned page. NSURLConnection will handle storing the login cookie the site returns. After you've logged in you're probably going to need to execute another post request to query their system. Once again you will have to probably parse the result out of the HTML page that is returned.
NSURLConnection:
https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/nsurlconnection_Class/Reference/Reference.html
NSURLConnection Delegate Protocol:
https://developer.apple.com/library/mac/#documentation/Foundation/Reference/NSURLConnectionDelegate_Protocol/Reference/Reference.html#//apple_ref/occ/intf/NSURLConnectionDelegate
This all of course assumes that this website doesn't have an API you can use.
Looks like you need to programatically POST in https to the server, then you will get back some DOM document, or JSON, or some weird thing, which you then parse.
POSTing with iOS is pretty easy, look at something like LRResty https://github.com/lukeredpath/LRResty or similar.
When you get the data back, first thing to do is look at it with NSLog. Then if the data is HTML, you will need to wade into the HTML to get the result.
The problem with that approach is that the company hosting the page may change their API at any time. You should ask them to either not ever change anything (if they want to change, then make a new page and leave the old one working, or better, support a simple REST API - which would also help them build nice AJAX/html5 web sites in the future.).

Quickly and accurately grabbing webpage titles

I'm looking to get the title of a webpage, a common feature of many IRC bots that I'm wanting to incorporate into a IRC client I'm writing for fun.
The method that I currently have working basically connects and sends a GET request for the entire webpage then seeks out the tags and reads inbetween them. For larger webpages this can be slower than I'd like. An additional problem I've noticed is webpages with dynamic titles (such as some phpbb forums) will not return the accurate title as it would show in a browser because I don't do any execution of javascript ect..
It seems one way to get an accurate title is to dump the html into a browser control (such as the IE COM control) and pull the title, but this is just going to make it even more time consuming.
Is there a simple method I am un aware of?
In a word, no, not really.
I guess rather than downloading the whole document you could stream the HTTP file into your application and just stop downloading when you reach </title> - that would save you waiting for the whole HTML document to download.
However that doesn't help the situation if you need to read the title after it's been changed by some client-side javascript. As you say, the only way I can think of doing that is by using a browser control.

Prevent URL being stored to browser history

I need to open a popup window to a url with certain parameters. The parameters contain information that I would like to prevent from showing up in the browser history. The url points to a 3rd party site and I can't affect the way those parameters are transferred to them (can't use POST for example).
Currently I have worked around this so that I have a page on our server that loads the content of the third party page to an iframe and this seems to work.
However, I was wondering if there are any other ways of doing this and are they maybe somehow better or worse? Javascript or something? The negative side of this iframe thing is that it is not XHTML Strict compliant, which is something we are aiming for.
There are other similar questions here but I couldn't find a good answer.
Edit: Apparently this does not work as expected in IE. It might be that I still keep the solution for another reason, but it would be nice to know if there is a "bulletproof" solution.

Resources