Connection refused while web scraping using HTMLUnit - connection

I am trying to build a java application to scrape a website using HTMLUnit. After extracting some data the application encounters following exception -
java.lang.RuntimeException: org.apache.http.conn.HttpHostConnectException: Connection to siteURL refused.
If I run application again, it is able to extract some data again before failing with the same exception. Probably the server see lot of requests from same client IP and refuses connection for a request.
Also, when application encounters this problem, I am able to connect to the site using a browser.
How can I overcome this problem? In web scraping applications how are such problems approached and resolved?

This is how I debug such issues :
Download Fiddler
By default, fiddler listens on port 8888, all you have to do, is configure webClient to use fiddler as proxy & then all requests being sent can be seen (analyzed, modified & re-sent too) in fiddler.
client.getOptions().setProxyConfig(new ProxyConfig("127.0.0.1", 8888));
From what I can say from my previous experience is that the target site blocks after some-time, you can try adding a pause or rotating proxies & user-agents. You can also try clearing cookies.

Related

SignalR Issue when Load Balanced on Netscalers

We are attempting to deploy a SignalR site on a Citrix NetScaler, as opposed to the current deployment on a single server. There are three servers in the farm. If you navigate to any single server, SignalR comes up fine. If you go to the NetScaler address, you get this:
WebSocket connection to
'wss://mysite.com/myapp/signalr/connect?transport=webSockets&clientProtocol=1.5&connectionToken=(token_displayed_here)'
failed: Error during WebSocket handshake: net::ERR_CONNECTION_RESET
After this error, there is about a 10-15 second delay, then it starts working. If I attempt to disable websockets as I have read that Netscalers still have issues with them, the error goes away but the delay remains. I believe the delay is caused by it trying to connect with ServerSentEvents and failing that as well. It appears that only long polling may be working over the NetScaler.
We have checked the NetScaler websocket settings, made sure the servers have the correct machine keys, had a backplane set up (tried Redis and an Oracle Nuget package as that's our typical DB), checked the OWIN versions and web.config settings, all of the stuff that Google told me to do that I could find but still get this error and delay. One thing that I did find is that Netscalers have issues with wss, but haven't been able to find anything about how to account for this. Most of the information found was for people using other load balancing technology.
Is using SignalR (or more specifically, WebSockets or ServerSentEvents) with a NetScaler even doable, and if so what could be causing this problem?

WSO2 ESB connections issue

Am working with wso2esb4.9.0 and having around 160 services which are http and are processed frequently, Initially when the server is started every thing is fine all service request response is up to the mark,
After 10-12 days the ESB server gets hanged does not process any request and no exception are seen in the log file even,They are some request which may be piled up in the server and not allowing new connection to process.
when i do restart of the server all the connections get releases and works for other 10-12 days again.
But doing a restart of the server may not be a good idea to do , where can i find these connections and close them if possible and am i missing any config changes of wso2 esb.
Am trying to find some different connection number using JMX and also what to know if any one face this issue and found the possible solution.

sap.ui.model.odata.ODataModel returns 501 error on serivces.odata.org

I am trying to create an OData model in SAP UI5 this way:
new sap.ui.model.odata.ODataModel("http://services.odata.org/Northwind/Northwind.svc/");
but I am getting a 501 not implemented error!
could you please check what's wrong?
Thanks
As far as I can see it, the service is not really CORS-enabled. I have the same problem with my own examples here, as soon as I am not using some kind of proxy, I get this error.
The reason behind it is that when you send a complex request to the service, you'll autmatically have a so-called preflight request sent by your browser (before the actual GET) which is not a GET-Request, but an HTTP OPTIONS request.
All the odata.org sample services return a 501 error at the moment for such requests.
You can e.g. use the simpleProxyServlet which is shipped with UI5, or of course any other proxy which would solve this.
You are getting this error as your browser will refuse this request due to same Origin Policy. Here is what you should do:
Deploy the app on the same server or domain as the service that you want to call, so that both resources are in the same origin (if possible)
Disable the same-origin policy in the browser for local testing. Run Chrome by running Chrome with the following command:
[your-path-to-chrome-installation-dir]\chrome.exe
--disable-web-security --user-data-dir. Make sure that all instances of Chrome are closed before you run the command. This allows all web
sites to break out of the same-origin policy and connect to the
remote service directly.
-> Don't do this in your productive app as it imposes a security risk.
Using a proxy
The following documentation should help you understand this more and implement:
Conncting with oData Service
Request failing due to Same-Origin Policy sharing(CORS)
Please use "proxy/http/services.odata.org/Northwind/Northwind.svc", I think it's solve your problem!

Inspect HTTP requests made by Apache Wink client

I'm using Apache Wink to access a service, and trying to debug a problem where the server apparently does not recieve my request in the intended format (details below, but are probably immaterial). Is there a way I can make the Wink client to log the HTTP requests that it makes to the server, so that I can see what is being sent down the wire?
Details: I'm using Eclipse Lyo to create a ChangeRequest in RTC (rational team concert) using their OSLC v2 REST APIs.(Eclipse Lyo internally uses Apache Wink). In doing so, even though I've set a "Filed Against" property in the ChangeRequest being submitted, RTC does not recognize it and complains that it is missing.
I think it's better to use a proxy to monitor the traffic. If your client runs on Windows, Fiddler is a very nice tool.

Grails handling network connection stall

I am using Grails Ws-Client Plugin
but my application waits for the SOAP response back from the server from which i am consuming web service and my application waits from this code
def proxy = webService.getClient(wsdlUrl)
This mostly occours when the server is down or net connection is slow.
the wait also continues in case the webservice is temporarily removed from the server and the url containing the wsdl is redirecting to home page of website when try to access on web browser.
How can i detect that the wsdl is present or not and how can i set timeout like property so that the wait for response exist for 10 seconds and then it stops waiting for response and code start executing normally in case of stall .
I also don't get any exception or error as well.
Sounds like there's no read and/or connect timeouts set on the client by default. This should help if the web service is down: proxy.setConnectionTimeout(value_in_milliseconds)
I'm not sure about setting the read timeout though, which is what you'd see if the host was up and accepting connections but the web service wasn't available or not responding. The best solution we found for this was to use the Apache Commons HTTP client instead of the default client, which gave us much more granular configuration over the client's connection settings. It's possible they're in the WS-Client plugin also, but the relevant documentation (actually the GroovyWS documentation) doesn't appear to mention anything about read timeouts.

Resources