Troubleshooting nightmare: non-replicatable errors while webscraping - ruby-on-rails

I'm trying to run a webscraper that scrapes indeed.com and applies for jobs. What really gets me is the inconsistent, yet random errors. I'm not a programmer, but as far as I understand, if 2+2=4, then it should always be 4.
Here is the script I'm trying to run:
https://github.com/jmopr/job-hunter/blob/master/scraper.rb
Seems to only work with firefox v45.0.2 because of the geckodriver
My own fixes in scraper.rb if you wish to execute the script yourself:
config.allow_url("indeed.com")
JobScraper.new('https://www.indeed.com/', ARGV[0], ARGV[3]).scrape(ARGV[1], ARGV[2])
ERRORS
Example 1
def perform_search
# For indeed0
save_and_open_page
fill_in 'q', :with => #skillset
fill_in 'l', :with => #region
find('#fj').click
sleep(1)
end
Error: Unable to find class #fj. So it was able to find q, and l, but not fj. q and l are forms while fj is a button. How was it able to find the forms but not the button...????? Re-executed code via the terminal command rails server and the error went away. later came back again, how random in nature!!!! How is this possible? I can't even predict when it will happen so i can save_and_open_page
Example 2: error comes up when you run a search. no jobs get posted.
Error: block passed to #window_opened _by opened 0 windows instead of 1 (Capybara::Window Error)
Re-execute code, error went away, later comes back...
To clarify on example 2:
That error sometimes comes up since I have a Canadian IP address and it redirects me to indeed.ca. However, when used a US ip address via a VPN, that error was consistent 100% of the time. In an attempt to work around this, i've modified the code to go to the US version of the site, again, that error is consistent 100% of the time. Any idea on why this window is not popping up when i'm on the US version of indeed.com?
Summary:
i'm not necessarily looking for solutions, but an understanding of what is going on. Why the randomness in error.

2+2=4 under a given set of assumptions and conditions. Browsers and scrapers unfortunately aren't that predictable, with random delays, page throttling, changing pages, varying support levels for different technologies, etc.
In your current case the reason for the window_opened_by error could have been not having Capybara.default_max_wait_time set long enough (how long Capybara will wait for the window to open), however if you try the search manually you'll see that indeed no longer opens the job description in a new window if the current window is wide enough to show it in a right panel. Basically the code you're trying to use is no longer fully compatible with indeed.com due to changes in how indeed.com works. You could fix this by setting the drivers window size to a size where indeed.com will always open a new window, or by setting the window size big enough job descriptions open on the same page and rewriting the code to not look for a new window.
As for the no '#fj' issue, the easiest way to debug that is to put
save_and_open_screenshot if page.has_no_css?('#fj')
before the find('#fj').click and see what the page looks like when there is no '#fj' element on it. Doing that shows indeed.com is randomly returning the mobile site. Why this is happening I have no idea, but it could just be what indeed.com does when it doesn't recognize the current user agent. If that's the case you can probably work around that by setting the user agent the capybara-webkit driver uses, or you could just switch to calling click_button('Find Jobs') which should click the button on both the mobile and non-mobile pages.

Related

Is it right that ASP.NET bundles get generated on every request?

We hit a performance issue recently that highlighted something that I need to confirm.
When you include a bundle like this:
#Scripts.Render("~/jquery)
This appears to be running through (identified using dotTrace, and seen it running through this):
Microsoft.Ajax.Utilities.MinifyJavascript()
for every single request to both the page that has the include, and also the call to the script itself.
I appreciate that in a real world scenario, there will only be 1 hit to the script as the client will cache it. however, it seems inefficient to say the least.
The question is, is this expected behavior, as if it isn't, I'd like to fix it (so any suggestions), but if it is, we can pre-minify the scripts.
UPDATE
So, even if I change the compilation mode to debug, it's still firing the minify method. It outputs the individual urls, but still trys to minify it.
However, if remove all the references to the render methods, it doesn't try to minify anything, and runs rapidly, doesn't balloon the app pool, and doesn't max the CPU on the web server.

Delphi 5 application partially loaded in task manager, takes forever to actually display

I have an application written in Delphi 5, which runs fine on most (windows) computers.
However, occasionally the program begins to load (you can see it in task manager, uses about 2.5-3 MB of memory), but then stalls for a number of minutes, sometimes hours.
If you leave it long enough, the formshow event will eventually occur and the application window will pop up, but it seems like some other application or windows setting is preventing it from initially using all the memory it needs to run (approx. 35-40 MB).
Also, on some of my client's workstations, if they have MS Outlook running, they can close it and my application will pop up. Does anyone know what is going on here, and/or how to fix it?
Since nobody has given a better answer I'll take a stab at how to solve this:
There's something in your initialization that is locking it up somehow. Without seeing your code I do not know what it is so I'll only address how to go about finding it:
You need to log what you accomplish during startup. If you have any kind of screen showing I find the window title useful for this but it sounds like you don't--that means you need to write the log to a file. Let it get lost, kill the task and see where it got.
Note that this means you need to cleanly write your data despite an abnormal program termination. How to go about this:
A) Append, write your line, close.
B) Write your line, then flush the file handle.
C) Initially write your file to consist of a large number of blanks--ensure this is larger than the actual log will be. Write your line. In case of abnormal termination it will retain the original larger file size.
I would write a timestamp on every log item so you can see if it's just processing something too slowly.
If examining the log shows you where the problem is, fine. If, as usually happens, it's not enough you put a bunch more logging between the last item that did get logged and the next one that didn't--I've been known to log every line when hunting a cryptic problem that only happened on someone else's system.
If finding the line isn't enough to pinpoint the problem also dump the value of relevant variables.
Finally, if such intense scrutiny makes the bug go away start looking for an uninitialized variable. (While a memory stomp is also an option I doubt it's the culprit here.)

Getting strange Capybara issues

So I'm using capybara to test my backbone app. The app uses jquery animations to do slide transitions.
So I have been getting all kinds of weird issues. Stuff like element not found ( even when using the waiting finders and disabling the jquery animations ) I switched from the chrome driver back to Firefox and it fixed some of the issues. My current issues include:
Sometimes it doesn't find elements if the browser window is not maximized even though they return true for .visible? if I inspect w pry. (this is a fixed with slide w no responsive stuff )
and the following error:
Failure/Error: click_link "Continue"
Selenium::WebDriver::Error::StaleElementReferenceError:
Element not found in the cache - perhaps the page has changed since it was looked up
Basically, my questions are:
what am I doing wrong to trigger these issues.
can you tell me what if I have any other glaring issues in my code?
and when using a waiting Finder, do I need to chain my click to the returned element to ensure it has waited correctly or can I just find the element and call the click on another line:
Do I have to chain like this
page.find('#myDiv a').click_link('continue')
Or does this work?
page.find('h1').should have_content('Im some headline')
click_link('continue')
Here is my code: http://pastebin.com/z94m0ir5
I've also seen issues with off-screen elements not being found. I'm not sure exactly what causes this, but it might be related to the overflow CSS property of the container. We've tried to work around this by ensuring that windows are opened at full size on our CI server, or in some cases scrolling elements into view by executing JavaScript. This seems to be a Selenium limitation: https://code.google.com/p/selenium/issues/detail?id=4241
It's hard to say exactly what's going wrong, but I'm suspicious of the use of sleep statements and a lot of use of evaluate_script/execute_script. These are often bad signs. With the waiting finder and assertion methods in Capybara, sleeps shouldn't be necessary (you may need to set longer wait times for some actions). JavaScript execution, other than being a poor simulation of how the user interacts with the page, don't wait at all, and when you use jQuery, actions on selectors that don't match anything will silently fail, so that could result in the state of the page not being correct.
You do not have to chain. Waiting methods in Capybara are all synchronous.

How to fail gracefully and get notified if screen scraping fails in ruby on rails

I am working on a Rails 3 project that relies heavily on screen scraping to collect data mainly using Nokogiri. I'm aggregating essentially all the same data but I'm grabbing it from many difference sources and as time goes on I will be adding more and more. However I am acutely aware that screen scraping can be notoriously unreliable.
As such I am interested in how other people have handled the problem of verifying the data and then also getting notified if it is failing.
My current plan is as follow.
I am going to have validation on my model for most of the fields. If they fail I won't get bad data into my system. Although logging this failure in a meaningful way is still a problem.
I was thinking of some kind of counter where after so many failures from a particular source I somehow turn it off. Not sure how to keep track of that. I guess the only way is to have a field on my Source model that counts it and can be reset.
Logging is 800 pound gorilla I'm not sure how to deal with. I could just do standard writing to logs but if something fails I'd like to store the entire html so I can figure it out. Also I need to notify myself somehow so I can address the issues. I thought of maybe just creating a model for all this and storing it in the database. If I did this I'd probably have to store the html on s3 or something. I'm running this on heroku so that influences what I can do.
Setup begin and rescue blocks around every field. I was trying to figure out a to code this in a nicer ruby way so I just don't have a page of them but although I do have some fields are just straight up doc.css_at("#whatever") there are quite a number that require various formatting or calculations so I think it makes sense to try to rescue that so I can then log what went wrong. The other option is to let the exception bubble up and catch it when I try to create the model.
Anyway I'm sure I'm not even thinking of everything but that is why I'm trying to figure out how other people have handled this problem.
Our team does something similar to this, so here's some ideas:
we use a really high level begin/rescue transaction to make sure we don't get into weird half loaded states:
begin
ActiveRecord::Base.transaction do
...try to load a data source...
end
rescue
...error handling...
end
Email/page yourself when certain errors occur. We use exception_notifier but if you're sitting on Heroku the Exceptional plugin also seems like a good option. I've also heard of people having success w/ hoptoad
Capturing state is VERY important for troubleshooting issues. Something that's worked quite well for us is GMail. Our loaders effectively have two phases:
capture data and send it to our gmail account
log into gmail, download latest data and parse it
The second phase is the complex one, and if it fails a developer can simply log into the gmail account and easily inspect the failed message. This process has some limitations (per email and per mailbox storage limits, two phase pipeline, etc.) and we started out doing it because we had no other option, but it's proven shockingly resilient and convenient. Keep email in mind as a cheap/easy way to store noncritical state. We didn't start out thinking of using it that way and are now really glad we do. Logging into GMail feels better than digging through log files.
Build a dashboard UI. We have a simple dashboard with a grid of sources by day that looks like this. Each box is colored either red or green based on whether the load for that source on that day succeeded. You can go one step further and set up a monitor on this UI (mon.itor.us or equivalent) that alarms if some error threshold is met.

How do I get webrat / selenium to "wait for" the CSS of the page to load?

When I use webrat in selenium mode, visit returns quickly, as expected. No prob.
I am trying to assert that my styles get applied correctly (by looking at background images on different elements). I am able to get this information via JS, but it seems like the stylesheets have not loaded and/or gotten applied during my test.
I see that you can "wait" for elements to appear, but I don't see how I can wait for all the styles to get applied. I can put in a general delay, but that seems like built-in flakiness or slowness, which I am trying to avoid.
Obviously since I know what styles I'm looking for I can wait for them to appear. I'll write such a helper, but I was thinking there might be a more general mechanism already in place that I haven't seen.
Is there an easy way detect that the page is really really "ready"?
That's strange. I know that wait_for_page_to_load waits for the whole page, stylesheets included.
If you still think it's not waiting as it should, you can use wait_for_condition which will execute a javascript and wait until is returns true. Here's an example:
#selenium.wait_for_condition "selenium.browserbot.getCurrentWindow().document.body.style.backgroundColor == 'white'", "60000"
We ran into this when a page was reporting loaded even though a Cold Fusion portion was still accessing a database for info to display. Subsequent processing would then occur too soon.
Look at the abstract Wait class in the Selenium API. You can write your own custom until() clause that could test for certain text to appear, text to go away (in the case of a floating message that goes away when the loading is done) or any other event that you can test for in the Selenium repertoire. The API page even has a nice example that helps a lot getting it set up.

Resources