I have the following section of code from my screen scraping script (in a Rails 3.1 application):
# Add each row to a new call record
page = agent.page.search("table tbody tr").each do |row|
next if (!row.at('td'))
time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
call = Call.find_or_create_by_time(time)
call.update_attributes({:time => time, :source => source, :destination => destination, :duration => duration})
end
This was working but I think a few changes have been made on the remote site (they don't currently have an API).
The new HTML code is as follows:
<tr class='o'>
<td class='checkbox'><input class="bulk-check" id="recordings_13877" name="recordings[13877]" type="checkbox" value="1" /></td>
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
<td></td>
<td class='opt recording'>
<img alt="" class="icon recordings" src="/images/icons/recordings.png?1313703677" title="" />
<img alt="" class="icon recording-remove" src="/images/icons/recording-remove.png?1317304112" title="" />
</td>
</tr>
Since the suspected changes the data is being imported in the wrong fields or being missed completely. Currently the only part of the data I want/need is:
<td>09 Feb 11:37</td>
<td>Danny McClelland</td>
<td>01772123573</td>
<td>00:00:28</td>
Sadly, those rows don't have any unique identifiers though.
Any help/advice is appreciated!
Is there a better way to write the script that is more 'future' proof?
the first td is a checkbox now.
So just change it to:
time, source, destination, duration = row.search('td')[1..5].map{ |td| td.text.strip }
There's really no way to future proof a scraper (unless you're psychic)
Related
Im running into an issue, and I think the issue is with how my page.all is pulling radio button questions in.
So here is the HTML for the table itself (Multiple questions with 5 radio button choices a piece):
<table class="table table-striped table-stuff table-collapsible">
<colgroup>
<thead>
<tbody>
<input id="0_answer_question_id" value="9966" name="response[answers][0][answer_id]" type="hidden">
<tr>
<td class="heading">
<td class="option">
<div class="radio-inline radio-inline--empty">
<input id="question_1_1" value="1" name="response[answers_attributes][0][answer_opinion]" type="radio">
<label for="question_1_1">Strongly Disagree</label>
</div>
</td>
<td class="option">
<td class="option">
<td class="option">
<td class="option">
</tr>
<input id="response_1_question_id" value="9966" name="response[answers_attributes][1][answer_question_id]" type="hidden">
<tr>
<input id="response_1_id" value="<a number>" name="response[answers_attributes][1][id]" type="hidden">
<Same as above repeated 5 times with numbers changed>
</tbody>
</table>
Im using:
page.all('table.table-stuff tbody tr', minimum: 6).each do |row|
row.all("td label").sample.trigger('click')
end
To get each row and select one from it. HOWEVER, I notice "sometimes" a row will not have one selected. My theory is the "heading" (which has a <label> itself is accepting one of the clicks perhaps? (since from my understanding of how page.all works it's grabbing every tbody tr within the table...but is maybe grabbing the heading too? (since it contains a td label?)
Also when a table is named something like table table-striped table-stuff table-collapsible...how can you tell what the actual table "name" is? (I didn't write this website, just doing tests for it). When putting it in the page.all('table.<etc>')?
If the heading td (it's not expanded in your example) also contains a label element (so it would be included in the results of your all call) then you just need to change the CSS selector so it wouldn't be included - something like
row.all("td.option label").sample.trigger('click') # only choose labels contined in tds with the class of 'option'
or
row.all("td:not(.heading) label").sample.trigger('click') # choose labels contained in tds without the class of 'heading'
On your second question about table names, I don't really understand what you're asking. Tables don't have name attributes, they could have an id attribute or a caption containing some text which could then be used to find them with capybara via find(:table, 'id or caption text') or within_table('id or caption text') { code to execute within scope of the table }. Rather, you seem to be talking about the classes on the element which are specified in a CSS selector with '.'. Therefore a CSS selector to match a table element with all the classes you listed would be - 'table.table.table-striped.table-stuff.table-collapsible'
Note: If you're sure there's always only 5 choices you could add the :count option to your find to make sure your selector is only finding those items
row.all("td.option label", count: 5).sample.trigger('click')
I have a table like this with many numbers, below particular row from this table.
<tr>
<td>5100<td>
<td> Number description<td>
<td>long number description<td>
<td>
<input class="checkbox" type="checkbox" checked="" value="5100" name="id_rola[]">
<td>
<td>
<a href="javascript:documeny.frolazak_5100.submit();">
<b><img src="temp/img/foto.gif"></b>
</a>
<td>
<tr>
Whats i try to do was for this number check checkbox and next click link, but with this i have a problem. Firstly when i do check("5100") this show me unable to find checkbox "5100", and next when i try click link with find(:xpath, "//a[#href="javascript:document.frolazak_5100.submit();]").click , this show me unable to find xpath.
For any suggestion thx.
EDIT: ANSWERS
Check checkbox with particular value:
find(:css, ".checkbox[value='5100']").set(true)
Click on image in link in row:
To do this i wrote:
all('tr').each do |tr|
next unless tr.has_text?('5100')
#href = tr.all('td a')[0][:href]
end
visit #href
But this solution was be sloowly, so first i puts #href and then i could manipulate directly with params in this href and instead using method all simply visit this href. BTW why i dont think about it earlier :)
I am working on a Rails 3.2.11 app using angular 1.0.5.
Currently, a user will select a Cycle from a dropdown, and that will return a bunch of JSON from my controller using ng-resource.
Here is the method
$scope.update = function(cycleId) {
Cycle.get({action: cycleId}, function(resource) {
$scope.selectedCycle = resource;
$scope.tasks = resource.tasks;
$scope.newTask = {cycle_id: resource.cycle.id};
});
};
Here is an example of what json my controller is returning, which is 'resource' in above function: https://gist.github.com/anonymous/01ffe5a37e370661f6fb
Basically I am needing to use ng-repeat twice (one of them nested) using angulars ng-repeat, so that I can get the task_type_name in there as a header. I'm getting some weird interesting results. See the shorted code in my view below and the full thing here
<section ng-repeat="(task_type_name,task_type) in tasks ">
<h2>{{task_type_name}}</h2>
<table>
<tr>
<th>Task Name</th>
</tr>
<section id="task-edit">
<tr ng-repeat="task in task_type">
<td>
<%= link_to "{{task.name}}", '', "ng-click"=>"toggleShowHistory(task.id)" %>
</td>
</tr>
</section
</table>
So my is occuring at this part here
<section id="task-edit">
<tr ng-repeat="task in task_type">
If I try to combine that section and tr, OR change tr to ANYTHING but a tr, {{task}} no longer becomes available.
<section class="task-edit" ng-repeat="task in task_type">
{{task}} is available right here
<tr>
{{task}} is not available right here
<td>
{{task}} is not avaiable right here
</td>
</tr>
</section>
I tested the same concept on the first loop, and it seems to be fine on that loop just not the second, nested loop.
I'm assuming it has something to do with the scope. But i'm just not getting it.
Also, if you have any tips, i'm very new to angular and would love them.
I created a demo, and I am not seeing any issue. Please check your data source and make sure you plug in the tasks value of the json.
You need to change the nested section to tbody.
Demo on jsfiddle
I am using this example for file uploader.
Now it works this way:
I upload a file, after the file is saved,the function(do_picture_analyse) calls R and produces a histogram(simplest version, in the more complicated version 2 packages have to be installed in R), picture of the histogram is saved. The problem is that if I want to upload 50 files, it takes lots of time to load 2 packages in R for each file separately(after_save callback).
What I need:
I upload a file, file is saved, I click on a button "Histogram" and the function do_picture analyses is called on all files that are in the database( It doesnt matter if some of the files have already been analyzed)
So I need only to know how to make an interaction between a button and a call of the function and nothing more.
My show.html.erb:
<script id="template-download" type="text/x-tmpl">
{% for (var i=0, file; file=o.files[i]; i++) { %}
<tr class="template-download fade">
<td></td>
<td class="name">
{%=file.name%}
</td>
<td class="nam">
{%=file.name%}
</td>
<td class="size"><span>{%=o.formatFileSize(file.size)%}</span></td>
<td class="Pic">
<button class="btn btn-mini btn-info">Pic</button>
</td>
<td class="Hist">
<button class="btn btn-mini btn-primary" >Hist</button>
</td>
<td class="delete">
<button class="btn btn-mini btn-danger" data-type="{%=file.delete_type%}" data-url="{%=file.delete_url%}">
<i class="icon-trash icon-white"></i>
</button>
<input type="checkbox" name="delete" value="1">
</td>
</tr>
{% } %}
</script>
my upload.rb:
def to_jq_upload
{
"name" => (read_attribute(:upload_file_name)).split(".").first,
"size" => read_attribute(:upload_file_size),
"url" => upload.url(:original),
"delete_url" => upload_path(self),
"delete_type" => "DELETE",
"url_chip_image"=>read_attribute(:chip_image),
}
end
after_save :do_picture_analyse
def do_picture_analyse
if read_attribute(:chip_image)==nil
require 'rinruby'
myr = RinRuby.new(echo=false)
myr.filepath=upload.path(:original)
myr.fileurl=upload.url(:original)
myr.eval <<EOF
s=read.table(filepath)
for(j in nchar(filepath):1){
if(substr(filepath,j,j)=="/"){
savepath<-substr(filepath,1,j-1)
file.name<-filepath
file.name<-substr(file.name,j+1,nchar(filepath)-4)
break
}
}
file.name1<-paste(file.name,"image.jpeg",sep="_")
savepath<-paste(savepath,file.name1,sep="/")
jpeg(filename=savepath,width=250, height=250)
hist(s$V1)
dev.off()
EOF
self.update_attributes(
:chip_image => (((myr.fileurl).split("?").first)[6..-5]+'_image.jpeg')
)
end
end
EDIT:
do_picture_analyse can take a folder as a parameter and analyse all files inside it by loading the the packages only one time for entire folder.There are only two folders for the files(two different types of files, let say .txt and .blabla files will be saved either in the txt-Folder or in a blabla-Folder. The type of the folder is saved in the database as well. By clicking the button, two folders should be passed to the do_picture_analyse and it will do everything
Thanks in advance
you need to create a new route for this :
resources :name_of_your_controller do
# use this if you want a route like resources/:id/analyze (single file)
get :analyze, on: :member
# use this if you want a route like resources/analyze (multiple files)
get :analyze, on: :collection
end
then create a new action on your controller :
def analyze
# for single file analysis do something like this :
#file = File.find( params[:id] )
#file.do_picture_analyse
respond_to do |format|
# render what you need to render, js or html
end
# ... or do something like this for multiple file analysis :
#files = File.where( params[:search] )
#files.each {|f| f.do_picture_analyse )
# etc.
end
you can then link your button to your action :
# single file
<%= link_to "Histogram", analyze_file_path( file ) %>
# multiple files
<%= link_to "Histogram", analyze_files_path( search: your_search_conditions ) %>
PS: if your method needs a lot of processing power (if you use R, i assume that you have complex calculations involved), you should consider to extract it in a Worker to run it as a background task.
edit
response to your comments :
i think you should extract this method and make it a class method, that accepts one or more paths.
Then create a collection route that points to your controller ; in your controller action load the files according to some params and does something like this :
# find the directories to be processed :
paths = #files.map(&:folder_type).uniq
# pass them to your class method :
File.do_picture_analyse(paths)
It is even possible to create a class methods that automatically handles these two steps for all files in a relation :
def self.perform_analysis!
paths = all.map(&:folder_type).uniq # or uniq.pluck(:folder_type) on rails >= 3.2.1
do_picture_analyse(paths)
end
def self.do_picture_analyse( *paths )
# call R from here
end
then you can do :
File.where( your_search_params ).perform_analysis!
The short and sweet answer: in your show.html.erb write:
<td class="Hist">
<%= link_to 'Histogram', histogram_path, ;method => :post, :class => 'btn btn-mini btn-primary" %>
</td>
In your config/routes.rb add the following line
post '/histogram' => 'your-controller#histogram', :as => 'histogram'
THis means that the histogram_path will point to a controller named your-controller and call the action histogram. Please replace those with your names.
And then you should be good to go.
I have taken the liberty to propose a POST action, because I am assuming the action is not idempotent. If it is, you should use a GET.
Hope this helps.
I have the following HTML code :
<table class="report" width="100%">
<thead>
</thead>
<tbody>
<tr class="alt">
<td>
<a onclick="window.open(this.href);return false;" href="/search/searches/1563/reports/946">56175-746-45619568-noor.fli.zip</a>
</td>
<td class="_"> Report </td>
<td class="_"> 09 Apr 2012</td>
<td class="_"> Noor</td>
<td class="_"> 2.8 MB</td>
<td class="_">Ready</td>
</tr>
I want to click on href="/search/searches/1563/reports/946">56175-746-45619568-noor.fli.zip but I do not want to use XPATH. I tried a lot of things but failed, is there a way to click on this href without using XPATH. Thanks a lot.
You can use the href
br.link(:href => '/search/searches/1563/reports/946').click
or the text
br.link(:text => '56175-746-45619568-noor.fli.zip').click
or you can use variations with regex matches
br.link(:href => /reports/).click
or
br.link(:text => /noor.fli.zip/).click
Is it the only link in that table? or always the first link in that table?
browser.table(:class => 'report').a.click
If there are multiple tables, then you have to figure out how to find the one you want. perhaps by the text inside the table. If in your example the text Noor is unique to that table, then you could try something like this
browser.table(:class => 'report', :text => /Noor/).a.click
or if you know the structure above will persist where the link and the info about the report are on a single table row)
browser.row(:text => /Noor/).a.click
You'd have to try to decide which is going to be the most robust or least brittle